performance Multi-thread gas/temperature updates (actual implementation proposal)

Nightinggale · July 31, 2019

Based on gameplay evidence, revealing the map slows down the game and it appears to be the updates in TheSIM.dll, which is causing it. Gas movements and temperature movements seems to be the main culprits. This makes it an interesting part of the code to optimize. Multi-threading would be nice, but it requires a good approach because the "by the book" approach is likely to cause critical races and we don't want those.

In the following I will assume updating a cell will access the cell and the 8 cells around it. It's ok if it accesses fewer cells than that, but if it accesses more, then the following might not work at all. Also I assume the current code loops through all cells from cell ID 0 to the end because that's the approach, which makes the most sense when judging from the memory layout. However I can't actually verify this because TheSIM.dll is unmanaged code and the decompiler only works on the managed code. In other words I can't actually see what goes on inside that file.

The proposed setup:

Assign a name to each row of cells. We name them A, B and C. After that we repeat, meaning we get A, B, C, A, B, C, A.... Instead of looping all cells, Loop though all A rows and do each row from left to right. Once done, do all B rows and last all C rows.

The idea is that since there are 2 rows between each A row and cells only affect neighboring cells, we will be able to do the A rows in any order without affecting the result. In fact because no A row will ever access cells, which are also accessed by any other A row, doing all A rows in parallel should be possible. The same with B and C. What can't be done is running A, B and/or C at the same time. The order of those 3 has to be fixed and one at a time.

This means the current loop all cells code should be replaced with start all A threads, wait for threads to finish, start all B... etc.

Single thread operation can easily be implemented. Start the called method in the main thread instead of moving it to another thread. This will result in the same functionality, but with everything in the same thread. While likely not very useful for players, it will be a huge improvement if anybody has to debug the code. Make single threaded operation a game option in case somebody wants to play with just one active CPU core. It's easy to implement and we never know if it will be good to have some time in the future if something unexpected happen. Maybe it's enough to have the setting in kplayerprefs.yaml and not the game UI.

Don't use hardcoded values. Instead provide full support for mods, which change the map size by reading the current map size at runtime.

The benefits of this approach:

Same behavior as the current one (or at least hopefully close enough)
Allows near unlimited number of CPU cores (map height/3 number of cores)
Minimal thread overhead (entire rows for each thread)
CPU cache friendly (loops through rows in a cache optimized way)
Single core friendly (debugging friendly)
No need for thread locks (the cores should work, not wait for each other)
Doesn't change the currently working implementation of what to do with each cell, just the call order

Nebbie · July 31, 2019

Interesting idea, but one concern I'd have is that you need to wait for the longest A row to finish before you can do any B calculations, then the longest B row before any C calculations. If a player explores one long band left and right, then we are sure to have one very long row for each, which becomes a bottleneck (most threads idle a bit waiting on these few threads to finish). I would instead suggest a 2-dimensional approach using a repeating 3x3 grid of cell labels (A, B, C, D, E, F, G, H, I) so that each thread can be given X numbers of a cell label, splitting them up equally by ditching the left-to-right marching approach, and thus ensuring that in general each thread will take the same amount of time to complete, no matter how the map has been explored.

However, this could cause CPU cache issues if a thread is given too wide of a range of cells. It would need to be tested.

Nightinggale · July 31, 2019

2 hours ago, Nebbie said:

I would instead suggest a 2-dimensional approach using a repeating 3x3 grid of cell labels (A, B, C, D, E, F, G, H, I)

I thought about a layout like this:

AB
CD

If each letter is at least 2x2 and only one letter is executed at once, then the effect will be the same: no cell will be accessed by two different threads at the same time.

However thinking about this for a while, I realized it will make a lot of random cell access. If the map is accessed row by row, then each thread will access the map data in the order the cells are stored, hence maximizing the cache utilization. This would most likely result in less time required for running through a full map.

2 hours ago, Nebbie said:

Interesting idea, but one concern I'd have is that you need to wait for the longest A row to finish before you can do any B calculations, then the longest B row before any C calculations. If a player explores one long band left and right, then we are sure to have one very long row for each, which becomes a bottleneck (most threads idle a bit waiting on these few threads to finish).

It's a valid concern and yes the player can end up exploring in a way where one or two rows will be much longer than the rest. However is that a problem? If the player have only explored a few rows, then most of the map will be unexplored. If most of the map is unexplored, then odds are that the player will not have FPS issues. It doesn't matter if the CPU can provide 90 or 95 FPS if it's capped at 60 anyway.

The low framerate becomes a problem when the player has explored a lot of the map, perhaps even all of it. For this reason the code should be optimized for an explored map rather than a map with just a small fraction of it explored if we can only optimize for one of those two cases.

D.L.S. · August 1, 2019

You theorizing without knowing what exactly happening in sim.dll.

I suspect that for most tile changes there used cellular automata algorithm, and the thing about cellular automata - it can be perfectly multi-threaded(GPU pipelines, to some extend, work similar and there is no race conditions and all such nasty stuff). In few words there must be 2 arrays: current state of world, and future state, first one read only, second write only and in each of it's cell writing happens only once, after calculating next step - swap addresses of previous and future arrays, and repeat this each sim tick. I can be wrong, but behavior of some mechanics in game looks like there really happening cellular automata.

Nightinggale · August 1, 2019

15 minutes ago, D.L.S. said:

You theorizing without knowing what exactly happening in sim.dll.

Correct. However considering there is a performance issue and the game is apparently running very single threaded, then I assume there is an issue. The best I can do is to make assumptions of why it isn't multi-threaded and then say if that's the case, then here is an idea.

15 minutes ago, D.L.S. said:

I suspect that for most tile changes there used cellular automata algorithm, and the thing about cellular automata - it can be perfectly multi-threaded(GPU pipelines, to some extend, work similar and there is no race conditions and all such nasty stuff). In few words there must be 2 arrays: current state of world, and future state, first one read only, second write only and in each of it's cell writing happens only once, after calculating next step - swap addresses of previous and future arrays, and repeat this each sim tick. I can be wrong, but behavior of some mechanics in game looks like there really happening cellular automata.

If you are right, then why is the game still single threaded?

If you are wrong, then why isn't the game rewritten to match what you just wrote? It sounds like a clear winner.

If there are two arrays with a current and next state, then it really do sound like the perfect setup for a SIMD approach. Even if the code is written naively like for a CPU (if else etc), and hence losing half the cycles or more to flagged out instructions, then it still sounds like it would be way better than what we have right now.

bobucles · August 1, 2019

Simple. Flexible. Efficient. Pick one.

ONI has been under development for a long time. Many systems have changed, sometimes repeatedly, sometimes dramatically. Settling down and grinding out super efficient code only makes sense when it is set in stone.

Threads may be cheap but they're not THAT cheap, and reckless spam will will add overhead to the program. A lean, mean single thread will always be more efficient than many threads with tangled dependencies.

The ONI asteroid contains roughly 100000 map tiles. Let's take a very modest 2.5Ghz CPU. If we were to run the game at the default 1x speed (5 updates/sec), a single process has around 25000 CPU cycles per tile to get stuff done. Plenty of room for even a messy setup. When that same simulator is cranked up to 50 updates/sec, there's 2,500 CPU cycles available per tile. Not a lot of space to goof around, but it seems pretty workable. As long as all the data is neatly lined up and the CPU can knock out operations one after another, there's no immediate reason to believe a single thread can't handle the effort for world updates.

Tekky · August 1, 2019

The developers of the game Factorio, which is programmed in C++ and very heavily optimized for performance, have determined that there is not much to gain from extensive multithreading, because the main bottleneck is not so much the processor itself, but rather memory bandwidth.

I especially found this post by (ex-)Factorio developer Harkonnen (who was a Factorio developer at the time he wrote that post) to be very interesting.

Here are some Factorio benchmark results:

https://www.reddit.com/r/factorio/comments/4h647g/factorio_performance_test_cpuram_based_fpsups/

However, I am not sure if these results would also apply to ONI. As far as I know, Factorio does not have to iterate over all tiles/cells, as ONI does with temperate and gas calculations. Therefore, ONI may profit a lot more from multithreading.

Nightinggale · August 1, 2019

9 hours ago, bobucles said:

Threads may be cheap but they're not THAT cheap, and reckless spam will will add overhead to the program. A lean, mean single thread will always be more efficient than many threads with tangled dependencies.

Starting a thread for each row is a waste of time. It should be done more clever. The best solution would be something like Threading Building Blocks, which sadly is C++ only, but I will assume it's possible to find something similar for C#. The idea is that it starts a thread for each core and then each core will get a queue of tasks. There is a queue watcher, which can move tasks from one queue to another in case one thread by chance ended up with all the fast tasks. Since this will not multitask between active threads and the greatly reduced need to start and end threads, it is possible to boost performance from running much smaller tasks in parallel than would otherwise be possible.

8 hours ago, Tekky said:

The developers of the game Factorio, which is programmed in C++ and very heavily optimized for performance, have determined that there is not much to gain from extensive multithreading, because the main bottleneck is not so much the processor itself, but rather memory bandwidth.

A newer link than anything you provided is FFF-271, which talks about fluids in pipes and parallelism. It talks about putting all the needed data in fluidboxes to optimize usage of CPU cache, hence reducing the need for memory I/O. Another benefit of fluidboxes is isolating the pipes from each other and those two combined means it's both possible and beneficial to multithread the calculations.

What's interesting here is that ONI has cell data in arrays meaning the memory layout is much more like what Factorio is today than what it was prior to this update.

bobucles · August 1, 2019

11 hours ago, bobucles said:

around 25000 CPU cycles per tile

...

2,500 CPU cycles available

Haha, whoops. Maybe one day I'll be good at math. A 2.5Ghz CPU pumping 100k tiles at 50 updates/sec only has 500 CPU cycles available per tile. That's getting into pretty lean territory, especially when the less destructive cache misses can cost a few dozen cycles.

There is rarely any point in gaming history where throwing more CPU cycles at a problem ends up solving it better. Thanks to the magic of O-notation, any base code that isn't O(logN) or O(N) is going to crush a CPU and more processing power will not fix it. If a solution can cut down CPU expenses by half it is better to take that, because a solution that doubles access to CPU cycles will never come close.

D.L.S. · August 1, 2019

14 hours ago, Nightinggale said:

If you are right, then why is the game still single threaded?

If you are wrong, then why isn't the game rewritten to match what you just wrote? It sounds like a clear winner.

I can only guess: at the beginning there was no thermodynamic in this game, i guess at that point even simple single-thread approach for simulating world was fast enough. After adding bunch of new mechanics, performance become poor but at that point was too late to rewrite all from ground, it is cheaper to patch holes than build new boat. Also i assume that not only sim.dll cause fps drop, if you speed up things there you would still had poor performance because of other things on Unity side of the game.

P.S. i think sim.dll is c++ library. Ask devs what is going on there, or disassemble it and try to figure yourself.

Nightinggale · August 1, 2019

1 hour ago, D.L.S. said:

P.S. i think sim.dll is c++ library. Ask devs what is going on there, or disassemble it and try to figure yourself.

I did something else. I opened the pdb file (the symbol file) in plain text and it starts with "Microsoft C/C++ MSF 7.00". Sometimes even binary files can be useful to read in plain text, particularly the first line.

Since it is apparently C++, Threading Building Blocks is the obvious choice for any task, which uses multithreading for performance reasons. The brilliance in this library is that you essentially only need two lines, one is a function call you need another thread to handle and the other is waiting for all tasks in other threads to complete. No real setup and the library deals with threading overhead to reduce it to a minimum. It's also a free library with the option for paid corporate support. I have always viewed TBB as Intel's attempt to make developers encourage end users to buy CPUs with more cores rather than trying to profit from developers themselves. Despite being developed by Intel, it also works on any AMD CPU, which is able to run ONI.

Moving the calculations to the GPU would still be better though, but we can't tell for certain if it's an option.

1 hour ago, D.L.S. said:

if you speed up things there you would still had poor performance because of other things on Unity side of the game.

Unity is a fairly fast engine and not prone to low FPS by itself. If Unity is the cause of the slow gameplay, then the task is to find the flawed Unity calls. Luckily the Unity calls are in C# and not TheSIM.dll, meaning we can actually search for them if we really want to. I didn't say it's an easy task, just that it's possible if somebody is determined enough to look through the code to identify the calls.

Tekky · August 1, 2019

2 hours ago, Nightinggale said:

I have always viewed TBB as Intel's attempt to make developers encourage end users to buy CPUs with more cores rather than trying to profit from developers themselves. Despite being developed by Intel, it also works on any AMD CPU, which is able to run ONI.

Generally, I am very distrustful of Intel when it comes to its software supporting AMD CPUs.

For example, 10 years ago, it was found that Intel had designed its C++ compiler in such a way that it made full use of SSE features on Intel CPUs, but deactivated this optimization on all non-Intel CPUs, even in cases where the CPU identified itself as supporting SSE. See this Wikipedia article for further information.

I consider this a very unfair business practice of Intel, because this makes AMD processors appear significantly slower than Intel CPUs, when they are not.

Nightinggale · August 1, 2019

3 minutes ago, Tekky said:

Generally, I am very distrustful of Intel when it comes to its software supporting AMD CPUs.

We should always assume companies not wanting the best for their competitors, particularly if we can't really verify what they are doing. However TBB is open source. For once we have a chance to see what is going on and so far it appears nobody has found any "anti-AMD" performance features.

D.L.S. · August 1, 2019

3 hours ago, Nightinggale said:

Unity is a fairly fast engine and not prone to low FPS by itself. If Unity is the cause of the slow gameplay, then the task is to find the flawed Unity calls. Luckily the Unity calls are in C# and not TheSIM.dll, meaning we can actually search for them if we really want to. I didn't say it's an easy task, just that it's possible if somebody is determined enough to look through the code to identify the calls.

I had fps drop because of lots pickable items laying around, i think it starts lag because of many entities in game world.

On 31.07.2019 at 6:12 PM, Nightinggale said:

Based on gameplay evidence, revealing the map slows down the game and it appears to be the updates in TheSIM.dll, which is causing it.

Btw, did you profile it, i mean measured execution time of simulation tick, does it go slower with time? Also revealing the map affect simulation time?

Nightinggale · August 1, 2019

13 minutes ago, D.L.S. said:

Btw, did you profile it, i mean measured execution time of simulation tick, does it go slower with time? Also revealing the map affect simulation time?

No, because I don't know how to profile without having access to the source code. Sure I can try different games and measure FPS, but that's not particularly useful for measuring the slow parts of the code. If you can tell how to profile ONI, then I would love to hear about it.

Writing this I realize it would be possible to write a prefix and a postfix to a method call and then measure how long time passes between those two methods being called (System.Diagnostics.Stopwatch would be ideal for this). It is however a lot of work to set up if you are to measure more than just a few methods. Actually now that I think about it, maybe I would do something just like this. It would however require a good strategy for which methods to measure.

18 minutes ago, D.L.S. said:

I had fps drop because of lots pickable items laying around, i think it starts lag because of many entities in game world.

Try measuring FPS (or CPU usage) when the game starts and then use debug to reveal the entire map. You will notice a drop in frames, which won't come from pickupables (yes that's the name in the source code) lying around. I'm not saying the map is the only issue. In fact we have indications that pathfinding is also an issue. Also I did make the black hole mod to get rid of items lying around for performance reasons.

However if we focus on all possible slowdowns at once, we lose focus on what to do with each. I have written about one such location in the code because I came up with something, which might boost performance. If it makes the code execute twice as fast, then great. If it's just 10% of the total CPU time, then it saves 5%.

People have tried replacing gases with neutronium and claim to gain better performance. If that's the case, then gas movement/temperature does play a role in the FPS issue and that makes it a valid target for brainstorming regarding optimization approaches.

D.L.S. · August 2, 2019

14 hours ago, Nightinggale said:

Writing this I realize it would be possible to write a prefix and a postfix to a method call and then measure how long time passes between those two methods being called (System.Diagnostics.Stopwatch would be ideal for this). It is however a lot of work to set up if you are to measure more than just a few methods. Actually now that I think about it, maybe I would do something just like this. It would however require a good strategy for which methods to measure.

Yes i mean something like this, look for UnsafeSim200ms(float dt) in Game.cs, measuring execution time of this method will roughly show execution time of world simulation. I think something simple as "startTime = DateTime.Now.Ticks; endTime = DateTime.Now.Ticks - startTime;" and log endTime.

I heard that replacing any tile with neutronium affect performance, probably because there is no temperature simulation for neutronium, replacing with vacuum and void tiles should work similar. Also i guess replacing gas/liquid tile with solid should gain some performance because there is no physics simulation(movement/mass equalization/gravity) for solid tiles. All this guesses can be tested once you setup measuring of simulation time.

bobucles · August 2, 2019

It is very unlikely that this particular case of threading will yield a significant increase in performance, and will more likely result in a net drop of speed. The main reason is due to the memory access. No matter what line you pick to start processing, the entire map's sim state is required to complete the computation. Processing the effect from row A means grabbing temperature data from rows B and C, and this is true of picking those rows as well.

For example row A1 finishes. There is no immediate need to store B1 or C1, they can be released from memory because A2 needs the data from B2 and C2. Complete the cycle, start processing the B rows, and suddenly B1 needs to reload A1 and C1 to do it. Instead of the data being loaded, used once, and put away, you had to reload it THREE times and juggle the threading red tape to get the same outcome. What a waste.

TLDR: Threading is not a panacea. All threading optimizations do is make THREADING faster. Threaded bad code does not fix bad code.

Nightinggale · August 3, 2019

11 minutes ago, bobucles said:

For example row A1 finishes. There is no immediate need to store B1 or C1, they can be released from memory because A2 needs the data from B2 and C2. Complete the cycle, start processing the B rows, and suddenly B1 needs to reload A1 and C1 to do it. Instead of the data being loaded, used once, and put away, you had to reload it THREE times and juggle the threading red tape to get the same outcome. What a waste.

You are right, except for one part. (and no, I'm not talking about how A2 is technically between C1 and B2, not between B2 and C2).

The map data is read 3 times instead of one time. If the game is throttled by memory access, then you are correct. However if the game is throttled by how fast a CPU core can work, then putting more cores to work will speed up the calculations.

The number of cells calculated per second on a CPU core will go down, but I would be happy if it is cut in half, because that means 4 cores will still end up being double speed compared to single core performance.

What you are saying is essentially that using more cores will thrash the level 3 cache. It's a known issue and has been a topic for years. It's also an argument against hyper threading. However we still use multi core CPUs and hyper threading because they can speed up execution even if it doesn't double the speed every time you double the number of cores.

Getting the best out of the CPU cache is still a concern and there are ways to tweak to try to maximize this. We could make rows instead of 3x3 boxes on the map. We could also consider making each thread handle more than one row, like the rows goes A, B, A, B etc and A and B are then 5 rows each.

How much performance there is to gain here and which approach is the best is impossible to predict, particularly for people without access to the source code in question. There is really only one way to find out and that it trial an error. There is potentially a lot to gain, but we can't be certain how much until we have a result.

natanstarke · August 3, 2019

i have what i think a super simple solution to this at least for pipes, make insulated tiles do not interact with the ambient at all then players spam them everywhere.

bobucles · August 4, 2019

On 8/2/2019 at 8:10 PM, Nightinggale said:

If the game is throttled by memory access, then you are correct.

CPU design is by nature throttled by memory access. There are dozens of CPU cycles for every memory access they can do. So it makes sense to read once, process everything, and finish in one pass. It makes extra sense for a straightforward set of rules such as the temperature sim.

You can spend all day running core management meetings, function diversification quotas and write miles of threaded red tape to create a modern middle management hell in CPU form. It's trash. It's absolute garbage. OR, you can spend the exact same amount of developer energy to create good, clean algorithms. Once the task is done, it's done. No more CPU cycles need to be spent on it. It becomes a surprise how much work 1Ghz can actually do, if you tell it to actually do work.

Nightinggale · August 4, 2019

3 hours ago, bobucles said:

CPU design is by nature throttled by memory access. There are dozens of CPU cycles for every memory access they can do.

It's true, but if everything is throttled by memory access speed, then the CPU speed shouldn't really matter. In the benchmark thread I tried running the same CPU at 4, 4.2 and 4.4 GHz. The resulting FPS measurements (average over 60 seconds) painted a really clear picture: CPU speed influence FPS significantly. Remember the memory speed was unchanged.

There are two parts to memory bottlenecks: latency and throughput. If the CPU is bottlenecked by throughput, then adding more cores won't boost performance. If the bottleneck is latency, then more cores can benefit performance because it's possible to increase the throughput while not changing the latency. Ok, you can get slightly increased latency, but that would matter far less than the increased CPU power/increased throughput.

Also if CPUs are bottlenecked by memory, then why do people spend fortunes on CPUs with many cores? All they need is fast memory, not fast CPUs. Well the benchmarks for Blender rendering shows that 32 cores are faster than 16 cores. Clearly there is something else going on than just memory bottlenecks.

3 hours ago, bobucles said:

You can spend all day running core management meetings, function diversification quotas and write miles of threaded red tape to create a modern middle management hell in CPU form. It's trash. It's absolute garbage.

This thread started out about how to add threads without adding thread management issues. Adding TBB makes it even easier and likely more efficient too.

3 hours ago, bobucles said:

OR, you can spend the exact same amount of developer energy to create good, clean algorithms. Once the task is done, it's done. No more CPU cycles need to be spent on it. It becomes a surprise how much work 1Ghz can actually do, if you tell it to actually do work.

Optimizing single threaded performance is great. It's always best to reduce the number of needed CPU cycles for a whole lot of reasons (battery life, heat, fan noise etc). However there is a limit to how well something can be optimized. If you need to add up 100 numbers, you can't optimize it to only use 3 additions. At some point you end up maxing out a CPU core and if you have enough to calculate you might have to resort to using multiple cores for performance reasons. A nice bonus is that optimizing single core performance will also optimize multi core performance.

Also keep in mind that I can't really comment on optimizing the algorithm because I don't know the current algorithm or precisely what it needs to accomplish. However I can make reasonable assumptions and then say "if those assumptions are true, then multithreading is possible by...". This means this thread isn't about optimizing single vs multi threaded operation, but rather try to brainstorm for concepts, which could work. It beats the alternative, which is to not write anything.

I would also add that it isn't like an order to Klei or anything. It's brainstorming of what could be a good idea. It doesn't have to be implemented or implemented like I propose here. In fact I find the talk about a cellular automata algorithm running on the GPU to be the most interesting. If ONI speeds up, then I'm happy. If I influenced it directly (my proposal) or indirectly (somebody else replying here), then it's a bonus, but ONI FPS is the main goal.

Sign In

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites