Jump to content

Multi Core Support


Recommended Posts

Please Klei, the only thing that spoils this game is the lack of Multi-Core Support.

Getting 10 FPS mid-late game when your state of the art i9 or 7900X CPU is sitting on 9% load is very painful.

 

I know you have said that this requires a game engine re-write but as we are now in the realm of Multi-Core CPU's and you keep adding more and more excellent features, this is the one thing that the game desperately needs.

Even if it was a paid for DLC please find some way of adding it.

Link to comment
Share on other sites

On 1/29/2023 at 11:29 PM, SGT_Imalas said:

in theory, unity engine can support multicore.
in the case of ONI, this would however require a rewrite of the entire game.

I said this in my original post, it still needs it. Even one core can't get much over 85% because there are too many threads requiring processing before the main game completes a frame. The game is just too calculation heavy to run on so few cores.

Link to comment
Share on other sites

Multi-core is not inherently impossible, nor inherently trivial, nor inherently a single problem.

Concurrent execution has one and only one concern:  data contention.  That's all.  There is literally no other problem.

Data contention means the state of the thing can become invalid at some point if operations occur simultaneously.  For example, if incrementing a counter, the computer reads, then adds, then writes.  Absent atomic operations that read-increment-write in one access (which blocks memory access from all other cores, serializing other read-add-write instructions), the machine must ensure that between the read and the write, no other actions are taken.  This is done by locking—mutexes, read-write locks, and other semaphores.  This can be implemented by certain odd logic, but modern processors have atomic operations that allow a programmer to swap a value without the race condition, functionally providing a lock; and modern operating systems handle this by de-scheduling all threads that hit the semaphore (they yield to the kernel to wait on the mutex).

Data contention can also be handled by lockless data structures.  A lockless data structure is consistent no matter what order parallel threads are executed in.  Consider two threads A and B, where thread A writes to a value TotalCount and thread B reads from the value TotalCount and takes an action based on the count.  The action for thread B changes if TotalCount changes, but does that matter?

Let's do this with locking.

In the first situation, thread A takes a lock; thread B reaches the lock and waits.  Thread A updates TotalCount, releases the lock, and now thread B uses the new TotalCount value.

In the second situation, thread B takes the lock; thread A reaches the lock and waits.  Thread B acts on the old value of TotalCount, and then releases the lock, and now Thread A updates TotalCount.

Do you see anything odd here?

In no place does thread B do anything relying on TotalCount and some other value that must be updated with TotalCount.  The value of TotalCount is correct in either situation.  That being the case, why do you need the lock?  You only need the lock if TotalCountX and TotalCountY must both be updated together, such that if one is updated and the other is not then the state is invalid and any use of the two will produce invalid results.  Even if that's the case, you can put a lock around the update to TotalCountX and TotalCountY, but not bother checking the lock for read-only accesses to either in operations that do not require the other variable (i.e. operations for which the two are essentially independent).

Concurrent programming maximizes performance by producing as much lockless code as possible.  This requires rethinking how data structures are accessed and how certain operations are achieved.  This can be done in various ways, such as by accumulation.  When I wrote a ballot counter for ranked ballots (STV, Ranked Pairs, IRV, Tideman's alternative, and others), I separated the ballots out into piles and ran counts in a loop, with each thread using a separate index in an array; I then summed all the entries in the array to produce the final value after this concurrent count finished.  Basically, I moved the logic away from the variable that in the naive method needs to be locked, and thus avoided a lock.

When that fails, concurrent programming must rely on locks.

Locks involve making the program stop a calculation and wait.

Locks take time to create, lock, and unlock, as they operate across threads and involve context switches, which are slow.  Code that makes heavy use of locking takes a major performance hit.

Single-threaded code, like ONI's simulator, can't just have locks scattered about and make performance gains.  It could actually be slower running on 32 cores than it would be running thread-unsafe on one core (taking all the locks while running on a single core would also be ungodly slow).

The challenge therefor would be to identify what can be made independent, make those things independent, figure out where and how those can be broken down into independent operations, rewrite those operations to operate (mostly) lockless, then spread the threads around appropriately—independent but serialized operations as big serial threads, and independent non-serialized operations farmed out to arrays of worker threads one per core (plus one).

That's not just a rewrite; it's re-engineering.  It can be done in pieces, sure, and it's not an immensely difficult challenge; but it's  not a trivial task, and the precise behavior of ONI's simulation isn't going to be preserved.

I'm not saying Klei shouldn't do it—I fully support improving multi-core performance by reworking the simulator—just that it's a big task, and that it can (and should) be taken in small parts.  Things like only recomputing networks when the network is attached to another network, incrementally updating the networks, and then from there having this recomputation happen first and then network execution happen with one network per thread because networks are completely independent.  If you serialize your work as first splitting, combining, and altering networks when builds happen, and then computing the networks, you can suddenly fan out into threads when you get to computing the networks, and then converge back to your single thread of execution after all the networks are independently computed.  I haven't looked into the code; this may indeed be how Klei is doing it now.  If not, it's one relatively straight-forward improvement, making network changes event-driven and network updates multi-threaded with one thread per network; but that would still be one very crude improvement to one small part of the entire simulation, a tiny piece of the monumental task of optimizing for multi-threaded environments.

Improving the gas movement simulation may very well involve rethinking how the simulation functions entirely.  If the simulation operates by updating one or two squares (swapping, exchanging mass, exchanging temperature) and then taking the next step from the new state, that's impossible to make concurrent; a completely different approach may produce a similar (but not identical) simulation that can be broken down into smaller chunks and parallelized, requiring a full re-engineering of that particular mechanic.  A more crude approach would be to identify gas barriers—like identifying rooms—and making the gas movement (mass exchange etc.) independent for those, and threading them.  Every time an airlock opens or closes, contiguous gas chunks would be combined and split.  Players would be able to optimize performance of their base by splitting gas areas, possibly by using the transit tubes as the exclusive method of movement between large chunks of the base and thus providing a gas barrier.  Same with liquids—because liquids are affected by gravity, it makes sense to calculate whether a liquid will move and change its boundary with gasses first, adjust the gas barriers if they change, and then compute all your separate liquid and gas pools.  Temperature movement still involves the whole map, only barriered by vacuum.

Again, I haven't looked into how the engine works.  Maybe Klei does it this way already, maybe not.  These are the kinds of considerations that would go into making the engine take more advantage of multiple cores.

Link to comment
Share on other sites

On 2/14/2023 at 9:21 PM, johninbaltimore said:

Multi-core is not inherently impossible, nor inherently trivial, nor inherently a single problem.

Concurrent execution has one and only one concern:  data contention.  That's all.  There is literally no other problem.

Data contention means the state of the thing can become invalid at some point if operations occur simultaneously.  For example, if incrementing a counter, the computer reads, then adds, then writes.  Absent atomic operations that read-increment-write in one access (which blocks memory access from all other cores, serializing other read-add-write instructions), the machine must ensure that between the read and the write, no other actions are taken.  This is done by locking—mutexes, read-write locks, and other semaphores.  This can be implemented by certain odd logic, but modern processors have atomic operations that allow a programmer to swap a value without the race condition, functionally providing a lock; and modern operating systems handle this by de-scheduling all threads that hit the semaphore (they yield to the kernel to wait on the mutex).

Data contention can also be handled by lockless data structures.  A lockless data structure is consistent no matter what order parallel threads are executed in.  Consider two threads A and B, where thread A writes to a value TotalCount and thread B reads from the value TotalCount and takes an action based on the count.  The action for thread B changes if TotalCount changes, but does that matter?

Let's do this with locking.

In the first situation, thread A takes a lock; thread B reaches the lock and waits.  Thread A updates TotalCount, releases the lock, and now thread B uses the new TotalCount value.

In the second situation, thread B takes the lock; thread A reaches the lock and waits.  Thread B acts on the old value of TotalCount, and then releases the lock, and now Thread A updates TotalCount.

Do you see anything odd here?

In no place does thread B do anything relying on TotalCount and some other value that must be updated with TotalCount.  The value of TotalCount is correct in either situation.  That being the case, why do you need the lock?  You only need the lock if TotalCountX and TotalCountY must both be updated together, such that if one is updated and the other is not then the state is invalid and any use of the two will produce invalid results.  Even if that's the case, you can put a lock around the update to TotalCountX and TotalCountY, but not bother checking the lock for read-only accesses to either in operations that do not require the other variable (i.e. operations for which the two are essentially independent).

Concurrent programming maximizes performance by producing as much lockless code as possible.  This requires rethinking how data structures are accessed and how certain operations are achieved.  This can be done in various ways, such as by accumulation.  When I wrote a ballot counter for ranked ballots (STV, Ranked Pairs, IRV, Tideman's alternative, and others), I separated the ballots out into piles and ran counts in a loop, with each thread using a separate index in an array; I then summed all the entries in the array to produce the final value after this concurrent count finished.  Basically, I moved the logic away from the variable that in the naive method needs to be locked, and thus avoided a lock.

When that fails, concurrent programming must rely on locks.

Locks involve making the program stop a calculation and wait.

Locks take time to create, lock, and unlock, as they operate across threads and involve context switches, which are slow.  Code that makes heavy use of locking takes a major performance hit.

Single-threaded code, like ONI's simulator, can't just have locks scattered about and make performance gains.  It could actually be slower running on 32 cores than it would be running thread-unsafe on one core (taking all the locks while running on a single core would also be ungodly slow).

The challenge therefor would be to identify what can be made independent, make those things independent, figure out where and how those can be broken down into independent operations, rewrite those operations to operate (mostly) lockless, then spread the threads around appropriately—independent but serialized operations as big serial threads, and independent non-serialized operations farmed out to arrays of worker threads one per core (plus one).

That's not just a rewrite; it's re-engineering.  It can be done in pieces, sure, and it's not an immensely difficult challenge; but it's  not a trivial task, and the precise behavior of ONI's simulation isn't going to be preserved.

I'm not saying Klei shouldn't do it—I fully support improving multi-core performance by reworking the simulator—just that it's a big task, and that it can (and should) be taken in small parts.  Things like only recomputing networks when the network is attached to another network, incrementally updating the networks, and then from there having this recomputation happen first and then network execution happen with one network per thread because networks are completely independent.  If you serialize your work as first splitting, combining, and altering networks when builds happen, and then computing the networks, you can suddenly fan out into threads when you get to computing the networks, and then converge back to your single thread of execution after all the networks are independently computed.  I haven't looked into the code; this may indeed be how Klei is doing it now.  If not, it's one relatively straight-forward improvement, making network changes event-driven and network updates multi-threaded with one thread per network; but that would still be one very crude improvement to one small part of the entire simulation, a tiny piece of the monumental task of optimizing for multi-threaded environments.

Improving the gas movement simulation may very well involve rethinking how the simulation functions entirely.  If the simulation operates by updating one or two squares (swapping, exchanging mass, exchanging temperature) and then taking the next step from the new state, that's impossible to make concurrent; a completely different approach may produce a similar (but not identical) simulation that can be broken down into smaller chunks and parallelized, requiring a full re-engineering of that particular mechanic.  A more crude approach would be to identify gas barriers—like identifying rooms—and making the gas movement (mass exchange etc.) independent for those, and threading them.  Every time an airlock opens or closes, contiguous gas chunks would be combined and split.  Players would be able to optimize performance of their base by splitting gas areas, possibly by using the transit tubes as the exclusive method of movement between large chunks of the base and thus providing a gas barrier.  Same with liquids—because liquids are affected by gravity, it makes sense to calculate whether a liquid will move and change its boundary with gasses first, adjust the gas barriers if they change, and then compute all your separate liquid and gas pools.  Temperature movement still involves the whole map, only barriered by vacuum.

Again, I haven't looked into how the engine works.  Maybe Klei does it this way already, maybe not.  These are the kinds of considerations that would go into making the engine take more advantage of multiple cores.

there is one problem with your long text, i not ever read all.

you talk in here multi cores mixed to single core

but real multi core is that core a not care what core b does

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

Please be aware that the content of this thread may be outdated and no longer applicable.

×
  • Create New...