Wednesday 5 October 2022

How we can learn parallel computing from computer games

 I really like computer games like Factorio or Satisfactory In those games you have to mine your planet resources e.g iron/copper/uranium/coal and put an intricate supply chain of factory machines/betls until more complicated resources come out e.g copper wire, engine, fuel, etc'. 

Here are a few screenshots:

Those games are really fun. It always felt like programming sometimes. It can bring your creative side to life, when reality sometimes does not have that. 

There many games like that, but what most have in common is the performance issues once you get into the "megabase" territory. The megabase is where your factory grows large, and usually, everything starts to slow down once you get there.

Mostly, those games are single threaded, at least the simulation of the factory is single-threaded. Up until recently I thought that it was single thread because of programming simplicity, but then I read Factorio forum post about their performance issues, and it seems they tried multi-threading but said it did not improve drastically(15%), and claimed the issues was memory bottlenecked. 

It seems weird to me, how my 32/48 threads ThreadRippers can be a bottleneck when using two or more cores so easily? Does it really make sense?

Lets try to write the concept of those games in Go language, a very simple, very high parallisem programming language. Lets make each factory/entity a go-routine, and the go-channels are the input/output of the factory.

This is basically the implementation of what we described earlier, and well? its behaving pretty garbage, CPU is quite low (but much more than a single thread), but even on a very powerful ThreadRipper it does not really get us any good performance, probably not even better than the single thread, but why? how can it be?

Using AMD uProf to profile the process, it seems it get stuck a lot on some timer function and opcode. It might be a Windows thing, so I switched to linux.

In linux things still seems bad, but different. 

For example the findrunnable function seems to be related to Go:

Might be realted to the that fact they combine (or not) timers?

I remeber Akka having some kind of timer schduler, so lets try it:



We dont see really the JVM, only Kernel stuff, which is good.
Its probably means Akka handles Timers better:

The surprise was the M1 (both regular and pro) which behaved much better and got much better performance (for both Go and Akka) than my 3960x/2950x ThreadRippers. Whicih I further tried to explain here:

It seems that those benchmarks basically profile timers/scheduler/event-loop performance in some way or another, which is not ideal. 

I also wanted to test CUDA: 

The cuda approach is quite different. We basically have matrix which go through all the time in a loop and we check if enough time has passed for each factory. This means no insane scheduling really. This with the fact that GPUs are really good at traversing a lot of data means my 1080 ti could really crash this test, and get a decent result (100M "manufacturers").

The limit of the 1080 TI might be a memory. I wanted to test it with 32GB M1 Max (with Apple Metal) but still did not manage to get to it. 

I learn a lot in this process (as probably there is much more to learn). Hope you as well!

I am not sure I can fully answer the original question I asked. CUDA for sure can do it, but it might not really I was looking for. 

No comments:

Post a Comment