Multithreaded Simulations

Hi all.
Sorry if this has already been said before, it’s already implemented and I need to download the latest version and/or I shouldn’t be making a thread about it. I’m sure I’m guilty of at least one of those.
***Anyways, as the title suggests, I would like to see multi-threading support for all simulations. But, before you chastise me because obviously communicating between cores slows it down so much you might as well use a single thread, I have a couple of ideas that could make this work. Perhaps it could be extended to GPUs too, but hey, I’m not all that good with programming in general, I don’t know if it’s easy or next to impossible (probably the latter). I’m putting my idea up for the experts to judge.
***The main problem is that every bit of a simulation, whether it’s the voxels in smoke or vertices in cloth, needs to interact with its surroundings. Regardless of how you divided it, you would end up with inconsistencies along the edges - either that or the simulation would get slowed by the communication between cores, plus synchronicity issues among others.
***First thing to do is the not compare with/reference the current subframe/step, but the one before.
Obviously this sacrifices a bit of accuracy and requires increasing the number of subframes/steps to accomodate (worst case scenario I think is logically double), but you now have all cores instead of one working. They each compare their own chunk with the entire previous subframe/step, which should already be cached anyways (there’s no reason to get rid of it other than to free up memory). Not too much memory used, no need for communication between cores, overall much faster. A bit of efficiency lost indirectly but it’s a big leap forward.
***


***Still though, there’s an obvious bottleneck. Let’s say I have 4 cores (which I actually do) and with a bunch of other programs running at the same time they are each at a different speed. The most free one finishes its chunk first, followed by the next, and continuing until the last. You could go back more frames, but that drastically reduces the efficiency and at the end of the day doesn’t solve the problem. This would become more pronounced with more cores and would really slow it down.
***Instead you could do something like the tiles used in rendering, except with 3D chunks (or however it would be divided). Each core deals with one little chunk (group of hairs, whatever) at a time and doesn’t need to sync or communicate because it’s comparing with the previous, already cached, subframe/step. Multithreading accomplished.
***There’s other technicalities as well (how the chunks would be determined, how the physics would be adjusted to retain accuracy while being fast, etc.) as well as the question of how to extend this to GPUs, but in theory at least this would be a massive leap forward. No more CPU idling, people with a butt ton of cores would no longer have to suffer from painfully slow simulations.
If it can be done I think this should have rather high priority.
Thoughts?

If it moves fast enough, that would indeed be a problem, solvable by going further back* (reference more subframes/steps behind). But at some point that’s even less efficient. If you have such a powerful CPU though, you can afford to increase the resolution, detail or whatever. However, like adding more samples to renders, it’s a rudimentary solution.
Don’t see why this bit wouldn’t work though, since instead of dividing up the timeline it’s dividing within a simulation step. Each core deals with one chunk at a time, using data from one subframe/step ago as the reference. They only need to communicate initially to divide the simulation into chunks and at the end to put it together, for each subframe/step. Certainly better than the alternative right?
As someone else said, it’s true that having more people won’t make long division faster. But we can cheat and do something a bit different to accomplish the same thing, something that can be split up. Long division isn’t really comparable to any physics simulation.
*Okay, but isn’t referencing multiple steps ago equivalent to multiple parallel simulations? How would that be any better?
Well you can reference multiple subframes/steps. Not quite sure how that would work though since the simulations are supposed to be linear. So referencing more than 1 subframe/step ago is probably a bad idea.
How will it be divided? Any sample code to show?

Nope, no idea. But rather than try and give each core the same amount of work, I think having more, smaller chunks would work better. Then it’s comparable to render tiles.
Sorry, I should have made all that clear.

I guess a raw idea like this isn’t really all that useful to the devs, but I didn’t see this approach posted anywhere else so I thought it might be worth mentioning. I probably shouldn’t be advertising this as a solution for multithreading when it’s more of a skeleton of one.

As Lukas pointed out in the link shared by Richard, multi-threading can be really tricky to handle for physics simulations. Now, I don’t know a lot about some physics simulation, but I do know fluid dynamics as well as Blender’s smoke simulator. There are steps in a fluid solver that cannot be multi-threaded, however, Blender’s is making use of multi-threading as much as possible. For example, the solver is multi-threaded along the Z-axis, dividing the domain into chunks of more or less equal size between all the cores.

Perhaps it can take better advantage of multi-threading: at the moment it is using OpenMP, which is giving worse performance than Blender’s native multi-threading library, but we can’t really replace it since the solver code is basically an external library. By experience, I think that with a properly written task system using work stealing and thread balancing, and giving a single slice of voxels to compute to each worker thread at a time, we would have better performance.

Also, the overall algorithm is a bit slow to compute, if we were to switch to more modern solving technique (like the FLIP method, replacing vorticity confinement with vortex particles, etc…), it might get even faster.

So really, it all depends on the multi-threading approach and the algorithms used. Just trying to multi-thread something (porting to the GPU counts as multi-threading) in the hopes of getting it to go faster without thinking about the efficiency of the algorithm or the cache locality of the data is a recipe for disaster.

There no easy way around this problem as the whole process is pretty much linear. Realflow liquid sims are simply broken up into it’s own independent sims and hope that nobody notice that they don’t interact with each other (might have changed with the current version). You can probably do one better by doing a rough pre-pass and get a rough idea on how the overall smoke simulation should behave and move, then run some multicore independent simulation at higher resolution that uses the pre-pass as a guide. Will get you better multicore performance and speed at the sacrifice of accuracy (independent sim that loosely interact with each other). Another probably easier method is to put the whole process on the GPU. Turbulence FD uses the GPU to calculate sims and after memory runs out, it continues the sim on the cpu. Works pretty well for the most part. Of course, it would work even better if the sim can be multi gpu, which goes back to our original problem.

Running things on GPU is multithreading to the extreme. On a CPU there may be 8 to 64 threads, on a GPU there are thousands.

Yes, I agree with that, but isn’t the approach on gpu parallelism simpler/finer and different from cpu parallelism? With multiple gpus, you have speed problems for quick sync up?

What approach are you referring to? Grid-based fluid simulations can be trivially parallel, see here for a GPU implementation.

The idea that there are thousands of “threads” on a GPU compared to a CPU is a bit misleading. If you want all the parallelism out of your CPU, you need to use SIMD, not just threads. For instance, an Intel CPU may have four cores with four-wide (float32) SIMD each. An NVIDIA GPU may have 16 SMMs, 4 warps per SMM, each doing 32-wide (float32) SIMD - which is closer to 64 CPU threads. If Intel counted the SIMD lanes as “cores” or “threads” (like NVIDIA), that would be 4 or 8 times as much.

Either way, “threads” on a GPU (no matter how you define them) are completely different from CPU threads - usually you can’t effectively use the same synchronization primitives, because you don’t have the same guarantees on memory consistency. That’s why porting threaded CPU code is often difficult and/or pointless. On the other hand, there’s the possibility to program CPUs more like GPUs, using OpenCL or ISPC.

it looks like someone is working on multithreading bullet
(if you look at the recent commits)

What if instead of doing a bucket arrangement instead you refined precision in passes instead, with each sequential pass starting before the previous frames have completed, using the lower precision data to calculate higher and threading each pass to an available cpu while as previous pass finishes.

This would require a totally different solver approach, I’d imagine.