More than 64 threads

Ace_Dragon · June 6, 2016, 9:22am

sundialsvc4:

A critical thing to remember about “threads” is that they do not(!) multiply the CPU resource: they divide it.

It’s perfectly fine to run 128 threads … if(!) you have 64 cores! That’s only two threads per core.

Furthermore, with Cycles you have one “fixed, unchanging, limiting-resource:” the GPU chip(s).

Multithreading does not speed up a so-called “CPU-bound” operation such as graphics rendering, except to the extent that it allows multiple CPU cores to be effectively employed … and then, only to the extent that “those cores, on this motherboard,” can be effectively employed. Each thread will “consume a full time-slice,” almost every time. Like it or not, there are only 1,000 milliseconds of CPU-time available in each second.

A thread-setting much larger than, say, “2x the number of cores in this machine” is, IMHO, basically wasted. Counter-productive.

Agreed, I have actually done tests going up to 32 to 64 threads (on a machine with 8 native threads) and there is no real time savings that you can expect from it (it’s either roughly the same as just using the 8 threads or the render slows down by a few percentage points).

The only way you can get a faster render with more threads is if you shell out the cash for a machine that either has one of those enthusiast-grade processors or the professional Xeon processors.

sundialsvc4 · June 7, 2016, 6:25pm

Set the “number of threads” to, say, 2x the number of physical cores in your machine … no more!

And in the specific case of Cycles (GPU-based …) rendering, you might need to set it even lower(!) still.

Here’s why:

The basic rationale for using multiple threads is to allow all of the processor cores to be working on the problem. The operating system can be expected to recognize these to be “CPU-intensive” threads, and to dispatch one of them to each available CPU core. They will be dispatched “round-robin.”
There is no advantage in adding “multiple threads” to each core, because each thread is probably going to consume a full time-slice, so the presence of multiple threads won’t make it faster. Instead, the overhead of managing threads will make it slower.
In a GPU-based Cycles render, there is one overwhelming, “ruling” constraint: “the number of GPUs that you have … namely, one.” This hardware device should be served by a single supplicant, not multiple threads who are constantly asking it to do different things, and thus constantly “setting up” GPU-work only to have it be “torn down” by the next young turk.

All of the ruling constraints are physical: how many cores do you have, and how many GPUs do you have?

lukasstockner97 · June 7, 2016, 6:39pm

sundialsvc4:

Set the “number of threads” to, say, 2x the number of physical cores in your machine … no more!

And in the specific case of Cycles (GPU-based …) rendering, you might need to set it even lower(!) still.

Here’s why:

The basic rationale for using multiple threads is to allow all of the processor cores to be working on the problem. The operating system can be expected to recognize these to be “CPU-intensive” threads, and to dispatch one of them to each available CPU core.

There is no advantage in adding “multiple threads” to each core, because each thread is probably going to consume a full time-slice, so the presence of multiple threads won’t make it faster. Instead, the overhead of managing threads will make it slower.

In a GPU-based Cycles render, there is one overwhelming, “ruling” constraint: “the number of GPUs that you have … namely, one.” This hardware device should be served by a single supplicant, not multiple threads who are constantly asking it to do different things, and thus constantly “setting up” GPU-work only to have it be “torn down” by the next young turk.

All of the ruling constraints are physical: how many cores do you have, and how many GPUs do you have?

Why do you keep talking about GPUs, this discussion is about CPU rendering?

Generally, the thing about threading is: A CPU has multiple cores, so it makes sense to run multiple tasks in parallel. To do so, you run multiple threads. Now, however, CPU performance is not as simple as it seems: Modern CPUs have separate integer, float and SIMD units and are highly pipelined, which means that mixing different calculations makes then faster. Also, more importantly, RAM is extremely slow in comparison: It’s not unlikely that the CPU has to wait for data for 500 or more cycles. Most of the time, the cache has the data, but when it doesn’t, the CPU would just idle.
Therefore, modern CPUs have Hyperthreading: Each physical core presents itself as two cores to the system. Therefore, it will get two threads. Since it’s a single core, it can only process one of them at each time. However, when one of them waits for data, the other one can be executed. That’s why you want to run two threads per physical core.

The problem with this “twice your cores” advice for thread number is that many people mix up the physical and logical core count: For example, a Xeon Xeon E5-2620v4 has 8 physical cores, which appear to be 16 logical cores due to Hyperthreading. For that chip, you’ll want to run 16 threads in Cycles. Any more will just take more time to manage and don’t really help at all - remember, your CPU can’t even run 16 threads at one time.

Razorblade · June 8, 2016, 4:06am

hm it seams a very exotic machine,
I wonder if multiple machines with less cores, would be A) cheaper, and B) do a movie animation renderjob faster.
Basicly multiple pipelines vs a single memory wait queue lane, as how Lukas described it.

May i ask where you use all this horsepower for ?

cpurender · June 8, 2016, 10:47am

@Razorblade:

Exotic? It’s a workstation. Never heard of it before?
Regarding horsepower: Would you render with your mobile phone?
There are infinite number of use cases for powerful machines, machines can never be powerful enough.

Managing multiple machines is also a cost, single or fewer machines are more efficient => cheaper.

Ace_Dragon · June 8, 2016, 11:27am

There are a few ways that renders in Cycles could be sped up.

Get more powerful hardware to do the job (this is the only one controllable by the user, but incredibly expensive if you plan on having it at home)
The developers optimize and improve performance at the code level (being worked on to an extent)
The developers commit smarter sampling techniques and/or sophisticated denoising techniques (being worked on by Lukas)

With some very sophisticated algorithms related to sampling and denoising being developed, it ultimately reduces the need to spend exorbitant amounts of cash on fancy, top-of-the-line, Xeon-based workstations (that will be obsolete in a couple of years anyway).

cpurender · June 8, 2016, 1:39pm

My Xeon workstation wasn’t expensive btw, the CPUs were just as much as two GTX 1080.
The render (Cycles only) time and power consumption are similar to those GPUs.

cpurender · June 13, 2016, 11:09am

I have an update for this thread.
My build wasn’t slower than blender.org’s build, the input field for displaying the number of threads is just misleading, I would call it buggy.
If I want to use all available threads, I don’t need to change this in the code:

#define BLENDER_MAX_THREADS

Actually that code line only affects the input field by defining max. number of threads to show or to change.
Switching to Auto-detect will let Blender use all threads even if above 64, the misleading part here is that the deactivated input field will show max. 64 threads only.
/uploads/default/original/4X/b/7/7/b77768f435d78a425db0cae4cc712c04cc1dfe2e.pngstc=1

Attachments

sundialsvc4 · June 14, 2016, 4:52am

Two comments to the foregoing:

(1) “Logical cores” are the CPU’s own version of time-slicing … courtesy of the microprocessor manufacturer’s Marketing department. They are not real resources. You want to count only things that are physically there, since you’re doing a CPU-intensive activity.

(2) The reason why I mention the GPU in a CPU-discussion is that it, too, is a physical resource. Therefore, it is a “ruling constraint” to the overall problem, if that resource is being used. The GPU is also an unusual resource in that it must be loaded with information (slow …) in order to do its calculations (fast …) in order to produce results that must then be taken off (slow …). If you’re running multiple processes that are all dependent on the GPU (which might not be the case here …), not only are they competing for “just one chip,” but they are constantly loading and unloading information, even if it is the same information being reloaded each time. In a bad situation, the GPU might spend most of its time being loaded and unloaded and loaded again, and very little time computing!

You basically need to think about what physical resources you have available to you, and do not over-commit those resources. If there is no physical advantage to having the OS dispatch multiple processes or threads, then it is a waste of “time that you do not have.”

Multiprocessing divides available resources across time … it does not multiply them. It comes at a price, when CPU-bound workloads such as these are being processed.

lukasstockner97 · June 14, 2016, 7:16am

sundialsvc4:

Two comments to the foregoing:

(1) “Logical cores” are the CPU’s own version of time-slicing … courtesy of the microprocessor manufacturer’s Marketing department. They are not real resources. You want to count only things that are physically there, since you’re doing a CPU-intensive activity.

(2) The reason why I mention the GPU in a CPU-discussion is that it, too, is a physical resource. Therefore, it is a “ruling constraint” to the overall problem, if that resource is being used. The GPU is also an unusual resource in that it must be loaded with information (slow …) in order to do its calculations (fast …) in order to produce results that must then be taken off (slow …). If you’re running multiple processes that are all dependent on the GPU (which might not be the case here …), not only are they competing for “just one chip,” but they are constantly loading and unloading information, even if it is the same information being reloaded each time. In a bad situation, the GPU might spend most of its time being loaded and unloaded and loaded again, and very little time computing!

You basically need to think about what physical resources you have available to you, and do not over-commit those resources. If there is no physical advantage to having the OS dispatch multiple processes or threads, then it is a waste of “time that you do not have.”

Multiprocessing divides available resources across time … it does not multiply them. It comes at a price, when CPU-bound workloads such as these are being processed.

Well, but especially BVH traversal usually isn’t compute-bound, but memory-bound (since the larger scenes don’t fit into caches which causes the pipeline to stall). Therefore, Hyperthreading helps a lot, and you should definitely run one thread per logical core.
Seriously, claiming that hyperthreading is a marketing trick makes me doubt that you know what you’re talking about - no amount of “that’s not real” rhetorics changes that.
Also, well, the GPU is indeed a physical resource you have - but so is your DVD-drive, and I don’t see that mentioned in CPU rendering considerations either - which would be just as valid as bringing up GPUs in a CPU topic.

cpurender · June 14, 2016, 3:18pm

@sundialsvc4: There’s actually no questions left for my part regarding HT efficiency. I understand it well enough.
The whole topic was just about how to change the value of that input field to a higher number.

Talking about HT efficiency I found out that Cinebench is more efficient of using HT than Cycles, maybe Cinebench’s renderer is less complex.

Cinebench (win7) stats:

2 x 18 cores / 72 threads / 2.2 Ghz baseclock: score 4080
2 x 20 cores / 80 threads / 2.0 Ghz baseclock: score 4305

Cycles (Ubuntu), BMW27.blend stats:

2 x 18 cores / 72 threads / 2.2 Ghz baseclock: 39 sec
2 x 20 cores / 80 threads / 2.0 Ghz baseclock: 41 sec

mib2berlin · June 14, 2016, 10:00pm

Hi cpurender, Sergey the Cycles core developer is working on a patch to kick the 64 thread limit but it is not finished yet.
It seams a bit more complicate to make Cycles effective on more threads as to change BLENDER_MAX_THREADS only.

Cheers, mib

cpurender · June 15, 2016, 1:58am

If we put efficiency aside, Blender already uses more than 64 threads accordingly without changing the BLENDER_MAX_THREADS constant. Which means if I run with 64 threads, I got worse result comparing to 72 threads.
Look at my previous posts in this thread for more information.

If we consider efficiency, my configuration 2 x 20 cores / 80 threads @ 2.0 Ghz must be faster than 2 x 18 cores / 72 threads @ 2.2 Ghz by about 7%. All other benchmarks confirm this.
According to your information, Sergey is probably working on this matter.