Cycles: CPU+GPU renders slower than GPU only?

I’ve always under the impression that using both CPU and GPU to render is faster the GPU alone. To my surprise it isn’t, maybe my setting is incorrect?

GPU only: 43.90 seconds
GPU+CPU: 57.49 seconds

Not exactly small difference to be ignored.

Windows 10 Pro 64
48GB RAM
Xeon 6 cores 3.33Ghz
GTX 1080 Ti
GTX 1070
Blender 2.83

If your cpu is much slower than gpu it will hang the render. Basically integration of hybrid rendering in cycles works like that, each bucket gets one core and gpu being one of them. If there is one super slow core then the gpu(s) alone would buzz through that before it even starts rendering. Effectively making your cpu + gpu slower than gpu(s) alone.

2 Likes

CPU’s work best at lower tiles 16x16 or 32x32.

Also depending on the scene and low sample size (you ahve 128) the CPU will hold back the overall process.

Try same scene with 500 samples and lower tile setting and see if that works as expected.

1 Like

Hi.
What is the exact model of that Xeon CPU?
What is the CPU only render time?

Beyond that CPU+GPU use one CPU thread per GPU (you have two less CPU threads available for rendering), it sounds strange that you have worse times using small tile sizes.

CPU only render is “too damn long” lol, I gave up after 7 minutes.

https://ark.intel.com/content/www/us/en/ark/products/47917/intel-xeon-processor-w3680-12m-cache-3-33-ghz-6-40-gt-s-intel-qpi.html

I guess that’s what @staughost has said.
You have an old slow CPU vs two powerful GPUs. You don’t expect magic there.
The best advantage with CPU+GPU render is when you have a CPU and a GPU with similar render times when they work alone.

1 Like

I’ve rendered it with smaller tiles 32x32, 16x16 and 8x8. Results are about the same in terms of the difference between CPU render & CPU+GPU render. I didn’t use 500 samples because I use denoise, higher sample is not needed in my case.

1 Like

That’s good to know, thanks for the inputs.

The GPU rendering actually needs some CPU power to work right. If you CPU is way slower then the GPU there can be a stall between CPU GPU communication when CPU is rendering. This stall can lead to the GPU not being fed fast enough to render at 100% speed. This causes the GPUs to slow down making the render time longer than GPU only. At least this is what Cycles seems to do. Octane seems to be very all on the GPU rendering not taxing the CPU at all when rendering.

When using a fast GPU(s) I rarely find a use case where the CPU+GPU hybrid mode is faster. GPUs love large tiles and have super fast memory. CPUs do better with small tile sizes. System RAM is much slower than on card VRAM, so Cycles is trying to manage memory across this heterogeneous group and buses but it ends up being quite unoptimized. Often you’ll have GPUs done with their tiles and idle while the GPU is finishing its work. If you could perfectly optimize, you would theoretically get faster rendering, but because of significant wait states and additional overhead, it’s rarely worth it. In most cases you’re best just using GPU rendering (Nvidia RTX cards with Optix is currently the fastest way to go in general) unless you don’t have enough VRAM and have to render on the CPU.

I have a Threadripper 1950x and it renders faster than my 980ti (I will replace it with something eventually), which is actually a surprise. This is since the release of 2.83 and adaptive sampling; 2.82 the GPU just had the edge.
I’m seeing well over 25% advantage for CPU at least in normal use and with testing via the various MikePan Scenes - with the exception of the classroom
32 or 64 tile size seems to be best for CPU - depends on scene.
for GPU, 512 tile size is no longer the best for me, when using GPU only I mean; It seems to be 256 or even 128 in the tests I’ve done for GPU.

It might be that computers vary, and that scenes vary between computers, so check imo.

CPU 1950x
BMW 1:09.75
Classroom 5:28.33
Fishy Cat 2:32.07
Koro 2:08.20
Barcelone 5:45.21

GPU 980ti
BMW 1:45.67
Classroom 4:08.92
Fishy Cat 4: 36.89
Koro 5:42.79
Barcelone 9:58.10

CPU + GPU render fast than GPU, when both fast. If one of this faster than other, then will create bottleneck and render will be slow.

And tile sizes too important. Example, for NVIDIA 32x32, for AMD 64x64 tile sizes are good for hybrid render.

I realize that this suggestion is better sent to the developers but it might be a relevant perspective for here too.

Currently each GPU is assigned one tile and each CPU thread is assigned one tile. Very few people will have a CPU so powerful that one thread is equal to an entire GPU if such a CPU even exists.

I propose for GPU + CPU that the entire CPU should be assigned to one tile. Perhaps that tile would be subdivided into one piece per core or thread. This allows a CPU tile to finish faster and create less of a bottle neck. The CPU would probably finish some tiles during the render giving a benefit to GPU + CPU and then lag less on the last tile.

I’m not sure how to send this idea to the devs, but here it is.

My suggestion to fix this for Blender would be to have the cpu tile size be a small multiple of what the gpu is. This way lets say you have a 512x512 tile size for GPU. CPU with 64 cores would get 1 GPU tile with each core having a sqrt[(512x512)/64]= 64 so a 64x64 tile or whatever the smallest cpu tiles was set to. If the smallest cpu tiles was set to 128x128 the CPU would have 4 GPU tiles total to have all 64 cores working.

The basic ideas is split have the GPU tile set then have a smallest CPU tile set. The CPU renders as few GPU tile sizes as it can at a time. If the smallest CPU tile was set to 8x8 with a 64 core and the GPU tile was 512x512 the CPU would have a ton of little squares that didn’t even take up a single GPU tile at any moment. In that case the CPU with start with 1/64th of a single GPU tile rendering, 8x8pixelsx64cores = 4,096 pixels or 1/64 of a 512x512 chunk, but the tiles in that 512x512 chunk would progress quickly. After a core finishes in that 512x512 chunk and there is not a new one for that core the core could start in a new 512x512 chunk.

Great minds think alike. I hope the devs do something to make cpu + gpu beneficial.

My fast rendering are CPU+GPU tiles-32x32 I7 - 8700K and GTX 1080

To push the idea further, there could be a option to run a benchmark render. So the tile sizes are optimised so that the CPU and GPU would finish at the same time. So there can be like 1 GPU tile vs how many tiles CPU can render at the same time. Sure, parts of the frame take different times to render, but to get GPU and CPU to roughly finish at the same time. Alternatively, if possible and makes sense, when GPU finishes, it can start chugging CPU tiles.