2.8 Branch update with Cycles CPU/GPGPU rendering together

This is kind of what I would expect. If you have a weak GPU then adding in the CPU will help a lot. If you have a strong GPU (or multiple GPUs) then adding the CPU will not make the render that much faster. It would be interesting to see a strong GPU and a strong CPU (heavy multicore) and see the results.

Some things about this …
This is all currently in master/buildbot.

16x16 is the new optimal tile size for GPU in most scenes. If you use denoiser, it could work better with 32x32, or 64x64 sometimes.
This new 16x16 size can improve performance if you have GPUs with very different performance. For example before, the slowest GPU usually took the last “big” tile slowing down everything. Being now much smaller size, it is expected that if the slower GPU takes the last tile it will not slowdown all too much.
GPU+CPU should always be faster than GPU only or CPU only, unless you have made a mistake in User Preferences by selecting “OpenCL on CPU” instead of CPU only. This considering that you have a not too old CPU and with at least four threads. One CPU thread is disabled (not used for CPU render) per GPU when CPU+GPU is used, that should also be taken into account if you have many nvidia GPUs.

Finally, I could be wrong in something I said above :slight_smile:

Edit:
This I say is for CUDA. Apparently “Grzesiek” has AMD GPU, so I do not know what problems there might be with it currently on Master.

It may sound like a silly question, but where did you find 2.91 with CPU+GPU? When I download versions from bledner.org, only with 2.8 I have CPU+GPU option, but it is waaay slower than on only GPU with 2.79 official, no matter of tiles size. I have 3770k and 1060Ti 6GB.
Thanks!
Edit: OK, found it.

Hi all! I rendered this in blender 2.79.1 (hash:2bf3825) on my 4 years old laptop:


i7-4702MQ
16 GB RAM
Nvidia GTX760M 2GB VRAM
Windows 8.1

100 Samples with denoiser (all on except diffuse and glossy direct):

CPU+GPU (tile 32x32) : 02:13
GPU (tile 512x512, i ran several tests, this was the faster one with gpu only) : 4:40

100 Samples no denoiser same tile sizes as above:

CPU+GPU: 1:58
GPU: 4:18

4K textures and sss enabled on Principled Shader.

WOW.

Ok, I have 2.79.1 and CPU + GPU. I just had old build, with no combo rendering. But, I have a far better results with 2.79 and only GPU. On tiles 271x271 (I havefound them to be fastes on my computer and hardware) sample scene - classroom is ready in 12:20. Using 2.79.1 no matter of tiles sizes ( 9 threads - 8 CPU and 1 for GPU) gives me render times from 28 to 30 minutes.

Hi.
The fabulous Lukas Stockner has sent the commit that improves the performance of denoiser with small tile sizes
https://developer.blender.org/rBfa3d50af95fde76ef08590d2f86444f2f9fdca95

This will be included (hopefully) in buildbots builds from tonight.

That’s weird. Could you provide more information? What is your CPU and GPU model? What tile size did you use for CPU+GPU render in 2.79.1? Tile size matter, use 16x16 in 2.79.1
What render time with GPU only in 2.79.1 (using the same tile size as in 2.79)?

You use most recent 2.79 builds:

Something was wrong with my 2.79.1 build, I have downloaded it again after a computer reset and it’s normal now.
GPU on 2.79 with optimal tile size (271) - 12:20.
GPU on 2.79.1, tile size 271 -7:40
GPU on 2.79.1, tile size 16 - 7:20
GPU + CPU on 2.79.1 with tile size 16 - 6:40.
I’m impressed.

Something was wrong with my 2.79.1 build, I have downloaded it again after a computer reset and it’s normal now.
GPU on 2.79 with optimal tile size (271) - 12:20.
GPU on 2.79.1, tile size 271 -7:40
GPU on 2.79.1, tile size 16 - 7:20
GPU + CPU on 2.79.1 with tile size 16 - 6:40.
I’m impressed.

Hi,
Are there plans to render with CPU & GPU in the viewport? If so, is there a timeline?
ty

That was Brecht’s first attempt, but CPU slowed down a lot viewport. So he decided GPU only for now for viewport.

Any OpenCL love?

The small tile denoiser improvement is pretty nice, but small tiles still struggle.

11/27 build (pre denoiser improvement):

16x16 tiles no denoiser - 1:24.01
64x64 tiles no denoiser - 1:30.35
16x16 tiles denoised - 2:25.68 (72% slowdown from denoising)
64x64 tiles denoised - 1:47.70 (18% slowdown from denoising)

12/1 build (includes denoiser improvement):

16x16 tiles no denoiser - 1:21.15
64x64 tiles no denoiser - 1:28.32
16x16 tiles denoised - 2:03.68 (51% slowdown from denoising)
64x64 tiles denoised - 1:42.10 (15% slowdown from denoising)

gpus: Gtx750ti + Quadro k4200
cpu: i7-5930k

There are issues with hybrid, and it may be related to the fact that I use NV with CUDA for GPU. I did several test renders (low samples with denoising) using varied tile sizes. At low tile sizes (48x48 or 32x32) the speed is not massively different. However, CPU only and GPU only produce different results, and this becomes more obvious in the hybrid render, where you can see the rendered tiles.

The only way I got a clean render was to use GPU optimised tiles, which tripled the render time compared with GPU only. GPU only was still the fastest in all my tests.

I understand the issues with memory limits, but if it’s going to triple render times, I think optimising the scene is still the better choice.


CPU only. 48x48 tiles. Render time = 5:26


GPU only. Tilesize automatic (around 256x216): Render time 0:42


Hybrid: Tilesize 48x48. Render time 0:53. However, this shows the issue with tonal differences.

This is currently in master. Any visible difference between CPU and GPU in render result should be reported with the corresponding .blend file showing the problem.

By the way, it would be useful for others to be able to opine here, that you all indicate which is the model of CPU and GPU (or GPUs) you use. Not with all the hardware combinations CPU can provide the same improvements in speed.

Yes, that was exactly what Lukas said :slight_smile:

I think if Cycles needs to be improved in speed, then CPU could do BHV calculation “while” the GPU’s render frames.
And the render tiles should not wait till all tiles are finished but when the BHV part is ready start on the next frame, CPU could handle the collection of multiple frame-tiles over multiple frames (and if the cpu has more core’s render some frames completely with CPU optimizations (small tiles).

I too get around 40% speedup with GPU + CPU and small tiles compared to GPU alone (980Ti)

For info, I filed the bug report and it’s been confirmed and will be looked at.

I bet that in the end the denoise will be done in a post-render pass for static images, and a post-animation pass on animations for crossframe denoise quality, will see, I´m not sure why is not done as a post-render pass for a single frame right now, this will kill the problems with small tiles… there should be some reason I think…

Cheers