Tested: $1300 8-GPU Blender Cycles Rendering - Does Cycles Scale?

Does Cycles love more GPUs? Where is the point of diminishing returns?

I’ve been wanting to know if I can incorporate raytraced renders into more parts of my creative process, rather than just final renders. I have a particular interest in lighting and its effects (SSS, caustics, AO, etc.). With Optix and adaptive sampling, the number of samples needed for a good image is rapidly coming down. Nvidia RTX 3000-series cards look to have 4x the RT processing power, further accelerating renders. Cycles will never been an online renderer, but it’s getting closer. Even in the online render world we’re seeing impressive use of raytracing. Unreal Engine has had RT since v4.25 and v5 has some amazing new features.

After my last test of PCI-e risers and their effect on render times, I wanted to understand how Cycles scales with multiple GPUs. Can I just keep adding more GPUs or am I wasting money after adding a few? I’ve seen many comments on forums (including this one) that there is a steep falloff in performance when you add more than 2-3 cards, which is why you don’t often see high-end workstations with more than 4 GPUs. Cycles being a tile-based renderer, however, I expected that (like scaling with CPU cores) Cycles should scale fairly well with more GPUs.

For this test I used 8 x Nvidia GeForce GTX 1060 video cards connected to a cryptocurrency mining motherboard that can handle up to 13 video cards. I used a 6th-gen Core i5 quad core CPU and 8GB of RAM running Ubuntu 19.10.

Setup

  • Asrock H110 BTC+ Pro motherboard
  • Intel Core i5 6402P quad core CPU
  • 8GB DDR4 RAM
  • 8 x Nvidia GeForce GTX 1060 (3GB or 6GB)
  • 8 x Generic Chinese PCI-e x1 USB risers
  • 2 x 800W Raidmax Gold PSU
  • Ubuntu 19.10 x64 with Nvidia GeForce drivers 440.xx
  • Blender 2.82a

Method
I wanted to use the Junk Shop splash screen scene that I used in the PCI-e riser test, but that scene uses about 6GB of VRAM, and not all of the cards in the test had 6GB. The most complex demo scene I could find that would fit within 3GB of VRAM was the Blender Classroom scene, which uses about 1.2GB of VRAM. I tested at 50, 150, 250, 500, and 2500 samples with 1 up to 8 GPUs.

I set the Cycles Render Devices to CUDA (Optix is not supported on these cards) and selected however many cards I wanted to use for the test, starting from 1 and going all the way up to 8.

8-gpus-blender-preferences

Settings were: Cycles for render engine, Supported for feature set, GPU Compute as device, and Integrator was Path Tracing. Denoising was not enabled. Everything else was as-is with this demo .blend scene. I did not adjust resolution or other settings. I ran each timed render at least twice and took the average.

Caveats

  • The classroom scene is a benchmark scene, which may not be representative of the work you do
  • Also, this test reveals a general idea of scaling, but some scenes may scale differently
  • I only used PCI-e x1 for all cards. As found in my PCI-e riser test, performance is considerably lower at very low samples (e.g. 50) when using x1, but above ~325 samples, this drops to around 6% and stays there. In this case I wasn’t looking at overall speed, but scaling. Using PCI-e x1 slots in this test should affect scaling but it did reduce performance in all scenarios accordingly.
  • Blender uses increasing amounts of CPU as GPUs are added. Adding a fourth GPU was enough to use 100% of the i5 quad core CPU. Strangely, on my dual Xeon system, 7 GPUs only used about 12% of the CPU, which is equivalent to about 50% of the i5’s CPU power. I’ve asked Blender devs about this, but have not received an answer. The 100% CPU utilization did not seem to have much impact on render times, but it did make the scaling picture look worse than it should at 4 GPUs and above.
  • This was tested in Linux, but Windows performance was very similar.

Results

  • The single frame with 8 x GTX 1060 at 1250 samples rendered in 5 minutes. A single RTX 2080 Ti with Optix acceleration took 6 minutes.
  • Low samples (e.g. 50) do not scale well past 3 GPUs, as expected. There’s just not enough rendering to be done compared to the loading of textures, geo, etc.
  • High samples scale well all the way up to 8 GPUs. The higher the samples, the more linear the performance gains. At 1250 samples 8 cards resulted in 7.4x the performance of a single card.
  • The sweet spot for renders around 250 samples is 6 GPUs in this test.

Thoughts
First, unless you are doing very low sample renders, Cycles scales quite well when adding more GPUs. There’s no reason to stop at 4 GPUs out of fear of diminishing returns. Unless you’re typically rendering at samples <200, you’re better off with >4 videos cards if your motherboard can handle it and you can afford the cards.

Second, scaling doesn’t need to be expensive. In this test I used a $60 used CPU on a $60 used motherboard and video cards that go for about $130 each on eBay (an especially good value if you get the 6GB versions). The total cost for the 8 GPU test system was $1300. For about $100 more than a single RTX 2080 Ti you get a full system that is 20% faster than the 2080 Ti when rendering moderate to high samples, even when using Optix acceleration. Eight cards do consume more power than even the power hungry 2080 Ti, but it’s not especially high. The full system with all 8 cards rendering used around 800 watts. Of course, you have much less VRAM than the 2080 Ti, but most Blender scenes seem to fit in 6GB of VRAM.

Finally, I was happy to see how well Cycles scaled. In a couple of weeks, I should have more RTX cards so I can use raytracing in even more areas of my creative process. Once I get them, I will do a quick test with RTX and Optix to see if the scaling pattern is similar.

16 Likes

Note that if you try using a 4x connection to each GPU, you’d get better scaling at smaller samples.

Another user posted some results of GPU across 1x 4x 8x and 16x rendering.

But do like seeing all that rendering power in one system.

will admit, would love to see some photos of your setup. :wink:

Also how much power at the wall when rendering?

it’s the same user. :wink:

1 Like

lol… totally missed that… how emberessing… lol… LOL

Yup, I’m the same guy :laughing: I’ll have more posts like this. I have a lot of questions that I’m exploring and thought it would be fun and informative to others if I post my results.

This week I’ll take some pics and post them. I’ll also get a precise reading on the power at the wall.

3 Likes

Set to follow your posts so I won’t miss another one of your cool posts. Informative to the max.

I’m working mainly on AMD gpus, and trying to properly do Bifurificaiton of PCIE slots, in part your test results for 4x reaffirmed my goal to get all GPU’s at least 4x bandwidth. :slight_smile:

But finding solid Bifurificaiton devices is hard. Though splitting my lanes is easy (using Threadripper so splitting a 16x slot to 4x4x4x is a simple bios switch… just need “cabling”

Have some and "initial test showed some positive result, but now can’t repeat the experiment (using cheap PCIE extension cables :frowning: … but ordered a new proper cable)

1 Like

Thanks for sharing the pics. I’m glad that I’m not the only one with a messy test bench :slight_smile:

I’m glad that my posts have been helpful. If you have requests for additional tests, let me know. I have a few more tests in mind, but I’m slowed down by issues like parts on backorder and BIOS issues with my new motherboard.

I haven’t tried bifurcation yet, but I plan to once I get my new motherboard up and running. It’s a server board with a x24 slot that can be split to 3 x8 slots. My plan is to use a riser card from Tyan riser card and then use ribbon cables from there. How to get it all mounted is going to be the tricky part. Not sure how to do that yet. I might have to make a custom rig for it.

Work is crazy busy for me now through Wednesday, but after that I’ll take pics and upload them of the 8-GPU rig and other parts I’m working on.

BTW, those anti-static bags are somewhat conductive. Be careful using them under your components.

If doing an animation the scaling should be really good if you can set Blender to do 1 frame per GPU. This would reduce the startup time a lot, which is what is killing the linear scaling when using multiple cards on a single frame.

That is cool. Didn’t know you could split a PCI E slot like that. Now just need a case that can fit it.

If you’re doing low samples, setting a range of frames per GPU would be ideal. I’m not sure how to do that within a single machine. I only know how it works with network rendering/render farm. At 200 samples and above then you don’t lose much by having several GPUs working on a single frame. There’s enough to do with each tile that it’s like each GPU doing a mini-frame.

This is generally reffered to as pcie bifurcation. I know that all Threadrippers have it, and some Ryzen 2000/3000 CPU’s but the Ryzen its very board specific.

As for case, i build a desk to house it all :slight_smile:

1 Like

best option i think is to start one blender instance per GPU, each instance would render to one folder, and setup to create a “dummy” file whiel it renders.

This way Blender1 renders frame.000
Blender2 would try to render frame.000 but sees a file is there, and starts rendering frame.001
Same for all other Blender instances.

This will of course only work for per frame basis not for a set.

Another is simply use command lines for each blender instance and set a render range for each, and would work. Only issue is that if one set is finished, that respective GPU will become idle.

I usually stick to the first option if i want to use 100% of all GPU resources… but will admit I more often use all GPu’s per one frame cause I want to see “instant” results of the quality…

More details on command line arguments :

blender -f, --render-frame <.frame> . . <.frame>

    A range of frames can be expressed using .. separator between the first and last frames (inclusive).

https://docs.blender.org/manual/en/latest/advanced/command_line/arguments.html

Thanks for providing this. I’ll try it this weekend. As you mentioned the issue will be that not all frames are equal, so even if you have multiple cards with the same specs, you could find one GPU finishing first. But that may not be that big of a deal.

Some render farms are able to split frames. I’m not sure how that works. Are they sending tiles to different compute devices? It would cool if Cycles could do something like that. It would also be nice if you could easily tell Cycles to render with one frame/device versus one tile/device without having to use the command line or render to folder/overwrite method.

Splitting a frame is called “bucket rendering” By default blender doesn’t have it but you can “trick it” using the border renders.

issue is that at the end you have to combine the individual “buckets” into a single image. I bet its duable via command lines as well, as it is simply compositing x amount of parts of a single frame into single image.

Still best approach (being the most efficient on single sistem)

In blender scene set “placeholders” on, and disable “overwrite” in the "overwride window.

launch identicial amount of blender instances as GPU’s. (note each instance would have to be pointed to individual GPU).

as all would be pointed to the same folder, and upon starting a frame render it creates a temporary “placeholder” image. the second instace would Not be able to render that image, and would automatically start rendering the second image. Third would start Third, and so on.

If first one finishes it would automatically start next, but can’t so ti will try until it does not find a placeholder image.

The only “loss” is towards the end of the animation. eg. if you have 5 GPU’s so the last 4 would be an area when GPU’s would be idle.

Thanks for the detailed explanation of the placeholder/overwrite method. I saw something about it before but didn’t know how to do it. This would be a good method if you want to render out a full raytraced animation with low samples and then denoise it, either as a draft or to have that strange painterly look that the denoiser produces in very noisy images.

You’re probably right about bucketing (glad I know that term now). I would guess that render farm companies have tons of scripts to do everything by command line, including recombining split frames. I’m sure Blender devs have more important things to do, but it doesn’t seem like it would be that hard to add these features to Blender. You don’t really need to touch the actual render engine code or anything. You just need to add to the render management settings. I wish I knew how to code. I might give it a try!