Tested: Cycles GPU Rendering with PCIe x1 Risers (and RTX Optix support)

Todd_Takehana · May 18, 2020, 9:30am

Are PCI-e x1 Risers Usable for Cycles GPU Rendering? Does Optix work?

I’ve been looking to add more GPUs for rendering in Cycles, but finding the space is an issue. On my dual Xeon motherboard, there are 7 PCI-e slots (4 x16, and 3 x8 slots). Without risers there’s no practical way to add more than 2 GPUs with this board’s layout. Good quality riser cables, like the ones from Taiwanese manufacturer Li-Heat are $50-$60 each. They’re high quality, but pricey considering that you’re halfway to the price of a water block (which again isn’t possible because there are no single slot RTX cards with compatible water blocks). And what if I want to add a GPU to a board like my Ryzen 3900x X470 system? Could those cheap Chinese USB PCI-e x1 risers work?

Reading through the forums, many people have commented that using x1 would not work. They say flat out that Nvidia cards or features like Optix do not work with less than x4. Other say that you need at least x8 because at x1 there’s so little bandwidth that you’ll take a major hit to performance.

I set out to test this for myself and collect some data, which I am presenting here.

Setup

Supermicro X11DPH-I dual Xeon motherboard
2 x Xeon Platinum 8160 24-core CPUs
64GB (quad channel) 2133MHz DDR4 ECC RDIMM RAM
2 x Gigabyte RTX 2070 Gaming OC 8GB (one at x1, one at x16)
1 x Gigabyte RTX 2080 Ti Gaming OC 11GB at x16
Generic Chinese PCI-e x1 USB riser
1200W Raidmax Gold PSU
Windows 10 1909 x64 and Nvidia GeForce drivers 445.87
Blender 2.82a

Method
With no other apps running on Windows, I loaded the Junk Shop splash screen scene in Blender 2.82a.

I set the Cycles Render Devices to Optix and selected either the 2080 Ti (in x16 mode), the 2070 (in x16 mode) or the 2070 (in x1 mode). All three were connected to the computer at the same time.

Blender-Preferences

I set the render engine to Cycles, Feature Set to Experimental, and Device to GPU Compute. Samples was set the same for each card when I did a run. I enabled Denoising with Optix AI Denoiser. I did a first render that let the kernel compile, then ran the timed renders after that. Everything else was as-is with the .blend scene. I did not adjust resolution or any other settings. I ran each timed render twice and took the average (the times were always nearly identical).

Caveats

This is only one scene, a complex one with distinct attributes. This may not be representative of the work you do. Various scenes may give different results
Tests were performed on a dual Xeon motherboard with a lot of PCI-e lanes. Using an x1 riser on an X470 motherboard with a Ryzen 3900x gave me trouble when Optix AI Denoising was enabled, possibly because of PCI-e lane issues when also having an NVMe x4 drive and a GPU at x16
The BIOS in the Supermicro board (a server board) is annoying, so it ended up booting to the x1 riser card as the display device. I may try another run with a fourth GPU that is the display device so the x1 card can be compute only. It’s possibly that because it was the display device, the x1 performance penalty is partly because it is a display card.
I have not tested this extensively with multi-GPU rendering… yet

Results

Using a card with the x1 riser as the display device introduces major lag. There just is not enough bandwidth for the data flow needed by the display device
Optix rendering and Optix denoiser work with an x1 riser. I encountered no issues on the Xeon board (but some with the X470 board)
At very low samples (e.g. 50) there is no practical difference in performance between a 2080 Ti and 2070 using x16. This is because most of the time is loading the data, not rendering. Even at 250 samples, the difference is quite small.

4. At very low sample sizes the x1 card takes 50% longer than the same card connected via x16 due loading data into VRAM at only x1 speed.

5. At 250 samples, the performance delta between x1 and x16 drops to 23%, then just 8% at 500 samples, and flattens out around 5% beyond that
6. The value of a 2080 Ti really shows only when you get to much larger samples (>1000)

Conclusions
As always, your mileage will vary depending on what scenes you are rendering, resolutions, passes, textures, geometry, etc. What I was trying to determine was if using a PCI-e x1 riser would work (especially with RTX/Optix), and if it did work, the typical performance cost.

There’s no guarantee that x1 risers will work for everyone. It works fine for me in CUDA mode on both of the boards I tried, but with Optix denoising it gave me trouble on the X470 board. That being said, x1 risers work with Nvidia cards (I also did try it on a GTX 1060 without trouble). Also, Optix rendering and Optix denoiser both work with x1 risers.

Regarding performance, it depends again on how you render. Putting the card you run your display with on PCI-e x1 is going to be painful, so don’t do that. Using a GPU at x1 as a compute device will slow you down at very low samples. With this scene, at somewhere between 250 and 500 samples the performance difference becomes minor and flattens out around 5%. For me, my renders are 800 to 5000 samples, depending on the scene and what I’m going for. That means that buying cheap x1 risers will free up more money to buy more GPUs. The performance loss is small enough that I’m much better off with more GPUs than investing boards with more x8 or x16 slots or buying expensive risers.

A final thought is that the performance benefit of a 2080 Ti over a 2070 is minor with small samples.

Even with samples >1000 you only get about 38% faster renders from a card that is 2x the price of a 2070 Super (which would be much faster than my 2070’s). Unless you need the extra VRAM or money is no object, you are better off with mutiple 2060 Super/2070/2070 Super cards.

Hope this was helpful to some of you out there who have wondered about riser cables. If you have suggestions for more tests, let me know and I’ll try to test it. Also, let me know if you’d like to see a video of this.

Update: The X470 issues appear to be due to the way my board (Crosshair Hero VII) handles its PCI-e lane allocation. Depending on whether you’re using NVMe M.2 drives, the board appears to put some slots in PCI-e 2.0 mode. It looks like RTX features don’t work with PCI-e 2.0

Update 2: Risers of Different Lane Widths

I finally got x4 and x8 risers, so I was able to run the tests again with x1, x4, x8, and x16 at various samples. I ran the tests as above with the same scene on my dual Xeon system. Tests were run with the Gigabyte RTX 2070 Gaming OC at stock clocks and same driver version. Total VRAM used was 5.9GB, which you may be to keep in mind if looking at card with 6GB or less (e.g. RTX 2060). 8GB of VRAM seems to be a good minimum for decently complex scenes. I do not see examples of exceeding 11GB yet.

Tests show that there is no difference in performance between x16 and x8 or even x4. If you render a lot of simple scenes/frames with low samples, x4 and above will identical in performance and the most efficient, but you do not need x16 or even x8. As with previous tests, x1 is slower with low samples because it’s slower to load. Once the data are loaded, though, performance is nearly identical to higher lane counts.

The conclusion is the same: If you do Cycles rendering of moderate to complex scenes/animations, buy more GPUs, even if you have to run them at x1, before getting a motherboard and setup that can do x8 or x16 to multiple GPUs. Your render times will only be slower by about 6%.

birdnamnam · May 19, 2020, 4:58am

Very useful post. Thanks for your efforts.

Felix_Kutt · May 19, 2020, 5:09am

The results reflect what I expected, but very interesting to read and always good to see confirmation.

Well presented.

@Bart maybe feature this on BN as well? I’m sure this can be of help to many wondering about this.

Todd_Takehana · May 19, 2020, 5:22am

Thanks for the feedback. It was interesting to delve into this. I’m planning to do a couple more on multi-GPU scaling, hybrid rendering, and EEVEE.

watercycles · May 19, 2020, 7:17am

To fully utilize a computer with more than 3 GPUs a new instance of Blender should be open where it renders the same scene with different frame numbers. I’m kind of surprised a one or 2 GPU per frame option is not already built in. Probably because it’s so fast. Anyway you go beyond 4 GPUs and your speed per extra card seems to fall off quite a bit according to this video. Probably because the start up time for each card in comparison to the render time. Someone should easily be able to code this option into Blender. Maybe hit up the eCycles guy. Anyway with the option of one frame per card on machines with like 20 GPUs the render speed for animations could be way higher. It would also be a way to render animation on Eevee with multiple GPUs.

Todd_Takehana · May 19, 2020, 7:41am

That’s kind of what I was expecting. At some point if you’ve got a lot of GPU power from multiple cards working on a single frame, then loading is what you’re mainly waiting for. Each GPU just won’t have a lot to work on unless it’s super complex with super high samples. That’s probably why a lot of render farms parcel out animations on a per-frame basis rather than throw a ton of compute at one frame at a time.

In a couple of weeks I’ll be getting more GPUs, so I’ll give this a try. It’ll probably be similar to the video you linked.

This could change drastically in the next two years though. It’s likely that we’ll see the memory that the Playstation 5 is using come to video cards. In that scenario you’d have maybe 16GB of VRAM but also 256GB of PCI-e 4.0 NVMe storage on the GPU running at 7GB/s. If you are loading from a pool of that high speed of memory, load times might be dramatically reduced allowing you to efficiently use more GPUs on a single frame.

Grzesiek · May 19, 2020, 8:06am

Big thanks for the test. Wonder how a PCIe x4 conneciton would work.

Either way, thanks for posting this review

Todd_Takehana · May 20, 2020, 10:19pm

I ordered x4 and x8 risers to test those as well. I suspect that even x4 will be enough to reduce all or nearly all of the performance difference. Not many games or apps use the full x16 on PCI-e 3.0, which is why you don’t see a difference in performance when you have x8/x8 setup.

watercycles · May 22, 2020, 4:50am

Thanks to the tech of Optix I hear the ram on your computer might be usable like it was on the GPU. This would mean you could have like 2TB of ram depending on what the motherboard and cpu allow at a speed even faster than NVMe. It would not speed up renders though. It would only allow bigger scenes with larger textures. Probably should keep those optimized anyway though as a hard drive can get full fast if it is full of a bunch of 8k textures.

Todd_Takehana · May 23, 2020, 4:50am

To my knowledge there’s no capability for that in Optix, nor plans. If you have a source, please send it to me. Optix is part of CUDA and Gameworks, and is an API for ray tracing, related accelerators and AI de-noising specifically. CUDA itself handles the memory management.

There is a way for a GPU to use system RAM. CUDA Unified Memory allows an app to transfer data between system RAM and GPU RAM. There are a number of problems with that, though. DDR4 at 2133MHz is around 17GB/s, whereas GDDR5 is around 500GB/s, and any data transfers have to traverse PCI-e 3.0 and the CPU, which at most (x16) 32GB/s and adds a lot of latency. This is why Nvidia went from SLI over PCI-e to NVLink which is either 50 or 100GB/s and cuts out all the latency of going through he host system. It’s still much slower than GDDR or HBM, but 2-3x what the PCI-e 3.0 bus can do and very low latency. Even with PCI-e 4.0 you’ll have lots of latency and only 64GB/s, which is 10% of GDDR6 bandwidth.

What will be interesting is to see what Nvidia will do by having NVMe memory on the GPU with direct access to the GPU DRAM. You’ll only get 7GB/s, which is around DDR3 DRAM speed, but you have almost zero overhead and no latency because it’s on the card, like what the PS5 is doing. The trick though is how to intelligently load and offload data. With such limited bandwidth you will need to program in when to load/offload far enough in advance that the GPU chip is not waiting on data to get into the GDDR memory. Right now that’s done quite manually, which is fine for games because it’s fairly predictable (e.g. loading a level), but for 3D apps it’s ad hoc. Maybe Nvidia will have some AI tricks up its sleeve to use Tensor cores for smart predictions.

watercycles · May 23, 2020, 9:36am

I think I heard it in this video, but Octane pulls some tricks.

Here is a good talk about the NVlink.

It does look like there will have to be some special code to use anything not on the video card. It’s going to be slower, but the Unreal 5 demo showed there are ways around it.

Todd_Takehana · May 24, 2020, 3:13am

@Grzesiek I ran tests at x4 and updated the post. tl;dr, x4 performance is the same as x8 or x16.

Todd_Takehana · May 24, 2020, 3:34am

The Puget Systems article is interesting. It shows how NVLink can be quite complicated. TCC mode is really for compute tasks like AI/ML and only works on Quadro cards. GeForce cards don’t support it, and their P2P connectivity and pooling is primarily enabled through DirectX and support of the app that you’re using. However, Redshift can use NVlink and pooling without SLI enabled or through these methods, and of course it works under Linux, which does not have DirectX.

I’m not that familiar with Octane renderer. I prefer Redshift both as a product and a company, but will probably shift to Arnold for most uses as its GPU capabilities stabilize. It’s hard to been Arnold’s quality.

Looking through the Octane forums, it seems like they have some special code of their own utilize host system memory. The hierarchy seems to be to use the memory of a single GPU, then use NVlink pooled memory if needed, then go to host system memory. I don’t see any tests of the performance cost of doing that, but it seems high.

watercycles · May 24, 2020, 3:37am

E-Cycles also just got memory sharing through NVlink.

Todd_Takehana · May 24, 2020, 3:41am

Thanks for reminding me. I should test E-cycles

Grimm · May 24, 2020, 7:02pm

Octane does support both out-of-core geometry and textures, and there is a slight performance hit if you enable it (about 5% to 10% last I heard). NVLink is slightly faster and you are correct, it doesn’t get activated unless you run out of vram.

Octane’s quality is very similar to Arnold, but it isn’t able to handle the insane amount of geometry and textures that Arnold can work with.

Todd_Takehana · May 24, 2020, 10:19pm

Interesting. As geometries get bigger and textures more commonly exceed 8K, we’re going to need advanced memory management capabilities. There’s a lot of interesting solutions ranging from heterogeneous memory like Octane to Nvidia’s AI memory compression in Ampere to on-card NVMe. I don’t think we’ve seen a killer solution yet, but the consoles are showing us a glimpse into the next 4-5 years.

Octane seems quite popular and used for a lot of production projects. From the small amount of testing that I did, beyond scaling, I couldn’t get the look I was going for with things like hair, particles (smoke), and subsurface scattering. Somehow I just couldn’t get it to look as good as Arnold, but maybe someone who’s a pro at it can make it look just as good.

charisti · April 3, 2021, 6:13am

Thx for your test and the detailed review. May i ask what x4 or x8 risers you used?