Setting up a render farm

We run several crypto-mining computers (3 rigs, 9 GPUs each), but we’d also like the option to switch them over to blender for network rendering. A couple of questions:

Has anyone ever actually tested Cycles with 9 GPUs simultaneously? We’re using GTX 1060s, 6GB. Do they need to be full speed PCIE slots, or does 4x work (we’re using gen 2 risers to fit all the GPUs on one board)? I’d imagine it depends on the complexity of the scene.

We’d also like to know if there are any scripts for remotely controlling the rendering, so that we don’t manually have to set the master and slave every time we do a render. The rigs are running Ubuntu. In a perfect world, I’d be able to hit one button and 1. stop the mining, 2. boot to the blender OS, and 3. start blender to begin rendering over the network

Crazy questions I know, so thanks for bearing with me. Looking forward to hearing everyone’s thoughts.

EDIT: sorry for the redundant post. I see there are already a few threads about this.

A small update:
After running the BMW benchmark, we found that our rendertime was 1 minute 10 seconds.
Keep in mind, the system has 8xGTX1060s, so while fast, the test was slightly disappointing. I see on blenchmark that the top speed is 13 seconds, and that’s with 5xGTX980s. So why am I seeing such slow speeds? My only guess is that the gen 2 risers are the bottleneck, but even then it only takes 10 seconds to synchronize. Anyone have any thoughts on this?

Thanks for any advice or help

You sure you had all gpus rendering? Did you see a tile for each gpu being rendered? What is your tile size?

Cool. We have quite a few computers and probably can share some wisdom

Has anyone ever actually tested Cycles with 9 GPUs simultaneously? We’re using GTX 1060s, 6GB. Do they need to be full speed PCIE slots, or does 4x work (we’re using gen 2 risers to fit all the GPUs on one board)? I’d imagine it depends on the complexity of the scene.

So the way cycles works is that multiple gpus cannot work on the same tile at once, and gpus love large tile sizes. So… the best thing to do is if you have multiple gpus in the computer at the same time, is to run multiple blenders rendering on single gpus each. Python scripting is the easiest way of getting this setup. Our results were with 2 1080’s below:

1x 1080 = 100%
2x1080, rendering same frame = 140%
2x1080, rendering different frames in different blenders = 189%

You also mention whether they need to be full speed pci-e slots. Right now, half of our render grid are using 1 x PCI-E 1 speeds, and it impacts the speed over a full 16x PCI-E 2 slot by about 1-2 seconds. So no doesnt make a huge difference

We’d also like to know if there are any scripts for remotely controlling the rendering, so that we don’t manually have to set the master and slave every time we do a render. The rigs are running Ubuntu. In a perfect world, I’d be able to hit one button and 1. stop the mining, 2. boot to the blender OS, and 3. start blender to begin rendering over the network

Hmm seems like you are using netrender to render? We tried it out back in the day and couldnt get it to efficiently work. We tend to use the stock standard placeholder/no overwrite option all rendering to the same location.

The way we set it going over 18 computers, is a windows batch script to log onto each computer and run a shell script (you could probably do this via ssh aswell)… which then does what you want, in your case, kills any cryptomining and starts it in slave mode. I am 90% confident there is a python command to run as soon as it loads up.

After running the BMW benchmark, we found that our rendertime was 1 minute 10 seconds.
Keep in mind, the system has 8xGTX1060s, so while fast, the test was slightly disappointing. I see on blenchmark that the top speed is 13 seconds, and that’s with 5xGTX980s. So why am I seeing such slow speeds? My only guess is that the gen 2 risers are the bottleneck, but even then it only takes 10 seconds to synchronize. Anyone have any thoughts on this?

Blenchmark is not a good measurement for benchmarking.

Hi!

Take a look on Loki Render, with that you can set up a Render Farm and Yes, it is possible to make it work with GPU, see links below.

https://sourceforge.net/p/loki-render/discussion/support/thread/cfccd7ba/

I Think you can have all Grunt ready and on standby, so when you switch to use it for rendering it is already running on each computer. I have tried it and it works very well. I have used it running 4 computers on CPU and one on GPU.

//W

kesonmisYou sure you had all gpus rendering? Did you see a tile for each gpu being rendered? What is your tile size?

Definitely. 8 tiles for 8 GPUs. The tile size is whatever the default for that benchmark is. 256x256 I think.

@doubleshop

Good to know about the PCIE slots.
In your 1080 comparison, what do those percentages refer to? That 2x1080s is 40% faster than 1, and 2 in different blenders is 89% faster? If so, that’s a pretty significant increase just by running 1 GPU per frame. I know that when each tile is completed, it goes back to the CPU to be combined into an image. So why is fewer GPUs faster? Wouldn’t it need to go through the same process no matter how many tiles were rendering simultaneously?

Whether or not we run the renders through netrender or by manual placeholder/overwrite doesn’t matter much to me, I just want to get the most out of the GPUs. If running multiple instances of blender is the way to do it, so be it. As long as we can automate it via python or something. I don’t know anything about python or coding in general.

@wilnix
Thanks for the links. I’ll look into loki render.

Zeke, get back to me if you run into any problems with Loki Render and I will try to hellp you. //W

We usually have 4 tiles all up for a image, 1080s love large tile sizes… so… when its rendering and it has completed 3 tiles, and its on the fourth, and the fourth one is a complex tile… you have one gpu sitting there idle, whilst its waiting for the rest. The CPU is used for three things, Preprocessing (BVH creation, image loading, sync with the gpu), GPU management, Post Processing (Compositor). The Pre and Post processing are quite cpu intensive yes, but if the cpu is pretty much idle whilst its rendering, wouldnt it be better put to use processing the other frames & getting another frame ready?

Give it a whirl yourself and see if theres a difference.

Whether or not we run the renders through netrender or by manual placeholder/overwrite doesn’t matter much to me, I just want to get the most out of the GPUs. If running multiple instances of blender is the way to do it, so be it. As long as we can automate it via python or something. I don’t know anything about python or coding in general.

The other thing is with netrender (when we were trying it out), for each chunk it would resend the blend file and assets, instead of sending the blend file once over the network and rendering it many times. Not sure if that functionatlity has changed since then but it does effect large renders with many computers.

I have different results then doublebishop when it comes to render time improvements. Though my system consists of AMD cards so there could be better optimization in OpenCL vs CUDA?

Render of the Blenders Classroom scene. (all default settings - except tile changes when CPU rendered)

Blender 2.79 RC1. Each result is average of two renders of the same scene.

Single Xeon E5-2687w - 22m 43s (8 core 16 thread) (16x16 tiles) (second CPU in another rig)
1x RX 480 - 9m 32s (with monitors connected)
1x RX 480 - 8m 56s (no monitors connected)
2x RX 480 - 4m 30s

As soon as I get the remaining 2 RX’s in the system i’ll do more testing, but so far using Windows 10 I got nearly 1/2 render time with two GPUs.

I’ll test higher tiles to see if that has effect.

And thanks doublebishop for the PCIE info. 1-2 second is nearly a non existent penalty.

And it would make sense with the 1-2 seconds. outside of the initial data transfer to the card, rendering is done only on data within the GPU framebuffer. I’ll test my xeon setup as it has a 4x PICE (gen2) slot for curiosity, as my 4th GPU for now will be in that slot.

Try blender official benchmark files and report your result here. I would be very interested to see how your 8 GPU’s can handel this files https://code.blender.org/2016/02/new-cycles-benchmark/

Im not sure how well cycles scale the multiple GPU’s. With 2 GPU it is almost linear scaling but i don’t know how it scales with more than 2 GPU’s

In perfect scaling it should work like this: 1 GPU renders one scene in 100Min, 2 GPU should cut time to 50min, 3 GPU should cut to 25min, 4 GPU should cut time to 12.5min and so on.

Try render some benchamrk scene with only 1 GPU and divide that time by 8 “you have 8 GPU’s”. That is the theoretical time you should get with 8 GPU’s

I’m quite certain that there is something wrong in your math… Why would a third GPU halve the render time of 2 GPUs? And why would 4 GPUs suddenly only need one eighth of the original render time?

I think it should be:
1 GPU = 100 min,
2 GPUs = 100/2 = 50 min,
3 GPUs = 100/3 = 33.3 min,
4 GPUs = 100/4 = 25 min,
5 GPUs = 100/5 = 20 min,
6 GPUs = 100/6 = 16.6 min,
7 GPUs = 100/7 = 14.3 min,
8 GPUs = 100/8 = 12.5 min.

No?

If every gpu would cut time in half as in staners calculations, I’d only buy the last one, because that gives the most absolute performance :smiley:
Ikaris logic is correct, ideal scaling is at most linear, not exponential.

Hmmm… i made a mistake, i think you are right about this math! :smiley:

So rebuild /finished settting up my main workstation. - got the 3rd RX 480 in. the 4th not working as there isnt’ sufficient amount of PCIE resources… waiting for ASROCk to verify what the issue ist… still

Single Xeon E5-2687w - 22m 43s (8 core 16 thread) (16x16 tiles)

1x RX 480 - 8m 59s (539s)
2x RX 480 - 4m 35s (275s) - 1.96 (2 perfect) scaling
3x RX 480 - 3m 05s (185s) - 2.91 (3 perfect) scaling
4x RX 480 - … (…) - … (4 perfect) scaling

So my third is still within the almost linear improvement

as soon as ASROCK provides support, 4th gpu testing to follow