Multiple Blender processes for multiple GPUs?

mfiocca · May 15, 2015, 8:47am

I’ve got a Ubuntu 14.04 headless server running right now for rendering on Cycles, across 4 GPUs.

Specs:
(4) ASUS GTX 970s, 4GB
1300W PSU
Core i7
16GB RAM
250GB SSD

Animations are rendering beautifully across the 4 GPUs, but my question is concerning parallelizing processes to those cards:

Is it possible for me to have 4 separate blender processes running in parallel targeting one specific GPU each?

Running all 4 cards as “one” card (CUDA_MULTI_0 in a .py setup) is not actually a full 4x speed increase over running just one card, though it is still quite fast, and amazing i might add having come from mostly CPU rendering. Anyway, it turns out to be more like 3x speed, and i want to maximize render times further by rendering one animation frame per card in parallel, giving me a true 4x speed.

Theoretically, it should just be a matter of running 4 separate blender processes on the command line, while having python scripts setup to target specific cards like this:

$ blender -b file.blend -P cuda1.py -s 1 -e 10 -a &
$ blender -b file.blend -P cuda2.py -s 11 -e 20 -a &
$ blender -b file.blend -P cuda3.py -s 21 -e 30 -a &
$ blender -b file.blend -P cuda4.py -s 22 -e 40 -a &

and the cuda[x].py scripts look like this, where each one is targeting a specific CUDA device:

import bpy, _cycles

bpy.context.scene.cycles.device = ‘GPU’
bpy.context.user_preferences.system.compute_device_type = ‘CUDA’

this is different in each cuda[x].py file, CUDA_0, CUDA_1, CUDA_2, CUDA_3

bpy.context.user_preferences.system.compute_device = ‘CUDA_0’

So, this all technically works right now, but here’s the big issue i’m trying to figure out. When i do this, only the first blender process gets GPU access. Processes 2-4 default back the CPU, ignoring the GPU instructions. All four cards work just fine, and as I mentioned, one blender process using the CUDA_MULTI_0 device in the python script is leveraging all four cards as expected.

Is there some known limitation to blender, or CUDA/NVIDIA for that matter, that only lets a single processes access any or all the GPUs at one time, that maybe i’m just not aware of?

Any advice on this is greatly appreciated.

p.s. i’ve attached a quick photo of this server build for anyone interested.

Attachments

mfiocca · May 20, 2015, 2:18pm

It turns out all I needed was to update the NVIDIA driver to v346.72 for the ASUS GTX 970 Strix cards to be able to handle multiple processes in tandem.

ARRELL · June 14, 2015, 12:45pm

Do you get true 4x speed over a single card this way?

Does 3 cards render nearly 3x faster than 1… ?

mfiocca · June 14, 2015, 3:41pm

in my benchmarks with my scenes and rig, its always been faster to render animations one card per frame and to parallelize processes as much as possible.

example: my 970s each render my scene at 30 minutes per frame individually. Together all four as one card, i get 8 minutes per frame. so, i can render 4 frames in four parallel processes in 30 minutes, where 4 cards together as one ‘mega’ card rendering 4 frames in serial, would come back in 32 minutes. So its marginally faster for animations to split the cards up into individual, parallel rendering.

Striping cards together is great for single frame renders, but its never a full 4x faster with 4 cards, a full 3x faster with 3 cards. its more like 3.?x faster with 4 cards, and 2.?x faster with 3 cards.

mfiocca · June 14, 2015, 4:43pm

To extend this into another example, since one of my scenes is actually 180 frames long, that scene breaks down like this in total render time:

Single Card = 5,400 mins
4 Cards as ‘one’, rendering frames serially = 1,440 minutes
4 Cards rendering in parallel, each one rendering its own frame = 1,350 mins

mfiocca · June 14, 2015, 5:01pm

sorry, i also have to post one more bench, and that is GPU vs CPU in my test case. This same scene we’re talking about would take

40,500 minutes to render on one ~$350 CPU (quad i7)

vs

5,400 minutes to render on one ~$350 GTX 970

ARRELL · June 16, 2015, 1:50pm

Thank you for this information! It really answers my question.

openprivacy · November 25, 2015, 12:44pm

What motherboard are you using? I’m getting ready to build a headless blender renderer also with i7 and GTX 970. I’m looking at the Gigabyte GA-Z97X that can hold two GPUs. Do you have any advice?

mfiocca · November 26, 2015, 6:10pm

This system is built on the ASUS RAMPAGE IV Black Edition gaming motherboard, with a Republic of Gamers BIOS. It has 6 PCIe slots, and i’ve got (4) ASUS GTX 970 Strix Cards on PCI riser cables (unpowered). I’ve also got a 1300 watt PSU to give those cards enough juice. The 970 is particularly efficient on power consumption, so if we were using any other card, i would have opted to put the cards on their own dedicated PSU, while using a lower rated PSU for just the core system peripherals. Attached are some updated photos of the home-made chassis built to accommodate the risers.

This system is running very well as a headless blender “farm”. I’ve written some job queue software that treats each card as a separate render node, so for animations, we have 4 frames rendering simultaneously across the 4 cards on this one machine, vs using all 4 cards on a single task. This has proven to be a little faster in render times across larger animation shots, since its a literal 4x speed compared to a single card, whereas the cards grouped give you more of a 3.5x speed.

After having built this, I already have the itch to build another one. I’m probably going to go with Titans next time (though very expense when thinking about buying for of them $$$$). One Titan (5k CUDA cores) is the CUDA equivalent to all four of my 970’s combined. Having 4 Titans @ 20,000 CUDA cores on one machine is … well… quite a dream to think about experimenting with. That machine will definitely have two PSUs. One for the standard peripherals, one to power just the GPUs. We have a pretty consistent and well-powered electrical grid at our office, so the wattage isn’t really too much of a concern for us there. Not something I would run out of your basement if you don’t have the headroom on your circuit panel.

mfiocca · November 26, 2015, 6:15pm

Sorry, i lied … I was typing from memory before … but just verified that the ASUS Rampage IV Black MOBO has 4 slots, not 6. Our cards fill up all the slots.

mfiocca · November 26, 2015, 6:39pm

And … I completely ignore your real question it appears. Advice …

Ok, so what I know is based in headless Linux systems. If you are planning your system around Windows, i’m afraid I can’t be of much help there. One thing I do know, is that Windows will probably be a heck of a lot easier to configure your cards if you plan to overclock them, or need to do some Nvidia management. The open source linux drivers took a minute to get working right, since the 970 cards are considered old by nvidia now, and I had to track down “obsolete” drivers from nvidia in order to get CUDA working the way I wanted, where each card could handle separate tasks simultaneously. But, I chose linux because in the end, its still far more power and allows me to tweak any aspect of the system. And its free, and who wants to pay for software licenses when you’re spending all that cash on server parts?

Honestly though, if you plan to overclock the 970, don’t. I’ve tried overclocking my cards using nvidia’s linux drivers, and it made absolutely no impact on render speeds. the Strix 970 particularly is already pre-built in an overclocked state, so messing with OC settings really doesn’t give you anything above what you get from the factory. Thats if you are using the Strix cards, that is, I can’t say the same will be true with a different card type.

The only other advice I can give you is the headless blender rendering part. You’ll want to send renders on the cli, or in a script or app that you make, using the -P option with a python script on the command. You will want to launch each render job with a python script that configures your render session to use all your desired GPU settings for that particular task. We send each frame as a separate render job, vs rendering a whole animation through. This allows us to put 4 separate rendering processes on the machine at once, each configured to use a different GPU in parallel. Our job queue software manages the queue of frames that need to be rendered across the sequence.

openprivacy · November 28, 2015, 7:02am

@mfiocca - Thank you for your excellent and informative replies! We’re going to stick with two 970’s (2 PCI slots) for now with thoughts of possibly upgrading to two Titans (ha!) in the future. When you render animations in parallel, do you render out individual PNGs? Would you share your python script with us so that we could use the same mechanism? If you cannot share this, could you point us to a script that allows us to utilize graphics cards instead of just the CPUs? Additionally, if we used said python scripts, would we still have to use -a in the CLI, or would just a -P be enough?

Also, you mentioned that the 970 is considered “old” by nvidia. Is there value to using a newer card? Or can you just point us to the old version that we should use?

By the way, when I say “we,” it’s mostly my 14 year old son who is the CG artist in the family. He’s been on blender for six months and has some experience with CLI. The current headless Xubuntu server (the family backup/web server) he uses has no GPUs, so this is new territory for him/us. He will be buying this machine with his money from mowing lawns.

openprivacy · November 28, 2015, 7:31am

Oh - I just saw (again) your cuda[x].py scripts in the initial post - thank you (again!).

Is all the rendering managed by the GPUs with very little load on the i7 and RAM? Is much RAM recommended?

And, where you “import bpy, _cycles”, is ‘_cycles’ a standard python library?

openprivacy · November 28, 2015, 7:49am

And this is my current parts list:
https://pcpartpicker.com/user/openprivacy/saved/2Z2scf

Yes, only 1 970 right now (all he can afford at the moment) and the extra disks are for an 8TB RAID 10 array as this machine will also be replacing our ancient family backup server.

mfiocca · November 29, 2015, 1:41pm

@openprivacy, glad to help! to try and answer your questions:

Regarding Still Images

We always render to still sequences, then composite them together in adobe after effects. This has a couple benefits:

multiple processes can render different parts of the animation simultaneously without having to merge any frames back into a single movie file.
lets you render to much higher quality image formats, like 32bit floating point .exr (for when we do work for film), or 16 bit PNG (if we’re doing standard HD work). You can do some googling on “Linear Color Workflow”. This is the process of using 32 bit floating point imagery throughout the film creation process, so that color correction and other post-operations gives you as much color data to work with as possible. If you are familiar at all with RAW image formats that a lot of DSLR cameras save to, this is pretty much the same thing. Rendering to a 32bit .exr format is like creating a “RAW” image that lets you manipulate that in post in ways that would be impossible with standard formats.

But, if compositing software like after effects or Nuke isn’t available to you, there’s a number of ways you can convert image sequences into movie files. Actually, you could fire up blender and use it as a post tool with its compositing features and combine the frames and export out to a movie file.

If you just have one GPU for now, there’s no reason you can’t just render directly to quicktime or something like that too. It all depends on what the plans are for the rendered frames; If you plan to do some compositing, or other post work, still sequences are the way to go. If you just want to spit out your renders and use as-is from Blender, and have just one render process running at one time, then just render directly to quicktime.

Regarding the Python Script

As you caught above, i’ve got the python script up there in my first post, and example usage of how to fire up blender on that script. For redundancy, to start a render job from the command line, while using the python script, you would type this out on the cli:

$ blender -b file.blend -P the_script.py -s 1 -e 10 -a &

This will launch blender in the background, using the python script as “start up” instructions, open file.blend, and begin rendering frames 1 through 10. The python startup instructions above are what tells blender to use the GPU instead of the CPU for rendering. And, when you have multiple cards, you have some options in how you want to use those cards. If you look at the last line of the python example, you see:

bpy.context.user_preferences.system.compute_device = ‘CUDA_0’

Since i have 4 GPUs, this tells blender to use just the first card in the CUDA array for the job, and ignore the other three. CUDA_1 would be the second card in the array, CUDA_2 being the third card in the array, and so on. CUDA_MULTI_0 will tell blender to use all four cards together as one massive card.

Regarding -a VS -P

You need both on the cli. -a just tells blender you are asking it to rendering an animation, and corresponding -s (start frame) and -e (end frame) are also to be provided. -f can be used instead of -a to render a single frame. So, say you wanted just a single frame render using the GPU:

$ blender -b file.blend -P the_script.py -f 5 &

This will render just frame 5 of file.blend, using the python startup instructions.

Regarding CPU

Actually, what i’ve noticed on my machine, is that each GPU will use up exactly one entire thread on the CPU, pegging @ 100% while rendering. So, using HTOP on linux, I can see that there are 4 of the 8 threads pegged to 100% on my i7, the other 4 threads are completely dead, 0% utilization. I’ve also tested this with just two cards in use during rendering, and only two corresponding threads were pegged to 100%. So, with this I think that an i7 might even be overkill for a GPU machine. You just want to make sure that you have enough threads on the CPU for the number of GPUs you plan to run. The chip itself doesn’t need to be that beefy since the computation happens on the GPU. The CPU is just there to catch the results from the GPU and do its thing with converting the results into image data and saving to disk, etc.

970 Being “Old”

Its certainly not “old” since these are being sold still quite a bit, but when it comes to linux drivers, nvidia seems to only support the absolute latest and greatest cards (because they want you to keep buying the latest and greatest). In order to get CUDA working properly with my 970s in Ubuntu Server 14.04 LTS, it took me some trial and error in finding specific driver numbers and downloading archived run packages from NVIDIA. I think v346.72 is what I landed on for this system.

Now, card choice I’ve found has boiled down to three things for me: Video Memory, CUDA core count, and Linux Driver support. Get the card with the greatest number of CUDA cores and memory you can afford and can efficiently power. Most GTX cards draw somewhere in the 200-240 watts of power per card. The 970 draws in the 140’s (last i checked, but typing from memory), which allowed me to put 4 cards on the same PSU as the rest of the system.

Regarding RAM

Memory is important for large blend scenes, but for rendering on the GPU, its actually the video memory in your GPU that is used for this process. The video memory on the card is the “RAM” that buffers texture and polygon data during rendering. My 970’s each have 3GB of video memory, which means each card can hold up to 3GB of texture and mesh data during rendering. If you’ve got billions of polygons and gigs of texture images in a single render, you’ve got to take this into consideration. Low poly scenes that have only a few meg of textures aren’t really a threat to most card’s memory sizes. Traditional RAM is used when rendering on the CPU, and is pretty trivial during rendering. RAM, whether its in the GPU or on the MOBO, is just there to stash and cache texture data mostly. So you just have to have enough memory to hold however much your using in your scene.

Regarding _cycles

When you startup blender on the command line with a startup script (-P), _cycles is a python module that is made available to your python script, since it lives with the blender runtime. No extra steps or installations are required to make that work. It is specific to blender though, and _cycles is only available to blender startup scripts. I’m sure there’s probably some way in python to get at the _cycles module externally, but it does live with the python version bundled with blender.

Regarding your Son

FRIGGIN AWESOME!!!

openprivacy · December 1, 2015, 7:12am

@mfiocca - Once again, great thanks for all your information. I’ve been programming on UNIX-based machines for 40 years (now supporting DevOps and Security at CivicActions, a F/OSS web services company). I do backend stuff and - wouldn’t you know it? - my son is interested in front-end graphics and animations with blender, which is something I can’t help him with at all. As I needed to upgrade the 8-year-old basement backup/web/minecraft server, Stevo asked if if could also be enhanced as a render farm. (He’s already run some renders that have taken a week on the old machine.) Your description of a headless, Linux-based render farm gave me/us the confidence to move forward, and we put the orders in for parts yesterday. Give us a couple weeks to build the system and try it out, and we may be back with more questions - though you’ve laid out a lot for us in this thread.

Here’s his YouTube page with some of his animations (and Minecraft “Only One Commands”): https://www.youtube.com/user/lebostevo (most of the accompanying music was created by him with GarageBand). A “Random Animations 2” is now nearly complete. We would love to see some of the results of your work (if any is available for publication).

Happy December!

mfiocca · December 2, 2015, 2:32pm

Oh thats cool. We also do backend web development at my company, Snapdragon Studios. Mostly MEAN stack and php development, some iOS apps and games.

Here’s a link an mp4 of some compositing shots i’m currently working on for a film. Its really rough, and there is quite a bit to cleanup, but this shows some 3d camera tracking of blender, rendered with cycles, compositing and post done in after effects. This is a scene where the actors are in an underground lab, and there is some mechanical equipment with glowing laser beams and humming generators that I added into this footage. Some background mattes of underground cave behind the actors in a couple of the shots near the end. No audio, and like I said, this is still a major work in progress.

[video]https://s3-us-west-2.amazonaws.com/snapdragon.one14/revelators.mp4[/video]

mfiocca · December 2, 2015, 2:42pm

Also in that shot, there are two screens that have temporary animated HUD displays. Those were tracked in blender, and composited in after effects. There was also quite a bit of rotoscoping done on these shots using both Mocha and After Effects.

doublebishop · December 2, 2015, 4:02pm

What you may want to do instead of this
$ blender -b file.blend -P cuda1.py -s 1 -e 10 -a &
is enable placeholder + no overwrite in the blend file and then you should be able to just go
$ blender -b file.blend -P cuda1.py -a &

As at the start of the frame, it will check if there is a frame there, if there isnt it creates a placeholder then starts rendering… if there is a frame there (placeholder or finished) it will skip that frame and go onto the next… it simplifies the process a lot.

The only time this starts to break down when the frames render faster then a second a frame.

mfiocca · December 2, 2015, 4:33pm

@doublebishop, thats pretty cool. we have 4 parallel blender processes running in our setup though, and our software picks different points in the timeline to render based on which process is requesting framesup the frames

say if our animation is 100 frames:

CUDA 0: 1-25
CUDA 1: 26-50
CUDA 2: 51-75
CUDA 3: 76-100

all rendering in parallel