Cycles Performance

rawalanche · August 3, 2018, 10:18am

Hi,

I am creating a new thread to not pollute Cycles Development Updates thread, but this thread is a spin off from the discussion here:

I’ve done some more tests, mainly to showcase how much performance benefit can be gained from having ability to use cached GI for secondary bounces.

In my previous post, I have done a comparison between Corona and Cycles in pure path tracing scenario (no caching), and they both achieved relatively similar performance. Both got to some reasonable result in about 5 minutes in 1280x720 resolution. However I think it’s hard to appreciate just how much time has to be spent rendering using pure path tracing to get to a result with noise level low enough for us to be able to hand over an image to a client:

So, I have rendered the same comparison again, but the goal here was to achieve a noise level acceptable enough to consider image at least close to final quality:
Corona:

Cycles:

Corona took 1h 15m to achieve reasonable noise level

Cycles took 1h 5m to achieve slightly noisier level. Overall the performance is again comparable.

Both renderers ran on CPU only, with same pure path tracing settings, same max ray intensity, same ray depth, etc… Cycles seems to have slightly more aggressive russian roulette as it terminates more rays in deep areas, resulting in very slightly darker result at the same ray depth.

These days, standard delivery resolution for an archviz render is around 3508x2480 (A4 format), so to extrapolate rendertimes from 1280x720 (921 600 pixels) to 3508x2480 (8 699 840 pixels), it’d take 9.44 times longer to get reasonable quality at that resolution, that’s about 11h 15m. 11 hours and 15 minutes per image on i7 5930k is nowhere near acceptable unfortunately.

Now let’s take a look at another set of renders. This is Corona with caching for secondary GI (UHD cache):

We have achieved slightly cleaner quality than the PT+PT result in 3 minutes. That is 25x speed up in this particular case.

Cycles on the other hand has a benefit of adding GPU to the work. I have used latest master and used both i7 5930k as well as my high end GTX1080Ti to get this:

Same result in 16 minutes, so that’s a speed up of 4x compared to pure PT.

Now, what’s often confusing is comparing CPUs and GPUs of different prices, generations, and performance tiers. People often make mistake of claiming how faster the GPU is when they are comparing a shiny new GPU they bought to a rusty 5 year old CPU they have. So let’s do a little conversion of my i7 5930k to Threadripper 1950x, because of following reasons:
1, 1950x and GTX1080Ti release dates are very close
2, 1950x and GTX1080Ti are equivalent generations of the product
3, 1950x and GTX1080Ti retailed at roughly the same price
4, 1950x and GTX1080Ti were the representatives state of the art GPU and state of the art CPU at their release
5, They achieve nearly the same performance in Cycles:

6, 1950x has almost precisely 3x the multithreaded performance of my i7 5930k (1071 vs 3180) in Cinebench, which makes the conversion super easy.

SO:
To get the approximate rendertimes on Threadripper 1950x, all I got to do is to divide my i7 rendertime by 3. This means that on 1950x, pure PT rendertimes in both Corona and Cycles would be somewhere around 23.5 minutes for 1280x720 preview, and around 3h 40m for final resolution image. 3h 40m is still somewhat slow, but starts to get close to acceptable. Again, keep in mind you would have to get almost the latest, very expensive CPU.

For cached secondary GI, 1950x rendertime would be 1 minute for 1280x720 preview and 9.44 minutes for final 3.5k resolution. Yes, that’s correct! With good CPU and good secondary cached GI, you can get final quality 3508x2480 render of this average interior scene in less than 10 minutes!

Now, let’s add one more render to the mix. Cycles with GTX1080Ti only, without i7 5930k:

Took 18m 6s for pure PT, while I estimated 23.5 minutes for Threadripper 1950x. This is interior scenario, and if you look above at the GTX1080Ti vs 1950x graph, you can see that the scene where GPU won over the CPU most was the interior scane, classroom, so this estimate again sounds about right. It also shows that performance of CPU and GPU is not simply additive.

Now, with all these estimates established, let’s get to the conclusion. Let’s say you were a person who happened to get a new PC, and spend roughly the same money on top end CPU and top end GPU of the latest generation. You would get Threadripper 1950x and GTX1080Ti, a great baseline for CPU vs GPU comparison for all the reasons listed above.

IF Cycles had secondary cached GI that’s comparable to Corona, you could render preview resolution of this scene in 1 minute at a noise level low enough to show this to a client.

Without cached secondary GI, but with GPU of the same power as your CPU, you could get the same quality in 18 minutes. With adding your CPU to the work as well, you could perhaps cut it down to 12.

Bottom line is that to get to the performance of cached secondary GI, you would have to add 12 more GTX1080Tis into your computer to match that. And even then, you would still be limited by GTX1080Ti’s 11GB of VRAM vs the average 64GB of CPU RAM.

Tangent animation has stated they could not use GPUs to render their movies as the scenes just would not fit in VRAM. They also said that rendering performance was very important to them to get the job done, hence their have their own Cycles developer.

My conclusion is that the best thing which can be done to Cycles, in terms of getting more performance, is definitely caching for secondary GI.

Now, I know that GI caching on GPU is extremely difficult to do, but I would be more than happy to have it on CPU only. It’s not like GPU/CPU feature parity is a stone carved rule, since, for example, we already have OSL support specific to CPU only. I would be very thankful for CPU only secondary caching.

I do not expect this to happen anytime soon, but I just wanted to display a practical example of how much performance benefit can be gained from this. A software level optimization, which can quite literally add performance comparable to adding a dozen of top end GPUs to your system. And in turn, save people, and studios a LOT of money

YAFU · August 3, 2018, 11:34am

Hi.
It would be good to use a scene that can be shared so that others can do their own tests. Much better if it is a scene containing textures, and it would be ideal also to be able to test animated scenes.
Anyway, is it possible to download the scene that you used from somewhere?

marcoG_ita · August 3, 2018, 12:12pm

@rawalanche - so, as I said in Cycles thread at the beginning of the side topic, all the biz about

I have tested almost every renderer in the existence and rendering has been my main focus for past 9 years. Statement that Cycles is significantly slower than other mainstream renderers is factually true.

is again…pure bulls**t.

Don’t want to sound rude but you should work more with Cycles before throwing these bombs about its (fake) slowness. Of course optimization and more speed are aways welcome, but Cycles got extremely usable in interior rendering, i do it professionally everyday all the time.

The comparison with cached GI is apple vs orange, sorry but I’ve asked for it for years, but there are two problems AFAIK:

Cycles is aimed primarly to studios doing animations and PT “Arnold style” is superior for such target
There is no manpower (developer time) to cover such a big feature, a basic implementation is somewhat trivial but a consistent system such Corona is a big topic. No developer → no feature

If you still want to compare apple vs orange you should be aware that you left out two pretty important features you can throw in to speed up interiors in Cycles, namely:

AO bounce (2-3 bounce), in the Simplify panel, with 0 radius for AO to avoid darker corners, I use it quite a lot and if you use a low factor (0.02-0.05) you can get a big speed gain while preserving an “unbiased look”.
Denoiser, Cycles denoiser is capable of doing a great job preserving details and remove noise, i can’t think about not using it.

With GPU, Filmic LUT, denoiser, AO bounces, i can get high quality interiors in 30-40 minutes @ 4000-5000 pixels wide. (A3 paper at 300dpi…)

EDIT: of course i’m not considering render times on animations with Motion Blur, smoke etc…where i suppose Cycles could still make huge gains in speed

YAFU · August 3, 2018, 12:18pm

What about Light Portals? When it was implemented, many people expressed their “Wow!”. Is it currently useful for interior scenes?

marcoG_ita · August 3, 2018, 12:21pm

Yes a lot! I always use them for interiors, the speed of sampling is slightly slower but the noise is much less for equal rendertime.

rawalanche · August 3, 2018, 1:27pm

I’m sorry, but so far you are the only one bullshitting:

This is simply not true anymore, as I wrote down below in the development thread, the retracing techniques for cached GI got so good in recent years they can now handle any scene, regardless of how dynamic the geometry or lighting is. It’s in fact so good that V-Ray, in recent versions, pretty much relies on usage of cached GI also to speed up direct light sampling, dome light sampling and so on, to the point where pure PT is not feasible anymore.

If you can get more performance and at the same more lighting accuracy yet without any artifacts, why in the world would you not choose that?

I do not dispute it, in fact I acknowledge it in one of my posts in the Cycles Development thread, which you probably did not read.

You can’t possibly be serious about calling me out for bullshit, at at the same time suggesting AO as a replacement for proper GI. This is not even a good joke.

There’s also not such thing as unbiased look. That has been proven on many fronts already. There is a ground truth, but that rarely attributes to making a good looking images. There’s almost no truly unbiased renderer on the market, not even Maxwell.

Do you think Corona doesn’t have denoiser too? So if I get a good result in Cycles in 15 minutes, and I can denoise it to get clean image, I can apply the same denoiser on the result I get in Corona 12 times faster, to make even that cleaner. Denoiser is a divider of a rendertime at the cost of some detail. But you still have a base rendertime to divide, and it makes a big difference if it’s 1 or 12 minutes.

Yes, if you are using AO and ambient light, you can get the results much faster, but at a lot worse quality and accuracy, and there are some scenarios where you can not hide deficiencies of AO and ambient light based workflow at all.

I am having very hard time respecting you after what you wrote here today. You can’t seriously claim I compare apples and oranges and then start suggesting AO as a replacement for GI. That’s like shouting at me I am doping at a bicycle race while passing right in front of me on a motorcycle.

You probably did not even bother to read the post properly, and instead proceeded to glance over every 3rd word and then make your assumption. If you look at the very top, the results clearly show that the pure Path Tracing performance between Corona and Cycles is nearly identical. I did not come here to shit on Cycles. The conclusion of the post was that it may be very well worth to invest some development time in the future to research secondary GI caching for Cycles, because it may result in a performance boost comparable to adding up to a dozen of high end GPUs to your system.

I can quite easily imagine Tangent Animation had relatively high rendertimes for their scenes partially (perhaps even mostly) because of how limited performance pure Path Tracing can deliver in a shots where majority of the rendered surface is hit mostly by an indirect light.

marcoG_ita · August 3, 2018, 2:28pm

@rawalanche - you clearly never used “AO bounce” feature as i intend it, you’re probably thinking about old 90’s AO replacing GI and making dark occluded areas.

What I said is that you can get a great speedup with AO bounce which is computed AFTER 2,3 or n bounces you set. The value of this AO is really low and without dark corners since you must set to 0 the radius.

Unfortunately i can’t show at the moment my daily job projects as they’re under NDA, but with coming holidays i will try to make a personal scene al let you judge. (without knowing, you would never say there is ANY AO involved)

Do note though, i’m not saying you’re forced to use this method to get reasonable results in Cycles, i’m saying it cut lot of rendertime with the resulting render was 90-95% equal to full global illumination.

Stuntkoala · August 3, 2018, 3:26pm

Aaand it got personal again.

@rawalanche can you provide a download for the scene you used so we can do our own tests?
I’m all for cached GI - maybe it could be tackeled with the next big “Code Quest” after the trasition to 2.8 and the new Depsgraph have stabilized. Even if it’s just CPU supported at the beginning. Maybe Mitsubas code could serve as a reference.

Also VCM could be very helpfull for complex light situations - caustics, interiors etc. Renderman has a very good implementation of it.
It would probably fit better with the existing Cycles architecture since it’s just another Method of Path Tracing. But I know too little about the underlying Math to judge this.

rawalanche · August 3, 2018, 3:36pm

Sure, here you go. You will have to put in your own HDRI since I can’t share it. However, if you touch nothing else than the HDRI texture path, you’ll get 1:1 result. Both scenes are set up with the exactly same conditions, e.g. pure Path Tracing, Max ray intensity of 10, max ray depth of 12, gray diffuse material.

As for the VCM, it helps in some cases, especially with caustics, and also a bit with indirectly lit surfaces, but it’s still nowhere near the efficiency of cached GI.
TestScene_Max.zip (2.8 MB)

TestScene_Blender.zip (4.1 MB)

But these threads usually end up with every expert posting their own “special” setup with ridiculous workarounds and compromises, which often loses the point. It’s not about if you can fake something with AO and ambient light to get speed up. This thread was primarily intended to show how much value cached GI has, especially these days, when it can handle dynamic animations.

marcoG_ita · August 3, 2018, 4:04pm

Man, it’s not about ridiculous setups nor faking stuff, it’s a check box and a value slider. The result is nearly 1:1 with full gi.

Two days ago you were sure cycles vs corona would have lost tremendously even with PT vs PT. You tested and it’s not true, so maybe you can consider the fact you are not the mega master of all the render engines out there and try to listen to others too?

You are asking for a feature i asked for 4 to 5 years ago, i know it would be helpful for lots of users including myself, spoken with Brecht multiple times, got Lukas Stockner making a rough patch with IC which he tested with scenes i provided…so do not give for granted every cycles user is a blind amateur and a fanboy and never used other engines.

The topic is cycles performance and whitin latest years of working with it i found how to get comparable if not better results compared to commercial renderers, with totally fine render times for clients. (Tens of minutes for printing ready resolutions). Im obviously speaking about still images, archviz and similar, not animated feature films.

If you don’t want to go further with tests to make cycles faster, and keep asking for cached secondary rays, is up to you, I won’t stop it for sure.

rawalanche · August 3, 2018, 4:14pm

What? I don’t get why you keep making it so personal. If you take a look at that thread, without any emotions, you see following: I did a test some years ago, and made an assumption based on that. I used that assumption to make estimates. When I encountered skepticism, first thing I did was to re-run the test using as controlled conditions as possible, and then I immediately posted the results, simply admitting that my previous assumption was incorrect, and I adapted a new assumption, which I also presented right at the top of this thread - that pure Path Tracing performance between Cycles and Corona is very similar.

I did exactly what an adult, intelligent person would do. I approached it with open mind and had absolutely no problem admitting my assumption was wrong and changing my mind.

Again, this is not a childish race of who requested that feature first. The fact of the matter is that today, there is still no secondary GI caching in master branch of Cycles. And another fact is that secondary cached GI is very beneficial. That’s why I made this thread. It’s not about what once was, but about what could be beneficial in the future.

On a sidenote, if IC you are referring to means Irradiance Cache, then that’s mostly used as a GI method for primary rays, not secondary ones. And caching primary GI rays is indeed not stable in animations, so perhaps you are misunderstanding what I am requesting. I am not in any way proposing GI caching for primary GI rays.

I’ve posted my scene just one post above. Feel free to grab it and post comparison between pure PT and your setup. If it’s not significantly different to the pure PT setup, then I have no problem in accepting it as a good solution. I would off course then proceed to try it on various different use cases, but if it holds up, then why not.

marcoG_ita · August 3, 2018, 5:19pm

Trust me, it’s not about the childish race about who first requested, i wanted to explain that I’m totally fine if cycles would get a proper cached gi, to make clear you’re not disputing with a guy which doesn’t know how nice it could be.

About primary rays, no, i actually made clear only for secondary ones, similar to brute force + light cache in vray, or the hd cache in corona.

I will try the corona bench scene converted to cycles as soon as i can in coming days. Of course it won’t match the 3 minutes from corona with hd cache, but that’s it’s nearly impossible considering how different are the two methods

razin · August 3, 2018, 8:26pm

I tested the scene, here the information :
resolution 1280 x 720 at 70% ( rendering on my laptop )
Hardware: using both i7 4720hq + gtx 920m

Color management : filmic with base contrast
The hdri is from HdriHeaven here’s the link

i use the 2k version and i rotated it at 240° on the z axis and i set the saturation to 0 with the hue saturation node

light path : Bounces at 12
samplings : 512

the first render is without the simplify feature

!

now with simplify set at 2 bounces & distance 0

3 bounces

4 bounces

if you compare the the first render and the last one you could see that the second one have brighter shadow and less render time and maybe a little less noise ?

marcoG_ita · August 3, 2018, 8:40pm

I got the same hdr as rawalance so we can compare a little better, i was wondering though, @rawalanche how many samples you set for the very first cycles render? Need to test same noise level.

Stuntkoala · August 3, 2018, 8:57pm

The HDRI rawalanche used is currently free for download
http://illuminatedtools.com/freeprobes/ext_LateAfternoon_Mountains_CSP/

marcoG_ita · August 3, 2018, 9:44pm

With the setup i usually do for my interiors, i got this done in 4:00 minutes (same resolution of rawalance tests) with a gtx1080. (not Ti, with your 1080Ti you can get this done in 3 minutes i suppose, same time as HD cache)

IMHO, the slight difference to ground truth is more than welcome considering the huge difference in rendertimes.

HD cache example is more noisy but that’s because i used a little of denoise, the render was quite clean though.

YAFU · August 3, 2018, 9:53pm

Hi.
Since the scene has no textures with fine details, I would recommend not using denoising in any of the tests.
One question, do the other render engines Clamp indirect ligth somehow in those tests?

rawalanche · August 4, 2018, 5:03am

Alright, let’s not use denoising, that completely defeats the purpose of any comparison.

I mean, this is not a war of denoisers, and if Cycles had to compete with the Corona one, it’d probably lose (especially in scenarios with advanced textured materials, refraction, MB, DoF and so on). If I throw Corona’s denoiser on 1 minute render with PT+UHD, I’d get this:

Keep in mind this would be 20 seconds on Threadripper 1950x

The difference from ground truth is unfortunately a bit too much for my taste. It actually heavily modifies scene lighting. Also, I can easily come up with several scenarios where the solution breaks completely. And in production, you need universal solution you can rely on.

But regardless of that. Let’s say that we would compare your non-Ti GTX1080 to something like Threadripper 1920x (comparable price and generation) running renderer with cached GI. Then you would get your 4 minutes where as 1920x would get around 30 seconds. That’s still a speed up factor of 8x. I would have to add 7 more GTX1080’s to the system, then I would have to use technique compromising lighting accuracy, and on top of that I would still be bound by very small VRAM amount.

On top of that, here is a GIF with a difference between Cycles’ pure PT and your AO trick:
PT_vs_AO

And here’s how much difference switching secondary GI from PT to cached one in Corona makes:
PT_vs_UHD

While your trick is already modifying scene lighting (and as I said, I can come up with scenarios making it more obvious, and those actually appear in production), you can see that caching secondary GI in Corona only adds little more light bounces in hard to reach areas in the curtain folds.

marcoG_ita · August 4, 2018, 10:48am

Well there are some arguments that needs a bit more explanation, i don’t have very much time today but here is a summary that IMHO should be considered:

it’s not completely true that denoiser should be avoided completely, Cycles can also denoise just indirect rays smoothing out GI while preserving all the direct rays and thus all the texture details, IMHO that’s a much closer behaviour to hd cache compared to full brute force tests, hd cache is really accurate but still, it’s an interpolation of fewer samples…
Your example of corona denoised render has completely removed the floor planks while Cycles preserved them, so the amazing render time is quite biased because too aggressive denoise
whithin october, new series of Nvidia gtx will start to hit the market, giving a new boost in performance and more memory available (i would not be surprised if 1180Ti will be equipped with 24 or 32 Gb), while CPUs won’t get the same boost in the very near future

I’d like to elaborate more but i can’t at the moment, all in all i wanted to show that Cycles is capable of clean outputs in very reasonable amount of time, it’s faster than Corona with HD cache? No it’s not, but that was quite a no brainer in the first place…it’s Cycles an order of magnitude slower than other engines? Nope it’s not

Can Cycles deliver high resolution images for clients in a resonable amount of time? Yes, these are the conclusion i tried to explain because i face with them everyday, just this.

Cycles speedups and optimizations are more than welcome but i don’t think a big project like corona hd cache could hit Cycles in the next year. That was my other point

r_fletch_r · August 5, 2018, 9:59am

Cycles is aimed primarly to studios doing animations and PT “Arnold style” is superior for such target

Having produced a full animation series with Corona I can tell you that it does this job very well. The Cache reduced render times massively and handled complex animations with very few issues ( every engine has issues )

With regards to de-noising, people seem to think its some sort of magic bullet, It really isn’t you always loose features and introduce funny artifacts. Once you’re in full production you want something that solidly delivers, I haven’t personally seen de-noise do that in animation. Its a fantastic tool for getting rid of the last little bit of grain or doing a quick 1 off render. I recall having a chat with one of the Arnold engineers and they told me denoising wasn’t even on their radar, they were focusing on better sampling. ( they also did a demo using Suzanne which gave me a giggle, I presume brecht was behind that).