Anisotropic shading in Cycles

BenDansie · March 22, 2012, 6:27pm

Pure speculation disclaimer - if Arnold is going to be as attractive as we all hope it is in various markets, then there will be an immense number of use cases. Giving rise to an immense number of use cases to support. Needing an army of tech support staff. Dealing directly with studios for now seems a very smart intermediate step to get quality feedback and iron out a lot of that use case support without the overhead of supporting every request from artists all over the globe.

Thats my hypothetical 2c.

MikeFarny · March 22, 2012, 10:34pm

Heh, don’t worry about me, I can keep an NDA.

BenDansie · March 22, 2012, 10:39pm

No worries. Just thought it worth mentioning in all the excitement.

MikeFarny · March 22, 2012, 11:19pm

Oh man. This feels a bit like a loaded question, but that’s okay. Warning: long post ahead.

Unfortunately, seeing source code might help answer maybe 1/20th of the question; there are so many factors at play when deciding if a software product is on the right path. Although both Cycles and Arnold are production renderers, they are aimed at slightly different submarkets at this point. Arnold is targeted at covering the small-to-large studios (favoring the mid-size and greater), and Cycles is more targeted at the individual person to small-sized studios where cost is one of the top 2 or 3 factors.

That said, I could have given you an assessment of Cycles with our without seeing any of Arnold internals. Cycles has a lot that is right about it. The architecture is incredibly interesting to me; the core rendering kernel can compile for CPU, CUDA, or OpenCL at the same time. That is both a blessing and a curse. The good part is that you can get it to work just about anywhere (in CPU mode), and it’s nice that you can write once and run on multiple devices. That’s just slick. The curse is that the code must fit the lowest common denominator (the GPU), with all of the headaches of synchronizing data and limited agility. The GPU limitations definitely handicap Cycles. For example, huge production scenes are pretty much out of the question (we’re talking a frame range’s assets to be touched by rendering a shot can cover terabytes). Smaller scenes where people use their cleverness to keep them lean, however, can often look nearly as good.

Honestly, there’s a huge amount of waste and extraneous data that larger studios generate. Artists at those studios tend to turn up the dials and load the scenes to the hilt. It seems to me that no matter what resources are available, the law is that artists will manage to make render times be 1-3 hours per frame on average no matter how fast/good you make the renderer. That seems to be the pain threshold, for whatever reason.

In contrast to Cycles: my own renderer (RenderSpud), Arnold, and many other renderers can manage memory and huge production scenes by using a few tricks.

Caching: all production renderers that I know of have a texture cache plus 3 or 4 other areas of data are cached as well; data that is requested will be loaded, but only a fixed amount of memory is dedicated to the cache. When the cache is full, and more is loaded, old stuff can be dropped out, possibly to be reloaded later if requested again. Think textures: there are easily many gigabytes of textures we routinely point at for a given frame to render at Tippett Studio. No way you’re going to load all of those and put them on a GPU, and you’d kill GPU performance so badly as to make the CPU seem much faster if you were to load/unload caches on the GPU. It just ain’t gonna happen.

Delayed loading: also known as being lazy (yes, this is the technical term), the renderer can load the minimum amount of the scene as possible, with big bounding boxes around the various coarse pieces. Then, when a ray actually hits one of them, it will invoke the procedural or load the file associated with that element of the scene, loading one more level of data, and then keep tracing the ray into it. Things that never get hit by rays? Well, they never get loaded, and thus take no memory or time. On the GPU, you could potentially mark rays that hit unloaded elements to be rescheduled, and then load the data and upload it to the GPU. It’s possible, but it just sounds like a nightmare. I would be very happy if someone proved me wrong and got this to perform reasonably, but it would again probably kill GPU performance so much that the CPU becomes attractive again. Note that what actually happens during the loading of elements can be totally arbitrary (e.g. procedurals). You cannot reasonably run the actual loading part on the GPU.

Arbitrary shading and parameters: this is one area where Cycles actually impresses me quite a bit. The node system in Cycles shows a ton of promise, and the whole stack-based evaluation is really clever. Props to Brecht for this. However, my experience writing the patch that started this thread demonstrated to me that it’s not a trivial thing to get arbitrary data on the mesh onto the GPU, even with Cycles system in place that makes it relatively easy. And it’s certainly still not possible to write arbitrary shaders either. This power and flexibility is one of the hallmarks of the big-studio production renderers, and is the reason for RSL, MetaSL, OSL, etc. to exist. They get used. A lot.

There are a handful of things that Cycles also needs, but is slowly gaining them. It’s got a new BRDF available now (teehee, anisotropic shading!), it has some support for full float textures, it has some decent importance and multiple importance sampling. It has a good start on AOVs (render layers). But it needs full support for deformation motion blur, vector displacement, thin primitives (curve segments), indoor rendering helps (bidir path tracing for example), and eventually it’ll need to finagle the above major techniques I talk about other production renderers having.

Cycles is heading in this direction, so if direction is what you’re concerned with, then it’s pointed in the right direction. But if it keeps a full GPU rendering kernel as a core feature, the delayed loading and caching techniques will be difficult to implement in the near future. GPUs are changing and improving, and Cycles is changing and improving, so this limitation may not be a limitation at all in the future. It very well could be that Cycles will be well-positioned in the future to include these other techniques when GPUs become better able to accommodate them, while other renderers that are CPU-only might be somewhat left out in the cold.

It’s hard to say what the future holds.

Whew.

MikeFarny · March 22, 2012, 11:26pm

You are spot on. This is not speculation, this is pretty much fact. Selling a renderer to individuals is an entirely different beast than marketing it to studios (and supporting them). Often the right way (or only sane way) to handle selling to individuals is to listen to most of the feature requests, but reject many of them and just go for the most important things. Hm…does this sound familiar to anyone? I see so many feature requests for Cycles, and people talking about how Brecht just seems to ignore them. Trust me, he’s doing it to keep his sanity, and because nobody could implement everything asked for. He’s instead letting his biggest clients help him find the most important features to implement, and he’s going for those. That’s about the most rational thing he could do, so I just chuckle when I see people complain about that. Brecht’s a wise man.

I suspect Solid Angle would get run over in no time if they started selling to individuals right now. Marcos is also a wise man. :eyebrowlift:

storm_st · March 23, 2012, 12:52am

would be very happy if someone proved me wrong and got this to perform reasonably, but it would again probably kill GPU performance so much that the CPU becomes attractive again.

Well, my secret plan of world domination, use central random generator node, (camera seed, for example), huge preprocessed 4D BVH data storage, and on-demand torrent like p2p net, as replacement of renderfarm.fi BOINC 2D tile scheme.

One can install blender, connect to 'net, register in central node (just for statistics), and see small colored bar like stamina on Elder scrolls or rogue energy in WoW. If you just staring to window, waving mouse pointer, your GPU crushing someone else scene in background, and colored bar “charging”. As soon as you change camera, material, other data, it instantly try to spread data using P2P connects to other free nodes, and start generate ray seeds, waiting for pixel colors in return. Very obvious, nothing new, but i think it will fit very nice with current GPU limitations.

The trick is MLT. Central node generate start seed, and is used only as initial MCMC step. All other things happens on different nodes. MLT tend to “dance” very locally, and 4D BSP can cache it efficiently. Nodes can be other computers, or different GPU adapter on same computer, it not matter. Network traffic is small, random seed and scene ID as request, pixels as answer. Of course, scene data must be pre-uloaded to “storage nodes”.

Ok ok, stop dreaming.

vitos1k · March 23, 2012, 4:03am

storm_st:

Well, my secret plan of world domination, use central random generator node, (camera seed, for example), huge preprocessed 4D BVH data storage, and on-demand torrent like p2p net, as replacement of renderfarm.fi BOINC 2D tile scheme.

One can install blender, connect to 'net, register in central node (just for statistics), and see small colored bar like stamina on Elder scrolls or rogue energy in WoW. If you just staring to window, waving mouse pointer, your GPU crushing someone else scene in background, and colored bar “charging”. As soon as you change camera, material, other data, it instantly try to spread data using P2P connects to other free nodes, and start generate ray seeds, waiting for pixel colors in return. Very obvious, nothing new, but i think it will fit very nice with current GPU limitations.

The trick is MLT. Central node generate start seed, and is used only as initial MCMC step. All other things happens on different nodes. MLT tend to “dance” very locally, and 4D BSP can cache it efficiently. Nodes can be other computers, or different GPU adapter on same computer, it not matter. Network traffic is small, random seed and scene ID as request, pixels as answer. Of course, scene data must be pre-uloaded to “storage nodes”.

Ok ok, stop dreaming.

storm_st, is there any svn or git rep snapshot of your branch?
also was wondering are you at #blendercoders (at irc.freenode.net) ? if not, you should! you can always get some feedback or discussions with developers.

monsterdog · March 23, 2012, 9:50am

It seems if arnold is solely marketed to studios without the support hinterland for individual users, a full, free, unsupported community edition would be quite a boon to everybody.

Solid Angle would dispel the vaporware myths out there, and at the same time gain an army of budding artists who are used to working with their renderer and its paradigms without losing its core customers (and maybe gaining a few more when hobbyists decide to open their own studio down the line.)

storm_st · March 23, 2012, 10:04am

No, unfortunately that it only in my head, nothing except words, and i think that plan have weak points (data spreading latency, non stable buggy/evil hacked nodes), but my own experience with renderfarm.fi (one of the first weeks long renderer, when all testers wander what is going on, is it rendering or stuck, etc) make me think this idea not that bad. It was just illustration as answer to MikeFarny doubt about principal current generation hardware GPU limitation, that scheme can “soften” bad GPU features and enhance good points (huge internal GPU cache bandwidth).

MikeFarny · March 23, 2012, 10:53am

I would tend to agree, actually (or at least a restricted version might be good), but I’ll have fairly limited influence on this issue. I can tell Marcos is thinking about this sort of thing, but it’s hard to say what is going to be the best move. Airing your dirty laundry is often good, but it can also be detrimental; it’s not a trivial decision to make. There are lots of hackers that would try to break it wide open, or for full evals they break the licensing on those, or else they come up with comp tricks to remove watermarks, etc. There are arguments either way as to whether that increases actual sales or not, but these are serious issues to consider.

Not having to deal with those problems is one of the nice aspects of Cycles, of course.

Edit: it occurs to me that the mysteriousness surrounding Arnold has actually played to its advantage. Yes, among individual users there are cries of vaporware (a manifestation of “I’ll believe it when I see it”), but it has been received quite well by the bigger studios. The exclusivity has obviously been a draw for some. That doesn’t mean it wouldn’t be suited for a more public release in the future, but it once again means that Marcos is a wise man.

NinthJake · March 23, 2012, 11:00am

Keeping it for large studios in the meantime sounds like a good decision, it is a great render engine and no-one can say otherwise.

I thought Arnold was finished since many years ago however

_jay · March 23, 2012, 3:47pm

storm_st is doing some promising things, and he’s quite enthusiastic about it. If I were you guys, I’d cheer him on as well about doing some smaller features and get those patches accepted, while he plugs away at his bidirectional path tracing and volume features.

A big +1 to this. Honestly I could care less about Arnold, it will most likely never be practical for me to buy it and I’m not down with stealing software.

LetterRip · March 23, 2012, 4:44pm

Mike,

my thoughts on GPU and CPU are using GPU for tweaking shaders, etc. on scenes that only have a subset of your production render textures and geometry for the fast itteration cycle, then switching to CPU for finishing touches and final renders. Also for textures and geometry - I really don’t see a reason to not move to textures and geometry that are always multiresolution and only the highest resolution needed is loaded (as per Mari, but also in GEGL) so even when working with multiple terabytes of textures and geometry your true geometry and textures sent to the GPU is only about 4 GB and typically much less. On each frame update have the cpu check if the camera move was enough to require a geometry or texture resolution change and if so send them to the GPU. Also stream tiles/passes as they are finished and composite them afterwards so as soon as your tile and/or pass take up a certain threshold of ram you can stream it to the CPU and free that memory and begin accumulating more data, then when the tile and/or pass are finished rending they can be combined. Note that a similar process was used at letwory interactive for combining multiple renders of the same frame - so each machine was doing 16 samples, and then 24 machines and combining to give 24*16 samples for the single frame.

Note that some of this is redundant with what you said.

Regarding lazy loading - the CPU can probably do a crude ray trace and predict what the gpu will need.

m9105826 · March 23, 2012, 4:48pm

MikeFarny:

Oh man. This feels a bit like a loaded question, but that’s okay. Warning: long post ahead.

Unfortunately, seeing source code might help answer maybe 1/20th of the question; there are so many factors at play when deciding if a software product is on the right path. Although both Cycles and Arnold are production renderers, they are aimed at slightly different submarkets at this point. Arnold is targeted at covering the small-to-large studios (favoring the mid-size and greater), and Cycles is more targeted at the individual person to small-sized studios where cost is one of the top 2 or 3 factors.

That said, I could have given you an assessment of Cycles with our without seeing any of Arnold internals. Cycles has a lot that is right about it. The architecture is incredibly interesting to me; the core rendering kernel can compile for CPU, CUDA, or OpenCL at the same time. That is both a blessing and a curse. The good part is that you can get it to work just about anywhere (in CPU mode), and it’s nice that you can write once and run on multiple devices. That’s just slick. The curse is that the code must fit the lowest common denominator (the GPU), with all of the headaches of synchronizing data and limited agility. The GPU limitations definitely handicap Cycles. For example, huge production scenes are pretty much out of the question (we’re talking a frame range’s assets to be touched by rendering a shot can cover terabytes). Smaller scenes where people use their cleverness to keep them lean, however, can often look nearly as good.

Honestly, there’s a huge amount of waste and extraneous data that larger studios generate. Artists at those studios tend to turn up the dials and load the scenes to the hilt. It seems to me that no matter what resources are available, the law is that artists will manage to make render times be 1-3 hours per frame on average no matter how fast/good you make the renderer. That seems to be the pain threshold, for whatever reason.

In contrast to Cycles: my own renderer (RenderSpud), Arnold, and many other renderers can manage memory and huge production scenes by using a few tricks.

Caching: all production renderers that I know of have a texture cache plus 3 or 4 other areas of data are cached as well; data that is requested will be loaded, but only a fixed amount of memory is dedicated to the cache. When the cache is full, and more is loaded, old stuff can be dropped out, possibly to be reloaded later if requested again. Think textures: there are easily many gigabytes of textures we routinely point at for a given frame to render at Tippett Studio. No way you’re going to load all of those and put them on a GPU, and you’d kill GPU performance so badly as to make the CPU seem much faster if you were to load/unload caches on the GPU. It just ain’t gonna happen.

Delayed loading: also known as being lazy (yes, this is the technical term), the renderer can load the minimum amount of the scene as possible, with big bounding boxes around the various coarse pieces. Then, when a ray actually hits one of them, it will invoke the procedural or load the file associated with that element of the scene, loading one more level of data, and then keep tracing the ray into it. Things that never get hit by rays? Well, they never get loaded, and thus take no memory or time. On the GPU, you could potentially mark rays that hit unloaded elements to be rescheduled, and then load the data and upload it to the GPU. It’s possible, but it just sounds like a nightmare. I would be very happy if someone proved me wrong and got this to perform reasonably, but it would again probably kill GPU performance so much that the CPU becomes attractive again. Note that what actually happens during the loading of elements can be totally arbitrary (e.g. procedurals). You cannot reasonably run the actual loading part on the GPU.

Arbitrary shading and parameters: this is one area where Cycles actually impresses me quite a bit. The node system in Cycles shows a ton of promise, and the whole stack-based evaluation is really clever. Props to Brecht for this. However, my experience writing the patch that started this thread demonstrated to me that it’s not a trivial thing to get arbitrary data on the mesh onto the GPU, even with Cycles system in place that makes it relatively easy. And it’s certainly still not possible to write arbitrary shaders either. This power and flexibility is one of the hallmarks of the big-studio production renderers, and is the reason for RSL, MetaSL, OSL, etc. to exist. They get used. A lot.

There are a handful of things that Cycles also needs, but is slowly gaining them. It’s got a new BRDF available now (teehee, anisotropic shading!), it has some support for full float textures, it has some decent importance and multiple importance sampling. It has a good start on AOVs (render layers). But it needs full support for deformation motion blur, vector displacement, thin primitives (curve segments), indoor rendering helps (bidir path tracing for example), and eventually it’ll need to finagle the above major techniques I talk about other production renderers having.

Cycles is heading in this direction, so if direction is what you’re concerned with, then it’s pointed in the right direction. But if it keeps a full GPU rendering kernel as a core feature, the delayed loading and caching techniques will be difficult to implement in the near future. GPUs are changing and improving, and Cycles is changing and improving, so this limitation may not be a limitation at all in the future. It very well could be that Cycles will be well-positioned in the future to include these other techniques when GPUs become better able to accommodate them, while other renderers that are CPU-only might be somewhat left out in the cold.

It’s hard to say what the future holds.

Whew.

Whew, indeed! Thanks for the fantastic answer to my question! I was expecting either a yes it is, or no it isn’t answer because of potential NDA limits, but you hit on nearly every point I was wondering about. I’m interested to see where Cycles goes in the future. I have long suspected that making GPU integration a high bullet point might hinder development down the road, but we shall see I suppose. I’m just now getting into learning the required skills to create/alter/add features to a raytracer, so hopefully within a year I’ll be able to look at Cycles code myself and get the answers I’m looking for, but in the meantime it’s very useful to have sources such as yourself to dumb it down for me.

_jay · March 23, 2012, 4:50pm

What is the point of quoting a half-page post? Seriously?

LetterRip · March 23, 2012, 4:51pm

Mike regarding arbitrary shaders - I got the impression that Arnold is also using OpenSL

We haven’t yet written anything designed around OpenCL and the idea of being able to target both CPU and GPU interchangeably. If that happened, one of the first places we’d leverage it would be in Open Shading Language, in the shaders that run inside Arnold. That would be a great optimisation for us.

Perhaps I’m mistaken and it is just ImageWorks custom version of Arnold that is using it though.

MikeFarny · March 23, 2012, 4:52pm

LetterRip:

Mike,

my thoughts on GPU and CPU are using GPU for tweaking shaders, etc. on scenes that only have a subset of your production render textures and geometry for the fast itteration cycle, then switching to CPU for finishing touches and final renders. Also for textures and geometry - I really don’t see a reason to not move to textures and geometry that are always multiresolution and only the highest resolution needed is loaded (as per Mari, but also in GEGL) so even when working with multiple terabytes of textures and geometry your true geometry and textures sent to the GPU is only about 4 GB and typically much less. On each frame update have the cpu check if the camera move was enough to require a geometry or texture resolution change and if so send them to the GPU. Also stream tiles/passes as they are finished and composite them afterwards so as soon as your tile and/or pass take up a certain threshold of ram you can stream it to the CPU and free that memory and begin accumulating more data, then when the tile and/or pass are finished rending they can be combined. Note that a similar process was used at letwory interactive for combining multiple renders of the same frame - so each machine was doing 16 samples, and then 24 machines and combining to give 24*16 samples for the single frame.

Note that some of this is redundant with what you said.

Regarding lazy loading - the CPU can probably do a crude ray trace and predict what the gpu will need.

I really like that last idea you mentioned there: you can get the CPU to predict quite a bit of what the GPU will need, and you can also potentially re-use a previous (frame?) to gather estimates as well. However, even one round of stall+reupload to the GPU can be a performance killer. CPU-based renders suffer a little bit this way too, if you think about it: hit something that isn’t loaded yet, and you have to stall at least that render thread, and other threads may pile up behind it. But the loading latency on CPU is usually small enough that it’s still a win to not have to load it up front.

As for tiling, that helps…some. Until you do many bounces of GI, that is. Typically rays get very incoherent very quickly, and they can and will go almost anywhere. You’ll end up needing so much that it doesn’t actually save you memory, but it does mean you can break up the task among multiple machines/devices.

However…I know that doing tiled renders can be a major, major pain in the rear depending on how it is set up. Oftentimes the overhead of setting up the render on multiple machines will outstrip the benefits.

m9105826 · March 23, 2012, 5:21pm

Chill man, I hit reply with quote instead of just reply. Not really a big deal.

LetterRip · March 23, 2012, 5:48pm

Hmm we could/should do variable bit rate decompression for textures also, should greatly increase the amount of GPU ram available.

We could also do streaming of certain data.

We might be able to compress output tiles as well.

MikeFarny · March 23, 2012, 11:27pm

Correct, that is the SPI version of Arnold. If any of you follow the OSL mailing lists, you can see all the adventures Larry Gritz et al have had in getting that implemented for Arnold. They have some pretty impressive results, actually.