Impact of modern CPU extensions on Cycles performance?

Hello and good day, I have been wondering to what extent Cycles is being speed up by using instruction sets like SSE4 and especially AVX. The latter is currently supported by Sandy- and Ivybridge-type processors as well as AMD Bulldozers on Linux and from Windows 7 SP1 onward. I dont know about MacOS. I know that official builds wont support these instructions but unfortunately I cant seem to find any AVX builds on graphicall either. Does anyone about the speed impact it has on Cycles?

Cycles does not at present use any vector extensions in its code. Some compilers have features that detect and automatically translate code to use vector extensions (autovectorization), i’ve seen it used in this build: http://graphicall.org/946

This build´s performance actually sparked my interest in seeing SSE4/AVX builds :wink: I am a bit of noob on this but from my understanding SSE3 should give way less of aspeedup over the default SSE2 path (?) than using AVX extensions. How much fpu-bound is Cycles anyway? AVX could seriously affect your buying decision. Maybe a 1090T (6 FPUs) gets destroyed by an now “overlooked” 8120 (4 FPUs) under AVX-code?

I don’t think the difference between AVX and SSE is big in code that isn’t written for it, but I don’t know. Cycles is heavy on float math, but that doesn’t mean it can be vectorized well. Not all of that builds performance is related to autovectorization, either.

I’ve read that AVX is useless, or even detrimental to path tracers because of something to do with cache misses. Can’t remember where I saw it, but it was in an article directly out of Intel Labs so I would certainly tend to trust it.

To enjoy AVX instructions the code must be coded for it. Your code must have 8 variables and perform 8 operations (additions, multiplications, …) with other 8 variables and then store the result in consecutive places in memory.
Do you understand now why you will not see any advantage? Blender is not doing that.

But of course modifying the BVH to do things this way would be useful.
For example QBVH means doing it with 4 (4 rays (really the same) do calculations intersecting 4 triangles) using SIMD instructions.
It would be useful to adapt the QBVH to VBVH using VAX instructions. But I don’t have AVX so I don’t care.

Also why would someone to code for 8 when they can code for hundreds of processors (GPU). GPU destroyed the CPU kingdom.

I read the same thing.

@Bao2 I have read several posts from you and you always give some very detailed explanations on topics. Are you a developer yourself or do you do builds? On the GPU side of things: This might be a silly question but how optimized do you think the GPU-code is?