Optimize a 64bit build?

Hello everyone!

I have a short question… is it possible to have a SSE3 optimized 64bit build of blender? 'Cause up till now I saw either generic 64bit builds or optimized 32bit builds. Or am I getting something seriously wrong here? (Bear with me! ;))

Best regards & thanks in advance!

Yeah, I’m sure it is. Other than possible problems with using SSE, same as any optimisation.

Maybe since blender isn’t fully 64bit safe yet people haven’t distributed any 64 bit optimised builds? (Edit - be a pioneer)

I thought Blender is 64bit safe now (since 2.44)?

You know there really should be a sticky somewhere that gives good information about optimization on blender. It’s a very gray area. That would be perfect in a hardware/system configuration/benchmarking/performance/compatibility forum.

Yes, I would definitely LOVE to see that somewhere…

Well (scalar) SSE2 will always be used for 64bit code, basically because x87 has been declared deprecated by AMD and Microsoft with the introduction of AMD64 architecture.

I don’t think Visual C++ or GCC can automatically generate SSE3 instruction out of regular source code, you have to use the intrinsics (blender doesn’t), so SSE3/SSSE3 for blender will only work with the Intel Compiler, which is the only one featuring advanced auto-vectorization anyway (since it’s the vector instructions that make the huge difference, not scalar SSE/SSE2)

Eventually i’ll try gcc’s loop-vectorization on blender, but for yafray it is definitely useless.

Maybe, I don’t keep up with that area of work, I recall the ongoing work though. I wasn’t sure if it had been finished.

Wish we could have a 64 bit version in windows X64, 64 bit linux is still in early development (it breaks all the time, and you only get to use about 1/10th of the software titles available in linux :frowning: unless you hack a 32 bit jail.)

I cannot use it in a production environment because of this :frowning:

32 bit blender is still cool for me though, I dont make huge scenes that require me to use over 2 gigs of ram.(yet)

Hey Mmph, 64bit linux is actually pretty stable… Most apps are available with 64 bit versions. Just check out any of the 64bit live cds. The repos also have quite a few apps. There was something about the ubuntu repos where the 32bit repos had around 11500 apps and the 64bit repos had around 10500. Not much of a dif.

I’m using 64bit linux on my work station and everything works great.

Yea i have to agree with Dan, 64bit Linux works very well, i use it every day since over 6 months and it didn’t break once…AMD64 ubuntu repository is virtually the same as the 32bit one. All drivers are available and working too.
The only 32bit app i use regularly is Firefox because of that stupid Macrom… err Adobe Flash plugin.

About Win64, someone just finally needs to fix the long != 64bit problem, without breaking other platforms of course, and check if all required libraries work in 64bit too. But there just doesn’t seem to be a developer for that task…
until then there will only be half-broken inofficial Win64 builds.

Why not start by gathering here 64bit (or other) optimization tips?

Let’s see:

my user-config.py contains these lines:

CCFLAGS.extend( [ ‘-march=athlon64’, ‘-O3’, ‘-ffast-math’, ‘-mmmx’, ‘-msse’, ‘-msse2’, ‘-msse3’, ‘-m3dnow’] )
CXXFLAGS.extend( [ ‘-march=athlon64’, ‘-O3’, ‘-ffast-math’, ‘-mmmx’, ‘-msse’, ‘-msse2’, ‘-msse3’, ‘-m3dnow’] )

As far as I understand (I’m no gcc expert):
-march64 is the important one. several other gcc parameters are activated or not depending on the cpu type.
-O3 is the optimization level. Higher the number higher the optimization and the compiling time.
-ffast-math is something I don’t really understand (someone explains to me?) but on every graphical app forum they tell you to use it, since, well, it must make some calculations run faster
-mmmx as far as I understood is disabled automatically by the sse2 / sse3 params but it didn’t hurt so I kept it there
-msse normally is enabled automatically by the -athlon64 but I couldn’t confirm
-msse2 the same
-msse3 i’m confused now, reading the posts above, does it help really?
-m3dnow is the special set of 3d instructions of amd cpus

that’s it, actually it gives me a quite fast build… compared to my basic windows 32bit one, the difference is incredible. Ah, another important thing is to strip the blender executable once it is compiled (use the “strip” command on it), it will reduce much its filesize

Can some of you gcc experts comment on this and help bettering? (and explain better than me the sse part)…

-march=athlon64 most importantly implies -mtune=athlon64 which tells the compiler to optimize for the characteristics of Athlon64 (well actually all K8 based ) CPUs.
And yes, to my knowledge it makes -msse, -msse2 and -m3dnow redundant.
Also, the 64bit compiler automatically assumes -msse and -msse2, all x86_64 CPUs feature them. And i don’t think you will persuade the 64bit compiler to use mmx or 3dnow on its own…they use x87 registers too and hence have been declared deprecated together with x87 in 64bit mode, apparently someone hoped to remove necessity of saving those registers on context switching, but so far no OS dared to really do that.

A common misconception is that -msse etc. make the compiler use those instructions wherever possible. This is not the case, the compiler may use those instructions and it allows the use of intrinsic functions available in GCC.
To make the compiler use SSE(2) instructions automatically for normal floating point code, you have to add -mfpmath=sse. But once more, this is the default for the 64bit compiler.
Note that this far, GCC will only use scalar SSE instructions. Those are instructions that take one value per register, no packed values (vectors).

To benefit from vector instructions, you can try -ftree-vectorize. It will try to parallelize loop iterations by using vector instructions. This requires at least gcc 4.0.

-O3 is the highest optimization level, but it is not necessarily faster than O2, it mainly does aggressive unrolling and inlining which can bloat code size so much that it is slower than before “optimizing”, only trying will tell (there’s some more options to control inlining limit btw…)

-ffast-math is not a very safe option, it allows the compiler all kinds of little tricks outside the IEEE specifications in order to make life for the CPU a bit easier. Code relying on those specifications however may break. So it’s definitely a “use at own risk” option.

In addition, you probably should not remove ‘-fPIC’, ‘-funsigned-char’ and ‘-fno-strict-aliasing’, seems the code relies on those, especially omitting the last one causes some not-so-nice warnings…

With 4.2.x, GCC added the compiler flags “-march=native” and “-mtune=native”. It mostly covers what your processor can safely do automagically. There is of course room for further optimizations…

This all sounds very interesting, though I have a feeling that someone new to compiling (such as myself) may have trouble understand what they heck ya’ll are talking about. :o People who already know about compiling stuff won’t need necessarily an easy step by step walkthrough of what it actually means, but it would def. be nice to have it all explained in laymans terms how the process of compiling an app actually takes place. Specifically Blender, as I feel Blender is the only app people may be interested in actually compiling, for the performance boost in linux. So they may not need to understand much about compiling - only to know how to build blender.

So, we first need the source code, right, and then we need a compiler called gcc. Would it then be a matter of typing in the instructions for the compiler to make the app? How many instructions does this entail? For example the tar.gz package that blender comes in for linux from blender.org will run directly out of its extracted folder. Presumably that means that it is a binary package?

So maybe if we can collect some easy to follow instructions in this thread, we can put them in to some kind of tutorial for compiling and optimising blender? Something that people who’ve never compiled anything before could follow, that is.

yorik: also, doing ‘-mmmx’, ‘-msse’, ‘-msse2’, ‘-msse3’ is all redundant, since sse includes mmx, sse2 includes sse, and so on. Requesting sse3 is all that’s necessary.

For my Core 2 Duo builds on 64 bit linux, I’m using -march=nocona, because Ubuntu is using GCC 4.1.x, and the core2 arch doesn’t come in until GCC 4.2.

finally, if I understand you all well, only specifying -march=athlon64 would, alone, do almost all optimization… the rest would be only very delicate fine-tuning for the real enthusiast…
Well, this is actually very valuable information, thanks!

I think I’ll begin to use -march=athlon64 in everything I compile…

Seems gcc is actually much smarter than it seems at first look…

@Dan: There are a couple of step-by-step instructions for compiling blender on the net, most of the time it depends on the linux distribution you have, since not all have the same tools and libraries… For exemple for ubuntu there is a guide on the blender wiki site.

Hey Mmph!

Just wanna tell you I use Fedora 64bit on everyday basis for over a year now, and up till now nothing has broken :slight_smile:

@all: Thanks for the great response, that was far more than I expected! I’ll try and mess around with the options a little and see what I come up with. Maybe we can post fast & working options here.

Thanks very much & greets!

Well let’s say it this way, -O* and -march=* are pretty much the only general optimizations left on x86-64. The -march does not speed up things terribly, a couple of percent here and there.
All the other things you had to add in x86 environments to squeeze your modern CPU - like -msse2 -mfpmath=sse -fomit-frame-pointers - simply became redundant as all x86-64 CPUs can do that without problems.
What’s left are deeper tweaks, like fiddling with the maximum size of loops to unroll and functions to inline, playing with loop vectorizer etc…
or the less safe math optimizations, which you can enable al together -ffast-math (kind of the “risk all or nothing” flag)

Things like SSE3/SSSE3 have to be used as intrinsics in code currently to benefit from enabling them, and also the vectorizer is only useful if the code really is written vectorizer-friendly.