What's the trick to p4/amd optomization?

I hear alot about code optamization, but I want to know how you do it? I’ve got an Athalon X2 dual core amd64 processor, and here’s the gcc options that I use:
-O3 -march=k8

it doesn’t seem to make much of a difference, what can I do?

The kind of optimization you’re hearing about has to do with the way it is converted to machine code.

Programs written in high-level languages (things humans like to read, like C/C++, Python, etc.) don’t really represent the details of how the computer likes to think of things.

As an example, here is someting simple in C:

int Least_Common_Multiple( int a, int b ) {

  int ma = a;
  int nb = b;

  while (ma != nb) {
    if (ma < nb) ma += a;
    else         nb += b;
    }

  return ma;
  }

Here it is in Assembly language:

lcm proc near
    push ebp
    mov  ebp, esp
    push ebx
    mov  ecx, dword ptr [ebp +12]  ; ecx <-- b
    mov  ebx, dword ptr [ebp +8]   ; ebx <-- a

    mov  edx, ebx                  ; edx <-- ma
    mov  eax, ecx                  ; eax <-- nb


    cmp  eax, edx                  ; enter the while loop
    je   short @3

@2:
    cmp  eax, edx                  ; if part
    jle  short @4
    add  edx, ebx
    jmp  short @5

@4:
    add  eax, ecx                  ; else part

@5:
    cmp  eax, edx                  ; while test (again)
    jne  short @2

@3:
    mov  eax, edx                  ; eax <-- result (ma)

    pop  ebx                       ; end function
    pop  ebp
    ret
lcm endp

The first thing to notice is that it looks like it takes a lot more to do it. It’s really the same thing, but the computer can only do little things at a time. For the ‘if part’, the computer first compares the two integers (cmp), then jumps if less than or equal (jle) to the ‘else part’ (pay attention to the fact that register eax is ‘b’ and edx is ‘a’. (A jump is a goto.)

The next thing to notice is how some things are optimized. The value of ‘a’ is first gotten from memory (the stack for you coders out there) and stored in one of the computer’s named, quick-access memories called registers: eax, ebx, ecx, et cetera.

One thing that was not optimized very well was the while loop condition. You’ll notice that it appears twice. Once to enter the loop and again at the end to repeat or exit the loop.

A better optimization would replace the ‘enter the loop’ part with a single instruction to just jump right away to @5. This makes the code smaller (by a few bytes) and doesn’t impact its speed significantly.

What you’ve been looking at above is 80386+ assembly code. That’s fine if you’re using a PC, but what if you’re computer uses a Motorola chip (Mac, SunSparc, etc.)? The assembly code, even though it does the same thing, looks and works differently. Some things work faster on one type of hardware while trying the same thing on another hardware could actually make it slower.

The Intel-Optimized Blender builds are compiled (converted to assembly --> machine code) to maximize speed and space (in that order). This is done using compilers that are specifically designed with an Intel processor in mind, instead of a general-purpose compiler like the GCC.

Hope this answers your question.

There is a thread regarding this topic:

https://blenderartists.org/forum/viewtopic.php?t=17626

I understand all that, but now do I tell the compiler to optomize with amd64 in mind?
I’ve tried to optomize it before, but it doesn’t seem to speed things up, and I want to know what I’m doing wrong?

you might not be doing anything wrong except trying to use gcc [though I don’t know about the intel compilier and 64 bit output]

-O3 -pipe -ffast-math -funroll-all-loops -fomit-frame-pointer -momit-leaf-frame-pointer -mfpmath=sse -march=yourcpu

for some reason scons provides marginally faster build than plain make.

This article should help you with what gcc compiler switches to use for amd64

http://www.coyotegulch.com/products/acovea/aco5k8gcc40.html

GreyBeard