Demystifying the new Xeon Phi: A possible gamechanger in highly parrelel computing

From the looks of it, the Xeon Phi looks to have up to several advantages over an equivalent Nvidia and ATI card.

You have OpenCL support (along with numerous other programming models), full double-precision processing capability, higher memory bandwidth, and more advanced math. The article linked does note that some information may be incomplete because they’re trying to sort through all of the claims and rumors, but they do seem to be getting somewhere.

I’m guessing now that there are three players in the consumer supercomputing market, things could really heat up which would likewise lead to a higher rate of advance in the capability of this type of hardware.

Your thoughts?

Hm ace, do you read something wrong? :slight_smile:
A GTX 680 is 3 times faster, 5 times cheaper and use less power.
Or did I got something wrong, for scientific calculations it may different but for cycles?

Cheers, mib.

I was kind of in a hurry to read through and post this, I guess I read some of the statistics backwards instead of forward. O.o

Anyway, the article says that the statistics may not be complete as information is still coming in, so they may trend more or less in favor of the Phi when it’s all said and done.

It looks to me like the focus on supporting more instruction types may have reduced the speed (cores are less simple), but that way it’s easier to program parts of an application to work with it.

One of the huge advantages is, that the Phi is more less a co-processor, like I explained in another thread already.
Many, to most might not remember, or don’t even know, that the early Intel had no floating point units.

First FPUs were available as co-processors, so you had to get an Intel 8080+8087, 80286+80287, 80386+80287, 80386+80387 if you wanted hardware floating point support…
With the 80486 we finally got an integrated FPU (not the SX model though)

The Xeon Phi, is like a coprocessor to the Xeon processor family, and can not be seen as a GPGPU replacement.
Oversimplified, you do not need to write special code for it.
You put the Phi in your system, re-compile your sourcecode with the supplied compiler, and it will make usage of the Phi.
I am sure there are optimizations you can make, but it’s not mandatory.

Back in the days the coprocessors were a huuuuge performance boost, they have been really expensive and it took some time till they got integrated into the processor.
I guess this is development repeating itself to some extend and before long we’ll have processors with a “Phi-unit” integrated.

Some might argue that the Phi is highly expensive, but as usual that’s firstly an opinion and secondly egocentric.
I am sure there are plenty of scientific and industrial applications where the Phi is “just the thing”, and even if not, it’s an interesting step in processor development…

Now some demystifying:

First some generic benchmarking, with no Phi but the professional cards:

LuxMark 2.0, OpenCL, Scene: Room, 2.016 mio tris, OpenCL, [samples/s]

  • Tesla K20: …140
  • Tesla K20 ECC: …139
  • Quadro 4000: …143
  • Quadro K5000: …192
  • Quadro 5000: …204
  • Quadro K5000 + Tesla K20: …331
  • Quadro K5000 + Tesla K20 ECC: …325
  • Fire Pro W8000: …860
  • Fire Pro W9000: …1073

Asking Nvidia about that abyssal performance, they said that their OpenCL driver is not that optimized, and pointed towards CUDA raytracers… where comparison with AMD is impossible lol… well.

That much for OpenCL, I guess Phi wasn’t available at that point for the benchmark but you get the picture.

The interesting part are SGEMM/DGEMM and FFT operations. CUDA is a highly optimized for it, MKL (the xeon phi libs) are there as well, and AMD is rather… humble.
SGEMM/DGEMM are matrix operations, FFT is fast fourier transformation.
GEMM = general matrix multiplication, S/D respectively single and double precision.

Looking at the performance for those operations it looks like this:

SGEMM [GFLOPS]

  • Dual Xeon E5-2690 … ~600
  • FirePro W9000 … ~1700
  • Xeon Phi (60Core 1.1GHz model) … ~1700
  • Radeon HD7970 … ~2300
  • Tesla K20 ECC … ~2400

DGEMM [GFLOPS]

  • Dual Xeon E5-2690 … ~300
  • FirePro W9000 … ~500
  • Radeon HD7970 … ~700
  • Xeon Phi (60Core 1.1GHz model) … ~850
  • Tesla K20 ECC … ~1050

For those alien to numerical math and linear algebra, matrices are not only used for 3D stuff, they are also used to solve complicated systems of equations for instance.

However, it’s clear that the Xeon Phi targets:

  • Double Precision
  • Scientific/Industrial application
  • Bleeding edge systems where speed is absolutely essential and cost irrelevant.

The conclusion, Xeon Phi shouldn’t concern the average Blender user now, but if one of “Intels Knights” is commanded to defend the peasants making a reasonable priced co-processor with good single precision performance…
I’d get one, just like I got a 80386 with a coprocessor back in the days. Given that there’s software around utilizing it :wink:

My usual techbrabbel, but that should clear up some things.

They already have quantum computers.

Interesting numbers, Xeon Phi should perform well here because it suffers less of thread divergence and memory access pattern typical of a path tracers (i.e. the score should be higher than what would you could expact looking just at peak GFlops performance).

Someone should remember to NVIDIA that all Kepler family performs worser than Fermi family with Cycles and Octane too (i.e. CUDA path tracers).

I think this was fixed with a new CUDA toolkit but I’m not really sure. But even if the 680 is worse than the 580 doesn’t mean the architecture is wrong for path tracers or GPGPU in general. There are already Teslas with another chip which could be better optimised for these tasks. The GTX6XX cards are primarly for gaming.
Rumors say in february NVidia will release the “Titan”. A 1000$ card with the chip used for the Tesla K20. A first benchmakr (might be faked) shows it is faster than a GTX690 (basicly 2x680).
This card could be interessting for rendering and it isn’t that overpriced like the Tesla cards. The Xeon Phi will also be quite expensive I guess.
I really don’t know why some people do such a hype about the Xeon Phi. CUDA works well witz Cycles and you can get for 200$ a card for CUDA. We don’t know if Xeon Phi will work with cycles (don’t believe the Intel marketing, “just pres complile and it works”, yeah sure) and anyway the Xeon Phi won’t be super powerfull for just a few dollars.

This is going to be an interesting future development. Apple probably will stop the MacPro line because the normal i7 CPUs themselves are already very fast. Intel as well as other hardware producers also noticed a drop in Desktop sales because those who dont need all the power of PC have powerful tablets now as an option or better alternative.

I just recently did a test with an i5 system and it was only 3 times slower than a 12 core xeon system.

Fishlike

what you seem to ignore is that code needs to work with CUDA, FPU code is different.