Difference between CPU and GPU

How come GPU can render faster than CPU?

Is it possible to build CPU with GPU architecture as a base?
I mean not like integrated GPU but completely new architecture.
Basically to harness GPU advantage on CPU.

In the beginning, GPUs were specialized hardware that was very efficient at particular operations relevant to 3D rendering, such as triangle rasterization and texture sampling. Only with the advent of programmable shaders, architectures shifted more towards general purpose computing.

GPUs are faster than a CPU if your problem is data-parallel, that is you can perform lots of similar operations at the same time on a large amount of data.

Take the example of a list with 1 million numbers, and you wanted to multiply each number in the list by two. No individual number in the list depends on any other number in the list, so you could task a million workers to independently perform the operation on just one item, yielding a million-fold speedup over a single worker.

It turns out that a lot of problems in graphics can be performed similarly like this, e.g. you can trace a ray for a pixel independently from all other pixels, so you can easily parallelize up to the amount of pixels in your image.

However, not all problems are parallel in nature. Usually you depend on some previous result to get ahead in the computation. That’s why many programs (including Blender) still use only 1 CPU core, for the most part. GPUs are in fact terrible at this sort of work.

In practice, even CPUs implement some features for data-parallelism on a small scale, like SIMD/SSE, which lets you perform operations on 4 or 8 (or more) items in a single instruction. Also, many CPUs speculatively execute code even when it depends on a yet unknown result. Most CPUs are also pipelined, which means that non-dependent instructions are executed in parallel. All this makes CPU cores more complex (and therefore larger).

There are processors that are optimized for data-parallel processing that aren’t GPUs: The Xeon Phi is one such device, it’s basically a bunch of simple Intel Atom cores with extra-wide SIMD. If you ran an algorithm that isn’t parallel on such a device, it would be quite slow compared to a normal CPU.

On a desktop/workstation, you always want a CPU that is optimized for serial processing and you might want another processor that is optimized for parallel processing, if you have use for that.

These are not perfect examples by any shot but should help visualise what Beer Baron is saying.

This is a die shot of an i7 2500k, showing how it has 4 large cores, the cache, controllers and its iGPU.

This is a logical layout of an NVidia GTX 1080, notice how it has many, many tiny cores. At the very least, each of the smaller groups in the GPU can do only do one type of task at a time, so they must all have the same job, such as calculate 32 rays at once…