It is not correct to say that GPUs are inefficient in terms of performance per watt. In the contrary they are very efficient in peak arithmetic power per watt if the problem is massive parallel and compute bound . This is why a lot Top500 systems included GPUs.
But not every problem has efficient algorithms that map nicely to GPU architectures. And the more complex you algorithm gets, the more likely it is that it can’t run efficiently in parallel on SIMD/SIMT architectures.
Code paths diverge, memory access patters get more scattered etc.
And about development: implementing a non-trivial renderer efficiently on a GPU is 5-10 times more complex than just write it for CPU.
Even if you restrict yourself to one particular GPU brand an technology (eg Nvidia/CUDA being the first an most mature GPGPU environment). When going to OpenCL the problems just multiply because then you have to fight not only with one compiler and architecture platform but several.
This is why a lot of GPU renderers first concentrated on CUDA and later struggeled hard to get good performance and feature set with OpenCL on AMD and Intel GPUs (and even CPUs for efficient usage of AVX etc.).
But now to the main topic: I don’t see FPGAs replace or enhance CPUs and GPUs for raytracing any time soon. They simply don’t have the user base that would justify development. I also can’t imagine that the OpenCL support is good enough to run non-trivial code on FPGAs.
And the compile times will surely even be worse than for GPUs. It takes even long to “compile” (mostly place and route) simple Verilog/VHDL stuff for FPGAs.
FPGAs are great to do a lot of bit twisting stuff in parallel. But for more complex operations it is usual to use “soft core CPUs” which are simple in-order RISC architectures defined in a HDL.
They are far behind in features set and performance behind “real” CPUs and mostly usable in embedded designs to drive a SoC (system on a chip) with different special function units to handle some specific problems efficiently.
Some people designed ray traversal/intersection accelerators of FPGA. But often they could not even use floating point math as the most FPGAs don’t provide special hardware for this. They might have 18bit integer multiplier blocks you can use to do fixed point DSP like stuff.
But as soon as you need even a simple soft-core FPU the ressource usage of your soft CPU gets much higher.
And we don’t talk about the number crunching FPU units in GPUs that have efficient special function units for a lot of higher level math functions that is needed in graphics shading and HPC. We just talk about basic operations like add/sub/mul/div.
Even a sqrt with high troughput is a lot of logic.
Surely the high-end FPGAs may include more floating point hardware over time. But still I doubt that this will be easier to program than GPUs. And as I said: as long as a high end FPGA is out of reach for a mass market the GPUs have a big advantage.
It’s the economies of scale: the high development costs of GPUs are paid by millions of gamers.
Even Intel gave up on Xeon Phi because only targeting the HPCs market is not enough to pay for the development cost of it.