One thing that has become a habit is to sell graphics cards through the theoretical amount of TFLOPS they can run. A metric that may make sense to compare models within a single architecture, but which is not used to accurately measure previous architectures and even other brands.
What does the TFLOPS of a GPU measure?
The term FLOPS stands for Floating Operations Per Second, which translated into Spanish is Floating Point Operations Per Second. It is therefore a rate of speed, since this measure with respect to time.
But what is an operation? Many often confuse it with education when it is not the same thing. A complete instruction must follow the instruction cycle and therefore must go through the entire instruction cycle to be completed. An operation, on the other hand, is what an execution unit does in each cycle, regardless of whether an instruction is completed or not.
When they give us the FLOPS rate, they do so taking into account totally unrealistic conditions if we take into account whether we take into account the way a program is executed, which are as follows:
- They are based on the incessant repetition of the fastest instruction to execute in number of cycles and the one with the most operations per clock cycle, the FMA call where a sum of two numbers is performed and multiplied by a third. This is not the case in any program, including shader programs.
- The data is not always in the registers of the shader unit or the compute unit, which actually forces one to delay said instruction and advance others.
- Not all instructions require the same number of clock cycles to execute.
- Assume all shader units are busy and working, so there are no bubbles and downtime.
Not only does TFLOPS live a gaming GPU
While applications via computer science are based on the exclusive use of all the computing power of shader units. On the other hand, when it comes to rendering graphics in real time, the classic 3D pipeline is used for rasterization and in case we are talking about hybrid ray tracing, which is combined with rasterization and therefore is an extension of it. .
What does that mean? Well, in the fact that there are stages in the graphics pipeline which depend on the speed of the previous ones, so the amount of TFLOPS that a GPU can perform is totally the same if at different stages the fixed function units are not are not fast enough to do their job, creating a bottleneck in the later stages.
For example, for generations the performance of fixed-function units for subdividing AMD primitives, which we know as tessellation, was much worse than in NVIDIA, causing their GPUs to perform worse in the scenes with a high geometric load, to which they had to combine a lower rasterization rate than their rivals.
Cycles per instruction
Another important component is the number of cycles per instruction, which counts the number of clock cycles required by any type of processor to resolve a specific instruction. Since not all instructions are the same, neither are the number of cycles they require.
The great advantage of GPUs over CPUs is that they don’t actually run programs, there is no software for them and this allows them to change the ISA even in small revisions to implement instructions that take fewer cycles, remove less efficient ones, etc. at.
It should be taken into account that programs written in shader language are compiled by the driver in real time the first time we use them on the GPU.