Instruction latencies are not information that GPU manufacturers typically give out in their public specifications, as they are of interest to game and application developers. This is why they are often overlooked by the general public. However, this is information that hardware fans are always interesting to know.
Latency of instructions on a GPU
Instruction latency in each processor is the time, measured in clock cycles, between the processing unit and the data it needs to perform an operation. If the data is not found in the registers, then it is necessary for the capture mechanism to go through the entire structure of the cache until it reaches a data item.
Since GPUs are made up of a large number of cores compared to CPUs, the number of requests made to VRAM is huge. Therefore, their cores have a different makeup and are based more on executing multiple contexts or threads than on serial execution. This allows them to jump from one context to another while waiting for data from one of the execution threads.
However, instruction latency in a GPU is also important, although to a lesser extent than in a CPU, since a GPU doesn’t always have the level of tasks to do to keep it busy. Additionally, there are times in the 3D pipeline such as Pixel / Fragment Shaders, where VRAM access is continuous and low latency is required to resolve most threads of execution.
AMD RX 6000 vs. NVIDIA RTX 30 in latencies
The Chips and Cheese website decided to measure the latencies of the latest gaming GPUs, NVIDIA RTX 30 and AMD RX 6000, for this they used the pointer tracking test in OpenCL. This test consists of copying entire blocks of data from VRAM to caches. Depending on the size of the block, this data can be copied to the different cache levels on the GPU, to subsequently measure the access time to the entire block, which will be at different levels depending on its size.
The test shows how AMD changed its cache hierarchy in RDNA architectures, as that was one of the points where its previous graphics architecture, GCN, was well below NVIDIA. Keep in mind that besides the Infinity cache, the AMD architecture has 3 cache levels: L0 in the compute unit, L1 in Shader Array, then L2. As for the Infinity cache, it only adds an extra 20 nanoseconds to access compared to L2.
As for NVIDIA, its cache structure has not evolved since Maxwell, GTX 900. The time to go from the cache inside the SM of the RTX 30 to the L2 cache is 100 nanoseconds, while in AMD it goes from its equivalent to L2 cache is only 66 ns with a level in the middle. One of the causes could be the huge size of NVIDIA’s GPUs compared to AMD’s. Being the cache one of the improvements NVIDIA could have applied to Lovelace.