NVIDIA DLSS, the GPU performance solution or just a myth?

The Boss

PC

NVIDIA DLSS, the GPU performance solution or just a myth?

DLSS, GPU, myth, NVIDIA, performance, solution

If it is necessary to speak of the two spearheads of NVIDIA for its GeForce RTX it is clear that they are Ray Tracing and DLSS, the first is no longer an advantage because of the implementation in RDNA 2 AMD, but the second is always a differential. this gives it a great advantage, but all is not what it seems at first glance.

DLSS over RTX depends on Tensor cores

The first thing we need to consider is how the different algorithms, commonly referred to as DLSS, take advantage of the console hardware and nothing better than doing an analysis of how the GPU is working while rendering a frame with the DLSS active and without it.

The two screenshots you have above these images correspond to the use of the NVIDIA NSight tool, which measures the usage of each part of the GPU over time. To interpret the graphics, we must take into account that the vertical axis corresponds to the level of use of that part of the GPU and the horizontal axis the time in which the image is rendered.

As you can see the difference between the two screenshots of the NSight is that in one of them you can see the usage level of each part of the GPU while using DLSS and in the other no. What is the difference? If we do not look closely, we will see that in the one corresponding to the use of DLSS, the graph corresponding to the Tensor Cores is flat except at the end of the graph, that is to say when these units are activated.

DLSS is nothing more than a super-resolution algorithm, which takes an image at a given input resolution and produces a higher resolution version of the same image in the process. This is why Tensor cores, when applied, are activated last, as they require the GPU to render the image first.

DLSS Operation on NVIDIA RTX

RTX 3070 3080 Ti

DLSS takes up to 3 milliseconds to render an image, regardless of the frame rate at which the game is running. If for example we want to apply DLSS in games at a frequency of 60Hz, then the GPU will have to resolve each frame by:

(1000 ms / 60 Hz) -3 ms.

In other words, in 13.6ms, in return, we’re going to get a higher frame rate in the output resolution than if we were to natively render the output resolution to the GPU.

DLSS Example of operation

Suppose we have a scene that we want to render at 4K. For this we have an indeterminate GeForce RTX which at said resolution reaches 25 frames per second, so it renders each of them at 40ms, we know that the same GPU can achieve a frame rate of 5o, 20ms at 1080p. Our hypothetical GeForce RTX takes about 2.5ms to go from 1080p to 4K, so if we enable DLSS to get a 4K image from an image at 1080p, each image with DLSS will take 22.5ms. With this, we were able to render the scene at 44 frames per second, which is higher than the 25 frames that would be obtained at native resolution.

On the other hand, if the GPU takes more than 3 milliseconds to blast the resolution, then DLSS will not be enabled, as this is the time limit set by NVIDIA in its RTX GPUs for them to apply the DLSS algorithms. This makes low-end GPUs limited in the resolution at which they can run DLSS.

DLSS benefits from high speed measurement cores

The Tensor cores are essential for running DLSSWithout them this could not be done at the speed that runs on the NVIDIA RTX, since the algorithm used to perform the resolution increase is what we call a convolutional neural network, in which we do not go. not go into this article, only to say that these use a large number of matrix multiplications and that tensor units are ideal for calculating with number matrices because they are the type of unit that performs them the fastest.

In the case of a movie today, the decoders end up generating the initial image in the frame buffer several times faster than the speed at which it is displayed on the screen, so there is more than time to scale and therefore you end up requiring a lot less power calculation. In a video game, on the other hand, we did not store it on a medium like the following image will be, but it must be generated by the GPU, this reduces the time that the scaler has to work.

Ampere SM Subcore

Each of these Tension cores are located inside each SM unit and depending on the graphics card we use, its computing capacity will vary, varying the number of SMs per GPU, and therefore it will generate the scaled image in less time. Because DLSS intervenes at the end of the rendering high speed is required to apply DLSS, that is why it is different from other super-resolution algorithms such as those used to scale movies and images.

Not all NVIDIA RTXs perform the same on DLSS

DLSS Performance Table

This table you see is taken from NVIDIA’s own documentation, where the input resolution in any case is 4 times lower than the output resolution, so we are in Performance mode. Hay que aclarar que existen dos modos adicionales, el Quality Mode da mejor calidad de imagen, pero requiere una resolución de entrada de la mitad de píxeles, mientras que el Ultra Performance Mode hace an escalado de 9 veces, pero tiene la peor calidad de imagen all.

As you can see in the table, the performance varies not only depending on the GPU, but also if we take into account the GPU we are using. Which should come as no surprise after what we explained earlier. The fact that in Performance mode an RTX 3090 ends up being able to go from 1080p to 4K in under 1ms is impressive to say the least. This has a downside that stems from a logical conclusion and that is that DLSS on more modest graphics cards will always perform less well.

The cause behind this is clear, a GPU with less power will not only need longer to render the image, but even to apply DLSS. Is the solution the Ultra Performance mode which multiplies by 9 the number of pixels? No, since DLSS requires that the output image have sufficient input resolution, since the more pixels there are on the screen, there will be more information and the scaling will be more precise.

Geometry, image quality and DLSS

DLSS fragments

GPUs are designed such that in the Pixel / Fragment Shader stage, in which the pixels of each fragment are colored and textures are applied, they do so with 2 × 2 pixel fragments. Most GPUs, when ‘they rasterized a triangle, convert it to a block of pixels which is then subdivided into 2 × 2 pixel blocks, where each block is sent to a compute unit.

The consequences on DLSS? The raster unit tends to reject all 2 × 2 chunks out of the box as being too small, sometimes corresponding to distant details. This means that details that at native resolution would be visible without issue are not visible in the resolution obtained via DLSS because they were not in the image to be resized.

Since DLSS requires an image with as much information as possible as an input reference, it is not an algorithm designed to generate very high resolution images from very low levels as detail is lost in the process. .

And what about AMD, can it emulate DLSS?

FidelityFX Super Resolution

Rumors about FidelityFX’s super-resolution have been circulating the network for months, but AMD has yet to give us a real-life example of how its DLSS counterpart works. What makes AMD’s life so difficult? Well, the fact that the Tensor Cores are crucial for DLSS and in the AMD RX 600 there are no equivalent units, but rather SIMD over register or SWAR is used in the ALUs of compute units for get higher performance in less precision FP16 formats., but a SIMD reader is not a systolic array or voltage reader.

Right off the bat we are talking about a 4 times differential in favor of NVIDIA, this means that when building a similar solution it starts from a considerable speed drawback, optimizations for the calculation of the matrices apart. We are not discussing whether NVIDIA is better than AMD in this area, but the fact that AMD when designing its RDNA 2 did not give importance to the tensor units.

CDNA calculation unit

Is it due to a disability? Well no, since paradoxically AMD added them to CDNA under the name of Matrix Core. For now, it’s early to talk about RDNA 3, but let’s hope AMD doesn’t make the same mistake again by not including one of these units. It makes no sense to do without when the cost per compute unit or SM is only 1mm2.

So, we hope that when AMD adds its algorithm due to the lack of Tensor units, it will not achieve the precision and speed of NVIDIA, but that AMD will present a simpler solution like a Performance mode that doubles the pixels at the same time. ‘screen.

Leave a Comment