NVIDIA Lovelace, possible organization and architecture of this GPU

The NVIDIA Lovelace architecture will not be released until 2022, using the TSMC 5nm node, although this last detail is not yet assured, as it has also been speculated with the TSMC 7nm node for the current RTX 3000 and at the end NVIDIA has makes use of Samsung’s 8nm node. In any case, for the moment it is a big unknown and the little that we know creates some doubts which were not taken into account, in particular as regards the configuration of certain parts and the bandwidth.

What we know about the NVIDIA Lovelace architecture

Of all the little we know about the future NVIDIA Lovelace, what is surprising is its setup, which we know from NVIDIA insider Kopite7kimi, who we remember is one of the most reliable sources after advancing the RTX 3000 specifications based on GeForce Ampere architecture one year in advance. Your information on GeForce Lovelace? The high-end AD102 chip will have a 12 GPC configuration instead of the TU102’s 7 GPC.

GPU	CUDA clouds
GPU GM200	CUDA clouds2816
GPUGP102	CUDA clouds3584
GPUTU102	CUDA clouds4608
GPUGA102	CUDA clouds10752
GPU¿AD102?	CUDA clouds18432

What Kopite refers to is the number of GPCs and the amount of TPC per GPC. There are 2 SM or Compute Units per GPC and a total of 128 CUDA cores in the case of the GeForce Ampere, which should be inherited by Lovelace. This brought news of an 18,432 CUDA or ALU core configuration in FP32, a number that would be the biggest jump in this regard compared to later generations of NVIDIA GPUs.

What’s wrong with these specs? Especially with regard to the bandwidth, which is related to the energy consumption, and it is that the increase in the number of GPCs implies an increase in the number of certain units that will have to be cut to make this configuration possible.

A hypothesis with feet of clay

The first thing we need to take into account is that a GPU’s configuration changes with the VRAM it has allocated and we have no reference that NVIDIA is going to use any type of special memory to be able to power a GPU. total of 12 Lovelace GPC. It is true that when it comes to GPU size, they can’t go much further with the GA102, which almost hits the size limit accepted by Samsung’s 8nm node and the 5nm node could double the size. transistors, but the question is: does memory exist to feed such a large amount of GPC? We’re talking about almost doubling the bandwidth, which we can’t do with GDDR6X.

It is possible that NVIDIA will use the FG-DRAM it has developed, but this type of memory seems to be more targeted for the HPC market than for the home market, if NVIDIA ever develops it. NVIDIA will therefore have to settle for GDDR6X, which can become a huge bottleneck.

Continuing with an organization such as the GeForce Ampere, means that the load on the VRAM will increase considerably, not only from the SMs, but also from the fixed functional units, in particular the texture units and the ROPS which are the units of this. type. that more data read and written from VRAM, which in the new configuration would increase significantly.

Possible modifications of ROPS and texture units at Lovelace

If we count the amount of ROPS in the current GA102, we will see that there is a total of 112 ROPS, having 7 GPC in said GPU, or 16 ROPS per GPU, a configuration of 12 GPC with 16 ROPS would increase the amount to 192 ROPS. As long as the ROPS are the ones that write the already finished pixels into the VRAM, the necessary bandwidth is doubled. The solution? Reduce the number of ROPS per GPC from 16 to 8, for a total of 96 ROPS, a more compliant figure for a 384-bit GDDR6X bus.

The other point concerns the texture units, at Maxwell the ratio of FP32 ALUs per texture unit was 16 ALUs. Since the texture units range from 4 to 4 in the SMs, this makes a total of 64 ALUs in FP32 within the SMs of these architectures. Ratio which has been broken in Ampere in its configuration, because at certain times it can be 128 ALU per SM. What do we believe? We believe that what we are going to change is from a configuration of 4 sub-cores per SM to 8 sub-cores per SM, in this way the load on the VRAM will be much less compared to the texture units that are in the GPU, but the ratio will increase again to 64 ALUs in FP32 or CUDA cores per texture unit.

The conclusion we draw from this, in terms of the changes needed, is that the power of the RTX 3090 or RTX 3080 Ti will dramatically increase its FLOPS rate, but not in the same way its texture or pixel fill rate, which makes sense in the transition to GPUs. more and more focused on Ray Tracing.