The fastest RTX 40 could be 5 times more powerful than the RTX 3090 Ti

Just over a week has passed since we addressed a very controversial and totally speculative subject on our part, more precisely since Monday of last week, where we named three hypotheses on the changes that NVIDIA could put in work in its organization and the internal structuring of the Ada Architecture Lovelace and how it would affect the RTX 40. Well, today a leak reveals where Huang will go and more importantly, what performance the RTX 40 could have. Fastest RTX 40.

Three hypotheses with the same principle: there will be changes in the SM in Ada Lovelace as the main architecture where, as we have already foreseen, it will not have much to do with what has been seen in Hopper, thus confirming that NVIDIA has two totally different approaches for the two architectures and that the next step is clearly an MCM chip system.

Ada Lovelace’s internal changes for the RTX 40

Again a leaker like Kopite7kimi on the prowl and within the leak just revealed we have one of the hypotheses we considered last week. Concretely, the improvements to this architecture that will bring the RTX 40 to life relate to an internal reorganization of the FP32 and INT32where NVIDIA’s decision is the most logical and perhaps least risky: to combine all shaders into a single engine that encompasses integers and floats.

That is, there would be a group of full Shaders for FP32 and INT32, which could result in a higher than expected number in an explosive number to hate, but less practical in real performance, as it turns out. is produced with the RTX 30.

1. Double the sub-core to improve the efficiency of 2*FP32.
2. There is a 4*FP32 expansion space.
This is my opinion on ADA. pic.twitter.com/HAt48SP5RT

– kopite7kimi (@kopite7kimi) May 5, 2022

To understand the changes, we have to go to Pascal versus Turing as such, since that’s where the first change took place. NVIDIA dropped entire performance to promote FP32 in every SM. Ampère left behind the number of works of 16 operations for FP32 and 16 operations for INT32

that Turing had for each clock cycle and unified return to work with 32 operations per cycle for both. From this, the controversy of the “false” count of these in the Shaders arose, since NVIDIA doubled the number of operations, yes, but not the number of Shaders as such.

Fastest RTX 40 performance

The next step now is to unify the two engines into one with a very clear goal: to improve efficiency. There will be no FP64 logically, but we will have an exclusive group of FP32 and INT32 which is also upgradable, and here is the really interesting part.

Although the diagram shows only one group for these, really if we look closely there are two of them, only technically they are unified into one for their functionality and not for their total number. Information leaked today reveals that these two groups could really be up to four as such, where given the abilities of floating and whole units to operate at the same time, it is speculated with a huge 100 TFLOPS

at worst and up to 200 TFLOPS at best.

This idea is based on some information that I cannot tell you now.
So 100T, 150T or 200TFLOPS is possible.

– kopite7kimi (@kopite7kimi) May 5, 2022

To put it into context, an RTX 3090 Ti gets 40 TFLOPS currently and already with the double counting system we discussed above, which means that in the event that NVIDIA used two unified groups of FP32 and INT32, the supposed RTX 4090 would be more than twice as fast as the top of range current the company, while in the case of the use of 4 of them the performance increases up to 5 times.

Logically this would imply a monstrous chip size, unlikely we’ll see it, but it does indicate that NVIDIA has an ace up its sleeve, perhaps not for Ada Lovelacebut yes for his successors.