The architecture we are talking about in this article is not yet available on the market, it has not even been presented, but it is the product of an analysis of the advances made in recent years, as well as the various patents on the Multi-GPU chiplets that AMD, NVIDIA and Intel have released over the past two years. That’s why we’ve decided to take this information and synthesize it so that you get a feel for how these types of GPUs work and what graphics issues they solve.
Traditional 3D rendering with multiple GPUs
The use of multiple graphics cards to combine their power to render each frame in 3D video games is not new, since the Voodoo 2 from 3dfx it is possible to distribute the rendering work, totally or partially, between several graphics cards. . The most common way to do this is alternate frame rendering, where the processor sends the list of screens of each frame alternately to each GPU. For example, GPU 1 handles frames 1, 3, 5, 7, while GPU 2 handles frames 2, 4, 6, 8, etc.
There is another way to render a scene in 3D which is Split Frame Rendering, which consists of multiple GPUs rendering the same scene and dividing the work, but with the following nuances: a GPU is the master GPU that reads the list of screens. and handles the rest. The first stages of the pipeline, before rasterization, are performed exclusively on the first GPU, similar to rasterization, and subsequent steps are performed equally on each GPU.
Split Frame Rendering seems like a fair way to distribute the work, however, now we’ll see what the issues this method involves and what limitations it is.
The limits of Split Frame Rendering and the potential solution
Each GPU contains 2 collections of DMA drives, the first pair can simultaneously read or write data to system RAM through the PCI Express port, but in many graphics cards that support Crossfire or SLI, there is another collection of DMA drives , which allow access to the VRAM of the other graphic. Of course, at the speed of the PCI Express port, which is a real bottleneck.
Ideally, all GPUs working together would have the same VRAM memory in common, but this is not the case. The data is therefore duplicated as many times as the number of graphics cards involved in the rendering, which is grossly inefficient. To this must be added the way graphics cards work when rendering 3D graphics in real time, so the configuration with multiple graphics cards is no longer used.
Caching tiles on a multi-GPU by chiplets
The concept of Tile Caching started to be used from NVIDIA’s Maxwell architecture and AMD’s Vega architecture, it is about taking some concepts of tiling rendering, but with the difference that instead making each tile in a separate memory and not doing it is done on the second level cache. The upside is that it saves on the power cost of some graphics operations, but the downside is that it depends on how much top-level cache is on the GPU.
The problem is that a cache does not function like conventional memory and that at any time and without program control, a cache row can be sent to the next level of the memory hierarchy. What if we decided to apply the same functionality to a chip-based GPU? Well, that’s where the extra level of cache comes in. In the new paradigm, each GPU’s last-level cache is ignored as memory for tile caching and the multi-GPU’s last-level cache is now used, which would be on a separate chip.
LCC on a Multi-GPU by chiplets
The last-level cache for chip-based multi-GPUs brings together a number of common manufacturer-independent characteristics. The following list of characteristics therefore applies to any GPU of this type, regardless of the manufacturer.
- It is not in any of the GPUs, but is external to them and therefore is on a separate chip.
- It uses an interposer with a very high speed interface such as a silicon bridge or TSV interconnects to communicate with the L2 cache of each GPU.
- The high bandwidth required does not allow conventional interconnections and is therefore only possible in a 2.5DIC configuration.
- The chiplet where the last level cache is located not only stores said memory, but it is also where the entire VRAM access mechanism is located, which is thus decoupled from the rendering engine.
- Its bandwidth is much higher than that of HBM memory, which is why it uses more advanced 3D interconnect technologies, which allow much higher bandwidths.
- In addition, like any last level cache, it has the ability to give consistency to all the elements that are its clients.
Through this cache, each GPU is prevented from having its own good VRAM in order to have a shared one, which greatly reduces the multiplicity of data and eliminates the bottlenecks that are the product of communication in a multi- Classic GPU.
Master and subordinate GPUs
In a graphics card based on a Multi-GPU by chiplets, the same configuration still exists as in a classic Multi-GPU when creating the display list. Where a single list is created, which receives the first GPU responsible for handling the rest of the GPUs, but the big difference is that the LLC chiplet we talked about in the previous section allows the first GPU to coordinate and send tasks to the rest of multi-GPU chiplet processing units.
An alternative solution is that all the chips in the Multi-GPU will miss the command processor completely and it is in the same circuit as that where the LCC chip is located as a conductor and taking advantage of all the communication infrastructure. to send the different thread instructions to different parts of the GPU.
In the second case, we wouldn’t have a master GPU and the rest as subordinates, but the whole 2.5D IC would be a single GPU, but instead of being monolithic, it would be made up of multiple chiplets.
Its importance for ray tracing
One of the most important points for the future is Ray Tracing, which in order to function requires the system to create a spatial data structure on the information of the objects in order to represent the transport of light. It has been shown that if said structure is close to the processor, the acceleration undergone by ray tracing is significant.
Of course, this structure is complex and takes a lot of memory. This is why having a great LLC cache will be extremely important in the future. And that’s the reason the LLC cache is going to be in a separate chip. Have the highest possible capacity and make this data structure as close to the GPU as possible.
Today, a good part of the slowness of Ray Tracing is due to the fact that a large part of the data is in the VRAM and that there is a huge latency in its access. Keep in mind that LLC cache in a multi-GPU would have the benefits of not only bandwidth but also latency of a cache. In addition, its large size and the data compression techniques developed in the laboratories of Intel, AMD and NVIDIA will allow the BVH structures used for acceleration to be stored within the “internal” memory of the GPU.
Table of Contents