How can we make a large number of GPU chips communicate with each other effectively? We need a memory to do the intercom job and this is where the Rambo cache comes in. We explain how it works and what its function is.
Rambo Cache as a difference between Xe-HP and Xe-HPC
As can be seen from the Intel slide, the Rambo Cache itself is a chip that includes memory inside, which will be used exclusively in the Intel Xe-HPC for communication between the different tiles / chips. While the Intel Xe-HP supports up to 4 different tiles, the Intel Xe-HPC handles a much higher amount of data, making this additional memory chip necessary as a communication bridge for extremely setups. complex in terms of the amount of data. GPU chiplets, or tiles as Intel calls them.
The Rambo cache will be placed between multiple Intel Xe-HPC compute tiles to facilitate communication between them. Compute tiles are none other than Intel Xe GPUs, but are specialized for high performance computing, so the classic fixed function units of GPUs will not be in Intel Xe-HPC as they are not used in computing high performance.
However, the Rambo Cache will be unprecedented in the rest of the Intel Xe, especially those that will not be based on multiple chips such as the Intel Xe-LP and the Intel Xe-HPG. In the specific case of the Intel Xe-HP, it seems that with 4 chiplets the Rambo Cache is not necessary because Interposer provides enough bandwidth to communicate the different chiplets mounted on it.
The goal is to reach the ExaFLOP
We know that the limit on the number of chips on an interposer is 4 GPUs, but from a higher number, it is when the interconnection based on an EMIB interposer does not give enough bandwidth for communication. , which necessitates something that unites the memory access and this is where the Rambo cache would come in, as this would allow Intel to create a more complex GPU than the maximum 4 chips it can build with it. ‘EMIB.
The goal? To be able to create material that in a combined way can reach 1 PetaFLOP of violence or in other words 1000 TFLOPS. A performance much superior to the GPUs that we have in PC, but we are not talking about a GPU for PC but a material designed for supercomputers, with the aim of reaching the ExaFLOP cap, which is 1000 PetaFLOPS and therefore 1 million TeraFLOPS.
The big concern of hardware architects to achieve this is power consumption, especially in data transfer, more calculations more data and more data moves more power. This is why it is important to keep the data as close to the processors as possible and this is where the Rambo cache comes in.
The Rambo cache as a top-level cache
When we have multiple cores, whether it’s a CPU or a GPU, and we want them all to have access to the same memory at both addressing and physical level, a last level cache is required. Its “geographic” location in the GPU is just before the memory controller but after each kernel’s private caches.
GPUs today have at least two levels of cache, the first level is deprived of shader units and is usually connected to texture units. The second level is instead shared by all elements of a GPU. In this case, they are your interconnect path to communicate, access the most recent data, and all of this so as not to saturate the VRAM controller with requests addressed to it.
But there is an extra level, when we have several full GPUs interconnected with each other under the same memory, then an additional level of cache is needed which groups the accesses to all the memories. Intel’s Rambo Cache being Intel’s solution to unify the access of all the GPUs that make up the Ponte Vecchio.