In this article we are not going to deal with a specific GPU architecture, but that of all in general and so when you see the scheme that manufacturers usually launch on the organization of their next GPU you can understand it without problem. It doesn’t matter if it’s an integrated or dedicated GPU and how powerful they are.
Organization of a GPU contemporary
To understand how a GPU is organized, you have to think of a Russian doll or matryoshka, which is made up of several dolls inside. Of which one could also speak of a set storing a series of subsets as and when. In other words, GPUs are organized in such a way that the different sets that make them up are in many cases inside each other.
Through this division, we will understand something as complex as a GPU much better, because from the simple we can build the complex. With that said, let’s start with the first component.
Set A in the organization of a GPU: shader units
The first of the sets are the shader units. By themselves, they are processors, but unlike CPUs, they are not designed for instruction-based parallelism, ILP, but execution threads, TLP. Whether it’s GPUs from AMD, NVIDIA, Intel, or any other brand, all contemporary GPUs are made up of:
- SIMD units and their records
- Scalar units and their registers.
- Planner
- Shared local memory
- Texture filter unit
- Best-in-class data and / or texture cache
- Load / store drives to move data to and from cache and shared memory.
- Lightning intersection unit.
- Systolic tables or tensor units
- Export bus which exports the data of set A to the different components of set B.
Set B in the organization of a GPU: Shader table / Shader engine / GPC
Set B includes set A in its interior, but initially adds the constant instruction and caches. In GPUs, like in CPUs, the top-level cache is divided into two parts, one for data and the other for instructions. The difference is that in the case of GPUs, the instruction cache is outside of the shader units and therefore they are in set B.
The set B in the organization of a GPU therefore comprises a series of shader units, which communicate with each other via the common communication interface between them, allowing them to communicate with each other. On the other hand, the different shader units are not alone in set B, as this is where there are several fixed function units for rendering graphics, like now.
- Unit of primitives: This one is invoked during the World Space Pipeline or Geometric Pipeline, it is in charge of the tessellation of the geometry of the scene.
- Rasterization unit: It performs primitive rasterization, converting triangles to pixel fragments and its stage being the one that begins the so-called Screen Space Pipeline or rasterization phase.
- ROPS: The units that write the image buffers operate in two stages. In the raster phase before the texturing phase, they generate the depth buffer (Z-Buffer) while in the phase after the texturing phase, they receive the result of this step to generate the Color Buffer or the different Render Targets ( deferred rendering).
Set C in the architecture of a GPU
:
We already have almost the complete GPU or the GPU without the accelerators, it consists of the following components:
- Several B-Sets inside.
- Shared global memory: A Scratchpad and therefore outside the cache hierarchy to communicate B Sets between them.
- Geometric unit: It has the ability to read pointers to the RAM that point to the geometry of the scene, with this it is possible to eliminate the non-visible or superfluous geometry so that it is not rendered unnecessarily in the frame.
- Command processors (graphics and IT)
- Last level cache: All the elements of the GPU are clients of this cache so it must have a huge communication ring, all the components of Set B have direct contact with the L2 cache as well as all the components of Set C.
The Last Level Cache (LLC) is important because it is the cache that gives us consistency between all the elements of Set C with each other, including obviously Set B within it. Not only that, but it keeps the external memory controller from being over-saturated as it is the LLC itself, as well as the MMU unit (s) of the GPU, which are responsible for capturing instructions and data from the RAM. . Think of the Last Level Cache as a sort of logistics warehouse in which all elements of Set C send and / or receive their packages and their logistics are controlled by the MMU, which is the unit in charge of doing so.
Final set, full GPU
With all this we already have the full GPU, the D set includes the main unit which is the GPU in charge of rendering the graphics of our favorite games, but it is not the highest level of a GPU, because we are missing a set of Support coprocessors. These don’t act to render graphics directly, but without them the GPU wouldn’t be able to function. These elements are generally:
- The GFX unit, including its first level cache
- The North Bridge or Northbridge of the GPU, if it is a heterogeneous SoC (with one processor) but with a shared memory pool, they will use a common Northbridge. All elements of set D are connected to the Northbridge
- Accelerators: Video encoders and display adapters are connected to the Northbridge. In the case of the Display Adapter, it is the one that sends the video signal to the DisplayPort or HDMI port
- DMA units: If there are two RAM address spaces (even with the same physical sink), the DMA drive allows data to move from one RAM space to another. In the case of a separate GPU, the DMA units serve as communication with the CPU or other GPUs.
- Driver and memory interface: It allows to communicate the elements of Set D with the external RAM. They are connected to the Northbridge and this is the only way to the external RAM.
With all this, you already have the complete organization of a GPU, with which you can read the diagram of a GPU much better and understand how it is organized internally.
Table of Contents