Intel has always been the third in contention when it comes to GPUs, after all that’s not their core business and it’s about something more than what they do that are CPUs. Although in recent years they have increased their resources and have a series of gaming GPUs on the starting ramp. However, its architecture presents a series of differentiating points compared to its competitors.
The Execution Unit, the basis of Intel’s GPUs
To understand the difference in organization or architecture of Intel GPUs compared to others, it is necessary to understand that if in an NVIDIA or AMD GPU the shader unit is the minimum unit, in the case of Intel it is the execution unit. does it consist? Each Execution Unit is a processor designed for thread level parallelism or full TLP. It therefore has a control unit, records and the corresponding execution units. Which are two SIMD units of 4 32 bit floating point ALUs and 4 more integers, which are switched and support register SIMD.
Thanks to the SIMD on register, by subdividing the ALUs and their associated registers, they can work with twice as many operands per clock cycle for each subdivision made in precision. Thus, they can perform twice as many 16-bit floating point operations as in 32-bit, but four times as many if they are in 8-bit. As for the functionality of the execution units, they are responsible for running the Shader programs, after all they are the equivalent of the SIMD units of Intel and AMD GPUs and therefore their job is the same.
In the Intel Xe, the Raja Koduri team made an important change in the control unit, since now two execution units share the same control unit. A change that is very reminiscent of the one that AMD made in its RDNA architectures where two Compute Units are grouped into a single Workgroup. Something that shouldn’t surprise us with the brain drain from AMD to Intel. This change has meant that the control unit has been updated, which surely represents a complete change in the internal ISA of Intel GPUs to a much more efficient one.
Sub-Slice, the Shader unit
The equivalent of shader units that NVIDIA and AMD GPUs have, we have already seen that these are not execution units, but rather sub-slices. Within them you will find the grouped execution units. Since each thread is a subset of a sub-slice and the slice is the superset of the sub-slice, we’ll see the latter later. Each sub-slice has 16 threads inside, which translates to 64 FP32 ALUs and 64 entire ALUs in total. A number that makes these units equivalent in raw computing power to their AMD equivalents, the Compute Units.
As for the rest of the items that can be found inside the sublicense, these are the classics of a unit of this type, although Intel uses a different nomenclature than usual. How is the case with what is called the 3D Sampler, which remains the classic unit for processing and filtering textures, simply Intel has given another name to this classic unit of fixed function that we find in all 3D graphics processors since its inception.
However, the Media Sampler is a much more interesting piece because it is unique to Intel GPUs, it is made up of a series of fixed function units, which are as follows:
- The video motion engine provides an estimate of pixel motion, which is essential for video encoders.
- The Adaptive Video Scalar is a unit that performs image smoothing filters.
- De-Noise / De-Interlace is a unit responsible for reducing noise in an image on the one hand and on the other hand transforming video in interlaced mode in progressive mode.
Starting with the Intel Xe, the Media Sampler was removed from the sub-slice and became a full-fledged independent unit. Which continues to be a differential piece to designs from NVIDIA and AMD.
The Slice, another common piece in GPUs
The Slice in the Intel GPU architecture is the equivalent of the Shader Engine or the GPC in the case of NVIDIA. Different names for an organization of units from each other. Inside are the sub-slices and a series of fixed-function units, which are common to GPUs from other companies.
Although here again the nomenclature can be confusing, for example in the rest of the architectures the raster unit is generally unified and the one that generates the depth buffer, the two elements occur in the raster phase in a common unit in the case from NVIDIA and AMD, but Intel does this separately.
The same goes for the Pixel Dispatch and the Pixel Back-End. Functions of ROP units which are performed here by two different elements. After all, the task at hand in both cases is the same.
The Intel GPU cache hierarchy
One of the points of differentiation of the common architecture of Intel GPUs, compared to AMD and NVIDIA, is precisely the way the cache hierarchy is organized. In AMD’s case, we find that the RX 6000 has a four-level hierarchy if you count the newly incorporated Infinity Cache. In the case of NVIDIA, the cache hierarchy is different from that of Intel and AMD, but it is not the competition from Intel that we want to focus on in this article because it is not dedicated to them.
The diagram in this section shows internal communication within the GPU, both at the sub-slice and slice level. In the case of the sub-slices, we have the classic data cache and the shared local memory. But unlike NVIDIA and AMD GPUs, Intel has traditionally added extra L2 cache accessible by both the 3D sampler and the media sampler. Which makes GPU L3 cache the top level GPU cache.
The differentiation between the L1 cache for data and therefore for threads and L2 for textures has changed in the Intel Xe, where the two have been combined into a single L1 cache of data and textures. They therefore now have a completely standard configuration compared to competing GPUs.
Another change concerns the L3 or last level cache. Contemporary GPUs support what is called Tiled Caching, which consists of rasterizing by tiles, but they do it on the last level cache and there is a risk that the data will fall into memory or the energy cost of it. recovery is skyrocketing, so They increased it from 3MB to 16MB.
Table of Contents