SWAR, how AI and multimedia accelerate both processors and GPUs

The performance of a processor can be measured in two ways, on the one hand, how fast it executes serial instructions and therefore they cannot be parallelized, since they only affect unit data. On the other hand, those which work with several data and can be parallelized. The traditional way of doing it on processors and GPUs? SIMD units, of which there is a subtype widely used in CPUs and GPUs, SWAR units.

ALU and their complexity

Before talking about the SWAR concept, we need to keep in mind that ALUs are the units of a CPU that are responsible for performing arithmetic and logic calculations with the different numbers. These can become complex in two ways, one because of the complexity of the instruction to be executed. The internal circuit of an ALU which can perform, for example, the calculation of a square root is not the same as that of a simple sum.

The other is the precision with which they work, which is the number of bits they simultaneously manipulate each time. An ALU can still handle data equal to or less than the number of bits for which it is designed. For example, we can’t make a 16-bit ALU calculate a 32-bit number, but we can do the opposite.

But what happens when we have multiple data of lesser precision? Normally they will run at the same speed as full precision, but there is a way to speed them up, and that is the over-register SIMD. Which is also a way to save transistors in a processor.

What is the SWAR concept?

By now, many readers will know that this is a SIMD unit, but we’re going to take a look at it so that no one gets lost in this article from the start. A SIMD unit is a type of ALU where, through a single instruction, multiple data is manipulated at the same time, and therefore there are multiple ALUs that share the captive part of what the instruction itself is and what its decoding, but where in each a different information is processed.

SIMD units are usually made up of multiple ALUs, but there are cases where the ALUs are subdivided into simpler ones, as well as the accumulation register where they temporarily store their data to calculate it. It is called SIMD on a register or by its acronym in English SWAR, which means SIMD in a register or SIMD on a register.

This type of SIMD unit is widely used and allows an n-bit precision ALU to execute the same instruction but using data with less precision. Usually with a precision of a half or a quarter. For example, we can make one 64-bit ALU to act as two 32-bit ALUs by executing said instruction in parallel, or four 16-bit.

Learn more about the SWAR concept?

This concept is already decades old, but the first time it appeared on PCs was in the late 90s with the appearance of SIMD units in the various types of processors that existed. The veterans of the place will remember concepts like MMX, AMD 3D Now !, SSE and others which were SIMD units built under the SWAR concept.

Suppose we want to build a 128 bit SIMD unit

In conventional SIMD units we have several ALUs operating in parallel and each of them has its own register or data accumulator. Thus, a 128-bit SIMD unit can be made up of 4 32-bit ALUs and 4 32-bit registers.
Instead, a SWAR unit is a single ALU that can operate with very high precision along with its accumulator register. This allows us to build the SIMD unit using a single 128 bit ALU with SWAR support.

The advantage of implementing a SWAR type unit over a scalar unit is simple to understand, if an ALU does not contain the SWAR mechanism that allows it to function as a SIMD unit with less precision data. , it will perform them at the same time. speed. that the data of the highest precision. What does it mean? A 32-bit unit without SWAR support, in case it needs to execute the same instruction on 16-bit data, will do so at the same speed as a 32-bit unit. On the other hand, if the ALU supports SWAR, it will be able to execute two 16-bit instructions in the same cycle, in the case where the two follow one another.

SWAR as a patch for the AI

Artificial intelligence algorithms have a peculiarity, they tend to work with very low precision data and today most ALUs work with 32 bit precision. This means adding 16, 8, and even 4-bit precision ALUs to a processor to speed up these algorithms. Which complicates the processor, but the engineers didn’t fall into this error and started pulling the SIMD to the registry in a peculiar way, especially on GPUs.

Is it possible to combine a conventional ALU SIMD with a SWAR design? Well yes, and this is what, for example, AMD does in its GPUs where each of the 32-bit ALUs that make up the SIMD units of its RDNA GPUs supports register-based SIMD and can therefore be subdivided into two 16-bit, 4 of 8 bits or 8 of 4 bits.

In the case of NVIDIA, they put the onus of speeding up algorithms for AI to the Tensor Cores, which are systolic arrays made up of 16-bit floating-point ALUs interconnected with each other in a three-way matrix. axes, hence the name of the unit. Tensor. They are not SIMD units, but each of their ALUs supports register SIMD by being able to perform twice as many operations with 8-bit precision and four times with 4-bit precision. Either way, Tensor units are important because they are designed to speed up die-to-die operations to a much higher speed than with a SIMD unit.