what is it and how does it work in CPU

One of the most studied advances in recent times concerns the movement of data within an architecture. Something that at first glance may seem completely solved for a long time, but in recent years has become a crucial point in increasing the performance of processors.

There are two reasons why the movement of data has become the obsession of architects when designing new hardware. The first has to do with energy consumption and the second with latency, which is the time it takes to perform a memory operation, and it is on this second point that we must make very clear the relationship between latency and bandwidth.

Bandwidth equals latency?

No, they are not the same, latency is the time in clock cycles it takes to resolve a memory request and this has a series of steps that must always be performed. The problem is that although the memory interface can be very fast, the memory controller may not be and sometimes this or the MMU of the CPU saturate and end up delaying all memory requests.

Well, regardless of the speed of the interface, if the memory request is blocked, the rest of the queue is blocked and no data is transmitted. And this can happen if we end up saturating a large number of RAM requests. The worst part is that it can even leave the CPU waiting for a long time to get the data for the next instruction to be executed.

Rather, bandwidth is simply the transfer speed. For example, you can have 100 requests at 1 Gb / s or 1 request at 100 Gb / s, but it must be taken into account that the memory controller of the processor which is responsible for managing the accesses will have more difficulties with the first one. case. than with the second.

Data movement units

Take any ISA no matter what it is and take a look at it, there you will see instructions that do not perform arithmetic-logical operation, nor tasked with performing a jump or a ship, but are responsible for performing data movements that involve movement to memory.

Many of these instructions are obsolete and have a specific clock cycle latency. What if we add a support processor to act as a messenger and be able to resolve these requests to RAM or any memory within the same address with lower latency? Well, the performance of the processor would increase and allow it to focus the clock cycles that usually wait to resolve new instructions.

Well, the Intel Data Streaming Accelerator is based on this principle and it is one of the keys to improve the performance of different processors.

Intel Data Flow Accelerator

As the name suggests, it is an accelerator, that is, a unit that performs a specific task, which in this case is the transmission of data in less time than the CPU. . The peculiarity of the DSA is that it is designed for one of the characteristics that Compute Express Link brings on PCI Express 5.0, which is to grant consistent access to RAM memory to all devices connected to the PCI Express port, c that is, they share the same memory addresses.

It is therefore used to perform the following operations:

You can move data from CPU to RAM and vice versa.
In order to access inconsistent memory spaces, with other memory addressing, you can do the address conversion automatically, so technically we are dealing with an updated DMA unit.
It also has access to persistent or non-volatile memories, so it can also access NVMe SSDs, Intel Optane modules, NVDIMMs etc …
Via NTB and in a server environment, it gives you access to another RAM or non-volatile memory from another card in the data center or server.
It has built-in functions to apply the above points to virtual machines.

As many of you will have deduced, this is a type of unit designed specifically for server processors, although it is not a fixed-function unit that operates automatically.

Intel DSA Instructions

The data flow accelerator is not a fixed function unit since it does not always apply the same program to the data entering it, but rather supports a series of instructions, so this is what which we call a domain specific processor. Some of the things you can do are:

Move: classic x86 data transfer instructions, those who wrote assembler will know. If the processor has one or more Intel Data Flow Accelerators, it will be run by them and not by the processor cores.
DIF: It is responsible for carrying out the process of verifying the integrity of information in memory.
CRC Generation: generates the CRC checksum on the transmitted data.
To fill: It is responsible for filling a section of memory with a specific data repeatedly, it is ideal for erasing the contents of a part of the memory, since it allows us to set all the bits to 0.
Compare: It allows you to compare two memory blocks and check whether they are identical.
Creating a delta record: Perform a check and generate a new data stream with the difference between the two.

The Data Streaming Accelerator can also control multiple storage devices at the same time:

Activate deactivate– Connect or disconnect a memory, RAM or non-volatile storage device.
Abort: cancel all memory requests to RAM or other memory device.
Drain: requests that all requests be made to a memory device at the same time.

The instruction list is much longer, but that’s just so you get a rough idea of how this new unit that Intel has built into its processors works. The benefits are clear and should be further improved at Sapphire Rapids.