Hardware error correction in RAM, HDD and SSD, what is it?

This crucial element resides in the transmission of data from one part of a processor to another, between different processors and between processor and memory. Although it may seem surprising, it can happen that due to certain elements the signal sent from one place to another ends up being distorted, so that the zeros of the binary code become 1s or vice versa.

The changes produced can have effects as simple as the data to be processed changing slightly or even the instruction to be executed. This is why, during the physical implementation of a processor or a memory, the certification team must check point by point that there are no elements at the origin of these variations in sending data. As well as implementation mechanisms in hardware for error correction.

data storage

The first point on which it is important that there are no hardware errors during the transmission of data is in the storage media, since these are crucial for storing information and we are interested in a 100% accuracy when storing data in binary format.

The resulting code on all architectures, however, is not always the same, as the set of registers and instructions, disk format and other standards used will mean that the binary is not the same on all systems. In any case, we must start from the fact that each file that we store on our hard disk or SSD and whatever its nature, will always be encoded in binary.

One way to fix errors is to use data redundancy, i.e. make multiple copies of each piece of data so that if one receives changes it can at least be checked automatically, but this is inefficient because it requires three times the bandwidth and storage. It is therefore necessary to find an alternative solution and the best known is to add a series of additional bits at the end of each block of data to certify that no error has occurred. These extra bits will be used for a series of algorithms to verify the stability of the information.

The hardware implementation

Hardware error correction systems are not only invisible to the user, they are invisible to the application and even to the operating system, because everything happens at the hardware level and in the space of the communication interfaces , within which there is a series of mechanisms, usually with a fixed function, which are responsible for applying the corresponding algorithms to ensure that the data that goes from one point to another remains unchanged at all times.

Every error-correcting system needs two participating elements, on the one hand, the one which sends the data and on the other hand, the one which receives them. Let’s not forget that sending data also includes how it works. The transmitter will be loaded to send the message together with a series of special bits that will be used to correct the errors in the way that the information that is transmitted will be compared with the information that will be received to maintain the integrity of the binary information in each moment.

The hardware systems responsible for performing this task cannot be accessed by any software element, since they are located in the data streaming interfaces redundantly performing the same task over and over again. This does not require complex processors, but units that are as simple as they are efficient.

parity bit

The easiest way to do this is to use the parity bit, which involves adding an extra bit to each byte and making it take on the value 0 or 1 depending on whether the amount of 1 in the rest of the byte is even or odd. . It doesn’t fix errors, it just tells us that the binary has been changed, but we don’t know how many changes were made and therefore it’s not the ideal solution to fix them, because we need locate where the error is. However, there are ways to be able to perform error correction with the parity bit and this at the same time will help us understand how it works.

Bug fixes, practical example

However, a more elegant way to solve it is to use multiple parity bits in each data block and organize the information in an array, for example, we can have a 16-bit or 4-byte code that can be represented in a 4 x 4 array. At the beginning of each row, we can add a parity bit that counts if the number of 1s in that byte is even or odd. At the end an additional byte of information is added, but what it does is count the number of 1s vertically and not horizontally.

So what the receiving system does is receive a 16-bit block of data, of which 12 bits include information and 4 are parity bits. What is done is to compare the code that was sent with the one that was received, checking first the first and third columns, then the first with the second, the second with the fourth and finally the third with the fourth column. A similar process is then performed with the lines.

The purpose of this procedure is to verify the exact location of the bit that was changed, as it locates the exact column and row where the error is. Once the receiving system knows in which position the information has changed, it only needs to return the value to the correct one. Today, much simpler error mechanisms are used than what we have explained to you (especially for performance), but this serves as an example for you to understand how hardware error correction works in a simple way.