what are they and what are they used for?

For years there has been a ecosystem of applications and tools around CUDA focused on the world of science and engineering and the different branches of each. From medicine to automotive design. This allowed NVIDIA to expand beyond PC gaming hardware and expand its potential market share.

CUDA is rather a philosophy for programming algorithms to run on an NVIDIA GPU, although there are also possibilities to do it on a central processor and even on a competitor’s chip. Currently, there are several programming languages that have corresponding CUDA extensions. These include: C, C++, Fortran, Python and MATLAB.

What are CUDA kernels?

In the world of hardware we use the word core as a synonym for processor and this is where the term CUDA kernels comes into conflict with the general lore. Imagine for a moment that a car engine manufacturer sells you a 16 valve engine and marks it as “16 engine”. Well, NVIDIA calls the units responsible for performing mathematical calculations of the cores, what is each processor called arithmetic logic units or ALU in English are what CUDA cores are in an NVIDIA GPU. Specifically, units capable of working with 32-bit precision floating point numbers are usually counted.

In the case of NVIDIA cards, what is the the real equivalents of a core or a processor are called SM. So, for example, an RTX 3090 Ti despite having 10,752 CUDA cores actually has 84 real cores, since that’s the number of real SMs. Think that a processor must be able to execute the entire instruction cycle by itself and not just one part, as is the case with so-called “CUDA cores”.

NVIDIA CUDA vs AMD stream processors, how are they different?

Not at all, since AMD cannot use the CUDA brand because it is owned by its rival, it uses the Stream Processors brand. The use of which is also incorrect and fair for the same reasons.

By the way, a stream processor or stream processor in its correct definition is any processor that depends directly on the bandwidth of its associated RAM memory and not latency. So a graphics chip or GPU is, but a CPU that is more latency dependent is not. On the other hand, since NVIDIA and AMD chips understand different binaries, it is impossible to run a CUDA program on a non-NVIDIA GPU.

	NVIDIA CUDA Needles	AMD stream processors
Which are?	ALU units	ALU units
Where are they?	On NVIDIA GPUs	On AMD GPUs
Can they run CUDA programs?	And	Nope

Are there different types of NVIDIA CUDA cores?

We generally refer to 32-bit precision floating-point units as CUDA cores, but other types of units are also included in the definition, namely:

ALU units with the ability to work with double-precision, i.e. 64-bit, floating point numbers.
32-bit integer units.

Because GPUs do not use a system of parallelism with respect to instructions, what is done is to use concurrent execution. Where a unit of one type can replace another type in the execution of an instruction. It is a capacity that NVIDIA chips have from the Volta architecture, in the case of desktop systems, from the RTX 20.

In exchange, NO SON NÚCLEOS CUDAthe following SM drives on the GPU:

Tensor Cores, which are responsible for performing operations with matrices. In general, they are used for AI and in graphics, they have utilities such as DLSS to automatically increase the resolution.

RT Cores, which calculate the intersections of rays throughout the scene during Ray Tracing.
SFUs, which have the ability to execute complex mathematical instructions faster than conventional ALUs. Supported instructions include trigonometric operations, square roots, powers, logarithms, etc.

Which, although they are also arithmetic-logical units, are not counted by NVIDIA as such.

How do CUDA kernels work?

Generally speaking CUDA cores work the same as any such unit, if we talk more specifically we need to understand what a thread of execution is for a contemporary graphics chip and conceptually separate it from the same concept in a central processor.

In a program running on a CPU, a thread of execution is a program with a series of instructions that perform a specific task. On the other hand, in a GPU, each piece of data has its own thread of execution. This means that each vertex, polygon, particle, pixel, chunk, texel, or any other graphics primitive has its own thread running on one of the CUDA cores.

How are instructions executed in the CUDA core?

Also, the way to execute threads, and it is in general on all GPUs, is to use a variation of the Round-Robin algorithm. Which consists on:

Instructions are categorized into groups based on the number of clock cycles required to execute from each of the ALUs/stream processors/CUDA cores.
If a thread’s instruction has not been executed in the allotted time, it is moved to the queue and the next in the list is executed. Which does not have to correspond to the same thread of execution as the first.

Keep in mind that today’s complex 3D scenes are made up of millions of visual elements to form the complex scenes and they are formed at a fast enough speed. Therefore, CUDA cores are the basis for processing all these elements in parallel and at high speed.

Their great advantage is that they directly execute the data located inside the registers and therefore in the internal memory of each SM. Therefore, they do not contain instructions for direct VRAM access. Rather, the entire ecosystem is designed in such a way that threads are pushed from memory to each of the GPU cores. This avoids bottlenecks in memory access. This involves a change from the traditional memory access model. Other than that, each of the SMs where the CUDA cores are located is much simpler in many functions than a CPU core.

CUDA kernels cannot run conventional programs

The execution threads that will run the CUDA cores are created and managed by the system’s central processor and are created in groups by the API when sending graphics or compute command lists. When the graphics chip’s command processor reads the command lists, they are categorized into blocks that are each dispatched to a different SM or real core. From there, the internal scheduler breaks down the execution threads according to the type of instruction and groups them to be executed.

This means that they cannot run conventional programs, due to this particular way the cores work in graphics card GPUs, because their nature prevents them from doing so. That’s why you can’t install any operating system or run any conventional program on it.