Don't get caught, more cores doesn't mean the computer is more powerful

One of the reasons for using new versions of programs over time is that they are designed to take better advantage of processors with higher core counts. Let’s not forget that over time their number in processors increases. However, why does performance not increase in au pair programs?

Programs never scale with the number of cores

It is important to take into account that the programs that are executed do not have the ability to divide their processes or tasks active at any given time, depending on the number of execution threads that we have in our CPU. More than anything because this division is explicit in the program code, i.e. it is the product of the skill of the programmer and the design of the application.

In fact, what is relevant when coding a program is not to optimize it to use as many cores as possible, but rather for latency. Understand the latter as the time it takes for a processor to complete a task measured in units of time. And it is that the performance of a CPU consists in accomplishing the most tasks in the least amount of time. Which will depend on your architecture first and then your clock speed.

However, what we are interested in regarding latency is how many tasks it can complete in a given period, i.e. workload and it will depend on the situation and how the programs have been written. In other words, the performance does not only depend on the hardware, but also on the quality or the poorness of the writing of the software.

Division of labor into several cores

Now, if we increase the number of cores in a system, then it becomes possible to break work into chunks and complete it much easier. This is where the T/N formula comes in, where T is the number of tasks to perform and N is the number of execution threads the system can run. Obviously, we could load the maximum number of jobs on a few cores and brute force them to repair. The problem is th at this measure is counterproductive because it benefits the most modern CPUs, which have higher performance individually on each core.

However, dividing the work between different cores is additional work which is usually given to a core which acts as a conductor and must perform the following tasks:

You need to create processes and to-do lists and have good control of them at all times.
They must be able to predict the start and end of a task at any time, including the time it takes to complete one and start another.

The different kernels must have the ability to send a signal to the main kernel to know when a process starts and ends.

This solution was adopted by SONY, Toshiba and IBM in the Cell Broadband Engine, the central processor of the PS3 where a master core was in charge of directing the rest. Although much older, it was adopted by the Atari Jaguar. For PS4, SONY has not repeated this model anymore and no one has implemented it on PC because it is a nightmare, however, it is the most efficient way to divide the work.

Not everything can run on multiple cores

If we are wondering if we can divide any task into subtasks to be distributed indefinitely in a larger number of cores, the answer is no. Specifically, we need to classify tasks into three different types:

Those that can be fully parallelized and therefore distributed between the different cores available to the central processor.
Tasks that can be executed partially in parallel.
Parts of code that cannot be executed in parallel.

In the first case, T/N is applied at 100%, in the second case, we already enter the so-called Amdahl’s law where the acceleration due to the increase in the number of cores is partial and in the third case we simply needs all the power of a single core for this task,

What differentiates the CPU from the GPU in multithreading

Here we come to a differential point, each GPU or graphics chip has a control unit that is responsible for reading the command lists and distributing them between the different GPU cores and even between the different units. This is a hardware-level implementation of the previous case and works great in any configuration where you want to saturate, as long as there is work, and therefore keep as many cores busy as possible. However, it should be understood that the concept of thread of execution in a GPU always corresponds to a corresponding data and its list of instructions. That is to say a pixel, a vertex or any data.

This makes them easy to parallelize. That is, if we wanted to fry an egg, the process in the processor would be to fry the egg, which would be totally sequential. In contrast, in the graphics chip, a task would simply be to heat some oil or add an egg to the pan. All of this wouldn’t speed up the cooking of one egg, but several, which is why GPUs are better for tasks like calculating millions of polygons or pixels at once, but not for sequential tasks.