Well, a processor’s instructions can be broken down into smaller ones that we call microoperations. How about microinstructions? Well, because an instruction, just by segmenting it into multiple cycles for execution, takes multiple clock cycles to resolve. A micro-operation, on the other hand, takes a single clock cycle.
One way to get the most out of MHz or GHz is to pipeline, where each instruction is executed in several stages that each last one clock cycle. Since frequency is the inverse of time, to get more frequency we need to shorten the time. The problem is that the point is reached where an instruction can no longer be broken down, the number of stages in the pipeline becomes short and therefore the clock speed that can be achieved is low.
In fact, these originated with the appearance of the out-of-order execution of the Intel P6 architecture and its derivative processors such as the Pentium II and III. The reason is that the segmentation of the P5 or Pentium only allowed them to reach a little more than 200 MHz. With microoperations, by further extending the number of stages of each instruction, they broke the GHz barrier with the Pentium 3s and were able to have clock speeds 16 times faster with the Pentium 4. Since then they have been used in all processors with out-of-order execution regardless of brand, register and game instructions.
You CPU no wires neither x86, nor RISC-V, nor ARM
In current CPUs, when instructions arrive at the CPU control unit to be decoded and assigned to the control unit, they are first broken down into several different micro-operations. This means that each instruction executed by the processor is made up of a series of basic micro-operations and the collection of them in an ordered flow is called microcode.
The decomposition of instructions into micro-operations and the transformation of programs stored in RAM into microcode are found in all processors today. So when your phone’s ARM ISA processor or your PC’s x86 processor executes programs, its execution units do not resolve instructions with these sets of registers and instructions.
This process not only has the advantages that we explained in the previous section, but we can also find instructions that, even within the same architecture and under the same set of registers and instructions, are broken down differently and the programs are fully compatible. The idea is often to reduce the number of clock cycles required, but most of the time it is to avoid the contention that occurs when there are multiple requests to the same resource within the processor.
What is the micro-op cache?
The other important element to achieve the maximum possible performance is the micro-op cache, which is later than the micro-ops and therefore closer in time. Its origin is in the trace cache that Intel implemented in the Pentium 4. It is an extension of the first-level cache for instructions that stores the correlation between different instructions and the microoperations in which they have previously disassembled by the control unit. .
However, the x86 ISA has always had a problem compared to the RISC type, while the latter have a fixed instruction length in the code, in the case of the x86 each of them can measure between 1 and 15 bytes. It should be kept in mind that each instruction is retrieved and decoded in several micro-operations. This still requires a very complex control unit that can consume up to a third of its energy power without the necessary optimizations.
The micro-operations cache is therefore an evolution of the trace cache, but it is not part of the instruction cache, it is a hardware-independent entity. In a microoperation cache, the size of each of them is fixed in number of bytes, allowing for example a CPU with ISA x86 to operate closer to a RISC type and reducing the complexity of the unit of control and with it consumption. The difference with the Pentium 4 trace cache is that the current micro-operation cache stores all the micro-operations belonging to an instruction on a single line.
How it works?
What the micro-operations cache does is avoid the work of decoding the instructions, so when the decoder has just performed the said task, what it does is store the result of its work in the said cache . Thus, when it is necessary to decode the following instruction, it is sought whether the micro-operations which compose it are in said cache. The motivation for doing this is none other than the fact that it takes less time to consult said cache than to not decompose a complex instruction.
However, it functions as a cache and its contents are time-shifted as new instructions arrive. When there is a new instruction in the first-level instruction cache, the micro-op cache is searched if it is already decoded. If not, proceed as usual.
The most common instructions when broken down are usually part of the micro-op cache. What makes it less likely to be thrown away, however, is that those whose use is sporadic will be thrown away more often, in order to make room for new instructions. Ideally, the size of the micro-op cache should be large enough to store them all, but it should be small enough that searching through it does not affect processor performance.