iYAN inside! Online Index Hardware
Pipelining and Performance
- Performance aspects of pipelining -
I received several questions of people related to pipelining. Therefore it seemed time for an article about the basics of pipelining. I already explained some of it in my previous articles (AMD’s Battle and Cache and more cache), but apparently there was need for more info.
What is Pipelining
Pipelining is a way of doing more instructions in less time. Normally it takes several cycles for one instruction. Within each cycle a part is performed. Each time one of the parts is finished, so after one cycle, the instruction moves one place further creating space for a new instruction. In figure 1 a picture to make things clear.Each cycle one instruction can enter the pipeline and each cycle one can leave the pipeline. So the throughput is after the pipeline is filled one instruction per cycle.
Therefore pipelining decreases the amount of cycles needed for a number of instructions.
Pipelining means higher MHz
Pipelining has a second advantage, it makes it easier to increase the speed of a chip. Let’s look at picture 2, to illustrate this.In (1) an example of an instruction which takes 4 cycles. Within every cycle a part of the instruction is done. The most obvious way to improve the performance of this chip is simply do more per cycle. That gives us (2). Of course this improves performance of this instruction by a factor 2, but also gives us a problem.
If we want run the chip at a specific MHz rate, we must be sure that all parts have enough time to complete. The time available for a part (A, B, C or D) is less in situation (2), because by doing more parts per cycle, this also means that there is less time per part. Therefore increasing the MHz of a specific design is easier with (1) than with (2).
If we would pipeline (1) we would get a performance increase up to 4 times as fast, but without making it more difficult to increase the MHz for future versions without having to make radical changes in your chip-design. Therefore most high performance chips uses pipelining techniques.
Often chip producers even increase the number of parts needed for one instruction, for instance split up A in A1 and A2, so that less time is needed for each step, and the chip will be able to run at higher MHz. This is also called deep pipelining. The length of the pipeline isn’t tat important in this respect because if it is filled the throughput will be one instruction per cycle no matter what the length is.
Keeping the pipeline filled: Cache
We see that pipelining increases performance. Problem is now to keep the pipelines filled to get maximum performance possible. Let us look at the internal design of a PC, see figure 3.I took a K6-III CPU, because it has both L2 and L3 cache. Of course for the Pentium II(I) and Celeron leave out the L3 and for Pentium and K6(-2) CPU’s leave out the L2.
It might look a bit complicated to some of you, but the important part is that the memory is not directly accessible at full CPU-speed. The memory is relatively slow compared to the CPU.
Therefore is the CPU needs to get information out of the memory it needs to wait for a long time. During all this time the pipelines are not getting filled. This is an enormous waste of cycles. However the L1, L2 and L3 cache are much faster than the memory, so it’s much better to get the information out of the cache instead directly out of the memory.
So we see that in order to make efficient use of pipelines there needs to be sufficient cache. In nowadays systems over 90% of the information out of the memory required by the CPU comes from the cache. You could check this yourself by disabling your cache in the BIOS and see how slow it gets.
This also explains why the Rise mP6 performs not as well as it could. It has multiple pipelines, but only 16kB of L1 and no L2 but only the mainboard L3. The mP6 pipelines are simply not filled lots of the time, preventing the CPU form reaching it’s true potential.
This also explains why the K7 will be equipped with 128kB of L1 cache. The K7 has multiple pipelines for integer, floating point, MMX and 3Dnow! instructions. In order to keep all this pipelines filled an enormous amount of cache is required.
Branch Prediction
Of course there is also a downside to pipelining. If instructions depend on each other is it impossible to keep pipelines filled. Let’s look at an example.
If ( fDoA == TRUE ) DoA() ; Else DoB() ;First we’ll put the if-statement into the pipeline. Problem now is that we cannot continue with DoA() or DoB() because we don’t know which one to do to do. We’ve got a serious problem is this case, because we must wait until the if-statement is finished. During all this time the pipeline is empty.
Because these kinds of statements are very common, so there had to be found a way to solve this problem. The solution is branch-prediction. The CPU tries to predict what it has to do, DoA() or DoB(). It then inserts this statement after the initial if-statement.
Let’s say the CPU predicts that DoA() needs to be done. If it predicts right, this means that as soon as the if-statement is finished the DoA() part is finished right after that, saving valuable cycles. However if it predicts wrong it has to stop DoA() and start at DoB(). Therefore it is very important the CPU predicts right.
The CPU holds a table with the last choices it had to make and the last two results for each choice. If it now gets to the same statement again it knows what it had to do the last two times and uses this information to make a choice. For instance if the previous two times it had to do DoB(), it is more likely it needs to do DoB() again this time than DoA(). Therefore the CPU predicts it should do DoB().
This technique works very well and all CPU’s nowadays use this technique. The K7 will for instance have a table of the last 2048 choices. This all to prevent the CPU from taking a wrong choice and wasting valuable cycles.
Out-Of-Order Execution
This is also a technique to keep the pipelines filled. All the branch-prediction and cache in the world can not prevent a pipeline to "run dry" in some cases. Wouldn’t I be nice if we could look into the future at that moment and see what needs to be calculated there…
Well, that is exactly what Out-Of-Order (OOO) execution means. The CPU will look ahead for instruction independent of what is being calculated now and uses those instructions to fill empty spaces in the pipelines.
Because this is a rather complicated process only very little of OOO-tricks are implemented in AMD’s en Intel’s 6th generation chips. The K7 will have more of those OOO-techniques. One of the things it makes it a 7th generation CPU.
Conclusion
Pipelining increases performance by making it easier to increase MHz and by increasing the throughput to 1 instruction per cycle. However fast and large amounts of cache are needed to keep the pipelines filled. Besides cache the CPU also uses branch-prediction and OOO to prevent as much as possible that empty spaces exist in the pipelines.