iYAN inside! Online Index Hardware
AMD's K6-III: The Ultimate Socket 7 Processor
Only a few years ago, when Intel introduced the P6 (aka Pentium Pro), the socketed Pentium architecture was destined for its phased "death". Little did Intel realize that not only was its death too premature, but that Intel's early departure from this architecture would cost it major market share that AMD and Cyrix would be quick to fill. Intel raised the standards for the performance race, but AMD proved up to the challenge. AMD's K6-2 provided the breakthrough that led to a major rupture of Intel's monopoly of the X86 architecture, and the K6-III would be AMD's ultimate expression.
A Short History of the K6 architectures
The year was 1995, and AMD's license of the Intel microcode couldn't last through the Pentium era. AMD had to come up with a totally new design that was compatible with the Pentium. Intel was Superman, and to beat Superman, you need the one weapon that can kill him. K is for Kryptonite, and the letter gave its name to AMD's new line of processors. Against the Pentium, aka P5, AMD positioned the K5. Although proving to be more efficient in processing instructions clock to clock compared to a Pentium, the K5 couldn't clock high enough to compete with much faster clocking Pentiums. Sales sagged so much that AMD even fell behind rival X86 cloner Cyrix. A new outfit named NexGen promised to be the next great fabless design company after Cyrix. Their product was the Nx586. Inside was a core that borrowed heavily from RISC concepts. For compatibility, a decoder translated X86 opcodes to an internal RISC format. The RISC revolution was at its height, and RISC-like technologies laid the foundation for breaking toward the next generation of processing performance. While the NexGen chip showed promise, it did require its own proprietary motherboard and chipset. The specialized chipset was also its own cache controller. The solution was elegant, but without direct compatibility to Pentium motherboards, the Nx586 faced immense marketing challenges.
AMD needed a new design, and quick. They realized that NexGen could probably not survive alone. It was far from a high profile media event, but compared to other high profile acquisitions, AMD's acquisition of NexGen probably had the most profound effects on the history of the PC during the next few years. With this acquisition, AMD obtained a vital new central processor design, and the K5 as well as the original K6 project was scrapped. What was going to be the next generation NexGen chip would now become the new K6.
The K6 was not without its virtues and problems. The original NexGen chip enjoyed the convenience of having its own bus and cache architecture. The K6 had to be pin to pin compatible with the dual voltage Socket 7 Pentium MMX architecture. It had to work with the Socket 7 chipset cache controller, which was "foreign" to the K6. The chip was quite sophisticated for AMD's resources to produce, including a rather complicated 6 metal process, resulting in production snafus that kept AMD from producing the K6 quickly enough at high enough clock frequencies.
Worse yet, Intel had introduced the brilliant P6 architecture. Like the K6, the P6 relied on an internal RISC-like core. Although the P6 wasn't as efficient as the P5 at processing instructions clock cycle for clock cycle, the triumph in the P6 design was great foresight in the design of a streamlined simplicity that begat a great potential to raise clock speeds. Included with this design was a built-in cache controller holding a backside cache whose clocking was based on the CPU clock, not the motherboard clock. This enabled the cache speed to be equal to that of the CPU, preserving performance scalability as the speed of the CPU increases. Also included with this design was a twelve-stage pipeline that allowed new instructions to be decoded while older ones were still being executed. This meant the processor could execute more than one instruction per clock cycle: it was a superscalar processor. The floating point unit was also pipelined. Intel was shooting for the stars with this design; the intention was not to blow off alternative X86 processors, but to compete with CISC/RISC hybrid designs like the PowerPC, the Alpha and the MIPS.
It was remarkable that AMD was even able to pull off the K6. The K6 core proved to be, arguably, the most technically elegant of the X86 architectures, so much so that it was used as the basis of a textbook. From an execution standpoint, the K6 could be faster than a P6 with the same clock speed. The advantage of the P6 was that because of its longer pipelines, it could decode instructions faster. While the theoretical throughput was 3 instructions per cycle, limitations meant that for practical purposes, the peak rate was about 2.1 instructions per cycle. In comparison, the K6, with its shorter pipeline and two decoding units, peaked at about 1.9 instructions per cycle. However, the longer P6 pipeline also meant that a mispredicted jump would cause far greater latencies compared to those produced in the stubbier K6 pipeline, with branch prediction penalties as high as 11 cycles for the P6, while the K6 only suffered a penalty as low as 4 cycles. In addition, a mispredicted jump is more likely to happen on a P6 than a K6, because the latter has larger branch prediction tables. With back end execution units capable of performing faster than front end decoder units, the K6 appeared to be "starved," needing to keep its short pipelines always full to maintain peak performance. For this reason, AMD gave it a level 1 cache which, at 64K, was twice as big as that of the P2. But it appeared that the 64K L1 cache might not be enough for the K6, especially considering that its L2 cache is much much slower than that of the P6, running only at the motherboard's clock speed (aka front side cache). This was a limitation of the Socket 7 architecture.
At the risk of being simplistic, what did this mean? It meant that the P6 may be faster under certain circumstances, while the K6 may be faster on others. When code was compiled and ordered for the P6's pipeline, it could be expected to shine. The P6 would probably also excel in synthetic benchmarks and scientific applications, where loops and jumps are regular and predictable. (For that reason we could expect it to excel on SPEC benchmarks.) The P6 would also do well when running an application where its running code and data could fit well within the size of its backside cache.
The K6, meanwhile, would excel in an environment that had plenty of unpredictable branches and jumps. This is far more like the real world of applications, especially when running Windows. Remarkably, even with vastly differing approaches, the general processing performance of a K6 ended up roughly equivalent to that of a P6 and vice versa. Quite interestingly, 16-bit code tends to extract a heavy context change penalty on the P6 design, the greatest being on the Pentium Pro, and, to a lesser extent, on the Pentium II.
The disadvantage of the P6 design was that the manufacturing and packaging of the backside cache kept the cost of the processors high. The great disparity of pricing between the Pentium Pro and the Pentium MMX provided the niche the K6 could enter, since the K6 featured performance similar to a Pentium Pro with pricing below that of the Pentium. The Pentium II was more affordable than its prececessor, but it was not enough to stop the K6's price advantage. When Intel finally found the solution in the Celeron, the K6-2 had already acquired a lot of momentum and market share.
Even when placed against the Celeron, it was indeed quite remarkable for the K6 core to have kept up its general application performance even though it didn't have a fast backside cache. This leads us to wonder what would happen if the K6-2 did indeed have a backside cache. When the Super Socket 7 platform was introduced, the move from a 66MHz front side L2 cache to one running at 100Mhz provided a temporary boost to help the K6-2 hang on against Pentium II's in the 300MHz to 400Mhz range. The more technically oriented K6 users learned to tweak more performance from their chip by raising the front side cache speed to 112MHz or 115Mhz (given the motherboard's clocking options). They also realized that going down to 95MHz costs performance, even though the total overall CPU speed could be raised higher with the use of a higher multiplier. This is why a K6-2 at 333Mhz (95MHz x 3.5) does not perform any better than a 300Mhz part that gets its 300MHz from a 100MHz x 3 setting. Similarly, a K6-2 running at 336Mhz (112MHz x 3) may often perform better than one running at 350MHz (100MHz x 3.5).
But as Pentium II's and Celerons raised their clocks past 400MHz, the static 100Mhz front side bus of the Super 7 architecture started to affect the scalability of the K6-2. The backside cache became more and more imperative as clock speeds went higher. Ironically, the Celeron gave us the clue as to an approach to solving this problem, and its introduction showed that production capabilities for a lower-cost processor with an on-chip L2 cache is more than possible.