50 (or 60) years of processor development ... for this?

"Dennard's scaling law and Moore's law are dead, what now?" - a play in four acts by David Patterson

“We burn bridges, which we rush here, having no other evidence of our movement, except for memories of the smell of smoke and the assumption that it caused tears” - “Rosencrantz and Gildenshtern are dead”, an absurdist play by Tom Stoppard

On March 15, Dr. David Patterson spoke to an audience of about 200 pizza-fed engineers. The doctor briefly outlined the half-century history of building computers from the podium in the large conference hall of the E building on the Texas Instruments campus in Santa Clara during an IEEE lecture entitled “50 years of computer architecture: from central processors to DNN TPU and Open RISC-V”. This is the story of random ups and downs, dips and black holes that swallowed whole architectures.

Patterson began in the 1960s with the groundbreaking IBM System / 360 project, based on the early 1951 microprogramming work of Maurice Wilks. By the standards of IT, it was a long time ago ... Toward the end of the speech, Patterson showed an amazing chart. She clearly demonstrates how exactly the death of the law of scaling Dennard, followed by the death of Moore's law, completely changed the methods of designing computer systems. In the end, he explained the post-mortem technological effects of these shocks.

It is always nice to see how a real master does his favorite craft, and Patterson is really an expert on computer architecture and the forces that control it. He taught this topic since 1976 and co-authored this best-selling book “Computer Architecture. Quantitative Approach, together with Dr. John Hennessy. The book has recently gone through the sixth edition. So Patterson exceeded the milestone of 10,000 hours formulated by Malcolm Gladwell to achieve mastery in any subject by an order of magnitude. And it shows.

Patterson captured the audience’s attention for 75 minutes, dividing the performance into four acts. As in the absurdist play of Tom Stoppard “Rosencrantz and Gildenshtern are Dead”, it seems that nothing in this story - nothing at all - goes as planned.

Dr. David Patterson at the IEEE Santa Clara Valley Section conference on March 15, 2018, before being presented with the 2017 ACM Turing Award. Source: Steve Leibson

Act 1: IBM System / 360, DEC VAX and Prelude to CISC

In the 1950s and 1960s, grandiose experiments with command set architecture (ISA) for mainframes (at that time, almost no computers were designed except for mainframes). Almost every mainframe had a “new and improved” ISA. By the early 1960s, only IBM had released four lines of computers: 650, 701, 702, and 1401 for business, scientific, and real-time applications. All of them are with mutually incompatible instruction set architectures. Four incompatible ISAs mean that IBM develops and maintains four completely independent sets of peripherals (tape drives, disks / drum drives and printers), as well as four sets of software development tools (assemblers, compilers, operating systems, etc.) .

The situation clearly did not seem to be stable. Therefore, IBM recklessly played on a large scale. She decided to develop one binary-compatible instruction set for all her machines. One hardware independent instruction set for everything. Chief architect Jean Amdahl and his team developed the System / 360 architecture, designed to be implemented in a range from low-cost to high-value series with 8-, 16-, 32-, and 64-bit data buses.

In order to simplify processor development to the IBM System / 360, the development team decided to use microcode for difficult-to-design control logic. Maurice Wilkes invented the microcode in 1951, and it was first used for the EDSAC 2 computer in 1958. In a sense, the microcode was already a proven technology by the time the System / 360 project was launched. And he again proved his worth.

The processor microcode is immediately reflected in the design of mainframes, especially when the semiconductor memory chips straddled Moore's law. Perhaps the greatest example of the massive use of microcode was DEC VAX, introduced in 1977. The VAX 11/780, an innovative minicomputer on TTL and memory chips, became the benchmark for performance by the end of the century.

DEC engineers created ISA for VAX at a time when assembler programming prevailed, partly because of engineering inertia (“we always did this”), and partly because rudimentary high-level compilers at that time generated machine code that was lost to hand-crafted, concise assembly code. The VAX ISA instructions supported a huge number of programmer-friendly addressing modes and included separate machine instructions that performed complex operations, such as inserting / deleting a queue and calculating a polynomial. VAX engineers were delighted with the development of hardware that made life easier for programmers. The microcode made it easy to add new instructions to the ISA - and the 99-bit VAX firmware control swelled to 4096 words.

This focus on continually expanding the number of instructions to make a programmer's life easier on an assembler turned out to be a real competitive advantage of VAX from DEC. Programmers liked computers that facilitate their work. For many computer historians, VAX 11/780 marks the birth of the CISC processor architecture (with a full set of commands).

The second act: random successes and big failures

The DEC VAX 11/780 minicomputer reached a peak of popularity when the microprocessor boom began. Almost all of the first microprocessors were CISC machines, because easing the burden on the programmer remained a competitive advantage even when the computer shrank to a single chip. Intel's Gordon Moore, who invented Moore's Law back in Fairchild, was instructed to develop the next ISA to replace the accidentally popular ISA for 8-bit Intel 8080/8085 (and Z80). Taking one part from the extremely successful IBM System / 360 project (one ISA to manage everything) and the other from the DAC CISC minicomputer lineup, Gordon Moore also tried to develop a universal instruction set architecture - a single Intel ISA that will live to the end centuries.

At that time, 8-bit microprocessors worked in 16-bit address space, while the new final architecture of the Intel ISA instruction set had 32-bit address space and built-in memory protection. She supported instructions of any length, starting with one bit. And it was programmed in the newest and greatest high-level language: Ada.

This ISA was supposed to be part of the Intel iAPX 432 processor, and it was a very large, very ambitious project for Intel.

If you study the history of the "revolutionary" iAPX 432, then you will find that it ended in the most severe failure. The hardware required for the IAPX 432 architecture is extremely complex. As a result, the chip was released with great delay. (She demanded a 6-year development cycle and appeared only in 1981.) And when the microprocessor finally appeared, it turned out to be extremely slow.

Moore realized at the very beginning of the project that the development of the iAPX 432 would take a lot of time, so in 1976, for safety purposes, he launched a parallel project to develop a much less ambitious 16-bit microprocessor, based on the expansion of the successful 8-bit ISA from 8080, with compatibility at the source level code. The developers had only a year to release the chip, so they were given only three weeks to develop ISA. The result was an 8086 processor and one universal ISA, at least for the next few decades.

There was only a problem: according to the description of Intel's own insiders, the 8086 microprocessor came out very weak.

The performance of the Intel 8086 lagged behind the performance of its closest competitors: the elegant Motorola 68000 (32-bit processor in a 16-bit gown) and the 16-bit Zilog Z8000. Despite poor performance, IBM chose the Intel 8086 for its IBM PC project, because Intel engineers in Israel developed version 8088, which is 8086 with an 8-bit bus. The 8088 microprocessor worked a bit slower than the 8086, but its 8-bit bus seemed more compatible with existing peripheral chips and reduced the cost of manufacturing a PC motherboard.

According to forecasts of IBM, it was planned to sell about 250,000 computers IBM PC. Instead, sales exceeded 100 million, and Intel 8088 became a random, but an absolute hit.

The third act: the birth of RISC, VLIW and the sinking of "Itanika"

In 1974, immediately after the appearance of the first commercial microprocessors, John Cock from IBM tried to develop a control processor for an electronic telephone switchboard. He calculated that the controlling processor needed to execute about 10 million instructions per second (MIPS) to meet the requirements of the application. The microprocessors of that time were an order of magnitude slower, and even the IBM System / 370 mainframe was not suitable for this task: it issued about 2 MIPS.

So the Kok team within the project 801 developed a radically modernized processor architecture with a conveyor bus and a fast microcode-free control scheme - this was made possible by reducing the number of instructions to a minimum in order to simplify management. (The machine was called the IBM 801 because it was developed in the building 801 of the Thomas J. Watson IBM Research Center). The IBM 801 project first implemented the RISC architecture (reduced instruction set).

The prototype of the 801 was built on small Motorola MECL 10K chips, which all together produced an unprecedented performance of 15 MIPS and easily fit into the technical requirements. Since an abbreviated instruction set is less convenient for a programmer than a CISC instruction set, Kok's team had to develop optimizing compilers. They took on the added burden of creating efficient machine code from complex algorithms written in high-level languages.

After this, Kok became known as the “father of RISC”. IBM never released a telephone switchboard, but the 801 processor evolved and ultimately became the basis for a large line of IBM RISC processors widely used in its mainframes and servers.

Later, several engineers at DEC found that about 20% of the CISC instructions for VAX occupy about 80% of the microcode, but make up only 0.2% of the total program execution time. Such an expense! Considering the results of the IBM 801 project and the conclusions of the DEC engineers, it was possible to assume that the CISC architecture is not that great.

The assumption was confirmed.

In 1984, Stanford professor John Hennessy published a landmark article in the IEEE Transactions on Computers journal entitled “VLSI Processor Architecture” , where he proved the superiority of architectures and ISA on RISC for VLSI processor implementations. Patterson summarized the proof of Hennessy in his speech: RISC is by definition faster because CISC machines require 6 times more instruction cycles than RISC machines. Even though the CISC machine is required to perform two times fewer instructions for the same task, the RISC computer is essentially three times faster than the CISC.

Therefore, x86 processors in modern computers only see the CISC instructions that are compatible with them, but as soon as these instructions from the external RAM get into the processor, they are immediately shredded / shredded into pieces of simpler "microinstructions" (as Intel calls the RISC instructions), which then queued up and executed in several RISC pipelines. Today's x86 processors are faster, turning into RISC machines.

Several processor architecture developers have decided to develop an ISA that will be much better than RISC or CISC. With the help of very long machine instructions (VLIW) it became possible to pack a multitude of parallel operations into one huge machine instruction. The architects dubbed this version of ISA as VLIW (Very Long Instruction Word). VLIW machines borrow one of the principles of RISC work, imposing on the compiler work on the layout and packaging of the VLIW instructions generated from the high-level source code into the machine code.

Intel decided that the VLIW architecture looks very attractive - and began developing the VLIW processor, which will become its application for entering the inevitably coming world of 64-bit processors. Intel called its VLIW ISA, IA-64. As usual, Intel has developed its own nomenclature and its own names for familiar terms. In the Intel jargon, VLIW has evolved into EPIC (Explicitly Parallel Instruction Computing). The EPIC architecture should not be based on the x86 instruction set, in part to prevent copying from AMD.

Later, HP PA-RISC engineers also decided that the development potential of RISC was almost exhausted - and they also contracted VLIW disease. In 1994, HP teamed up with Intel to develop a collaborative 64-bit VLIW / EPIC architecture. The result will be called Itanium. The goal was announced to release the first Itanium processor in 1998.

However, it soon became clear that it would be difficult to develop VLIW processors and compilers. Intel did not announce the name Itanium until 1999 (the wits on Usenet immediately dubbed the processor “Itanic”), and the first working processor was released only in 2001. Itanik ended up safely drowned in 2017, when Intel announced the completion of work on IA-64. (See "Intel has sunk Itanium: perhaps the world's most expensive unsuccessful processor design." )

The EPIC architecture has also become an epic failure - a microprocessor version of Star Wars Ja-Ja Binks. Although at one time seemed like a good idea.

Itanium, EPIC and VLIW processors died for several reasons, says Patterson:

Unpredictable branches complicating the planning and packaging of parallel operations in VLIW command words.
Unpredictable cache misses slowed execution and resulted in variable delays.
VLIW command sets inflated the amount of code.
It turned out to be too difficult to create good optimizing compilers for VLIW machines.

Perhaps the most famous computer algorithm specialist in the world, Donald Knut, remarked: “The Itanium approach ... seemed so great - it has not yet become clear that the desired compilers are essentially impossible to write.”

It seems that compilers better cope with simple architectures like RISC.

From the architecture of VLIW did not work universal microprocessors. But later, they still found their vocation, which takes us into the fourth act of the play.

Act 4: Dennard's scaling law and Moore's law are dead, but DSA, TPU, and Open RISC-V are alive

In the play by Rosencrantz and Gildenshtern dead by Tom Stoppard, two minor characters snatched from Shakespeare's Hamlet finally understand at the end of the last act that they were dead throughout the play. In the final act of the history of the processors, Patterson died on the scaling law of Dennard and the law of Moore. Here is a drawing from the last edition of the book by Hennessy and Patterson, which graphically shows the whole story:

Source: John Hennessy and David Patterson, “Computer Architecture. Quantitative Approach, 6th ed. 2018

The graph shows that RISC microprocessors provided nearly twenty years of rapid performance growth from 1986 to 2004, as they evolved according to Moore's law (twice as many transistors at each new turn of the technical process) and Dennard's scaling law (doubling the power consumption per transistor by each new branch of the process). Then the law of scaling Dennard died - and the individual processors stopped accelerating. The power consumption of the transistor also stopped halving at each stage.

The industry compensated for this by relying solely on Moore's law of doubling the number of transistors — and quickly increasing the number of processors on a chip, entering a multi-core era. The doubling interval of productivity grew from 1.5 to 3.5 years during this era, which lasted less than ten years, before the Amdahl law came into force (paraphrase as "exploited parallelism is limited in each application"). Few applications are able to fully load dozens of processors.

Then Moore's law passed away.

According to Patterson, the result is that since 2015, the performance growth of processors has dropped to a paltry 3% per year. Moore’s doubling now takes place at 1.5 and even 3.5 years. Now it's twenty years old .

End of the game? “No,” says Patterson. In the processor architecture, you can try some more interesting things.

One example: custom architectures (DSA, Domain Specific Architectures) are specially designed processors that try to speed up the implementation of a small number of tasks for specific applications. VLIW architectures are not suitable for general-purpose processors, but they are quite reasonable for DSP applications with a much smaller number of branches. Another example: Google TPU (Tensor Processing Unit), which accelerates the execution of DNN (Deep Neural Network, deep neural networks) using a block of 65,536 units of multiplication-addition (MAC) on a single chip.

It turns out that matrix calculations with reduced accuracy are the key to implementing really fast DNNs. The 65,536 eight-bit MAC blocks in Google TPU operate at 700 MHz and deliver 92 TOPS performance (teraoperations per second). This is about 30 times faster than the server CPU and 15 times faster than the GPU. Multiply by half the power consumption in the 28-nanometer TPU compared to the server CPU or GPU - and get an advantage in power consumption / power of 60 and 30 times, respectively.

By a strange coincidence, Professor David Patterson recently retired from the University of California at Berkeley after teaching and working there for 40 years. He now holds the position of "honored engineer" at Google for a TPU development project.

Another interesting thing is the creation of open source ISA architectures, Patterson says. Previous attempts, including OpenRISC and OpenSPARC , did not take off, but Patterson spoke about a completely new open source ISA - this is RISC-V , which he helped develop at Berkeley. Look at the SoC, says Patterson, and you will see many processors with different ISA. “Why?” He asks the question.

Why do we need a universal ISA, another ISA for image processing, as well as ISA for video processing, for audio processing and DSP ISA on a single chip? Why not make only one or a few simple ISA (and one set of software development tools) that can be reused for specific applications? Why not make an open source ISA so that everyone can use this architecture for free and improve it? Patterson's only answer to these questions is RISC-V ISA.

The RISC-V Foundation was recently formed, similar in concept to the successful Linux Foundation. More than 100 companies have already entered into it, and he took over the work on standardization of RISC-V ISA. The mission of the Foundation is to promote the implementation of RISC-V ISA and its future development. Coincidentally, the “retired” Dr. David Patterson is the Deputy Chairman of the RISC-V Foundation.

Like Rosencrantz and Gildenshtern, Dennard's scaling law and Moore's law are dead at the end of Patterson's historical play, but interesting developments in computer architecture are just beginning.

"There is nothing more unconvincing than an unconvincing death," - as stated in the play.

Epilogue: On March 21, just a week after appearing at IEEE, the Computing Technology Association (ACM) recognized Patterson and Hennessy's contribution to computer architecture by awarding the 2017 ACM Turing Award for an innovative systematic, computational approach to designing and evaluating computer architectures that had long-term impact on the microprocessor industry ".

Source: https://habr.com/ru/post/411989/

All Articles

50 (or 60) years of processor development ... for this?

Act 1: IBM System / 360, DEC VAX and Prelude to CISC

The second act: random successes and big failures

The third act: the birth of RISC, VLIW and the sinking of "Itanika"

Act 4: Dennard's scaling law and Moore's law are dead, but DSA, TPU, and Open RISC-V are alive

More articles: