advertisement
CPU architecture after Moore’s Law: What’s next?
When considering the future of CPU architecture, some industry watchers predict excitement, and some predict boredom. But no one predicts…
When considering the future of CPU architecture, some industry watchers predict excitement, and some predict boredom. But no one predicts a return to the old days, when speed doubled at least every other year.
The upbeat prognosticators include David Patterson, a professor at the University of California, Berkeley, who literally wrote the textbook (with John Hennessy) on computer architecture. “This will be a renaissance era for computer architecture — these will be exciting times,” he says.
Not so much, says microprocessor consultant Jim Turley, founder of Silicon Insider. “In five years we will be 10% ahead of where we are now,” he predicts. “Every few years there is a university research project that thinks they are about to overturn the tried-and-true architecture that John von Neumann and Alan Turing would recognize — and unicorns will dance and butterflies will sing. It never really happens, and we just make the same computers go faster and everyone is satisfied. In terms of commercial value, steady, incremental improvement is the way to go.”
advertisement
They are both reacting to the same thing: the increasing irrelevance of Moore’s Law, which observed that the number of transistors that could be put on a chip at the same price doubled every 18 to 24 months. For more to fit they had to get smaller, which let them run faster, albeit hotter, so performance rose over the years — but so did expectations. Today, those expectations remain, but processor performance has plateaued.
The plateau and beyond
“Power dissipation is the whole deal,” says Tom Conte, a professor at the Georgia Institute of Technology and past president of the IEEE Computer Society. “Removing 150 watts per square centimeter is the best we can do without resorting to exotic cooling, which costs more. Since power is related to frequency, we can’t increase the frequency, as the chip would get hotter. So we put in more cores and clock them at about the same speed. They can accelerate your computer when it has multiple programs running, but no one has more than a few trying to run at the same time.”
The approach reaches the point of diminishing returns at about eight cores, says Linley Gwennap, an analyst at The Linley Group. “Eight things in parallel is about the limit, and hardly any programs use more than three or four cores. So we have run into a wall on getting speed from cores. The cores themselves are not getting much wider than 64 bits. Intel-style cores can do about five instructions at a time, and ARM cores are up to three, but beyond five is the point of diminishing returns, and we need new architecture to get beyond that. The bottom line is traditional software will not get much faster.”
advertisement
“Actually, we hit the wall back in the ’90s,” Conte adds. “Even though transistors were getting faster, CPU circuits were getting slower as wire length dominated the computation. We hid that fact using superscalar architecture [i.e., internal parallelism]. That gave us a speedup of 2x or 3x. Then we hit the power wall and had to stop playing that game.”
Since then, “We have done the trick of moving from one big, inefficient core to many small, efficient ones,” Patterson adds. “For general-purpose applications, we have run out of ideas for making them faster. The path forward is domain-specific architecture.”
In other words, device makers will be adding processors that do specific, narrow tasks, but do them better than standard microprocessors, he believes.
advertisement
The trend is already well under way in the smartphone field, led by ARM Holdings, the leading smartphone CPU source. Ian Smythe, senior director at ARM, estimates that the typical smartphone may have 10 to 20 processors, including cores running the operating system and apps.
Under ARM’s so-called big.LITTLE architecture, half the processor cores — the big ones — are designed to run foreground apps at maximum speed and have extra circuits for features such as branch prediction and out-of-order execution. The LITTLE cores run background apps with maximum power efficiency. The other processors would handle things such as power management, sensors, video, audio, wireless connectivity, touchscreen management and fingerprint recognition, he says.
“The performance boost should be significant to justify the overhead of moving a workload off the CPU to an accelerator,” says Smythe. “With graphics and video they were justified and, moving forward, neural networking appears amenable to these solutions.”
Indeed, sources agree that neural networking processors will become increasingly popular, since they can expedite machine learning, which is used for machine vision, natural-language recognition and other forms of artificial intelligence.
An example might be Wave Computing’s latest processor, which has 16,000 cores. The idea, explains CTO Chris Nicol, is to avoid the expenditure of energy needed to move data in and out of a standard, sequential CPU.
“The cores are designed so that the result of one can be used as the operand in another in the very next cycle,” says Nicol. He calls the results “spatial computing” and says the process can be conceptualized as a two-dimensional grid, with time as the third dimension. The acceleration factor over conventional processing is about 600, he adds. The end-user version, shipping by the end of the year, will be used as a neural networking appliance rather than a general-purpose computer, he adds.
“Once trained, a neural network can run on your phone using a smaller, domain-specific processor,” explains Patterson.
On the downside, specialized processors run specialized software that requires its own tools and compilers. “The proliferation of CPUs is really horrible from a software perspective, and people held out against it for as long as possible, but it is the only way now,” notes Krste Asanović, a professor at UC Berkeley.
Specialized programming frameworks have arisen to combat the chaos, says Steve Roddy, director at Cadence Design Systems, which licenses domain-specific processor (DSP) designs. “They enable programmers to write code that can be platform-independent and lets the box makers apply different levels of resources. Each chipmaker has to decide how to support the framework on that chip — the high end might run on a DSP, while at the low end the same code might run on the CPU. That way the developer does not need to know every detail in order to write Angry Birds on multiple phones.”
Several frameworks have arisen just for neural networking, notes Patterson, including TensorFlow from Google; the Microsoft Cognitive Toolkit, or CNTK; and Amazon’s MXNet.
“They are also a target for the hardware and make it easier for the architect to get the software to run on the hardware,” he notes.
Direct language execution
Frameworks would be unnecessary, and the software would presumably run faster, if a processor could directly execute the commands of a higher-level language, such as Java. But so far the idea has not proved practical.
“People try from time to time, but they have failed miserably in every case,” explains Turley. “The first 50% or 75% of the chip goes really well, investors put in more money, and there are beer bashes every Friday. But then comes the last 25%, and it is really hard, and the results are not that much faster. So they have failed one by one.”
“Existing processors are optimized to run Linux and Windows about as fast as they can, and there’s no obvious way to make them run faster that is not already built in,” notes Gwennap. “Linux and Windows are a big pile of code. Trying to figure out how to accelerate the code is complicated, so the alternative is just to try to do everything reasonably well.”
“It turns out that compilers are a good idea and offer many advantages,” Patterson adds.
The quantum route
In the quest for speed, the best direction would appear to be quantum computing, which avoids the physical limitations of the classical world around us by relying on the quantum physics of the exotic subatomic world. Academics have proposed ways to make logic gates based on quantum mechanics and thereby construct general-purpose quantum processors. But the only quantum computing vendor on the market, D-Wave Systems, is not doing that — yet.
Its latest model has 2,048 quantum bits, called qubits, but processing is not based on binary logic gates as with conventional machines, explains Jeremy Hilton, D-Wave’s senior vice president. “If you can create entangled states between the bits, you can perform logic operations on all of them simultaneously,” he notes, adding that the process takes five microseconds.
“Within five years we see it as fairly ubiquitous in the cloud and accessible for developers,” he adds.
“The technology uses quantum superposition and entanglement to measure the probability of bit combinations,” adds Tom Hackenberg, an analyst at IHS Markit.” When it becomes commercially viable it will not address traditional computing that relies on definitive binary transactions, but will be greatly sought after in the field of deep learning and neural network computing.”
“The technology could be extended to general-purpose computing, and we are extending it, but that is not our major focus,” Hilton adds.
Open-source processors
The sheer momentum of the processor market, dominated by Intel, AMD and (in smartphones) ARM, works against radical innovation, sources complain.
“The problem is that firms like Intel and ARM become wedded to their architectures, and there is so much software out there that it is difficult to innovate very much at the architecture level,” says Gwennap.
Development expenses are also a barrier. “When they talk about increasingly small chip geometries, such as 16, 14, 10, 7 and 5nm, the joke is that what the number really means is the number of customers you’ll have,” says Asanović. “The fabs are too expensive. The patterns are too exotic to print easily. Design costs are $500 million, and you need massive volume to justify that.”
But the emergence of open-source processor hardware may open the door to garage-style innovators.
Sources point to RISC-V (pronounced risk-five), an open-source reduced instruction set computing (RISC) architecture, in 32, 64 or 128 bits, promoted by the RISC-V Foundation. Rick O’Connor, head of the foundation, explains that the fifth-generation RISC processor is free for any vendor to use, but they must pay for certification (if they desire that) and to license the trademark. The base architecture has fewer than 50 hardware instructions but can be expanded modularly to the server level, he adds.
“We had 30 years of hindsight to make an efficient instruction set,” explains O’Connor.
The first vendor offering RISC-V silicon is apparently SiFive in San Francisco. Jack Kang, vice president at SiFive, says that by using the RISC-V open-source processor chip and various forms of design automation, he can deliver 32-bit customer samples for under $100,000.
“That’s an order of magnitude less than conventional methods. We’re trying to democratize access to silicon. People are attracted to RISC-V because it’s free and open, and it’s actually pretty good,” he says.
The adoption of open-source hardware could counter runaway costs, Asanović predicts. “As for using open source for common infrastructure components that are not competitive differentiators, the software industry did that a long time ago. But in the hardware industry, they just keep slogging way building the same controller in one company after another. Getting them to use open-source will get more of them to develop their own chips. The chip industry would otherwise be in a death spiral — fewer customers means you have to raise the price, and higher prices mean fewer customers.”
As Patterson puts it, “If the user can’t see the software, as in the internet of things, the [identity of the] processor doesn’t matter, so why pay money for one when you can get another just as good for free?”
Von Neumann, now and forever
What future innovators are unlikely to build is a processor that departs from the so-called von Neumann architecture. The name derives from a description written in 1945 under the byline of computer pioneer John von Neumann, detailing the basic functions that a general-purpose stored-program digital computer would need to include.
“Instructions, program counters, branch instructions, fetching operations, arithmetic operations — those have been around since the beginning, but unless someone is going to declare that [neural networking] matrix operations amount to a new architecture, I don’t see anything displacing von Neumann’s ideas,” Patterson says.
If something did, the field of software engineering might be in jeopardy, warns Conte. “People want reliable, certified, tested, bug-free software, but the science of how we do that is really brittle. Even the algorithms would have to change — it would be a total disruption of the computing stack. We can’t say, ‘Traditional computing stops now.’ That’s not practical.”
Consequently, “Ten years from today your desktop or tablet pretty likely will have a processor similar to the one we use today for the operating system, but will have many special-purpose processors, for vision or neural networking or God knows what,” Patterson adds.
“What we see in development for production in 2022 calls for more of the same: greater specialization and a greater number of specialized subsystems,” says Roddy.
“The assumption we continue to make is that the CPU will remain the conductor of the orchestra [of accelerators] since it is the host of the operating system,” Smythe at ARM says. “Von Neumann is where we are, and there is not much we can do to change that dramatically.”