Challenges and trends in processor design

Janet Wilson Computer

Although the problems are manifold, chip architec:ts from Sun, Cyrix, Motorola, Mips, Intel, and Digital see challenges r,ather than walls.

his year is near an inflection point in the three-year cycle for designing new microprocessors. For this reason, several companies will produce or introduce new projects in

What are the major roadblocks chip architects will have to overcome in the short term (five years)? Which (if any) are fundamental problems of physics? What will be the foremost obstacle to continued T 1998:

Digital’s 21264 and IBMs Power 3-64-bit processors-are projected to reach volume production. Sun’s UltraSparc 111 will sample. Intel plans to debut its latest Pentium 11, code- named Deschutes. Sun plans to release details about UltraJava, its newest high-performance Javdmedid3D processor. Cyrix’s first foray into the high-performance CPU core arena, Cayenne, will debut. AMD expects to release a 3D-enhanced version of the K6, and Silicon Graphics plans to announce the R12000, its next Mips processor. Toward the end of 1998, we can also expect more news about Merced, Intel’s next-generation, 64-bit processor. (For a related discussion, see “Intro- duction to Predicated Execution,” pp. 49-50).

Microprocessor design is often spotlighted for technological innovation, but trends here also have tremendous economic ramifications. Worldwide sales for microprocessors reached $23.6 billion in 1997, according to the World Semiconductor Trade Statistics orga- nization ( WSTS Semiconductor Forecast, Semi- conductor Industry Assoc., San Jose, Calif., 1997). This group, which represents 70 semiconductor companies, also reports that microprocessors outsold DRAMS for the first time in 1997, recording a 27.6 percent increase over 1997 sales. The WSTS estimates the 1998 market for microprocessors will hit $28.4 billion.

VIRTUAL ROUNDTABLE As part of this outlook issue, Computer invited six

computer architects to participate in a virtual roundtable. Each participant responded to the following list of questions posed by the Computer staff:

microprocessor performance improvements in the next five years? Are increasing costs associated with validation and testing a looming bottleneck? If so, what ways do you see around this problem? Will slow bus speeds prove a significant problem to increased chip speed? Why or why not? Will slow memory access be a major problem? What types of applications will drive microprocessor design in the next five years? Many point to multimedia-are there others? What are the trends surrounding microprocessor design itself: team sizes, schedules, targets, tools? Are these trends acceptable, or do any also con- stitute a threat to the business? Will stan’dards evolve to support modular systems on a chip? Do you see the major companies working rogether? What functionality may migrate to software?

ith so much at stake in this competitive field, we feazed that participants would find it difficult, if not impossible, to share their insights.

As one participant so eloquently put it, “I” an acad- emic at heart and like nothing better than all-out dis- cussions of interesting problems. On the other hand, I” paid with checks that have a company’s name at the bottom, and I must zealously guard their interests.”

Despite this, these six architects shared several insights of interest to those of us not intimately con- nected with processor design. We thank them for their candor and for giving generously of their time. *%

Janet Wilson is an associate editor for Computer. Con- tact her at [email protected].

January 1998 0018-9162/98/$10.00 0 1998 IEEE

mailto:[email protected]

Increasing Work, ushing the C

Marc Tremblay Sun Microsystems

hen designers create a new generation of processors, improving performance is often the key goal. There are three main factors that affect performance; they are:

how fast you can crank up the clock, * how much work you can do per cycle, and * how many instructions you need to perform a

task.

Designers optimize these factors through microarchitecture techniques, compiler optimizations, and instruction set architecture innovations.

People in industry talk about these three factors, but I haven’t seen too much, even from academia, that really improves how much work a processor does per cycle and also pushes the clock rate. Most ideas usually improve one factor to the detriment of the other. This can be fine, because if you improve one factor by 50 percent and decrease the other by 10 percent then, by the multiplication of factors, you’re still better off.

Today, most design houses are merely extending what already exists, designing microprocessors capable of issuing more instructions per cycle, for example. New machines are capable of more out-of-order execution, can access memory faster, or perform two memory operations in parallel. Yet this evolution may present problems because it conflicts with the goal of keeping the cycle time short. Another challenge for architects is to develop strategies that not only improve performance but facilitate the physical design.

Tailoring the processor. One way to branch out of this simple evolution is to design processors that are much more tailored to what users want to run. Now companies are designing microprocessors that end up being used in servers, powerful desktops, and even (sometimes) network computers. Architects are trying to design a processor that will run say, huge database applications or huge EDA applications, like CAD/CAM software. Then the same processor must also run a word processor, visualize 3D models, play a video clip, and so on. I foresee the day that designers will partition a microprocessor family into, say, client chips and server chips.

Client chips would focus on user interaction-that’s what 99 percent of users care about. By focusing on user inTeractions-multimedia, voice recognition,

speech processing, video, audio, aiid especially 3D graphics-we can optimize the chip, because all these applications have very predictable data accesses. For example, playing video on a client is easy; it’s just streaming data coming in and being decompressed and then displayed on the screen. In this example, the processor doesn’t need to be close to memory, so there’s no need to integrate memory with the processor on a single chip. That strategy may apply to other applications for which it’s hard to predict what data the chip needs to access from memory. But for multimedia, it’s fairly well-understood what needs to be brought into the microprocessor, and this data can be accessed speculatively and ahead of time. So the industry will specialize microprocessors not only by user, but by application categories.

Design drivers. Besides multimedia, other important applications for the next five years are personal pro- ductivity applications (like word processors, spread- sheets, and presentation software), 3D browsing, and shared whiteboards. There may also be more industrial applications, in which users visualize and order parts.

Microprocessors for these applications are completely different from those that sit on a server and run SAP, Baan, and Oracle software. So it would be much more economical if we could partition a microproces- so1 family to run a specific set of applications. We could then make trade-offs that deliver better performance and permit a chip to excel at its particular task.

leveraging design teams. One trend is that very high- end processors keep getting more complicated. This means that team size has remained large. One way companies will contlnue to work around this is by overlapping work between processor generations. Although there are two independent design teams for say, an UltraSparc I11 and UltraSparc IV, there’s a lot of technology and design that can be leveraged from one to the other, even though they’re completely different architectures. For example, the people designing the on-chip memory-the instruction and the data caches-can leverage a lot of their design. Actually, what we use are

Computer

small SWAT teams, which focus on parts of the design that can be highly leveraged across chips.

Another thing that could be leveraged, if managed properly, is a global verification methodology. We could also leverage some CAD tools from one generation to the next. This can help keep teams smaller so that designing two processors in a overlapping fashion does not require twice the size of the original team.

More parallelism. To take microprocessors to the next level, we need to look at parallelism at higher than the instruction level. We need to run more than one execution thread in parallel. A word processor, for instance, reformats pages sequentially. The program starts with paragraph one, then goes to two, three, and so on. We could rewrite code to reformat all the paragraphs in parallel so that one part of the machine could do the first 10 paragraphs in parallel, another part, the next 10, and eventually you could have several such threads format a document in parallel. There’s a lot of inherent parallelism in most applications; it’s just that we haven’t approached it like that.

Focusing on this higher level parallelism would allow smaller computational units to work in parallel on these threads, communicating only when nec-

essary. This would lead to a much more modular approach to processor design.

Part of the problem with doing this, though, is that we’ve been tied to the same ISA for 20 years: Binaries created in the late 1970s for the x86 still need to run unmodified on x86-compatible processors. But if we had a layer, basically a virtual machine, we could avoid these problems. This problem i s the trigger for runtime ISAs (as described by Josh Fisher in “Walk-Time Techniques: Catalyst for Architectural Change,” Com- puter, Sept. 1997, pp. 40-42), which are, of course, provided by Java. In Java, we no longer have to run existing binaries or deal with legacy ISAs, so now we can tailor the ISA to what users intend to run. 9

Marc Tremblay is a distinguished engineer involved in the architecture of high-performance processors at Sun Microsystems. Prior to his current work on the architecture for Ultrajava and piCOJava, he was coarchitect for the UltraSparc I and 11 processors. Tremblay received an M S and a PhD in computer science fyom U C L A and a BS in physics engineering from Lava1 University, Canada. He is a member of the IEEE Com- puter Society.

Reining in Complexity Greg Grohoski Cyrix

ver the last few years we’ve seen a trend toward increasing complexity, in an attempt to execute more instructions per cycle. The

lpendulum will swing the other direction now: Designers have gotten fed up with the

amount of complexity that’s in microprocessors. In the final analysis, complexity often fails to yield the expected IPC and winds up costing a lot in die size, clock cycle, and schedule. As a result, designers will simplify designs and take a hard look at why we’re including certain architectural features. Why, for instance, is there a high degree of out-of-order execution? How much can be done by the compiler with ISA extensions? Can we use various prescheduling techniques? Can we get the performance through an efficient pipeline with a higher clock rate?

In reducing complexity, RISC chips or other processors with new instruction sets have an advantage because they’re able to retarget the compiler. Many applications for such systems are also developed by

end users rather than purchased frolm third-party vendors. End users are usually willing to recompile to get a reasonable performance increase--typically at least 20 percent. The x86, however, is in a whole different arena; we’ve got basically no flexibiility whatsoever to recompile. So it’s a much bigger challenge to get more and more performance. Historically, that’s led to immense complexity.

Memory access is also a major problem. Most of the high-frequency chips coming to market today spend a large fraction of time waiting for data from the off-chip cache or from memory. In the future, we can mitigate some of these intraprocessor communication problems through careful integration and system design.

January 1998

My personal-favorite solution is putting the CPU on a DRAM. Although we may run a high-speed bus between the memory and CPU on today’s split-chip designs, we can’t fundamentally get the latency and bandwidth possible if the two were on the same chip. CPU/memory integration will make sense foi a wide variety of systems. Let’s say you have a system that can get by with 16 to 32 Mbytes of memory. On one 256-Mbit memory chip, using techniques like memory compression, you can afford to devote some reasonable percentage of

that die to a “simple” CPU. I think this type of design can be far more effective in terms of dollars per MIPS than today’s split-chip designs.

The next step will be getting data from the network or from a disk. Interconnects like Universal Serial Bus and Firewire will help, but they’re still too slow to keep up with future microprocessor speeds. We’ll need other strategies to attack those problems.

Design drivers. Processors will continue to change because of the applications that drive the design. I see large data-mining applications driving microprocessor design. Visual computing is also on the horizon; many users would like outstanding 3D graphics and multimedia. Clearly we’ve already seen this with wide- spread adoption of MMX or MMX-like technology, not only in Intel-compatible CPUs but also in RISC designs. At the low end, cost is a driving factor. When we can really sell powerful CPUs with memory and graphics for, say, $20 or $30 apiece, vendors can design PCs that sell for a few hundred dollars. This results in a different class of PCs, for which engineers have to figure out optimal cost-performance trade- offs. At the high end, where die size is less of an issue, we tend to throw the kitchen sink at a problem- what’s another couple hundred thousand transistors? It’s a lot more challenging to get world-class performance when your die-size target is 80 to 100 mm2.

So there’s no question that we’re tailoring products to a particular market. At Cyrix, one idea we’re pur- suing is a CPU core that works like a large ASIC block. System developers can instantiate various devices on a chip with this CPU core. For a group of companies to cooperate in system on chip, though, will take a product that has a potential to sell in quantities of 10 to 50 million units. When that opportunity arises, bar- riers to industry cooperation will fall. Of course the first company that attacks that problem successfully is likely to drive the standards and make the money.

Yet Cyrix’s interest in cores doesn’t mean we’re giving up on high-performance CPUs. What we’re after now is to deliver a compact, efficient core useful in a wide variety of applications.

Tool problems. Although I don’t want to offend any- one, the state of industry VLSI design tools is disap-

pointing. Most EDA vendors focus on the ASIC market-and rightly so, because they sell a lot more seats that way. But I don’t know any microprocessor designer who’s happy with their set of tools. Schematic entry, for instance, is painful. In industry, cycle simu- lation is still in its infancy. Better tools would enable better schedules.

This gets back to managing the design. We have to come up with designs that don’t require 250 engineers and billions of cycles for verification. We really have to cut back and simplify to deliver competitive performance without all the complexity and risk that’s inherent in unbounded designs.

Schedules and team sizes. Cyrix prides itself on accomplishing a lot with very few people. We have been pretty successful maintaining two to three year schedules with small, focused teams. There’s this desire (I call it a fantasy) that we can design a microprocessor faster than before. But in my experience, new designs always take around two and a half to three years. It takes time to staff a new team, to get them working with each other, and to adapt a new set of tools that are always deficient in one way or another. Yet the real problem is that the useful lifetime of microprocessors is measured in maybe a year and a half, so you’ve got to work on several different designs at once. This puts tremendous pressure on the engineers.

To get to design teams of less than a few hundred engineers, we really have to cut back and simplify the cores. When 200 people are working on a design, you spent a lot of time coordinating. There are also very few people who understand how the whole design works. So when you find a bug, a half dozen people have to sit in a room to devise a fix and also ensure they’ve covered all the corner cases. With a simple design, it’s much easier for one or maybe two engineers to work out a solution and prototype it quickly.

Fundamentally, we have to create less complexity so we spend less time debugging the complexity. Then we can begin the transistor-level design sooner. Since we won’t. spend so much time developing the behav- ioral model, we’ll be able to spend more time with the transistor-level design. This would allow us to at least maintain these three-year schedules and get to high clock frequencies without having 200 people on a design team. +:*

Greg Grohoski is a prolect manager at Cyrix and is currently workzng on thelalapen0 core. He prevzously worked at IBM Research as the lead architect for the Power 1 (IBMS first superscalar RISC processor) and also worked on the mzcroarchatecture for Power 2. Grohoski received an MS zn electrical engzneerzng from the Universzty of Illinois at Urbana-Champaign.

Computer

A Way of Life Brad Burgess Motorola

ith respect to roadblocks, many challenges lie ahead. There are several problems related to memory, the foremost of which is latency. Memory is a problem because the faster the processor runs,

the more difficult it is to hide latency to memory. Caches are fairly effective at addressing memory latency, depending on the application, though I do expect software writers and compilers to become better at tuning applications for cacheability.

In terms of processor design, one of the biggest headaches we’re seeing is wiring delay. At increasingly higher clock frequencies with sub-quarter-micron designs, metal is becoming a significant portion of the delay time in circuits. In the past, designers could focus primarily on gate delays and lay transistors down in a fairly straightforward fashion. In the future, the architecture, floor planning, and circuit design will be much more tightly interwoven.

How do we sustain continued performance improvements? As a hardware guy I hate to say it, but we need to improve the software. For large gains, we must h d better ways of expressing and exposing parallelism to the processor. In recent years, we’ve seen aggressive superscalar and very long instruction word compilers, parallelizing Fortran compilers, and languages support- ing explicit multithreading. Those are the beginnings of what programmers and compilers will need to do to expose more parallelism and improve performance.

With respect to roadblocks and physics, there are physical limits to how far one can shrink a CMOS transistor. In a Hot Chips presentation, for example, James Meindl of Gcorgia Tech pointed out that if we keep shrinking transistors as in the past, we will run into the limits of physics within the next 25 years. How do you shrink a transistor smaller than a single atom? You don’t; you find other ways to improve performance.

What applications will drive microprocessor designs? There are different drivers for different markets. Unfortunately, many people tend to think of a processor only in terms of their desktop PC; I view processors much more broadly. Although multimedia will likely push desktop PCs, there are other applications such as transaction processing, networking, signal processing, and real-time control that will significantly push high- end designs. Often each application space brings with

. .

it a unique set of requirements. A good general-purpose processor may offer high SPECmarks but it doesn’t necessarily make a good transaction processor or real-time control processor. In addition, there are many market segments that require performance/’power/cost trade- offs. Our PowerPC 750 processol; for example, has very high-end performance at significantly lower power dissipation than the competition. This. gives us a significant advantage in the laptop PC and certain embedded systems market segments.

Will standards evolve to supporc. modular systems on a chip? Again, it depends on the market. In the embedded space we’ve been doing modular systems- on-a-chip for several years. A number of customers have designed modules based on our internal bus standards. Whether or not we line up with other vendors is a business decision; the technology is there.

What about the size of design teams? There are several philosophies about how big design teams should be. I believe that smaller teams have an advantage in that communication is much more open, design details can be worked out, and trade-offs can be inade quickly. At the same time, you must have enough people to get the job accomplished. In large teams, you often have to build a bureaucracy just to manage communication and decision-making. The bureaucracy not only limits communication, but the decision-rnaking is often too far removed from the problem. At Motorola, we’ve avoided the 300 to 400 person teams.

What about better tools? In our work, we can always ask for better tools, and we push fixward the state of the art. Our philosophy is to look at: what’s on the market and use the best tools we can. At Somerset, we use a lot of our own tools because they are better than anything we can huy. We also use vendor tools, for example, layout and schematic capture tools. If you can find a comparable vendor tool, then by all means, use it. Support and service from outside keeps you from weigh- ing down your own tools organizalion. *:*

Brad Burgess is chief architect for he PuwerPC 750/740 microprocessovs (G3) and next-generation PowerPC microprocessor (G4) projects at Motorola and IBMS Somerset Microprocessor Design Center in Austin, Texas. He holds a BSEE and an MSEE from Texas A&M University.

January 1998

Challenges, adblocks

Earl Killian Mips Technologies

n the short term (the next five years) there are no major roadblocks. In microprocessor design, we’re always looking out about five years, and so we understand what the solutions are in that time frame. Just beyond that time frame several

issues arise. Many have suggested, and continue to suggest, major roadblocks for electronics technology. Yet technologists have successfully gone around those roadblocks in the past, and there is no reason to think this won’t continue for the next decade or two. Naysayers do serve a useful role: They point out the places where innovation is required! In that context, there are some more detailed issues that the industry will need to address:

0 Optical lithography will need to be replaced with deep-UV, X-ray, or electron beam techniques to permit feature sizes below 0.15 to 0.12 micron. This will require major investments by semiconductor and stepper companies, which will result in tremendous benefits to companies that pick the correct technology. This will benefit fabless companies-as opposed to those with fabs- because they will have access to whichever fabs best make the transition.

0 Below 0.1 micron, quantum effects begin to play a larger role in the operation of transistors. There are both potential pitfalls and opportunities in this area. We may also still have a few generations to go before this becomes a problem-IBM has built 0.08 micron devices in its labs that oper- ate successfully.

0 Interconnections on chip will be of increasing concern, but not in terms of local interconnects; global interconnect will require increasing care. Copper metalization is worth a 20- to 30-percent reduction in wire delay, which is helpful, but it is not a fundamental change (it only delays the problems by a generation or so). A trend I see is

. that chip designers will need to begin thinking about a chip more like a PC board. You’ll have components on this large expanse of silicon, and you have to think of some as far away and others as much more local.

0 Power consumption will be an increasing problem. Power has increased from generation to generation because transistor count and frequency

scaling have exceeded the rate of voltage scaling. In addition, it is uncertain whether voltage scaling can continue to very low voltages.

Obstacles and driving applications. There are no major obstacles that we can see in the next five years. Beyond that it gets a little murkier, but then it has always been a little murky that far our.

The widening gap between processors and memory is an issue, but has been increasing for decades. We’re fairly adept now at dealing with it. In general, we look for ways to convert bandwidth into latency.

Silicon Graphics’ approach to microprocessor design is to focus on particular market segments, and not to try to design an “all things to all people” sort of processor. Even in that context, our target applications are fairly broad, but are essentially all “big data” problems-manipulating and organizing large sets of data. We believe the action in microprocessor design will increasingly be how well you handle big data and how well you do multiprocessing. That’s where we are focusing.

Video and audio are currently minor challenges for microprocessors and will become less so over time because there is a limit to what the human visual and audio systems can perceive. Thus, these areas are inherently self-limiting and won’t be long-term issues. They illustrate a general phenomenon. Microprocessors begin by broaching a performance level that makes some fixed application barely possible, and designers struggle with that application for a while. A generation or two later, microprocessors can handle that application easily, and microprocessor designers have moved on to the next challenging application area.

Applications that can scale indefinitely are more interesting; 3D graphics may be in this category, at least for a while. The issue here is not the graphics itself, but the fact that 3D performance is growing at a faster rate than general-purpose processor performance. This will ultimately lead to a situation in which a single processor will be unable to feed the graphics pipelines that we are able to build. The

Computer

advantage here will go to companies that can apply multiprocessing to feeding 3D graphics hardware. For a similar reason, attempts to put part of the 3D graphics pipeline into the processor are misguided.

Image, handwriting, and speech recognition will be other major challenges. It is arguable whether these are fixed hurdles, like video and audio processing, or whether they can scale indefinitely.

Other areas that show the ability to challenge processor designers for some time are network and disk technology. Both had been increasing at fairly slow rates, but have recently picked up their pace. Applications that involve massive data storage and communication will continue to tax processors for the foreseeable future. Increasingly, this involves data mining and information retrieval. We’ve got almost too much information, even today, so the question becomes how well can machines organize it for us? Can machines help turn information into knowledge?

To help with the performance challenges, we should not move processor functionality back into the compiler. Rather, the compiler could better help with the real solution to the big-data challenge: multiprocessing. We should expect significant improvements in the area of automatic parallelization, for example.

Slow buses. One thing that is not a performance bottleneck for SGI are bus speeds. For years, slow bus speeds have been a concern of low-end chip vendors; they’ve responded to the challenges by standing still (PCs are stuck at 60 to 66 MHz). On the other hand, SGI is way out in front of this one; we’ve been shipping products with 400-MHz chip-to-chip communication by not using buses. In their place, we generally use point-to-point communication between chips, usually unidirectional. We can actually communicate at

400 MHz over five meters of cable, and can do significantly better on a PC board over shorter distances.

Testing, validation, and design. Validation and testing is a big area with many subcomponents. There are challenges in design validation, but no looming bottleneck. Chip debug in the lab is an increasing concern, because the inability to probe makes d’ebugging difficult. The answer here is more design validation with increasing emphasis on circuit validation via tools to raise it to the level of logic validation,. For example, SGI has been writing its own tools for circuit validation to fill a gap left by the CAD industry.

Team sizes and budgets are slowly increasing over time. It is roughly linear, but certainly not exponential. It would be disastrous if it were tied to transistor count (that is, exponential), but fortunately we’re able design things once and replicate to prevent that.

Systems on a chip. We’ll see modular systems on a chip, but probably not as the result of standards, and not from major companies working together, which they almost never do.

You will see more of the system moving onto the chip at SGI. One way we can increase the value of Mips microprocessors in the future iis to take SGI’s developed systems technology, improve it, and integrate it into our processors. *:*

Earl Killian is director of architecture, for Mips Tech- nologies. Previously, he cofounded Quantum Effect Design and participated in the devebpment of their R4600, R4650, and R4700 RISC pi)ocessors. Prior to that he was involved an the Mips R3000 and the first 64-bit microprocessor, the R4000. He has a BSEE from M I X

Maintaining a leading Position Robert Colwell Intel Corp.

ssuming that nothing basic in manufactur- ing or physics breaks in the next couple of years, there is no reason the historical trend in microprocessor performance will slow. A For the short term, industry has modeled

and understood the physics that will affect processor design and manufacture. Changes to accommodate

January 1998

those effects are in the production planning phase.

One force that could, however, interrupt the microprocessor performance trend is if consumers begin to prefer cheaper computers instead of faster ones. We can make them cheaper or faster, but it’s extremely difficult to do both at once.

Increasing costs associated with validation and testing are a problem, but not the real threat. The real threat i s shipping 30 million CPUs only to discover they’re imperfect in some way that causes a recall. Such an expense could easily bankrupt a company. The conver- gence of three factors may make it more likely for a problem to rise to recall severity:

The number of transistors per CPU is increasing rapidly per Moore’s law. Designs are becoming more complicated to drive microarchitecture performance higher. The technological sophistication of consumers purchasing desktop systems is decreasing.

There is every reason to suspect that this threat will grow, potentially to levels that require changes in the way companies conduct business.

Thus, chip manufacturers are pumping ever higher levels of validation into design efforts. More than 20 percent of the Pentium Pro design effort was associated with validation. As design teams grow 30 percent or so per major product generanon, validation teams are now larger than the entlre design team of only six years ago.

We are now much more rigorous at tracking both pre- and post-silicon “sightings”-anomalies in expected results that a human discovers. And although our validation tools have also improved enormously, we will still rely on humans writing test code in assembly language.

Over the last few years, random-instruction testing has also become much more useful and prevalent. Though consuming time and requiring huge amounts of code, RIT sequences are notorious for testing corner cases of a machine’s architecture that no human would ever think of. Both RIT and assembly language test code are therefore the current mainstays of a chip development effort.

Formal methods are now useful in some cases. For example, the Pentium recall was due to a design flaw in the chip’s floating-point divider, and formal methods are now mature enough that they can identify such errors. However, as currently practiced and envi- sioned, these techniques have serious practical limita- tions. For instance, they can be directed only to specific, well-defined, and contained areas of the design because they “blow up” quickly if we incor-

Computer

porate too much complexity. (A program that blows up begins to consume too much memory and time and is not likely to return useful results.) Also, like other validation methods, they are incapable of proving that a design is correct, a fact commonly forgotten.

Working around slow buses. Future CPUs will run in the gigahertz frequencies, and front-side bus electronics will not support anything close. Platform solutions will arise, such as bringing the L2 cache onto the CPU die and integrating the memory controller onto the CPU. In these approaches, memory traffic does not traverse a slow bus.

Memory. Memory will not keep pace with CPU performance in terms of latency; it is slow and, relative to CPU speed, becoming slower. Industry will continue to find ways to ameliorate (but not solve) this problem, including larger caches, more elaborate cache hierarchies, prefetch hints, compiler tricks, streaming buffers, intelligent memory controllers, and bank interleaving. The impending introduction of Rambus DRAMS into volume PCs will also help.

Communications focus. Within the next five years, computers will become primarily communication devices, as opposed to performing number-crunching tasks. We hadn’t noticed this emphasis until very recently because machines were not fast enough to execute the required workloads at human real-time speeds, nor was the I/O available to feed those machines, In playing games, for instance, most people (with Garry Kasparov a possible exception) report that a human is by far a more interesting and challenging opponent than a machine. Likewise, in virtual reality worlds, new explorers commonly wander about the space until a seemingly intelligent agent (such as a fellow human player) appears. Their attention transfers to the new agent as if by magic. I think this same human instinct is why people are using computers more as communication enhancement devices.

Beyond these multimedia-related communication functions, users will require better dependability and security. Antilock brakes that “mostly worked” or “hardly ever crashed” wouldn’t be acceptable, but that describes general-purpose computing today. This situation arises because we do not design hardware in conjunction with software, application developers don’t design software with the OS, and companies place less emphasis on the overall hardware-software system reliability than in getting to market quickly. This does not indict the industry-these are trends that the world has rewarded. Yet ultimately the lack of system dependability could well become industry’s concern because it will become society’s burden. Computer viruses are a plague, and system design has thus far downplayed hacker attacks, rather than guarded against them. There is a place for more secure computers and better encryption and decryption.

I

Trends in the design process. More transistors and higher performance beget larger design teams. Schedules are not getting any longer-in fact they’re constantly shorter. This means we do much more work presilicon, which in some cases is quicker overall but less efficient in terms of designer time than performing the same task post-silicon.

Complexity is high and increasing because the quest for higher performance leads in that direction. More complexity in less time produces a much higher risk of functional bugs. The need for more thorough validation in those shortened pre- and post-silicon time periods increases dramatically.

The software base for Intel architecture processors is mushrooming, and every new CPU must be compatible with the entire software base. The more software a new design must be compatible with, the harder the job of ensuring that it is.

No one of these issues seems to be a fundamental threat to the business, but, taken together, they defi- nitely are.

System on a chip. It would not be surprising to see CPUs with integrated versions of emerging standards such as Intel’s Accelerated Graphics Port (AGP) on- die. For companies trying to break into the business quickly, buying on-die units (phase-locked loops, floating-point units, or multiport register files) may make

Managing -

Speed Paul 1. Rubinfeld Digital Semiconductor

ive issues will be important to microprocessor design over the next five years; I list them in no particular order. One is power; as processors become faster, managing the power dissipation becomes a significant issue. More

transistors switching in parallel on one die dissipate more heat, which presents a whole bunch of electrical and thermal-management issues.

Second, at the same time the chips are running faster, they’re becoming more complex. So what I would broadly call on-chip signal mtegrity-maintaining the integrity of the signal as it moves from one end of the chip to the other-becomes more difficult. Some architects call this the interconnect nightmare. You essentially have lots of tightly packed wires that communicate

sense. For Intel, with our own fabs and process technology, a huge premium on tight integration and small die sizes, and the expectation of very large unit volume shipments, it makes less sense to contract out parts of the design.

Software functionality. In cost-sensitive market segments, as much functionality as is possible will shift to software because that saves money. Sound-card functions, modems, network controllers, and 3D operations may appear in software very soon. However, the trick is not to emulate a function in software, but to emulate several functions simultaneously, especially considering that many are real-time. A machine cannot fall behind the combined workload of all the real-time applications taken together. Today’s most popular operating systems cannot support such functions in software, so I expect an industry learning experience to accompany this shift. 0

Robert Colwell is an Intel Fellow acrd director of 32- bit microprocessor development Lrt Intel. He also worked on the Multiflow Trace computer. Colwell has a BS in electrical engineering from the University of Pittsburgh, and an MS and a PhD in electrical engineering from Carnegie Mellon University, He is a member of the IEEE Computer Society.

parasitically to each other-a difficult problem that cre- ates all types of circuit issues.

A third design issue is the idea of soft-error upsets: Cosmic rays or gamma radiation causes storage nodes to lose charge, which causes failures. More advanced work says this will imply future designs that are fault tol- erant. We know how to provide fault tolerance for memory with error-correcting code, but the way circuits are shrinking, any storage node could become susceptible to radiation. So beyond just memory ;structures, we may have to deal with fault tolerance thmughout the chip.

Recent research concentrated on soft-error rates due to alpha particles. Yet it now appears that gamma radiation is an order of magnitude more significant. In the

January 1998

future, as geometries get even smaller, as stored charge gets smallel; designers may have to assume that soft errors happen fairly frequently. We’ll have to deal with that at the logic level.

A fourth issue of concern is on-chip process variation. With smaller geometries and larger chips, we’re seeing process variation on a single die, which adds to signal skews between one part of the chip and another. This adds to the uncertainty, which basically limits the ability to design high-speed circuits. So that’s yet another set of analyses to do, another set of effects to design for.

I

The fifth issue concerns the human aspects of these projects-managing large teams isn’t easy. Clearly, large engineering teams are not unusual outside chip design, but inside it’s somewhat new.

limits of parallelism. Perhaps one obstacle I see is that there may be fundamental limits to parallelism. How much parallelism can we extract from software? In part, the way software is written today and the language structures it uses are factors that limit the amount of extractable parallelism. Of course, it takes a while for software to migrate to new programming languages and models; software isn’t that soft anymore. But we’re already working on diminishing returns-doing twice as much development work to extract less and less benefit. Parallelism is a rock we’re squeezing awfully hard.

Continuous profiling. Another trend we’ll see for putting performance improvements in software is this idea of continuous application profiling. Continuously executing an application and profiling the execution, gives us information to use in post-optimizations to the order of the code. That’s a technique Digital’s using in several of our tools and it pays big dividends.

Validation, buses, and memory. As for the validation problem, we seem to have that under control. We were able to functionally verify the Alpha 21264 efficiently, booting all the operating systems and running the application software on the first pass.

Bus speeds won’t necessarily be slow; they will probably scale. Bandwidth will also improve dramatically for this reason as well because area-array flip- chip packaging will let us implement much wider buses. There are also tricks-some have been used, some haven’t-to hide latency. Cache hierarchy and cache efficiency improvements will also mitigate the relative effect of slow buses. In the future, multi- megahertz buses will become common.

A similar set of arguments could be made for memory. For the first time in many years, new technologies-Rambus, SyncLink, and DDR RAM-are improving memory speeds. This, coupled with improvements in cache hierarchy and latency hiding, will mitigate the effects of the differences in speed between the internal processor and external memory.

Computer

Design drivers. At least three applications will probably drive the computer industry. The first is a generic category I call “modeling reality.” Virtual reality is an example, as are games. The computational powei required by these new games is quite remarkable; they and the whole entertainment field will drive a lot of computational needs.

Similarly, this whole idea of natural interfaces with computers-speech recognition, and so on-can consume lots of computational power and make computers more usable.

We also have to acknowledge that data sets and databases are growing large. There’s just far more data, and we want to manipulate it. So transaction processing will drive some development.

Design process issues. When it comes to schedules, I see two types of products-new designs and leveraged designs. I’m not sure there’s been a significant improvement in the design time around major new redesigns of a machine; more of the products seem to be leveraged designs, such as those that move to a new process generation. Good engineering says you ought to leverage your design for everything you can, but the full-blown, soup-to-nuts CPU still takes time to design. I’m not sure we’ve improved in the last five years.

One potential threat to the business is important to mention: There’s a shortage of qualified design engineers in the custom design space. Several companies appear to be fighting tooth and nail for those qualified designers. Few colleges really teach people how to design hzgh-speed microprocessors. Squeezing the last 50 percent out of the silicon requires circuit exper- tise and attention to physical details. You don’t find too many people who can deal with on-chip process variation and on-chip signal integrity. There are also reliability issues-how do we design fast circuits that are immune to various failure mechanisms?

Another problem is that most microprocessors companies really have to design their own analysis tools and essentially their own CAD packages to accommodate leading-edge problems. Maybe that’s a good indi- cator of whether you’re really designing a high-speed microprocessor: If you use standard tools, you’re not,

It’s important to realize that none of these are fundamental limits to processor performance. As we move from generation to generation, the problems lust get harder, and that makes our business fun. *:*

Paul I. Rubinfeld zs a senior engzneerzng manager for mzcroprocessor development at Dzgztal Semzconduc- tor, where he manages development of the Alpha 21 164 microprocessors as well as chzpset development for the entire Alpha famzly. He has worked on the VAX and PDP-11 CPU development prolects. Rubznfeld recezved a BS and an MS in electrical engzneerzng from Carnegze Mellon Unzverszty,

Introduction to Predicated Execution Wen-mei Hwu University of Illinois, Urbana-Champaign

The story of Merced, Intel’s first processor based on its next-generation 64-bit architecture, will continue to unfold in 1998, Intel expects this product of its collaboration with Hewlett-Packard to reach volume production in 1999. To date, however, the two companies have released few details about Intel Architecture 64 (IA-64). One significant change they did admit to at the October 1997 Microprocessor Forum was the switch to full predicated execution, a technique that no other commercial general-purpose processor employs.

Computer wanted to give its readers advance nottce vf thas promising technique. W e invited Wen-mei Hwu, a prominent researcher in this area, to explain predication, a topic you inay be hearing more about in 1998.--Janet Wilson

redicated execution is a mechanism that supports the conditional execution of individual operations.’ Compared to a conventional instruction

set, an operation in a predicated-execution architecture has an additional input operand-a predicate- that can assume a value of true or false. During runtime, a predicated-execution processor fetches operations regardless of their prcdicatc valuc. The processor executes operations with true predicates normally; it nullifies operations with false predicates and prevents them from modifying the processor state.

Using predication inherently changes the representation of a program’s control flow. A conventional instruction set requires all control flow to be explicitly represented in the form of branches, the only mechanism available to conditionally execute operations. An instruction set with predicated execution, how- cvcr, can support conditional execution via either conventional branches or predicated operations.

INSTRUCTION SET ENHANCEMENTS

instruction set should provide To effectively support predicated execution, an

the means to specify a predicate operand to indi-

a predicate register file, a nullification mechanism, and a set of predicate-defining operations.

vidual operations,

Adding a predicate operand Io conventional instruction formats requires extra bits in the instruction encoding. Instruction sets that must stay within a 32-bit encoding are typically limited to adding predicate operands to only operations that take fewer than 32 bits to encode, such as MOV operations. This is referred to as partial predication. Adding a predicate operand to an operation also increases the number of operand values the processor must read and write from architectural registers during; runtime. Using integer registers to hold predicates would require more registers as well as more ports to the register file. An efficient solution is to specify a separate predicate register file. A study by Scott Mahlkle and colleagues showed that partial and full pred.ication can both result in significant performance impovement to conventional instruction sets?

Special predicate-definmg operations are used to com- pute predicates. Such operations replace branch operations as the compiler generates code. Each branch thus replaced results in two predicates, one for operations along the branch‘s taken path and another for those on the fall-through path. Thus, each predicate-defining operation should simultaneously define two predicates. Figure 1 (on the next page) illustratcs this. In traditional code with branches, shown in Figure la , two branch operations jointly choose one of three possible execution paths. Figure 1 b shows the predicates assigned to each branch. In Figure IC, the code for predicated execution uses two predicate-defining operations. The first operation defines two predicates according to a set of U-type rules: Predicate p, is true if the original branch condition is false; predicate pz assumes the complementary value of pl. In the second predicate-defining operation, the value of p1 indicates whether the pro,gram should reach the corresponding branch in the conventional code. If not, then neither of the two possible paths (controlled by p3 and p4) should be activated. Thus, the second predicate instruction may define both p3 and p4 as false values. Figure Id summarizes the U-type rules for determining such destination predicates (p,) of a predicate-defining operation.

These are simple examples; Hewlett-Packard’s PlayDoh architecture specification discusses predicate- defining operations based on more advanced rules.3 Advanced rules allow the compiler to generate predicated code for more sophisticated program control structures. They also allow the compiler to perform implicit control flow optimizations on predicated code.

COMPILER SUPPORT Predication is most commonly utilized in a compiler

by employing if-conversion. This technique converts conditional branches in an acyclic region of the control flow into predicate-defining operations. With if-

January 1 I998

Figure 1. Example of branching code: (a) C source code, (b) assembly code control flow diagram, (cl predicated code, and (d) U-type rules for predicate-de fining operations.

conversion, a straight-line sequence of predicated code can replace complex nets of branching code. As Figure 1 shows, the execution of the code after if-conversion does not involve any branch. This means the compiler can eliminate problematic branches and avoid their associated overhead. It also facilitates increased instruction-level parallelism by allowing the compiler to overlap the execution of separate control flow paths. This allows the processor to simultaneously execute multiple paths from a single thread of con- . -

trol. An important benefit of predication not illus- trated in Figure 1 is that it allows overlapping and independent control flow constructs without expand- ing the code.

Providing compiler support for predicated execution is challenging. Current optimizing compilers rely on control flow representation as the foundation of analysis and optimization. Because predicated code changes the control flow representation, effectively handling it requires an extensive modification of the compiler infrastructure, particularly in the areas of classical and ILP optimizations, code scheduling, and register allocation. An effective compiler must balance the control flow and the use of predi~at ion.~ If resources become oversubscribed or dependence heights (the lengths of the chains of dependent oper-

execution can degrade performance. Predicated execution started as a software approach

to avoiding conditional branches in early supercomputers. Vector architectures such as the Cray 1 and array-processing architectures such as Illiac IV adopted predication in the forin of mask registers to allow effective vectorization of loops with conditional branches. During the era of mini-supercomputers, the Cydrome Cydra 5 became the first machine to support generalized predication. Parallel to the Cydra 5, the Multiflow Trace machine adopted partial predication by introducing a single instruction with a pred-

ations) become unbalanced among patha, predicated

icate input, a select instruction. Contemporary processors, such as the DEC Alpha and the Sparc V9, have adopted the partial-predication approach so they can maintain a 32-bit instruction encoding.

n the future, integrating control and data specula- tion with predicated execution will enable advanced compiler techniques to increase the performance of

future processor^.^ With the adoption of advanced full predication support in IA-64 and perhaps many other architectures, predicated execution may become one of the most significant advances in the history of computer architecture and compiler design. *3

. . . , . . . . . . . . . . . . . . . . . . , . . . . . . . . . . . . . . . . . . . . . . . . . . , . . . . . . . . , . . . References 1. B.R. Rau et al., “The Cydra 5 Departmental Supercom-

puter,” Computer, Jan. 1989, pp. 12-35. 2. S.A. Mahlke et al., “A Comparison of Full and Partial

Predicated Execution Support for ILP Processors,” Proc. Int’l Symp. Computer Architecture, ACM Press, New York, 1995, pp. 138-149.

3 . V. Kathail, M.S. Schlansker, and B.R. Rau, HPL Pluy- Doh Architecture Specification: Version 1 .O, Tech. Report HPL-93-80, Hewlett-Packard Labs, Palo Alto, Calif., 1994.

4. D.I. August, W.W. Hwu, and S.A. Mahlke, “A Frame- work for Balancing Control Flow and Predication,” Proc. Micro-30, IEEE CS Press, Los Alamitos, Calif.,

5. W.W. Hwu et al., “Compiler Technology for Future Microprocessors,” Proc. IEEE, IEEE Press, New York, 1995.

1997, pp. 92-103.

Wen-mei Hwu is a professor in the Department of Electrical and Computer Engineering at the University of Illinois, Urbana-Champaign. Contact him at [email protected].

Computer

mailto:[email protected]

Challenges and trends in processor design

Documents

Transcript of Challenges and trends in processor design