Performance and Flexibility for Mmultiple-Processor SoC Design

Performance and Flexibilityfor Multiple-Processor

SoC Design

Yalagoud A.Patil

OUTLINE

• Introduction

• The limitations of traditional ASIC design

• Extensible processors as an alternative to RTL

• Toward multiple-processor SoCs

• Processors and disruptive technology

• Conclusions

Introduction• The rapid evolution of silicon technology is bringing a new crisis to

system-on-chip (SoC) design.

• One way to speed up the development of mega-gate SoCs is the use of

multiple microprocessor cores to perform much of the processing currently

relegated to RTL.

• A few characteristics of typical deep-sub-micron integrated circuit (IC)

design illustrate the challenges facing SoC design teams: In a generic, 130-nm standard-cell foundry process, silicon density routinely exceeds

100K usable gates per mm2.

In the past, silicon capacity and design-automation tools limited the practical size of a

block of RTL to smaller than 100-K gates.

The design complexity of a typical logic block grows much more rapidly than does its gate

count, and system complexity increases much more rapidly than the number of constituent

blocks.

The cost of a design bug is going up. Much is made of the rising cost of deepsub-micron IC

masks—the cost of a full 130-nm mask set is approaching $1M, and 90-nm masks may reach

$2M.

All embedded systems now contain significant amounts of software.

Standard communication protocols are growing rapidly in complexity.

• In most markets, competitive forces drive the ever-increasing need to embrace

new technologies.

• Just one CMOS process step, say from 180 to 130nm roughly doubles the

available number of gates for a given die size and cost.

• The International Technology Roadmap for Semiconductors forecasts a slight

slowing in the pace of density increases, but exponential capacity increases are

expected to continue for at least the next decade, as shown in Figure.

• The trend toward the use of large numbers of RTL-based logic blocks and the mixing together of control processors and digital signal processors on the same chip is illustrated in Figure.

• This ceaseless growth in IC complexity is a central dilemma for SoC design.

• Unfortunately, general purpose processors fall far short of the mark with respect to application throughput,cost, and power efficiency for the most computationally demanding problems.

• designing custom RTL logic for these new, complex functions or emerging standards takes too long and produces designs that are too rigid to change easily.

• A closer look at the makeup of the typical RTL block in Figure gives insights into this paradox.

• In most RTL designs, the datapath consumes the vast majority of the gates in the logic block.

• For example, a packet-processing block will probably employ a datapath that closely corresponds to the packet header’s structure.

• This state machine may consume only a few percent of the block’s gate count, but it embodies most of the design and verification risk due to its complexity.

• One way to understand the risks associated with hardware state machines is

• to examine the combinatorial complexity of verification.• A state machine with N states and I inputs may have as many as N2 next-

state equations, and each of these equations will be some function of the I inputs, or 2I possible input combinations. Taken together, at least N2*2I input combinations must be tried to test all the state transitions of this state machine exhaustively.

• Configurable, extensible processors—a fundamentally new form of microprocessor-provide a way of reducing the risk of state-machine design by replacing hard-to-design, hard-to-verify state-machine logic blocks with pre-designed, pre-verified processor cores and application firmware.

THE LIMITATIONS OF TRADITIONAL ASIC DESIGN

• New chips are characterized by rapidly increasing logic complexity.

Moore’s-lawscaling of silicon density makes multi-million-gate designs

feasible.

• New chips are characterized by rapidly increasing logic complexity.

Moore’s-lawscaling of silicon density makes multi-million-gate designs

feasible.

• When requirements change,however, especially when new modes and

features must be added, RTL-level designs may not scale well, particularly

if the original design and verification team is not available to do the

redesign.

• The conventional SoC-design model closely follows the tradition of its predecessor: combining a standard microprocessor, standard memory, and RTL-builtlogic into an application-specific instruction set processor (ASIC).

• Most commonly, the processors used for these board-level designs are generalpurpose reduced instruction set computing (RISC) processors originally designed in the 1980s for general-purpose UNIX desktops and servers.

• When all system components are combined on a single piece of silicon, clock frequency increases and power dissipation decreases relative to the equivalent board-level design.

• SoC architectures that are cloned from board-level designs are often organized around one or two 32-bit busses (often a fast memory bus, plus a slow peripheral bus) because this approach saves pins—an expensive commodity in a board-level design but much less relevant to an SoC’s potential onchip connections.

The Impact of SoC Integration

• Ironically, bus bottlenecks commonly disappear in SoC designs.

• Wide busses are efficient and appropriate to use between adjoining SoC

logic blocks. The communications bandwidth between a processor and

surrounding logic can exceed 1GB per second on an SoC using these wider

busses.

• Although few practical SoC designs will even approach this limit, wide

onchip busses create tremendous architectural headroom and invite a new,

more effective approach to system architecture.

The Limitations of General-Purpose Processors

• The traditional approach to SoC design is further constrained by the origins and evolution of microprocessors.

• These processors were designed to serve general-purpose applications and were structured for implementation as stand-alone integrated circuits.

• The general-purpose nature of these processors makes them well suited to the extremely diverse mix of applications run on computer systems.

• Even the most silicon-intensive, deeply pipelined, super-scalar, general-purpose processors can rarely sustain much more than two instructions per cycle (IPC), and the harder processor designers push against this IPC limit, the higher the cost and power per unit of useful performance extracted from the microprocessor architecture.

• A digital camera may perform a variety of complex image processing but it never executes standard query language (SQL) database queries.

• The specialized nature of individual embedded applications creates two

issues for general-purpose processors in data-intensive embedded

applications.

• First, there is a poor match between the critical functions of many

embedded applications (e.g., image, audio, protocol processing) and a

RISC processor’s basic integer instruction set and register file.

• Second, the more focused embedded devices cannot take full advantage of

all of a general-purpose processor’s broad capabilities.

• Instead, designers have traditionally turned to hard-wired circuits to

perform these data-intensive functions such as image manipulation,

protocol processing, signal compression, encryption, and so on.

DSP as Application-Specific Processor• DSPs are often used in tandem with RISC controllers on SoCs, especially when

the end application calls for a mix of control and signal processing.

• The emergence of complex very long instruction word (VLIW) DSPs such as

Texas Instruments C6000 family and the StarCore architecture reflect this

“quest for generality.”

• In many cases a programmable DSP would be attractive, but only if it could be

sufficiently fast in the application to rival RTL performance.

• In the past 10 years, the wide availability of logic synthesis and ASIC design

tools has made RTL design the standard for hardware developers.

• Because they are not attempts to solve application-arbitrary sequential

problems, RTL designs avoid the general-purpose, single-processor

performance bottlenecks.

Extensible processors as an alternative to RTL

• Hardwired RTL design has many attractive characteristics: small area, low power, and high throughput.

• Application-specific processors as a replacement for complex RTL fit this need.

• The Origins of Configurable Processors:• A processor had to be “a jack of all trades, master of none.”• Research in application-specific instruction processors (ASIPs), especially

in Europe (code generation at IMEC, processor specification at the University of Dortmund, micro-code engines [“transport-triggered architectures”] at the Technical University of Delft and fast simulation at the University of Aachen all confirmed the possibility of developing a fully automated system for designing processors.

Configurable, Extensible Processors• Like RTL-based design using logic synthesis, extensible-processor technology

allows the design of high-speed logic blocks tailored to the assigned task.• All these software-development tools are built for exactly the same architecture

by the processor generator from the same definition used to build the processor itself.

• By generating the processor from a high-level description, the system designer controls all the relevant cost, performance, and functional attributes of the processor subsystem without having to become a microprocessor design expert.

• The four key questions for the use of configurable and extensible processors in SoCs are these:1. What target characteristics of the processor can be configured and extended?2. How does the system designer capture the target characteristics?3. What are the deliverables—the hardware and software components—to the system designer?4. What are typical results for building new platforms to address emerging communications and consumer applications?

• To be useful for practical SoC development, configuration of the processor

must meet two important criteria:

1. The configuration mechanism must accelerate and simplify the creation of

useful configurations.

2. The generated processor must include complete hardware descriptions

software development tools and verification aids.

• A range of extensible or configurable processors is now widely available.

Configurable products can be roughly categorized into five groups:

• Non-architectural processor configuration

• Fixed menu of processor architecture configurations

• User-modifiable processor RTL

• Processor extension using an instruction-set description language

• Fully automated processor synthesis.

• The logical equivalent of the RTL datapaths are implemented using the integer

pipeline of the base processor and additional execution units, registers, and other

functions added by the chip architect for a specific application.

• This design migration from hardwired state machine to firmware program control

has important implications:

1. Flexibility

2. Software-based development

3. Faster, more complete system modeling

4. Unification of control and data

5. Time-to-market

• Configurable and Extensible Processor Feature

• Extending a Processor

• Exploiting Extensibility

• The Impact of Extensibility on Performance

• Extensibility and Energy Efficiency

Performance and Flexibility for Mmultiple-Processor SoC Design

Technology

Transcript of Performance and Flexibility for Mmultiple-Processor SoC Design