COPING WITH LATENCY IN SOC DESIGNluca/research/lidMicro02.pdfFrom computation- to...

24

As commercial demand for system-on-a-chip (SOC)-based products grows, theeffective reuse of existing intellectual-propertydesign modules (also known as IP cores) isessential to meet the challenges posed by deep-submicron (DSM) technologies and to com-plete a reliable design within time-to-marketconstraints. Originally, IP cores were mostlyfunctional blocks built for previous designgenerations within the same company. Fre-quently, today’s IP cores are optimized mod-ules marketed as off-the-shelf components byspecialized vendors.

An IP core must be both flexible—to col-laborate with other modules within differentenvironments—and independent from theparticular details of one-among-many possi-ble implementations. The prerequisite for aneasy trade, reuse, and assembly of IP cores isthe ability to assemble predesigned compo-nents with little or no effort. The consequentchallenge is addressing the communicationand synchronization issues that naturally arisewhile assembling predesigned components.

We believe that the semiconductor industrywill experience a paradigm shift from com-putation- to communication-bound design:The number of transistors that a signal can

reach in a clock cycle—not the number thatdesigners can integrate on a chip—will drivethe design process. The “From computation-to communication-bound design” sidebar dis-cusses this trend. The strategic importance ofdeveloping a communication-based designmethodology naturally follows. Researchershave proposed the principle of orthogonal-ization of concerns as essential to dealing withthe complexity of systems-on-a-chip design.1

Here, we address the orthogonalization ofcommunication versus computation—in par-ticular, separating the design of the blockfunctionalities from communication archi-tecture development.

An essential element of communication-based design is the encapsulation of pre-designed functional modules withinautomatically generated interface structures.Such a strategy ensures a correct-by-con-struction composition of the system. Laten-cy-insensitive design and the recyclingparadigm are a step in this direction.

Coping with volatile latencyHigh-end-microprocessor designers have tra-

ditionally anticipated the challenges that ASICdesigners are going to face in working with the

Luca P. Carloni

Alberto L.Sangiovanni-

VincentelliUniversity of California,

Berkeley

LATENCY-INSENSITIVE DESIGN IS THE FOUNDATION OF A CORRECT-BY-

CONSTRUCTION METHODOLOGY FOR SOC DESIGN. THIS APPROACH CAN

HANDLE LATENCY’S INCREASING IMPACT ON DEEP-SUBMICRON

TECHNOLOGIES AND FACILITATE THE REUSE OF INTELLECTUAL-PROPERTY

CORES FOR BUILDING COMPLEX SYSTEMS ON CHIPS, REDUCING THE NUMBER

OF COSTLY ITERATIONS IN THE DESIGN PROCESS.

0272-1732/02/$17.00 2002 IEEE

COPING WITH LATENCY INSOC DESIGN

next process generation. Latency is increasing-ly affecting the design of state-of-the-art micro-processors. (Latency, for example, drove thedesign of so-called drive stages in the new hyper-pipelined Netburst microarchitecture2 of Intel’sPentium 4). The “impact of latency in the one-billion-transistor microprocessor design” side-bar (on p. 27) discusses latency as a forceshaping this future architecture, which isexpected within the next 10 years.

We have argued that interconnect latencywill have a significant impact on the design ofthe communication architecture among themodules of a system on a chip. However, it isdifficult to estimate the actual interconnect

latency early in the design cycle because sev-eral phenomena affect it. Process variations,crosstalk, and power supply drop variationsall affect interconnect latency. Furthermore,their combined effect can vary across chipregions and periods of chip operation. Hence,finding the exact delay value for a global wireis often impossible. On the other hand, rely-ing on conservative estimates to establish valueintervals can lead to suboptimal designs.

We believe that recently proposed designflows advocating new CAD tools that couplelogic synthesis and physical design will sufferfrom the impracticality of accurately estimat-ing the latency of global wires. Indeed, logic

25SEPTEMBER–OCTOBER 2002

According to the “2001 Technology Roadmap for Semiconductors,” thehistorical half-pitch technology gap between microprocessors and ASICson the one hand and DRAM on the other will disappear by 2005.1 As weproceed into deep-submicron (DSM) technologies, the 130-nm micro-processor half-pitch in 2002 will shrink to the projected 32-nm length in2013, allowing the integration of more than 1 billion transistors on a sin-gle die. At the same time, the ITRS predicts that the on-chip local-clockfrequency will rise from today’s 1 to 2 GHz, to 19 to 20 GHz. The so-calledinterconnect problem, however, threatens the outstanding pace of tech-nological progress that has shaped the semiconductor industry. Despitethe increase in metal layers and in aspect ratio, the resistance-capaci-tance delay of an average metal line with constant length is becomingworse with each process generation.2-3 The current migration from alu-minum to copper metallization is compensating for this trend by reducingthe interconnect resistivity. The introduction of low-k dielectric insula-tors may also alleviate the problem, but these one-time improvementswill not suffice in the long run as feature size continues to shrink.4

The increasing resistance-capacitance delays combined with theincreases in operating frequency, die size, and average interconnect lengthcause interconnect delay to become the largest fraction of the clock cycletime. In 1997, Computer magazine published a study by D. Matzke5 con-

taining a gloomy forecast of how on-chip interconnect latency (predictedto soon measure in the tens of clock cycles) will hamper Moore’s law. Sincethen, researchers have fiercely debated the real magnitude of the wire-delay impact and the consequent need for revolutionizing establisheddesign methodologies already challenged by the timing-closure problem.

The debate’s core has been about the scaling properties of interconnectwires relative to gate scaling. On one side, some researchers share anoptimistic view based on the fact that wires that scale in length togeth-er with gate lengths offer approximately a constant resistance and afalling capacitance. Hence, as long as designers adopt a modular designapproach and treat functional modules of up to 50,000 gates as the maincomponents, these researchers’ position is that current design flows cansustaining the challenges of DSM design.6

On the other side of the debate, researchers argue that the previousargument does not account for the presence in SOCs of many global wiresthat cannot scale in length. As Figure A7 (next page) shows, such wiresmust span multiple modules to connect distant gates, so aren’t scalable.

As Table A3 illustrates, the intrinsic interconnect delay of a 1-mm lengthwire for a 35-nm technology will be longer than the transistor delay by twoorders of magnitude. Some researchers argue that, even under the best

From computation- to communication-bound design

Table A. Interconnect delay of 1-mm line vs. transistor delay for various process technologies.3

Minimum, scaled, Reverse, scaled, Ratio of the1-mm interconnect 1-mm interconnect wire size to

Technology (type of MOSFET* switching intrinsic delay, intrinsic delay, the minimum wire and substrate) delay, approximate (ps) approximate (ps) approximate (ps) lithographic size

10 µm (Al, SiO2) 20 5 5 1

0.1 µm (Al, SiO2) 5 30 5 1.5

35 nm (Cu, low-k) 2.5 250 5 4.5

*Metal-oxide-semiconductor field-effect transistor

continued on p. 26

synthesis tools presently suffer from severaldrawbacks. They are

• inherently unstable, and• based on the synchronous design method-

ology.

Small variations to the hardware descrip-tion language’s input specification—like thosea designer might make to fix a slow path—can lead to major variations in the outputnetlist and, consequently, in the final layout.

The synchronous design methodology isthe foundation of the design flows for themajority of commercial chips today, but, ifleft unchanged, will lead to an exacerbationof the timing closure problem for tomorrow’sdesign flows. In fact, the main assumptionin synchronous design is that the delay ofeach combinational path is smaller than thesystem clock period. By combinational path,

we mean each signal path leaving a latch andtraversing only combinational logic andwires to reach another latch. The slowestcombinational path (critical path) dictatesthe maximum operating frequency for thesystem. However, it is often the case that thedesired operating frequency represents a fixeddesign constraint. Then, once designersderive the final layout, every path with adelay longer than the desired clock periodsimply represents a design exception to befixed. To fix these exceptions, designers usetechniques from wire buffering and transis-tor resizing to rerouting wires, replacingmodules, and even redesigning entire por-tions of the system.

Replacing, rerouting, and redesigning clear-ly do not alleviate the timing closure problem.Buffering is an efficient technique but carriesprecise limitations, because there is a limit tothe number of buffers that designers can insert

26

SOC DESIGN LATENCY

IEEE MICRO

conditions, the latency to transmit a signal across the chip in a top-levelmetal wire will vary between 12 and 32 cycles, depending on the clock rate.8

In fact, although the number of gates reachable in a cycle will not changesignificantly and the on-chip bandwidth will continue to grow, the percent-age of the die reachable within one clock cycle is inexorably and dramati-

cally decreasing. Designs will soon reach a point where more gates can fiton a chip that can communicate in one cycle.5,7 Hence, instead of being lim-ited by the number of transistors integratable on a single die (computationbound), designs will be limited by the amount of state and logic reachablewithin the required number of clock cycles (communication bound).

References1. A. Allan et al., “2001 Technology Roadmap for Semiconduc-

tors,” Computer, vol. 35, no. 1, Jan. 2002, pp. 42-53.

2. M.T. Bohr, “Interconnect Scaling: The Real Limiter to High

Performance ULSI,” Proc. 1995 IEEE Int’l Electron Devices

Meeting, IEEE Press, Piscataway, N.J., 1995, pp. 241-244.

3. J.A. Davis et al., “Interconnect Limits on Gigascale Integration

(GSI) in the 21st Century,” Proc. IEEE, vol. 89, no. 3, Mar.

2001, pp. 305-324.

4. M.J. Flynn, P. Hung, and K.W. Rudd, “Deep-Submicron

Microprocessor Design Issues,” IEEE Micro, vol. 19, no. 4,

July 1999, pp. 11-13.

5. D. Matzke, “Will Physical Scalability Sabotage Performance

Gains?” Computer, vol. 30, no. 9, Sept. 1997, pp. 37-39.

6. D. Sylvester and K. Keutzer, “Rethinking Deep-Submicron

Circuit Design,” Computer, vol. 32, no. 11, Nov. 1999, pp. 25-

33.

7. R. Ho, K. Mai, and M. Horowitz, “The Future of Wires,” Proc.

IEEE, vol. 89, no. 4, Apr. 2001, pp. 490-504.

8. V. Agarwal et al., “Clock Rate versus IPC: The End of the Road

for Conventional Microarchitectures,” Proc. 27th Ann. Int’l

Symp. Computer Architecture (ISCA 00), ACM Press, New

York, 2000, pp. 248-259.

continued from p. 25

Scaling

(1) (2)

Figure A. Local wires scale in length; global wires do not.The figure shows the different impact of technology scal-ing on devices, local wires, and global wires. The newertechnology process in SOC (2) allows a higher level ofintegration than SOC (1) as more (and smaller) devicesand modules are accommodated on the chip. Local wiresconnect devices within a module and shrink with themodule. Global wires connect devices located in differentmodules and do not shrink because they need to spansignificant proportions of the die.7

on a given wire and still reduce delay. In fact,as a last resort, designers often must break longwires by inserting latches, which is similar tothe insertion of new stages in a pipeline.

This operation, which trades off fixing awire exception with increasing its latency byone or more clock cycles, will become perva-sive in deep-submicron design, where mostglobal wires will be heavily pipelined anyway.In fact, latency (measured in clock cycles)between SOC components will vary consid-erably based on the components’ reciprocaldistances—even without considering the needfor increasing latency to further pipeline longglobal wires. Inserting latches (statefulrepeaters) has a different impact on the sur-rounding control logic with respect to insert-ing buffers (stateless repeaters). If the interfacelogic design of two communication compo-

nents assumes a certain latency, then design-ers must rework it to account for additionalpipeline stages. Such rework has serious con-sequences on design productivity.

On the one hand, the increasing impact oflatency variations will drive architecturestoward modular designs with an explicit glob-al latency mechanism. In the case of multi-processor architectures, latency variation willlead the designers to expose computation/com-munication tradeoffs to the software compil-ers. At the same time, the focus on latency willopen the way to new synthesis tools that canautomatically generate the hardware interfacesresponsible for implementing the appropriatesynchronization and communication strate-gies, such as channel multiplexing, datacoherency, and so on. All these considerationslead us to propose a design methodology that


The evolution toward communication-bound design in microprocessorsimplies that the amount of state reachable in a clock cycle, and not thenumber of transistors, becomes the major factor limiting the growth ofinstruction throughput. Furthermore, the increasing interconnect latencywill particularly penalize current memory-oriented microprocessor archi-tectures that strongly rely on the assumption of low-latency communica-tion with structures such as caches, register files, and rename/reordertables. Recent studies employ cache-delay analysis tools that account forcache parameters as well as technology generation figures. These stud-ies predict that in a 35-nm design running at 10 GHz and accessing evena 4-Kbyte level-one cache will require three clock cycles.1 In fact, this trendhas already started to affect design: The architects of the Alpha 21264adopted clustered functional units and a partitioned register file to con-tinue offering high computational bandwidth despite longer wire delays.Similarly, Intel’s Pentium 4 microprocessor presents two so-called drivestages in its new hyperpipelined Netburst microarchitecture; designersdedicated these stages purely to instruction distribution and data move-ment.2 Exposing interconnect latency to the microarchitecture and possi-bly to the instruction set will be key in controlling system performance.Several research teams have already started investigating this idea.3-8

In particular, Bill Dally proposes a model for a hypothetical 2009 mul-ticomputer chip. He divides the proposed chip into 64 tiles (each con-taining a processor and a memory), where the memory processor’sround-trip intratile communication latency is two cycles (1 ns). The worst-case latency for a corner-to-corner communication with the most distantmemory is 56 cycles. More importantly, Dally points out the followingfeatures of his approach:

• On-chip communication bandwidth is not an issue because of themany wiring tracks.

• Unlike modern multiprocessors with their all-or-nothing locality, laten-cy on the 2009 multicomputer varies continuously with distance.

• This approach controls latency by placing data near their point ofuse, not just at their point of use.

References1. V. Agarwal et al., “Clock Rate versus IPC: The End of the Road

for Conventional Microarchitectures,” Proc. 27th Ann. Int’l

Symp. Computer Architecture (ISCA 00), ACM Press, New

York, 2000, pp. 248-259.

2. P. Glaskowski, “Pentium 4 (Partially) Previewed,”

Microprocessor Report, vol. 14, no. 8, Aug. 2000, pp. 10-13.

3. W.J. Dally and S. Lacy, “VLSI Architecture: Past, Present and

Future,” Proc. Advanced Research in VLSI Conf., IEEE Press,

N.J., 1999, pp. 232-241.

4. K. Mai et al., “Smart Memories: A Modular Reconfigurable

Architecture,” Proc. 27th Ann. Int’l Symp. Computer Archi-

tecture (ISCA 00), ACM Press, New York, 2000, pp. 161-171.

5. C.E. Kozyrakis et al., “Scalable Processors in the Billion-

Transistor Era: IRAM,” Computer, vol. 30, no. 9, Sept. 1997,

pp. 75-78.

6. R. Ho, K. Mai, and M. Horowitz, “The Future of Wires,” Proc.

IEEE, vol. 89, no. 4, Apr. 2001, pp. 490-504.

7. E. Waingold et al., “Baring It All to Software: Raw Machines,”

Computer, vol. 30, no. 9, Sept. 1997, pp. 86-93.

8. R. Nagarajan et al., “A Design Space Evaluation of Grid

Processor Architectures,” Proc. 34th Ann. Int’l Symp.

Microarchitecture (Micro-34), IEEE CS Press, Los Alamitos,

Calif., 2001, pp. 40-51.

The impact of latency in the one-billion-transistor microprocessor design

guarantees the robustness of the system’s func-tionality and performance with respect to arbi-trary latency variations.

Latency-insensitive designThe foundation of latency-insensitive design

is the theory of latency-insensitive protocols.3

A latency-insensitive protocol controls com-munication among components of a patientsystem—a synchronous system whose func-tionality depends only on the order of each sig-nal’s events and not on their exact timing.Designers can model a synchronous system asa set of modules. These modules communi-cate by exchanging signals on a set of point-to-point channels. The protocol guaranteesthat a system, if composed of functionally cor-rect modules, behaves correctly, independentfrom delays in the channels connecting themodules. Consequently, it’s possible to auto-matically synthesize a hardware implementa-tion of the system such that its functionalbehavior is robust with respect to large varia-tions in communication latency.4 In practice,the channel implementation can vary and doesnot necessarily follow the point-to-point struc-ture. Still, this methodology orthogonalizescomputation and communication because itseparates the module design from the com-munication architecture options, whileenabling automatic synthesis of the interfacelogic. This separation is useful in two ways:

• It simplifies module design because design-ers can assume the synchronous hypothe-sis. That is, intermodule communicationwill take no time (or, in other words, it’scompleted within one virtual clock cycle).

• It permits the exploration of tradeoffs inderiving the communication architectureup to the design process’ late stages,because the protocol guarantees that theinterface logic can absorb arbitrary laten-cy variations.

In an earlier work, we discussed the appli-cation of the latency-insensitive methodologyto the case where the channels are simplyimplemented as sets of point-to-point metalwires.3 Here, our discussion addresses animmediate advantage of this methodology:After physical design, it lets designers pipelineany long wire having a delay larger than the

desired clock period into shorter segments byinserting special memory elements called relaystations. Hence, we divide the latency-insen-sitive design flow into four basic steps:

1. Specification of synchronous compo-nents. Designers must first specify thesystem as a collection of synchronouscomponents, relying on the synchronoushypothesis. These components can becustom modules or IP cores; we refer tothem as pearls.

2. Encapsulation. Next, it’s necessary toencapsulate each pearl within an auto-matically generated shell. A shell is a col-lection of buffering queues—one foreach port—plus the control logic thatinterfaces the pearl with the latency-insensitive protocol.

3. Physical layout. In this traditional step,designers use logic synthesis, and place-and-route tools to derive the layout of thechip implementing the system.

4. Relay station insertion. Designers seg-ment every wire whose latency is greaterthan the clock period by distributing thenecessary relay stations. Relay stations aresimilar to regular pipeline latches and,therefore, inherently different from tra-ditional buffering repeaters.

The only precondition this methodologyrequires is that the module’s pearls are stal-lable, that is, they can freeze their operationfor an arbitrary time without losing theirinternal state. This is a weak requirementbecause most hardware systems can be madestallable by, for instance, implementing agated clock mechanism. But, this precondi-tion is sufficient to permit the automatic syn-thesis of a shell around the pearl such that theshell-pearl combination satisfies the neces-sary, and much stronger, patience property.3

Figure 1 illustrates a system where five auto-matically generated shells encapsulate fivepearls to make them patient. The resultingshell-pearl components communicate bymeans of eight point-to-point communica-tion channels that implement the back-pres-sure mechanism described in the next section.Finally, the insertion of six relay stationsenables channel pipelining to meet the sys-tem clock period.

28

SOC DESIGN LATENCY

IEEE MICRO

Channels and back-pressureChannels are point-to-

point unidirectional linksbetween a source and a sinkmodule. Data are transmittedon a channel by means ofpackets that consist of a vari-able number of fields. Here,we consider only two basicfields: payload, which containsthe transmitted data; and void,a one-bit flag that, if set to 1,denotes that no data are pre-sent in the packet. If a packetdoes contain meaningful pay-load data (that is, it has voidset to 0), we call it a true packet.

A channel consists of wires and relay sta-tions. The number of relay stations in a chan-nel is finite and represents the channel’sbuffering capability. At each clock cycle, thesource module can either put a new true pack-et on the channel or, when no output data areavailable, put a void packet on it. Conversely,at each clock cycle the sink module retrievesthe incoming packet from the channel. It thendiscards or stores the packet on the inputchannel queue for later use, basing its deci-sion on the void field value.

A source module might not be ready tosend a true packet, and a sink module mightnot be ready to receive it if, for instance, itsinput queue is full. However, the latency-insensitive protocol demands fully reliablecommunication among the modules, anddoes not allow lossy communication links; itrequires the proper delivery of all packets.Consequently, the sink module must interactwith the channel (and ultimately with the cor-responding source module) to momentarilystop the communication flow and avoid theloss of any packet. Therefore, we slightly relaxour definition of a channel as unidirectionalto allow a bit of information (the channel stopflag) to move in the opposite direction. Thedashed wires in Figure 1 represent this signal,which is similar to the NACK signal inrequest/acknowledge protocols of asynchro-nous design. See the “Latency-insensitive ver-sus asynchronous design” sidebar for adiscussion of the two approaches.

By setting the stop flag equal to one duringa certain clock cycle, the sink module informs

the channel that it can’t receive the next pack-et, and the channel must hold the packet untilthe sink module resets the stop flag. Becausethe sink module and channel have limitedbuffering resources, a channel dealing with asink module that requires a long stall might fillup all its relay stations and have to send a stopflag to the source module to stall its packet pro-


Relaystation

Relaystation

Relaystation

Relaystation

Relaystation

Relaystation

Pearl 1

Shell 1

Pearl 4

Shell 4

Pearl 2

Shell 2

Pearl 5

Shell 5

Pearl 3

Shell 3

Figure 1. Shell encapsulation, relay station insertion, and channel back-pressure.

Latency-insensitive versus asynchronous designThe latency-insensitive design methodology is clearly reminiscent of many ideas that the

asynchronous-design community has proposed during the past three decades.1-2 Informally,asynchronous design’s underlying assumption is that the delay between two subsequentevents on a communication channel is completely arbitrary. In the case of a latency-insen-sitive system, designers constrain this arbitrary delay to be a multiple of the clock period. Thekey point is that this type of discretization lets designers leverage well-accepted design toolsand methodologies for the design and validation of synchronous circuits. In fact, the basicdistinction between the two approaches is the specification of a latency-insensitive systemas a synchronous system. Notice that we say specified because, from an implementationpoint of view, you could realize a latency-insensitive protocol using handshake-driven sig-naling techniques (for example, request-acknowledge mechanisms), which are typically asyn-chronous. On the other hand, synchronous systems specify a complex system as a collectionof modules whose state is updated collectively in one zero-time step. This update is natu-rally simpler than specifying the same system as the interaction of many components whosestate is updated following an intricate set of interdependency relations. IC designers couldlegitimately see latency-insensitive design as a compromise that accommodates importantproperties of asynchronous circuits within the familiar synchronous design framework.

References1. W.A. Clark and C.E. Molnar, “The Promise of Macromodular Systems,” Digest

of Papers 6th Ann. IEEE Computer Soc. Int’l Conf., IEEE Press, Piscataway,

N.J., 1972, pp. 309-312.

2. I.E. Sutherland, “Micropipelines,” Comm. ACM, vol. 32, no. 6, June 1989, pp.

720-738.

duction. This back-pressure mechanism con-trols the flow of information on a channel whileguaranteeing that no packets are lost.

Shell encapsulationGiven particular module M, we can devise

a method to automatically synthesize aninstance of a shell as a wrapper to encapsulateM. We can further interface the shell with thechannels so that M becomes a patient system.To do so, the only necessary condition is thatM be stallable. At each clock cycle, the mod-ule’s internal computation must fire only if allinputs have arrived. The module shell’s firsttask is to guarantee this input synchronization.Its second task is output propagation: At eachclock cycle, if M has produced new output val-ues and no output channel has previouslyraised a stop flag, then the shell can transmitthese output values by generating new truepackets. If the shell does not verify either ofthese two conditions, then it must retransmitthe packet transmitted in the previous cycle asa void packet. In summary, a shell for moduleM cyclically performs the following actions:

1. It obtains incoming packets from theinput channels, filters away the voidpackets, and extracts the input values forM from the payload fields of the truepackets.

2. When all input values are available forthe next computation, it passes them toM and initiates the computation.

3. It obtains the computation results from M.4. If no output channel has previously

raised a stop flag, it routes the result intothe output channels.

Figure 2 illustrates a simple unoptimizedimplementation of a shell wrapping a mod-ule with three input and two output channels.It’s also possible to insert queues between theshell I/Os and M’s I/Os to increase bufferingand avoid fragmented stalling. Using queueslets designers optimize implementations.Because such implementations follow the pro-tocol outlined previously, the shell-pearl com-bination operates in a way similar to that ofan actor in a static data flow. Hence, design-ers can take advantage of useful properties,

30

SOC DESIGN LATENCY

IEEE MICRO

Pearl module

dataIn

voidIn

stopOut

Cha

nnel

3

DataIn

voidIn

stopOut

Cha

nnel

2

DataIn

voidIn

stopOut

Cha

nnel

1

DataOut

voidOut

stopIn

Cha

nnel

5

DataOut

voidOut

stopIn

Cha

nnel

4

Clock

Figure 2. Shell encapsulation: Making an IP core patient.

such as the ability to statically size the shellqueues and to statically compute the perfor-mance of the overall system.

Relay stationsA relay station is a patient process commu-

nicating with two channels, ci and co. Let si

and so be the signals associated with the chan-nels. Further, let I(l, k, si), l ≤ k, denote thesequence of informative events (true packets)of si between the lth and kth clock cycle. Ifthese conditions are true, then si and so arelatency equivalent and for all k

I[1, (k − 1), si] − I(1, k, so) ≥ 0I(1, k, si) − I [1, (k − 1), so] ≤ 2

The following is an example of relay stationbehavior, where τ denotes a stalling event(void packet) and ιi a generic informativeevent:

si = ι1 ι2 ι3 τ τ ι4 ι5 ι6 τ τ τ ι7 τ ι8 ι9 ι10 …so = τ ι1 ι2 ι3 τ τ τ ι4 τ τ τ ι5 ι6 ι7 τ ι8 ι9 ι10 …

Notice that we don’t further specify signals si

and so; not even, for example, saying that si isthe input and so is the output. The definitionof relay station simply involves a set of rela-tions—a protocol, between si and so withoutany implementation detail. Still, it’s clear thateach informative event received on channel ci

is later emitted on co, while the presence of astalling event on co might induce a stalling eventon ci in a later cycle. In fact, an informativeevent takes at least one clock cycle to passthrough a relay station (minimum forwardlatency equal to one), and at most two infor-mative events can arrive on ci while no infor-mative events are emitted on co (internal storagecapacity equal to two). Finally, one extra stallingevent on co will move into ci in at least one cycle(minimum backward latency equal to one).The double storage capacity of a relay stationpermits, in the best case, communication withmaximum throughput (equal to one).

Because relay stations are patient process-es, and patience is a compositional property,the insertion of a relay station in a patient sys-tem guarantees that the system remainspatient.3 Further, because relay stations haveminimum latencies equal to one, designerscan repetitively insert them into a channel to

segment it, thereby increasing the channel’slatency. Figure 3 illustrates a possible relay sta-tion implementation. Researchers have recent-ly proposed other interesting relay stationimplementations.5

Figure 4 shows a finite state machine rep-resenting the control logic in the relay stationof Figure 3. Basically, the relay station switch-es between two main states: processing, when


Control logic

dataIn

voidIn

dataOut

voidout

Clock

stopOut stopIn

Auxiliaryregister

Mainregister

Figure 3. Example relay station implementation.

Processing

stopIn + voidIn

stopIn * voidIn

stopIn

stopIn

Stalling

ReadAux WriteAux

stopIn

stopIn

stopIn

stopIn

Figure 4. Finite state machine abstracting the relay stationlogic from Figure 3.

a flow of true packets traverses the relay sta-tion; and stalling, when communication isinterrupted. Switching through statesWriteAux and ReadAux governs the transitionbetween these two states. These states ensureproper control of the auxiliary register to avoidlosing a packet, because the stop flag takes acycle to propagate backward through the relaystation.

RecyclingIn general, an automatic tool should com-

plete the insertion of relay stations as part ofthe physical design process (similarly to thebuffer insertion techniques available in cur-rent design flows). In fact, the latency-insen-sitive methodology’s main advantage is thefreedom offered to designers in moving thelatency after deriving the final implementa-tion. Designers can fix problematic layoutswithout changing the design of individualmodules and also explore and optimize laten-cy-throughput tradeoffs up to the late stagesof the design process. The recycling paradigmformally captures the latency variations of thecommunication channels as you add (andmove) relay stations. It lets you exactly com-pute the system’s final throughput.6

For recycling, we model a latency-insensi-tive system as a directed (possibly cyclic) graphG that associates each vertex with a shell-pearlpair and in which each arc corresponds to achannel. We annotate each arc a with lengthl(a) and weight w(a). The length denotes thesmallest multiple of the desired clock periodthat is larger than the delay of the corre-sponding channel. The weight represents thenumber of relay stations inserted on the chan-nel. We call arc a illegal if w(aj) < l(aj) − 1. Thepresence of illegal wires in the layout impliesthat the final implementation is incorrect.Thanks to the latency-insensitive methodol-ogy, we can correct the final layout by intro-ducing relay stations to ensure that each wire’sdelay is less than the desired clock period. Thisinsertion corresponds to incrementing theweight of each illegal arc a by the quantity∆w(a) = l(a) − 1 − w(a).

Although recycling is an easy way to cor-rect the system’s final implementation and sat-isfy the timing constraints imposed by theclock, it does have a cost. In fact, augmentingthe weights of some arcs of G could increase

the graph’s maximum cycle mean λ(G),defined as maxC ∈ G{[w(C) + |C|] / |C|},where C is a cycle of G and w(C) the sum ofthe weights of the arcs of C. Increasing λ(G)corresponds to increasing the cycle time of thesystem modeled by G and, symmetrically,decreasing its throughput ϑ(G).6 Throughputalways decreases if any arc a with augmentedweight w(a) belongs to the set of critical cyclesof G, that is, those cycles whose mean coin-cides with the maximum cycle mean. Increas-ing the weights of some arcs might make anoncritical cycle of G become a critical cycleof recycled graph G ′. In any case, after com-pleting the recycling transformation, we canexactly compute the consequent throughputdegradation ∆ϑ(G, G ′) = ϑ(G) − ϑ(G ′) usingone of the following methods:

• Solve the maximum cycle mean problemfor graph G ′, and simply set ϑ(G ′) =1/[λ(G ′)].

• After establishing set Α of cycles havingat least one arc with an augmentedweight, increment the cycle mean of eachelement C of Α by the quantity 1/|C | ×∑i ∆w(ai), where ai are the arcs of C thathave been corrected. Let λ* be the max-imum among all these cycle means, then

∆ϑ (G, G′) = 0 if λ* ≤ λ(G) or [λ* − λ(G)] /[λ(G) × λ*] otherwise.

The critical cycle dictates overall through-put because the rest of the system must slowdown to wait for it and avoid packet loss. Ingeneral, if G contains more than one strong-ly connected component, you must decom-pose the recycling transformation in two steps:

1. Legalization. Legalize G by augmentingthe weights of the wires by the appropri-ate quantity.

2. Equalization. Compute maximumthroughput ϑ(Sk) = ak/bk ∈ ]0, 1] that issustainable by each strongly connectedcomponent Sk ∈ G′; recall that ϑ(Sk) isequal to the inverse of the maximum cyclemean λ(Sk). Equalize the throughputs byadding quantity nk ∈ Ζ* to the denomi-nator of each ϑ(Sk). This corresponds toaugmenting by quantity nk the weight ofthe critical cycle Ck ∈ Sk, that is, to dis-

32

SOC DESIGN LATENCY

IEEE MICRO

tribute nk extra relay stations among thecorresponding paths. Solving an opti-mization problem gives the quantities nk.

6

One key to avoiding large performance loss-es during recycling is to avoid augmenting theweights of those arcs belonging to critical cycleC of G. Furthermore, from the definition ofmaximum cycle mean, we have that for thesame w(C), the smaller the cycle’s cardinality|C |, the greater the loss in throughput for G.The worst case is clearly represented by self-loops. Designers must keep these considera-tions in mind while partitioning the systemfunctionality into tasks assigned to differentIP cores. It’s true that the latency-insensitivemethodology guarantees that no matter howpoor the final system implementation (interms of wire lengths in the communicationarchitecture), it’s always possible to fix the sys-tem by adding relay stations. Still, to achieveacceptable performance, designers shouldadopt a design strategy based on the followingguidelines:

• All modules should put comparable tim-ing constraints on the global clock (that is,delays of the longest combinatorial pathsinside each module should be similar).

• Modules whose corresponding verticesbelong to the same cycle should be keptclose to each other while deriving thefinal implementation.

Figure 5 illustrates the functional diagramof an MPEG-2 video encoder whose corre-sponding graph, reported in Figure 6 (nextpage), contains six distinct cycles:

• C1 = {a9, a10, a12}• C2 = {a9, a11, a14, a12}• C3 = {a16, a17, a18, a19, a20}• C4 = {a4, a5,, a6, a7, a8, a9, a10, a13}• C5 = {a4, a5,, a6, a7, a8, a9, a11, a14, a13}• C6 = {a6, a7, a8, a9, a11, a15, a17, a18, a19, a20}

Most arcs are common to more than onecycle, for example, a9 is part of C1, C2, C4, C5,and C6, but others are contained only in one


Framememory

Framememory

Pre

proc

essi

ng

Input

+

–

Discretecosine

transformQuantizer

(Q)

Regulator

Inversequantizer

(Q)−1

Inversediscrete

cosine transform

+

Motioncompensation

Motionestimation

Buf

fer

Output

Mot

ion

vect

ors

Pre

dict

ive

fram

es

Variablelength code

encoder

Figure 5. Block diagram of an MPEG-2 video encoder.

cycle; for example, a15 is only part of C6. Final-ly, some arcs, such as a3, are not contained inany cycle. We already know that increasingthe weight of these arcs does not affect the sys-

tem performance. However, can we computeperformance degradation in advance for thearcs belonging to one or more cycles? Figure7 shows the results of an analysis based on therecycling approach. The six curves in the chartrepresent the previously mentioned cycles.Each point of curve Ci shows the degradationin system throughput detected after settingtotal sum w(Ci) of the weights of Ci’s arcsequal to integer x, with x ∈ [0, 20]. Obvious-ly, the underlying assumption is that Ci is acritical cycle of G, limiting the choice of thosearcs of Ci with augmentable weights. Forexample, assume that w(C2) = 5, as a result ofsumming w(a9) = 4, w(a11) = 1, w(a14) = 0, andw(a12) = 0. In this case, C2 is definitely not acritical cycle. In fact, its cycle mean is λ(C2) =(5 + 4)/4 = 9/4 = 2.250. Even if w(a10) = w(a12)= 0, cycle Ci—with w(Ci) = w(a9) = 4—has alarger cycle mean: exactly λ(C1) = (4 + 3)/3 =7/3 = 2.333).

As Figure 7 confirms, the best way to avoidlosing performance is to increase the weightsof those arcs that belong to longer cycles. Per-forming the recycling transformation this waymight or might not be possible, for example,if the length of arc a10 is large, cycle C1 willultimately dictate the system throughput.However, in general the latency-insensitivemethodology lets us move around relay sta-tions without redesigning any module. Thiscapability could be useful, for example, inreducing the length of an arc such as a9, whichbelongs to both large and small cycles, whilein exchange increasing the length of a15, whichis only part of C6. This example assumes that,before rebalancing, l(a9) = 3 and l(a15) = 1,where, after rebalancing, they become respec-tively 1 and 3. It also assumes that all otherarcs have unit lengths, giving a final through-put of 10 / (10 + 2) = 0.833 instead of 3 / (3+ 2) = 0.6, a 38 percent improvement.

Deep-submicron technologies suffer—andin the foreseeable future will continue to

suffer—from delays on long wires that oftenforce costly redesigns. Latency among variousSOC components will vary considerably andin a manner that is difficult to predict. Laten-cy-insensitive design is the foundation of a cor-rect-by-construction methodology that lets usincrease the robustness of a design implemen-tation. This approach enables recovering the

34

SOC DESIGN LATENCY

IEEE MICRO

Figure 6. Latency-insensitive design graph of the MPEG-2video encoder shown in Figure 5. The vertices of the graphcorrespond to blocks in the MPEG diagram. Source and Tar-get represent respectively the input and output of theMPEG.

Target

V13V4

V3 V4 V5V12

V13

V14

V6

V7

V8

V9

V15

V1

V11

V10

Source

a2

a4a5

a6

a7

a8

a9

a10

a11a15

at

a14

a12

a20 a16

a17

a18

a19

a13

a1a3

aS

ArcVertices

aV

0 5 10w(Ci)

15 20

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

123456

Cycle

ϑ(G

)

Figure 7. Analysis of throughput degradation for the MPEG-2 encoder.


arbitrary delay variations of the wires by chang-ing their latency, leaving overall system func-tionality unaffected. A promising avenue forfurther research is the application of these con-cepts to the optimization of the computa-tion/communication tradeoffs that arise whiledesigning software compilers for machineshaving a communication architecture withvariable latency. MICRO

References1. K. Keutzer et al., “System Level Design:

Orthogonalization of Concerns and Platform-

Based Design,” IEEE Trans. Computer-Aided

Design of Integrated Circuits and Systems,

vol. 19, no. 12, Dec. 2000, pp. 1523-1543.

2. P. Glaskowski, “Pentium 4 (Partially)

Previewed,” Microprocessor Report, vol. 14,

no. 8, Aug. 2000, pp. 10-13.

3. L.P. Carloni, K.L. McMillan, and A.L.

Sangiovanni-Vincentelli, “Theory of Latency-

Insensitive Design,” IEEE Trans. Computer-

Aided Design, vol. 20, no. 9, Sept. 2001, pp.

1059-1076.

4. L.P. Carloni et al., “A Methodology for

‘Correct-by-Construction’ Latency Insensitive

Design,” Proc. Int’l Conf. Computer-Aided

Design (ICCAD 99), IEEE Press, Piscataway,

N.J., 1999, pp. 309-315.

5. T. Chelcea and S. Novick, “Robust Interfaces

for Mixed-Timing Systems with Application

to Latency-Insensitive Protocols,” Proc. 38th

Design Automation Conf. (DAC 01), ACM

Press, New York, 2001, pp. 21-26.

6. L.P. Carloni and A.L. Sangiovanni-Vincentelli,

“Performance Analysis and Optimization of

Latency Insensitive Systems,” Proc. 37th

Design Automation Conf. (DAC 00), IEEE

Press, Piscataway, N.J., 2000, pp. 361-367.

Luca P. Carloni is a PhD candidate in theElectrical Engineering and Computer ScienceDepartment of the University of California,Berkeley. His research interests include com-puter-aided design and design methodologiesfor electronic systems, embedded systemdesign, logic synthesis, and combinatorialoptimization. He has a Laurea in electricalengineering from the University of Bologna,Italy, and an MS in electrical engineering andcomputer science from the University of Cal-ifornia, Berkeley.

Alberto L. Sangiovanni-Vincentelli holds theButtner Chair in electrical engineering andcomputer science at the University of Cali-fornia, Berkeley. His research interests includeall aspects of computer-aided design of elec-tronic systems, hybrid systems and control,and embedded system design. He cofoundedCadence and Synopsys and received the Kauf-man Award from the Electronic DesignAutomation Consortium for pioneering con-tributions to EDA. He is a fellow of the IEEEand a member of the National Academy ofEngineering.

Direct questions and comments about thisarticle to Luca P. Carloni, 211-10 Cory Hall,University of California at Berkeley, BerkeleyCA 94720; [email protected].

For further information on this or any othercomputing topic, visit our Digital Library athttp://computer.org/publications/dlib.

computer.org/e-News

Available for FREE to members.

Good news for your in-box.

Be alerted to

• articles and special issues

• conference news

• submission and registrationdeadlines

• interactive forums

Sign Up Today for the IEEE

Computer Society’s

e-News

Sign Up Today for the IEEE

Computer Society’s

e-News

COPING WITH LATENCY IN SOC DESIGNluca/research/lidMicro02.pdfFrom computation- to...

Documents

Transcript of COPING WITH LATENCY IN SOC DESIGNluca/research/lidMicro02.pdfFrom computation- to...