Emerging Technologies of Computation
description
Transcript of Emerging Technologies of Computation
Montek Singh
COMP790-084Oct 27, 2011
Introduction to Asynchronous Design◦ What is asynchronous design?◦ Why do we want to do it?
Data Representation and Communication◦ How is data represented in an asynchronous
system?◦ How is information exchanged?
3
Introduction: Clocked Digital Introduction: Clocked Digital DesignDesignMost current digital systems are Most current digital systems are synchronous:synchronous:
Clock:Clock: a global signal that paces operation of all a global signal that paces operation of all componentscomponents
clockclock
Benefit of clocking: Benefit of clocking: enables discrete-time enables discrete-time representationrepresentation all components operate exactly once per clock all components operate exactly once per clock
ticktick component outputs need to be ready by next component outputs need to be ready by next
clock tickclock tickallows allows ““glitchyglitchy”” or incorrect outputs between clock ticks or incorrect outputs between clock ticks
4
Microelectronics TrendsMicroelectronics TrendsCurrent and Future Trends: Current and Future Trends: Significant Significant
ChallengesChallenges
Large-Scale Large-Scale ““Systems-on-a-ChipSystems-on-a-Chip”” (SoC) (SoC)100 Million ~ 1 Billion transistors/chip100 Million ~ 1 Billion transistors/chip
Very High SpeedsVery High Speedsmultiple GigaHertz clock ratesmultiple GigaHertz clock rates
Explosive Growth in Consumer ElectronicsExplosive Growth in Consumer Electronicsdemand for ever-increasing functionality …demand for ever-increasing functionality …… … with very low power consumption (limited battery life)with very low power consumption (limited battery life)
Higher Portability/Modularity/ReusabilityHigher Portability/Modularity/Reusability ““plug plug ’’n playn play”” components, robust interfaces components, robust interfaces
5
Challenges to Clocked DesignChallenges to Clocked DesignBreakdown of Single-Clock Paradigm:Breakdown of Single-Clock Paradigm:
Chip will be partitioned intoChip will be partitioned into multiple timing domainsmultiple timing domainschallenge: gluing together multiple timing domainschallenge: gluing together multiple timing domains
– glue logic is susceptible to glue logic is susceptible to ““metastabilitymetastability”” (=incorrect values (=incorrect values
transferred) and latency overheadstransferred) and latency overheads
Increasing Difficulties with Clocked Design:Increasing Difficulties with Clocked Design: Clock distribution: requires Clock distribution: requires significantsignificant designer effort designer effort
Performance bottleneck: a single slow componentPerformance bottleneck: a single slow component
Clock burns large fraction of chip power (~40-70%)Clock burns large fraction of chip power (~40-70%)
Fixed clock rate: poor match forFixed clock rate: poor match fordesigning designing reusable componentsreusable components interfacing with interfacing with mixed-timing environmentsmixed-timing environments
6
What is Asynchronous Design?What is Asynchronous Design? Digital design with Digital design with no centralized clockno centralized clock Synchronization using local Synchronization using local ““handshakinghandshaking””
Asynchronous SystemAsynchronous System(Distributed Control)(Distributed Control)
handshakinghandshakinginterfaceinterface
Synchronous SystemSynchronous System(Centralized Control)(Centralized Control)
clockclock
7
Why Asynchronous Design? (1)Why Asynchronous Design? (1) Higher PerformanceHigher Performance
May obtain May obtain ““average-caseaverage-case”” operation (not operation (not ““worst-worst-casecase””))not limited by slowest componentnot limited by slowest component
Avoids overheads of multi-GHz clock distributionAvoids overheads of multi-GHz clock distribution
Lower PowerLower Power No clock power expendedNo clock power expended Inactive components consume negligible powerInactive components consume negligible power
Better Electromagnetic CompatibilityBetter Electromagnetic Compatibility Smooth radiation spectra: Smooth radiation spectra: no clock spikesno clock spikes Much less interference with sensitive receivers Much less interference with sensitive receivers [e.g., [e.g.,
Philips pagers, smartcards]Philips pagers, smartcards]
Greater Flexibility/ModularityGreater Flexibility/Modularity Naturally adapt to variable-speed environmentsNaturally adapt to variable-speed environments Supports reusable componentsSupports reusable components
8
Why Asynchronous Design? (2)Why Asynchronous Design? (2) The world already is mostly asynchronous!The world already is mostly asynchronous!
Events at the level of (or in between) large-scale systems are Events at the level of (or in between) large-scale systems are asynchronousasynchronous several seconds to several millisecondsseveral seconds to several milliseconds e.g., PC-printer communication, keyboard inputs, network comm.e.g., PC-printer communication, keyboard inputs, network comm.
Events at the board level (or between chips) are often Events at the board level (or between chips) are often asynchronousasynchronous milliseconds to 100 nanosecondsmilliseconds to 100 nanoseconds e.g., CPU-memory interface, interface with I/O subsystem (interrupts)e.g., CPU-memory interface, interface with I/O subsystem (interrupts)
Events within a chip, at the level of functional units (e.g., Events within a chip, at the level of functional units (e.g., adders, control logic) are currently mostly synchronousadders, control logic) are currently mostly synchronous several nanoseconds to 100 picosecondsseveral nanoseconds to 100 picoseconds
Events at the level of a single logic gate are asynchronousEvents at the level of a single logic gate are asynchronous 10 picoseconds10 picoseconds
Events at the quantum level are asynchronousEvents at the quantum level are asynchronous picoseconds to femtosecondspicoseconds to femtoseconds
So, why bother with clocks at all?!So, why bother with clocks at all?! make everything asynchronous make everything asynchronous greater elegance and greater elegance and
robustnessrobustness
9
Challenges of Asynchronous Challenges of Asynchronous DesignDesign
communication must be hazard-free!communication must be hazard-free! special design challenge =special design challenge = ““hazard-free synthesishazard-free synthesis””
Testability Issues:Testability Issues: absence of clock means no absence of clock means no ““single-steppingsingle-stepping””
Lack of Commercial CAD Tools:Lack of Commercial CAD Tools: chicken-and-egg problemchicken-and-egg problem
Hazards: Hazards: potential potential ““glitchesglitches”” on wire on wire
clean signalsclean signals
hazardous signals
clockclock tick tick
no problemno problemfor for clockclockededsystemssystems
no problemno problemfor for clockclockededsystemssystems
10
Asynchronous Design: Past & Asynchronous Design: Past & PresentPresentAsync Design: Async Design: In existence for 50 years, but … In existence for 50 years, but …
… … many recent technical advances:many recent technical advances: Hazard-Free Circuit Design:Hazard-Free Circuit Design:
several practical techniques for controllers several practical techniques for controllers [Stanford/Columbia][Stanford/Columbia]
Design for Testability:Design for Testability:several test solutions, e.g. Philips Researchseveral test solutions, e.g. Philips Research
Maturing Computer-Aided-Design (Maturing Computer-Aided-Design (““CADCAD””) Tools:) Tools:software tools for automated design software tools for automated design
[Philips,Columbia,Manchester][Philips,Columbia,Manchester] recent DARPA program [Boeing,Philips,UNC,Columbia,…]recent DARPA program [Boeing,Philips,UNC,Columbia,…]
Successful Fabricated Chips:Successful Fabricated Chips:embedded processors, high-speed pipelines, consumer embedded processors, high-speed pipelines, consumer
electronics…electronics…
11
Recent Commercial Interest (1)Recent Commercial Interest (1)Several commercial asynchronous chips:Several commercial asynchronous chips:
Philips: Philips: asynchronous 80c51 microcontrollersasynchronous 80c51 microcontrollersused in commercial pagers [1998] and smartcards [2001]used in commercial pagers [1998] and smartcards [2001]
Univ. of Manchester: Univ. of Manchester: async ARM processor [2000]async ARM processor [2000] Motorola: Motorola: async divider in PowerPC chip [2000]async divider in PowerPC chip [2000] HAL: HAL: async floating-point dividerasync floating-point divider
in HAL-I and II processors [early 1990in HAL-I and II processors [early 1990’’s]s]
Recent experimental chips:Recent experimental chips: IBM, Sun and Intel:IBM, Sun and Intel:
fast pipelines, arbiters, instruction-length decoder…fast pipelines, arbiters, instruction-length decoder… IBM/Columbia/UNC: IBM/Columbia/UNC: asynchronous digital FIR filterasynchronous digital FIR filter
Several recent startups:Several recent startups: Handshake Solutions, Theseus Logic, Codetronix, Handshake Solutions, Theseus Logic, Codetronix,
Fulcrum, Silistix, …Fulcrum, Silistix, …
12
Recent Commercial Interest (2)Recent Commercial Interest (2)Major DARPA program:Major DARPA program:
~$13M~$13M Goals:Goals:
commercial-strength automated CAD tool (=silicon commercial-strength automated CAD tool (=silicon compiler)compiler)
– direct translation from algorithms to chip layoutdirect translation from algorithms to chip layout– capable of producing chips with 50M transistors or morecapable of producing chips with 50M transistors or more– rich suite of analysis and optimization toolsrich suite of analysis and optimization tools
demonstration chipdemonstration chip– Boeing applicationBoeing application– show dramatic improvements in: design time, power show dramatic improvements in: design time, power
consumption, noise pollution, speed (?)consumption, noise pollution, speed (?) Team:Team:
led by Boeingled by Boeingasync startups: Theseus, Handshake Solutions, Codetronixasync startups: Theseus, Handshake Solutions, Codetronixuniversities: UNC, Columbia, UW, OrSUuniversities: UNC, Columbia, UW, OrSU
13
Data Representation and Data Representation and CommunicationCommunication
14
A 5-minute Homework ProblemA 5-minute Homework ProblemAliceAlice and and BobBob live on opposite sides of a wide river: live on opposite sides of a wide river:
AliceAlice is supposed to send a message (say, a is supposed to send a message (say, a ““YesYes””//””NoNo””) ) across to across to Bob Bob around midnight. Both have flashlights, around midnight. Both have flashlights, but neither owns a watch. What should they do?but neither owns a watch. What should they do?
Suggest several strategies, and discuss pros and cons of Suggest several strategies, and discuss pros and cons of each.each.
AliceAlice
BobBob
15
got it
got it
Solution 1Solution 1AliceAlice uses 2 lamps:uses 2 lamps:
1 to indicate that she is ready with the message, and1 to indicate that she is ready with the message, and 1 for the message itself1 for the message itself
BobBob uses 1 lamp:uses 1 lamp: to indicate that he has received the messageto indicate that he has received the message
AliceAlice
BobBobreadyready
yes/no
yes/no
16
Solution 2Solution 2AliceAlice uses 2 lamps:uses 2 lamps:
GreenGreen lamp to indicate lamp to indicate ““yesyes”” Red Red lamp to indicate lamp to indicate ““nono””
BobBob uses 1 lamp:uses 1 lamp: to indicate that he has received the messageto indicate that he has received the message
got it
got it
AliceAlice
BobBobnono
yesyes
17
Solution 3Solution 3What if Alice and Bob could keep time?What if Alice and Bob could keep time?
AliceAlice uses 1 lamp uses 1 lamp for the message:for the message: At 12 midnight: turns on lamp At 12 midnight: turns on lamp if message = if message = ““yesyes”” At 12:01: turns lamp offAt 12:01: turns lamp off
BobBob needs no lamps!needs no lamps! Takes down the message between 12 and 12:01Takes down the message between 12 and 12:01
Pros:Pros: Fewer signals, lesser processing needed Fewer signals, lesser processing needed
Cons:Cons: Alice and Bob must keep their clocks closely Alice and Bob must keep their clocks closely synchronizedsynchronized If BobIf Bob’’s watch is off by a minute, incorrect communication s watch is off by a minute, incorrect communication
possiblepossible
18
Homework!Homework! Think of all scenarios in which Solution #1 can Think of all scenarios in which Solution #1 can
failfail
Are any of those scenarios a problem for Are any of those scenarios a problem for Solution #2 as well?Solution #2 as well?
19
Data Representation and Data Representation and CommunicationCommunication
How is data represented in an asynchronous How is data represented in an asynchronous system?system? How is information exchanged?: control How is information exchanged?: control signalingsignaling (handshake styles) (handshake styles)
20
Data Encoding: Data Encoding: ““Bundled DataBundled Data””Single-rail Single-rail ““Bundled DatapathBundled Datapath””: : simplest approach simplest approach
widely usedwidely used
Features:Features: datapath: datapath: 1 wire per bit (e.g. standard sync blocks)1 wire per bit (e.g. standard sync blocks) matched delay: matched delay: produces delayed produces delayed ““donedone”” signal signal
worst-case delay: longer than slowest pathworst-case delay: longer than slowest path
+ Practical style: can reuse sync componentsPractical style: can reuse sync components; ; small areasmall area
– Fixed (worst-case) completion timeFixed (worst-case) completion time
donedone indicatesindicates valid datavalid data
bit 1bit 1
requestrequest
bit nbit n
bit 1bit 1
bit mbit m
donedonematchedmatcheddelaydelay
functionfunctionblockblock
21
Bundled Data: Completion Bundled Data: Completion SensingSensingDelay Matching:Delay Matching:
either single worst-case delayeither single worst-case delay or, fine-grain delayor, fine-grain delay
request done
bank of delays
MUX
delayselector
Speculative completion:Speculative completion: choose delay “on the fly”choose delay “on the fly” start with shortest delay; increase as neededstart with shortest delay; increase as needed
22
+provides provides robustrobust data-dependent completion data-dependent completion
– needs completion detectorsneeds completion detectors
Data Encoding: Data Encoding: Dual-RailDual-Rail Dual-rail: Dual-rail: uses 2 wires per data bituses 2 wires per data bit
Dual-rail code
Meaning
00 “reset” value 01 0 value 10 1 value 11 unused
bit nbit n
bit 1bit 1
bit mbit m
bit 1bit 1
Each Dual-Rail Pair:Each Dual-Rail Pair: provides both provides both data valuedata value and and
validityvalidity
23
Dual-Rail: Completion SensingDual-Rail: Completion SensingDual-Rail Completion Detector:Dual-Rail Completion Detector:
combines dual-rail signalscombines dual-rail signals indicates when all bits are valid (or reset)indicates when all bits are valid (or reset)
CCDoneDone
ORORbitbit00
ORORbitbit11
ORORbitbitnn
OROR together 2 rails per bit together 2 rails per bit Merge results using a Müller Merge results using a Müller “C-element”“C-element”
C-element:C-element:if all inputs=1, output if all inputs=1, output 1 1if all inputs=0, output if all inputs=0, output 0 0else, maintain output valueelse, maintain output value
C-element:C-element:if all inputs=1, output if all inputs=1, output 1 1if all inputs=0, output if all inputs=0, output 0 0else, maintain output valueelse, maintain output value
24
4-Phase: 4-Phase: requires 4 events per handshakerequires 4 events per handshake
Handshaking Styles: Handshaking Styles: 4-phase4-phase
RequestRequest
AcknowledgeAcknowledge
startevent
eventdone
get ready fornext event
ready fornext event
+ ““Level-sensitiveLevel-sensitive”” simpler logic simpler logic implementationimplementation
– Overhead of Overhead of ““return-to-zeroreturn-to-zero”” (RTZ or (RTZ or resetting)resetting) extra events which do no useful computationextra events which do no useful computation
25
+ Elegant: Elegant: no return-to-zerono return-to-zero– Slower logic implementation:Slower logic implementation:
logic primitives are inherently level-sensitive, not event-logic primitives are inherently level-sensitive, not event-based (at least in CMOS)based (at least in CMOS)
Handshaking Styles: Handshaking Styles: 2-phase2-phase2-Phase: 2-Phase: requires 2 events per handshakerequires 2 events per handshake
a.k.a. a.k.a. transition signalingtransition signaling
RequestRequest
AcknowledgeAcknowledge
startevent
eventdone
start nextevent
next eventdone
26
+ No return-to-zero (like 2-phase)No return-to-zero (like 2-phase)
+ Level-based implementation (like 4-phase)Level-based implementation (like 4-phase)– Need a timing constraint on pulse widthNeed a timing constraint on pulse width
Handshaking Styles: Handshaking Styles: Pulse ModePulse ModePulse Mode: Pulse Mode: combines benefits of 2-phase and 4-combines benefits of 2-phase and 4-
phasephase use pulses to represent eventsuse pulses to represent events
RequestRequest
AcknowledgeAcknowledge
startevent
eventdone
start nextevent
next eventdone
27
+ Efficient protocol: no return-to-zero, level-Efficient protocol: no return-to-zero, level-basedbased
– Need aggressive low-level design techniquesNeed aggressive low-level design techniques much effort to ensure reliability, satisfy timing constraintsmuch effort to ensure reliability, satisfy timing constraints
Handshaking Styles: Handshaking Styles: Single-TrackSingle-TrackSingle-Track: Single-Track: combines req and ack onto single combines req and ack onto single
wire!wire! one wire used for bidirectional communicationone wire used for bidirectional communication
sender raises, receiver lowerssender raises, receiver lowersreq + ackreq + ack
RequestRequest
AcknowledgeAcknowledge
reqreq reqreq
ackack ackack
28
Handshaking + Data Handshaking + Data RepresentationRepresentationSeveral combinations possible:Several combinations possible:
dual-rail 4-phase, single-rail 4-phase, dual-rail 2-phase, and dual-rail 4-phase, single-rail 4-phase, dual-rail 2-phase, and single-rail 2-phasesingle-rail 2-phase
Example:Example: dual-rail 4-phase dual-rail 4-phase
dual-rail data: dual-rail data: functions as anfunctions as an implicit implicit ““requestrequest”” 4-phase cycle: between 4-phase cycle: between acknowledgeacknowledge and and implicit requestimplicit request
bit mbit m
bit 1bit 1
ackack
AA BB
29
Other Data Representation StylesOther Data Representation Styles Level-Encoded Dual-Rail (LEDR)Level-Encoded Dual-Rail (LEDR)
2 wires per bit: 2 wires per bit: ““datadata”” and and ““phasephase”” exactly one wire per bit changes valueexactly one wire per bit changes value
if new value is different, if new value is different, ““datadata”” wire changes value wire changes valueelse else ““phasephase”” wire change value wire change value
M-of-N CodesM-of-N Codes N wires used for a data wordN wires used for a data word M wires (M <= N) change valueM wires (M <= N) change value Values of N and M: have impact on…Values of N and M: have impact on…
information transmitted, power consumed and logic information transmitted, power consumed and logic complexitycomplexity
Knuth codes, Huffman codes, …Knuth codes, Huffman codes, …
datadataphasephase
30
Which to use?Which to use?Depends on several performance parameters:Depends on several performance parameters:
speedspeed single-rail vs. dual-railsingle-rail vs. dual-rail
– single-rail may be faster (if designed aggressively)single-rail may be faster (if designed aggressively)– dual-rail may be faster (if completion times vary widely)dual-rail may be faster (if completion times vary widely)
2-phase vs. 4-phase2-phase vs. 4-phase– 2-phase may be faster (if logic overhead is small)2-phase may be faster (if logic overhead is small)– 4-phase may be faster (if overhead of return-to-zero is small)4-phase may be faster (if overhead of return-to-zero is small)
power consumptionpower consumption2-phase typically has fewer gate transitions (2-phase typically has fewer gate transitions ( lower lower
power)power) amount of logic used (#gates/wires/pins amount of logic used (#gates/wires/pins chip area) chip area)
single-rail needs fewer gates/wires/pinssingle-rail needs fewer gates/wires/pins design and verification effortdesign and verification effort
dual-rail, 1-of-N, M-of-N, Knuth codes…:dual-rail, 1-of-N, M-of-N, Knuth codes…:– delay-insensitive: robust in the presence of arbitrary delaysdelay-insensitive: robust in the presence of arbitrary delays
single-rail: requires greater timing verification effortsingle-rail: requires greater timing verification effort
31
Homework!Homework! Suppose you are given N wiresSuppose you are given N wires
Which M-of-N encoding (i.e. what M) encodes most Which M-of-N encoding (i.e. what M) encodes most information?information?
Suppose you have to encode 4-bit valuesSuppose you have to encode 4-bit values Which M-of-N encoding yields fewest wires?Which M-of-N encoding yields fewest wires?
Suppose you can switch at most 2 wiresSuppose you can switch at most 2 wires Which M-of-N encoding yields fewest wires for 4-bit Which M-of-N encoding yields fewest wires for 4-bit
values?values?