Asynchronous Logic: A Computer Systems...
Transcript of Asynchronous Logic: A Computer Systems...
Asynchronous Logic: A Computer Systems Perspective
Rajit ManoharComputer Systems Lab, Cornellhttp://vlsi.cornell.edu/
Approaches to computation
Discrete Time Continuous Time
Discrete Value Digital, synchronous logic
Digital, asynchronous
logic
Continuous Value Switched-capacitor analog
General analog computation
Asynchronous logic
sender
receiver
clock
data timesigna
l val
ue
sampling window
sender
receiver
data 0
data 1acksender
receiver
datareq
ack
data 1
data 0
acknowledge
data
request
acknowledge
X X X X
Explicit versus implicit synchronization
time
A1
B2
B4
B3
B1
A2 C2
C1
A3
A4C4
C3
A B C
time
A1 B1
A2
C2
C1
A3
A4
C4
C3
A B C
B2
B3
B4
Synchronous computation
Asynchronous computation
Differences at a glance
Asynchronous Synchronous
Continuous time computation. Discrete time computation.
Observed delay is the average over possible circuit paths.
Observed delay is the maximum over possible circuit paths.
Local switching from data-driven computation. Can lead to low-power operation.
Global activity from clock-driven computation. Low power requires careful clock gating.
Throughput determined by device speed. (Margins needed for some circuit families.)
Throughput determined by device speed and additional operating margins.
Data must be encoded, requiring additional wires for signals.
Unencoded data acceptable due to global clock network.
A little bit of (biased) history…
1940 1950 1960 1970
1980 1990 2000 2010
Binary addition (von Neumann)
Illiac II (UIUC/Muller)Macro modular systems (Clarke, Molnar, Stucki,…)
Flow tables (Huffman)
System Timing (Seitz, in Mead/Conway)
Switching theory (Miller)
Micro pipelines (Sutherland)
General hazard-free synthesis
Asynchronous CPU (Martin)
ARM with caches (Furber)
Commercial 80C51 (Phillips)
10G Ethernet switch (Fulcrum)
Event-driven CPU (Manohar)
Pipelined FPGAs (Manohar)
TrueNorth (IBM/Cornell)
Atlas, MU5 (Manchester)
SpiNNaker (Furber)
Neurogrid (Boahen)
AER (Mead lab)
Sensory systems (retina, cochlea, etc.)
17FO4 MIPS CPU (Martin)
UltraSparc IIIi mem controller (Sun)
Trace theory (Snepscheut)
Syntactic compilation (Martin, Rem)
HiAER (Cauwenberghs)
FACETS (Heidelberg)
I. Exploiting the gap between average and max delay
• Slow part: computing carries❖ … but only in the worst-case!
1 1 0 0 0
1 0 1 1 0 0
+ 0 1 1 0 1 0
1 0 0 0 1 1 0
Theorem [von Neumann, 1946]. The average-case latency for a ripple carry binary adder is O(log N) for i.i.d. inputs
A.W. Burks, H.H. Goldstein, J. von Neumann. Preliminary discussion of the logical design of an electronic computing instrument. (1946)
I. Exploiting the gap between average and max delay
• Standard adder tree❖ O(log N) latency
• Hybrid adder tree❖ Tree adder + ripple adder
Theorem [Winograd, 1965]. The worst-case latency for binary addition is Ω(log N)Theorem. The average-case latency of the hybrid asynchronous adder is O(log log N) for i.i.d. inputsTheorem. The average-case latency of the hybrid asynchronous adder is optimal for any input distribution
R. Manohar and J. Tierno. Asynchronous parallel prefix computation. IEEE Transactions on Computers (1998)
S.O. Winograd. On the time required to perform addition. JACM (1965)
II. Exploiting data-driven power management
• Low power FIR filter❖ External, synchronous I/O❖ Monitor occupancy of FIFOs❖ Speed up/slow down using
adaptive power supply
• Break-point add/multiply❖ Detect when top half of
operand is required❖ Dynamically switch between
full width and half width arithmetic
FIR filter
State
DC/DC converter
Vdd
L. Nielsen and J. Sparso. A Low-power Asynchronous Datapath for a FIR filter bank. Proc. ASYNC (1996)
II. Exploiting data-driven power management
• Width-adaptive numbers❖ Compress leading 0’s and 1’s
sender
receiver
sender
receiver
128−bit datapath
8−bits per VDW block
Number: 1.4⇥ 1014 ⇤ 247
Number: �29 ⇤ �25
R. Manohar. Width-Adaptive Data Word Architectures. Proc. ARVLSI (2001)
0 1 0 1 0 1
0 0 0 · · · 0 0 0 1+ 0 0 0 · · · 0 0 0 1
0 0 0 · · · 0 0 1 0
0 1+ 0 1
0 1 0
00
0 1 0 1 0 1
0 0 0 · · · 0 0 0 1+ 0 0 0 · · · 0 0 0 1
0 0 0 · · · 0 0 1 0
0 1+ 0 1
0 1 0
00
v/s
Average: 16.3
Average: 10.8
Average: 10.0Average: 8.4
Average: 10.5Average: 8.8
Average: 10.8Average: 13.4
Average: 10.9
Average: 12.6
Average: 11.8
Average: 10.3
Operand Widths
15 20 25 30
Cum
ulat
ive
Dist
ribut
ion
Func
tion
(%)
Operand Widths
SPEC: 124.m88ksim
no compactionsimple compaction
full compaction
SPEC: 132.ijpeg
no compactionsimple compaction
0
20
40
60
80
100
5 10 15 20 25 30
Cum
ulat
ive
Dist
ribut
ion
Func
tion
(%)
Operand Widths
SPEC: 129.compress
no compactionsimple compaction
0
10
20
30
40
50
60
70
80
90
100
5 10 15 20 25 30
Cum
ulat
ive
Dist
ribut
ion
Func
tion
(%)
Operand Widths
SPEC: 126.gcc
no compactionsimple compaction
full compaction
full compaction full compaction
5
100
90
80
70
60
50
40
30
20
10
0
0
20
40
60
80
100
5 10 15 20 25 30
Cum
ulat
ive
Dist
ribut
ion
Func
tion
(%)
10
III. Exploiting elasticity of pipelines
gf h
gf h
*R. Manohar, A. J. Martin. Slack Elasticity in Concurrent Computing. Proc. Mathematics of Program Construction (1998)
Asynchronous computation is* robust to changes in circuit-
level pipelining!
III. Exploiting elasticity of pipelines
• Asynchronous FIR filter• Data arrives at variable rates• Data rates are predictable
• Fixed amount of time for asynccomputation
• Variable number of clock cyclesfor the fixed time budget
M. Singh, J.A. Tierno, A. Rylyakov, S. Rylov, S.M. Nowick. An adaptively pipelined mixed synchronous-asynchronous digital FIR filter chip operating at 1.3 GHz. Proc. ASYNC (2002)
1 2 3 9
III. Exploiting elasticity of pipelines
• FPGA throughput low due to overhead ofprogrammable interconnect
• ~3x improvement in peak throughput bytransparent interconnect pipelining CB
CB
CB
CB
SB
SB
CB
CB
CB
CB
SB
SB
LB
LB
LB
LB
J. Teifel and R. Manohar. Highly pipelined asynchronous FPGAs. Proc. FPGA (2004)
IV. Exploiting timing robustness
77K
200
400
600
800
1000
1200
0.5 1 1.5 2 2.5Voltage (V)
Xilinx Virtex
400K
Asynchronous FPGA Test Data
Process: TSMC 0.18um
Thro
ughp
ut (M
Hz) Nominal Vdd: 1.8V
294K
12K
0
D. Fang, J. Teifel, and R. Manohar. A High-Performance Asynchronous FPGA: Test Results. Proc. FCCM (2005)
IV. Exploiting timing robustness
• “GasP” logic family• 6-10 FO4
• No margins on controlsignals
• Standard datapath
• Latest 40nm chip operates>6 GHz
w
w
Self-reset
Single-trackcontrol wire
Track reset
TracksetWait for
L set,R reset
Datapathdriver
RA(i) RA(i+1)
I. Sutherland, S. Fairbanks. GasP: a minimal FIFO control. Proc. ASYNC (2001)
V. Exploiting low noise and low EMI
J. Kessels, T. Kramer, G. den Besten, A. Peeters, V. Timm. Applying Asynchronous Circuits in Contactless Smart Cards. Proc. ASYNC (2000)
Synchronous 80C51 (Phillips)
Asynchronous 80C51 (Phillips)
0 100 200 300 400 MHz 0 100 200 300 400 MHz
frequency domain
time domain
VI. Exploiting continuous-time information processing
• Continuous-timedigital signal processing
• Automatic adaptationto input bandwidth
• No aliasing from samplingprocess
Nyquist sampling Level-crossing sampling
B. Schell, Y. Tsividis. A clockless adc/dsp/dac system with activity-dependent power dissipation and no aliasing. Proc. ISSCC (2008)
VI. Exploiting continuous-time information processing
• Address-event representation in neuromorphic systems❖ “Time represents itself”❖ Spike timing and coincidence for information processing
• Asynchronous logic❖ Temporal resolution is decoupled from throughput❖ Automatic power management
• Projects❖ Silicon retinas, cochleas, …❖ More recent: FACETS, HiAER, Neurogrid, neuroP, SpiNNaker, TrueNorth
A note on oscillations in asynchronous logic• Well-known result in asynchronous circuit theory
❖ Signal transitions: look at the ith occurrence of transition t❖ = the actual time of this event❖
• Then:
❖ Bounded, independent of i, and t❖ If the circuit is strongly connected, this result holds
• More recently: stronger periodicity under more general conditions
S.M. Burns, A.J. Martin. Performance Analysis and Optimization of Asynchronous Circuits. Proc. ARVLSI (1990)
W. Hua and R. Manohar. Exact timing analysis for concurrent systems. To appear, work-in-progress session, DAC (2016)
J. Magott. Performance evaluation of concurrent systems using petri nets. IPL (1984)
|time(t, i)� time0(t, i)| < B
time(t, i)
time0(t, i) = f(t) + p⇤ ⇥ i
Asynchronous logic in TrueNorth
• Spikes trigger activity…❖ … so they should be asynchronous
• Neurons leak periodically…❖ … so add a periodic event to the system (the clock)
SRAM
Neuron
Token Control
Scheduler
Router
At the external timing “tick”: • Update each neuron, delivering spikes plus
any local state update • If the neuron spikes, send output spike to
destinations, with specified delay Spikes travel through asynchronous network
• Receiver buffers spikes until it is time to deliver them
Digital, asynchronous logic
Asynchronous logic in TrueNorth
• Spike communication network❖ Links are a shared resource❖ Links might be contended
• Spike delivery times can differ based on traffic❖ Different runs of the same network might
produce different spike patterns
• Instead, we use a spike scheduler❖ Guarantees deterministic computation❖ … and hence, repeatability
Summary
• Asynchronous logic has a long history
• Some key properties exploited in projects❖ Gap between average and maximum delay for a computation❖ Automatic data-driven power management❖ Elastic pipelining❖ Robustness to timing uncertainty/variation❖ Reduced EMI and noise❖ Continuous-time information representation
Acknowledgments
• The “async” community• Students:
❖ Ned Bingham, Wenmian Hua, Sean Ogden, Praful Purohit, Tayyar Rzayev, Nitish Srivastava
❖ John Teifel, Clinton Kelly, Virantha Ekanayake, Song Peng, David Biermann, David Fang, Chris LaFrieda, Filipp Akopyan, Basit Riaz Sheikh, Benjamin Tang, Nabil Imam, Sandra Jackson, Carlos Otero, Benjamin Hill, Stephen Longfield, Jonathan Tse, Robert Karmazin
• Collaborators:❖ General: David Albonesi, Sunil Bhave, Francois Guimbretiere, Amit Lal, Yoram
Moses, Ivan Sutherland, Yannis Tsividis❖ Neuromorphic: John Arthur, Kwabena Boahen, Tobi Delbruck, Chris Eliasmith,
Giacomo Indiveri, Paul Merolla, Dharmendra Modha
• Support: AFRL, DARPA, IARPA, NSF, ONR, IBM, Lockheed, Qualcomm, Samsung, SRC