Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis
description
Transcript of Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis
![Page 1: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/1.jpg)
1
Area Optimizations for Dual-Rail Circuits Using Relative-
Timing Analysis
Tiberiu Chelcea, Girish Venkataramani, Seth C. Goldstein
Department of Computer ScienceCarnegie Mellon University
![Page 2: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/2.jpg)
2
QDI: Orphans problem
• Early propagation:– “A” arrives early => Z
transitions– Stale values on the
other signals
• Incorrect behavior: inputs acknowledged before being received
Z0
Y0
X0
D0
C0
B0
A0
A1
B1
C1
D1
X1
Y1
Z1
![Page 3: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/3.jpg)
3
NCL-X solution
Z0
Y0
X0
D0
C0
B0
A0
A1
B1
C1
D1
X1
Y1
Z1N1
N2
N3
DoneC
Add completion detection
DoneA
![Page 4: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/4.jpg)
4
QDI Gate Delays
QDI implementations always assume the worst:equal probability for any gate delay
![Page 5: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/5.jpg)
5
Motivation
• Quasi-Delay Insensitive (QDI) circuits:– One timing constraint– Naturally tolerate
parametric variation, but…
• Have large area overheads– Added completion
detection for correctness
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
add_bk_32 lsr16 C880
gates cd
![Page 6: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/6.jpg)
6
Parametric Variation and Gate Delays
Goal: pay only what is necessary
ITRS’05: 35% parametric
variation by 2020
![Page 7: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/7.jpg)
7
Goal: Optimizing Sync→Async Flow
• Use timing information to reduce size of completion detection
• Use mixed gates to further reduce area– w/ early propagation– w/o early propagation
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
NCL-X Direct
gates cd
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
NCL-X Direct Exact
gates strict cd
regular gates
strict gates
![Page 8: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/8.jpg)
8
ContributionsThree new relative-timing
area optimizations:• Direct method:
– Timing analysis + simple CD elimination
• Greedy method: fast but not optimal– Uses strict gates, but
may increase area
• Exact method: optimal, but slow– Solves an mILP
problem
0.83
0.55
0.43
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Direct Greedy Exact
![Page 9: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/9.jpg)
9
Outline
• Timing analysis & Direct Optimization
• Greedy optimization method
• Exact optimization method
• Results
• Conclusions
![Page 10: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/10.jpg)
10
Basics• QDI circuits:
– Unbounded but finite delays on gates and wires
– One timing assumption: isochronic fork
• Timed circuits:
– Delays on gates and wires: bounded time intervals
– Given input arrival times: compute propagation intervals for each gate and wire
![Page 11: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/11.jpg)
11
Timing Computation
• Conservative assumption: any input change can trigger an output change
Z
YD
C
A
B
X
N1
N2
N3
(1.5,1.9)
(1.1,1.2)
(1.0,1.2)
(0.5,0.7)
(0.6,0.8)
(0.5,0.7)
(0,0)
(0,0)
(0,0)
(0,0)
(0,0)
(0,0)(0,0)
(3.5,4.1)
(3.0,4.0)
(1.5,1.9)
(3.6,4.9)
(3.6,4.9)
(2.0,5.6)(2.0,5.6)
GlobalPI
![Page 12: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/12.jpg)
12
Direct Optimization Method
• Gate completion detection iff gate may not be stable when outputs are produced
Z
YD
C
A
B
X
N1
N2
N3
(1.5,1.9)
(1.1,1.2)
(1.0,1.2)
(3.5,4.1)
(3.0,4.0)
(1.5,1.9)
(3.6,4.9)
(3.6,4.9)
(2.0,5.6)(2.0,5.6)
CDone
Under any input change, gate quiescent when output produced
1.9 < 2.0
![Page 13: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/13.jpg)
13
Strict Gates
• All inputs must arrive before producing an output
• Eliminate early propagation effect
- Extremely expensive+ Decrease length of
propagation interval
A
B
C
C
C
![Page 14: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/14.jpg)
14
Timing Computation with Strict Gates
• Entire completion detection: single OR gate
Z
YD
C
A
B
X
N1
N2
N3
(1.5,1.9)
(1.1,1.2)
(1.0,1.2)
(3.5,4.1)
(3.0,4.0)
(1.5,1.9)
(3.6,4.9)
(3.6,4.9)
(5.0,6.8)(5.0,6.8)
(1.4,1.9)Done
• This circuit: area not reduced• Goal: smart insertion of strict gates
![Page 15: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/15.jpg)
15
Outline
• Timing analysis & Direct Optimization
• Greedy optimization method
• Exact optimization method
• Results
• Conclusions
![Page 16: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/16.jpg)
16
Greedy Optimization (1)
• Strict gates: area implications– GlobalPI may be narrower and delayed– Fewer gates non-quiescent– Smaller completion detection
• Greedy optimization framework:– Flip gates in the circuit from normal to strict– Select most promising candidate– Continue until no improvements possible
![Page 17: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/17.jpg)
17
Greedy Optimization (2)
Algorithm:
1. For each gate Gi in the circuit
a. Flip each gate Gi in turn from regular to strict
b. Perform timing analysis, compute GlobalPIi
c. Flip back Gi to regular
2. Select Gk with the narrowest GlobalPIk
3. If GlobalPIk narrower than previous best:
a. Flip Gk to strict permanently
b. Continue (goto 1)
Else: finish
![Page 18: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/18.jpg)
18
Greedy Optimization (3)
• Algorithm does not optimize for area directly
• Instead: may reduce the completion detection by narrowing the output interval
• Results promising, but individual benchmarks may result in larger area
![Page 19: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/19.jpg)
19
Outline
• Timing analysis & Direct Method
• Greedy optimization method
• Exact optimization method
• Results
• Conclusions
![Page 20: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/20.jpg)
20
Exact Optimization Method
• mixed Integer Linear Programming (mILP)
• Transform circuit graph into an optimization problem:
– Introduce variables for each gate, wire and primary input/output
– Matrix coefficients: from library (gate areas) and back-annotation (gate/wire delays) files
– Decision variables (GS) should gate be strict?
![Page 21: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/21.jpg)
21
mILP formulation• Minimize: TotalArea = GateArea+CDArea
• GateArea = i (GSi·SAreai + (1-GSi)·NAreai)
• CDArea = SCD·Or2Area + (SCD-1)·CArea– SCD: # gates that need completion detection
• NeedsCD: does a gate need CD?– NeedsCD = 0 if PIM < GlobalPIm or successor is
strict; otherwise 1
• Rest of the model implements timing computation
![Page 22: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/22.jpg)
22
Improving the mILP Model
• Basic mILP model: too slow even for small circuits (hours for dozen gates)
• Leverage problem knowledge into model improvements:– Branching order: gates closer to the output are
more likely to become strict => inspected first – Single input gates: never strict– Provide initial solution (result of greedy opt)
• Can solve problems with hundreds of gates in minutes
![Page 23: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/23.jpg)
23
Related Work: Optimizations• Cortadella et al:
– logical function decompositions– can achieve substantial area savings– can be the starting point for our methods
• Zhou et al: consider strict gates in optimization, but no timing information
• Sokolov et al: two timing optimizations– Alternate levels: unrealistic assumptions for
gate delays– Longest path: applicable only for small circuits
![Page 24: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/24.jpg)
24
Experimental Setup• Tool flow:
– Synthesis & tech-mapping with Synopsys Design Compiler
– Perl scripts for dual-rail implementations– Optimization tool reads structural Verilog and
timing back-annotations– End result: optimized circuits (Verilog)
• Experiments:– Arithmetic and ISCAS’89 benchmarks– Pre-layout runs in 0.18m technology
![Page 25: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/25.jpg)
25
0
0.2
0.4
0.6
0.8
1
1.2
direct greedy ilp
Area: Ratio vs. NCL-X methodGreedy: 2.83x NCL-X areafor le32mILP does not finish in
less than 1 hourPartial results
Direct: 0.83xGreedy: 0.55xmILP: 0.43x
![Page 26: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/26.jpg)
26
Area breakdown
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
add_bk_32 Lsr16 C880
regular strict cdDi
rect
Gree
dy
ILP
NCLX
8/168 strict4.7% before → 40% after
Over twice as small than NCL-X
![Page 27: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/27.jpg)
27
Parametric Variation: BK adder
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0% 5% 10% 15% 20% 25% 30% 35%
Parametric Variation
Ratio
vs.
NCL
-X A
rea
Direct Greedy Exact
![Page 28: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/28.jpg)
28
Conclusions• Paper introduced:
– a method to translate synchronous circuits into optimized asynchronous circuits
– Three new relative timing optimizations for improving area
• Direct: extremely simple• Greedy: fast, good results• Exact: optimal, may be extremely slow
– Analyzed the impact of parametric variation on these circuits
![Page 29: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/29.jpg)
29
Backup slides
![Page 30: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/30.jpg)
30
Outline
• Background
• Timing analysis & Direct Optimization
• Greedy optimization method
• Exact optimization method
• Results
• Conclusions
![Page 31: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/31.jpg)
31
Introduction
• Future deep sub-micron technologies:– large parametric variations (ITRS’05 predicts
35% by 2020).– Asynchronous design a natural fit– Asynchronous handshaking: widespread
• Acceptance for asynchronous circuits is predicated on quality CAD tools:– “Pure” async: from scratch– Sync to async translation
![Page 32: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/32.jpg)
32
Synchronous to Asynchronous Translation
Synchronous circuit
Template-based replacement of each sync gate
AB
CD
Z
Y
X
Z = (A·B)·(C+D)
N1
N2
N3
Dual-rail circuit
Z0
Y0
X0
D0
C0
B0
A0
A1
B1
C1
D1
X1
Y1
Z1N1
N2
N3
![Page 33: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/33.jpg)
33
Related Work
• Numerous approaches for translating synchronous circuits into asynchronous
• Dealing with the orphans problem:
– Kondratiev et al: NCL-X (discussed below)
– Brej: anti-tokens
• Allows for early propagation
• Completion detection in background
• Even larger area overheads
![Page 34: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/34.jpg)
34
ILP optimization for 32-bit BK adder
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
55%
60%
65%
Time (s)
% e
rror
% Crt Sol % Best Estimation
CrtSol: current bestInteger solution
Best Estimation: best guess ofhow far the optimum isWhen 0, optimum found
![Page 35: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/35.jpg)
35
Outline
• Timing analysis & Direc Optimization
• Greedy optimization method
• Exact optimization method
• Results
• Conclusions
![Page 36: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/36.jpg)
36
Area breakdown
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
add_bk_32 Lsr16 C880
regular strict cdDi
rect
Gree
dy
ILP
NCLX
8/168 strict4.7% before → 40% after
Over twice as small than NCL-X
![Page 37: Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis](https://reader035.fdocuments.us/reader035/viewer/2022062807/5681520e550346895dc05098/html5/thumbnails/37.jpg)
37
mILP Run TimeBench #Inps #Outs #Gates #Vars #Constr Runtime
Eq32 64 1 37 731 1158 0.23s
Decode32 5 32 49 1239 2068 12.2s
C432 36 7 80 2391 4223 27m46s
Lsl16 32 16 81 1819 3534 10m24s
Lsr16 32 16 81 2315 4080 19m15s
Absval32 32 32 92 2420 4149 6m7s
C880 60 26 168 4385 7724 39m25s
C1908 33 25 190 3263 5300 20m23s
Bk32 64 32 285 4923 8293 78s
Clf32 64 32 309 5195 8737 71s