Speed and Power Trade-offs: Applied to Adder Design: Vojin G. Oklobdzija, Ram Krishnamurthy Intel...
-
Upload
esteban-swindell -
Category
Documents
-
view
218 -
download
4
Transcript of Speed and Power Trade-offs: Applied to Adder Design: Vojin G. Oklobdzija, Ram Krishnamurthy Intel...
Speed and Power Trade-Speed and Power Trade-offsoffs: : Applied to Adder Applied to Adder
Design: Design:
Speed and Power Trade-Speed and Power Trade-offsoffs: : Applied to Adder Applied to Adder
Design: Design:
Vojin G. Oklobdzija, Ram KrishnamurthyVojin G. Oklobdzija, Ram KrishnamurthyIntel AMR / ACSEL LaboratoryIntel AMR / ACSEL Laboratory
Intel Corp/ University of California DavisIntel Corp/ University of California Daviswww.ece.ucdavis.edu/acselwww.ece.ucdavis.edu/acsel
From: Tutorial PresentationFrom: Tutorial Presentation1616thth International Symposium on Computer International Symposium on Computer
Arithmetic Arithmetic
Santiago de Compostela, SPAINSantiago de Compostela, SPAIN
June 18, 2003June 18, 2003
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
2
Issues to be addressed
• How do we compare different topologies for their efficiency ?
• How do we estimate speed and efficiency of our algorithm ?
• What criteria's should we use when developing a new algorithm ?
• How does power enter into this equation ?
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
3
Additional Issues
• Determine which topology is the best for given Power or Delay budget
• Determine which topology can stretch the furthest in terms of speed or power
Metric Metric Metric Metric
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
5
Previously used estimates Counting the number of gates (logic levels): not accurate
C in
C out C in
C 4C 8C 12
C out
C 20C 24C 28
C in
C 16
a ib i
ind ividua l addersgenera ting: g i, p i,
and sum S i
C arry-lookahead b locks o f4-b its generating:
G i, P i, and C in fo r theadders
C arry-lookahead super- b locks o f4-b its b locks genera ting:
G * i, P * i, and C in fo r the 4-b itb locks
G roup producing fina lcarry C out and C 16
C ritica l pa th de lay = (fo r g i,p i)+2x2 (fo r G ,P )+3x2 (fo r C in)+1XO R - (fo r S um ) = appx. 12of de lay
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
6
Critical path in Motorola's 64-bit CLACritical path in Motorola's 64-bit CLA
C ritica l pa th : A , B - G 0 - G 3:0 - G 15:0 - G 47:0 - C 48 - C 60 - C 63 - S 63
G4
P7
G0
P0
G1
P1
G2
P2
G3
P3
...
CARRYBLOCK
G8
P1
1
... G1
2
P1
5
... G1
6
P3
1
... G3
2
P4
7
... G4
8
P5
1
G6
0
P6
0
G6
1
P6
1
G6
2
P6
2
G6
3
P6
3
... G5
2
P5
5
... G5
6
P5
9
...
PG BLOCK
PG BLOCK
PG BLOCK
PG BLOCK
P,G
0
P,G
1:0
P,G
2:0
G3
:0
P3
:0
G7
:4
P7
:4
G1
1:8
P1
1:8
G1
5:1
2
P1
5:1
2
G3
:0
P3
:0
G7
:0
P7
:0
G1
1:0
P1
1:0
G1
5:0
P1
5:0
G1
5:0
P1
5:0
G3
1:1
6
P3
1:1
6
G3
1:0
P3
1:0
G4
7:3
2
P4
7:3
2
G4
7:0
P4
7:0
G5
1:4
8
P5
1:4
8
G5
5:5
2
P5
5:5
2
G5
9:5
6
P5
9:5
6
C6
4
G5
1:4
8
P5
1:4
8
G5
5:4
8
P5
5:4
8
G5
9:4
8
P5
9:4
8
P,G
60
P,G
61
:60
P,G
62
:60
G6
3:6
0
P6
3:6
0
G6
3:4
8
P6
3:4
8
G6
3:0
P6
3:0
C0
C4
C8
C1
2
C1
6
C3
2
C4
8
C1
6
C3
2
C4
8
C5
2
C5
6
C6
0
C6
3
PG BLOCK
C6
2
C6
1
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
7
Motorola's 64-bit CLA
Modified PG Block
Intermediate propagate signals Pi:0 are generated to speed-up C3
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
8
Fan-In and Fan-Out DependencyFan-In and Fan-Out Dependency (Oklobdzija, Barnes: IBM 1985)
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
9
Delay Comparison: Variable Block Adder
(Oklobdzija, Barnes: IBM 1985)
Delay Complexity
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
10
Design Objective• Design takes time:
– finding results afterward is not of much value
• There is a disconnect between measures used by computer arithmetic when developing an algorithm and what is obtained after implementation– we want to estimate as close to the measured
results
• A simple tool that can evaluate different design trade-off for a given technology is needed
• Power trade-off is the most important– speed and power are tradable
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
11
Logical Effort Theory
•“Back of the Envelope” complexity: good for estimating speed
•Gate delay = linear function of load– Slope: logical effort gate driving
characteristics– Intersect: parasitic gate internal load
•“Logical Effort” accuracy is not sufficient– We needed to extend and refine the method– However, that becomes more than “Back of the
Envelope”
•Logical Effort does not account for possible power-delay trade-offs
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
12
Logical Effort Theory
• Excel –a platform of choice (ARITH-16)– Simple enough– Can provide computation quickly– Easy to enter a given design
• Technology characterization is needed:– This needs to be done only once: available for
every design afterwards– Domino gate = 2 stages of dynamic and static
• Different driving characteristics of these stages• Multi-output gate (carry-look-ahead, Ling/conditional
sum)
• Energy model needs to be included
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
13
AGUs: performance and peak-current limiters
High activity thermal hotspotGoal: high-performance energy-efficient
design
Energy Energy MotivationMotivation
Execution core
120oC
Cache
Processor thermal
map
AGU
Temp(oC)
*courtesy of Intel Corp.
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
14
Critical Paths of Critical Paths of Representative 64-bit Representative 64-bit
AddersAdders
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
15
Kogge-Stone AdderKogge-Stone Adder
Critical path = PG+5+XOR = 7 gate stages Generate,Propagate fanout of 2,3 Maximum interconnect spans 16b
Energy Energy inefficientinefficientEnergy Energy
inefficientinefficient
1235 4679 8101113 12141517 16181921 20222325 24262729 283031PG
Car
ry-m
erg
e g
ates
XOR
00
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
16
Sparse-tree Adder ArchitectureSparse-tree Adder Architecture
Generate every 4th carry in parallelSide-path: 4-bit conditional sum generator73% fewer carry-merge gatesenergy-
efficient
C27 C23 C19 C15 C11 C7 C3
293031 28 252627 24 212223 20 171819 16 131415 12 91011 8 567 4 123 0
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
17
StageLogical Effort
(G)Branch
Effort (B)Int. Pitch
(C)Effective Brnch Effort (B+I.C)
Paras tic Com p.
Path Branch
Effort = Bi Path Logical Effort=Gi
Path EffortPath Delay
(ps)
PG 0.6 2 1 2.1 1.3CM0 1.48 2 2 2.2 2.5CM1 0.59 2 4 2.4 1.6CM2 1.48 2 8 2.8 2.5CM3 0.59 2 16 3.6 1.6CM4 1.48 1 0 1.0 2.5XOR 1.69 1 0 1.0 3.0Inv 1 1 0 1.0 1.0
124.63 93.97
Kogge Stone Adder
108.92 1.14
Kogge-Stone adder (8-Kogge-Stone adder (8-stage)stage)
Adder Pitch (um)
10
Interconnect Cap
(fF/um) 0.157
Gate Cap (fF/um)
1.15
Avg inp. Cap /gate (um)
14
% int to gate
cap/pitch I10%
Inv. L.E. 2.24
Parasitic delay 3.8
Design ParametersAdder Pitch
(um)10
Interconnect Cap
(fF/um) 0.157
Gate Cap (fF/um)
1.15
Avg inp. Cap /gate (um)
14
% int to gate
cap/pitch I10%
Inv. L.E. 2.24
Parasitic delay 3.8
Design Parameters
D = 8*(GBH)1/8*2.2 + 3.8*P
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
18
MXA2 – Architecture & Result
• Multiplexer-based• Generate carries
using radix-2 (P,G)• 4-bit conditional sum
selected by carries• 4-b cell width = 17m• 9-stage critical path
– Per-stage effort = 3.7– Total effort delay =
33.3– Total parasitic = 22.5– Total delay = 55.8
PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4
S4 S4 S4 S4 S4 S4 S4 S4 S4 S4 S4 S4 S4 S4 S4 S4
60..6356..5952..5548..5144..4740..4336..3932..3528..3124..2720..2316..1912..158..114..70..3
S1 0
S
1 0S
10
G01G23
2
a3 a1a2 b2 a0 b0a3 b3 a2 b2 b0 a0 b1 a1
2
2
P03P03
p3p3
P23P23
G03
PG Group
S10
S
1 0
S10
S10
S10
S10
S10
p0
Sum0Sum1Sum2Sum3
p1g0p2
p3
G01
g2 g2 g1 a0 b0
a1 b1a2 b2
G01
Cin
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
19
(p,g)
XOR2NAND2
NOR2OAI
CM6CM1
NAND2AOI
NOR2OAI
CM2 CM3
NAND2AOI
NOR2OAI
CM4 CM5
AOI
OAI
CMo
XOR2NAND2
XOR2
XOR2
SumCiN
Evenbits
Oddbits
HC2 – ArchitectureHC2 – Architecture• Generate even carries
using radix-2 (P,G)• Generate odd carries
from even carries• CMOS adder for sum• 1-b cell width 4m• 10-stage critical path
4 3 02 114 7 663
30
31
15... ... ...
L2
L4
L6
L1
L3
L5
562
Odd
Sum ... ... ...
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
20
HC2 – Circuits & HC2 – Circuits & ResultsResults
pi gi-1 gi
G
pi gi-1 gi
G
pi pi-1
P
pi pi-1
P
a b a b
g p
P Cin
Sum
CK
Gi
Gi-1
G
Pi
CKPi
Ai
Bi Gi-1
Pi
Gi
G
Gi-1
Gi
Pi-1
CKGi
Ai Bi
Per-Stage Effort Total Effort Delay Total Parasitic Total DelayStatic 2.8t 28.0t 34.5t 62.5t
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
21
KS2 – Architecture & KS2 – Architecture & ResultsResults
• Generate carries using radix-2 (P,G)
• CMOS adder for sum• Similar circuits as
HC2• 1-b cell width 4m• 9-stage critical path
Per-Stage Effort Total Effort Delay Total Parasitic Total DelayStatic 3.0t 27.0t 30.6t 57.6tDynamic 2.11t 19.0t 23.6t 42.6t
4 3 02 114 7 615 ...
L2
L4
L6
L1
L3
L5
5
Inv
Sum ...
13...
...
...
...
30
31
29
63
62
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
22
63 62 5961 60 4 3 02 18 57 648 1632 12... ...... ... ...
G4P4
G16P16
CoSum
KS4 – KS4 – ArchitectureArchitecture
• Generate carries using redundant radix-4 (P,G)• Dynamic circuit• 1-b cell width 4m• 6-stage critical path
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
23
CKG4
A3
B3
A2
B2
A1
B1 B0
A0
B1 A1
A3
B3
A3
A2
B3
B2
A3
B3
A2
A3
B2
B3
A3
B3
A2
B2
A1
B1 A0
A1 B1
B0
P4CK
CK
CKG16
CK
g3 g2 g1 g0
p1
g3 p2
p1
g3 p2
p3
p1CK
g3 g1g2 g0
CKP16
G3 P2
P3 HS
STB
HSN
Sum
CK P1
G3 G2 G1 G0
CK
KS4 – Circuits & KS4 – Circuits & ResultResult
Per-Stage Effort Total Effort Delay Total Parasitic Total DelayDynamic 2.3t 13.8t 16.3t 30.1t
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
24
b32
b0
b16
b48 b15
b31b47
b63
Cin = C0
C48
C16
C32
C4
C8
C12
C20
C24
C28C36
C40
C44
C52
C56
C60
PGC PGC PGC
PGC PGC
PGC PGC PGC PGC PGC
C
PGC
PGC
PGCPGCPGCPGC
PGC
PGC PGC PGC
(P,G,C) Network
G-PathP-Path
CLA4 – CLA4 – ArchitectureArchitecture• Generate carries using radix-4 (P,G,C)
• 1-b cell width 4m• 15-stage critical path
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
25
A
B
AAN
CK
BNB
CK
G P K
AN
BN
CK CK
CK Sum
CiN
STBpg
Ci
CLA4 – Circuits & CLA4 – Circuits & ResultResult
Per-Stage Effort Total Effort Delay Total Parasitic Total DelayDynamic 1.4t 21.0t 33.3t 54.3t
G0 G1 G2 G3P0 P1 P2 P3
C0
P2:0 P3:0P1:0
G2:0 G3:0G1:0
C2 C3C1
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
26
LNG4 – LNG4 – ArchitectureArchitecture• Generate carries using Ling pseudo-carries
• Conditional sums selected by local & long carries
• 1-b cell width 5.1m; 9-stage critical path
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
27
LNG4 – Circuits & LNG4 – Circuits & ResultResult
A0
B0
A1 B1A1
B1
A2
B2
A2 B2
CKG3
G4
CK
A3
B3P4
A2 B2
B3A3B1
A0 B0
A1
CK
CK
P
LCH LCL
C1H C0LC1L C0H
SumH
CK
K
G
SumL LCH LCL
C1H C0LC1L C0H
CK
P2
P1
G0
CKLC
G2G1
Per-Stage Effort Total Effort Delay Total Parasitic Total DelayDynamic 2.4t 21.6t 22.3t 43.9t
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
28
Results from SimulationResults from Simulation
2.7
0.10.50.4
1.3
0.5
1.4-0.9
0
2
4
6
8
10
12
14
16
KS CS HC KS-4 KS-2 Ling HC CLA
HS
PIC
E &
Diff
eren
ce (
FO4)
• Fairly consistent with logical effort analysis
• Per-stage delay– 1.4 FO4 (static)
– 0.8 FO4 (dynamic)
Type Adder # Stages LE (FO4) SPICE (FO4) Diff (FO4)Static KS2 9 11.8 10.9 -0.88
MX2 9 11.4 12.8 1.41HC2 10 12.8 13.3 0.46
Dynamic KS4 6 6.2 7.4 1.27KS2 9 8.7 9.2 0.44
LNG4 9 9.0 9.5 0.51HC2 10 9.8 9.9 0.08
CLA4 16 11.4 14.2 2.74
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
29
Delay of Representative 64-b Delay of Representative 64-b AddersAdders
0
2
4
6
8
10
12
MXA2 HC2 KS2 QTA2 KS4 LNG4
To
tal D
elay
(F
O4)
Static
Dynamic
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
30
What happened when Power is considered ?
Delay
Energy
A
B
Adder A
Adder B
Region 1 Region 2
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
31
What happened when Power is What happened when Power is considered ?considered ?
Delay
Energy
A
B
Adder A
Adder B
Region 1 Region 2
A’ B’
A”
B”
Speed of A Speed of B
A isfaster
Lesspower
Point where B becomesbetter than A
With better E-Dtradeoff B canachieve more
speed with lesspower than A
• Must look at Energy-Delay Space of designs
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
32
Energy-Delay SpaceEnergy-Delay SpaceEnergy
Delay
Emin
Dmin
speed barrier
power limit
Different Adders
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
33
Logical Effort in Energy-Delay Logical Effort in Energy-Delay SpaceSpace
Total Delay
En
erg
y
LE Point
lower stage-effort
higher stage-effort
• It is possible to lower energy by trading delay? or …
Most design approaches focus here
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
34
Logical EffortLogical Effort
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
35
Delay in a Logic GateDelay in a Logic GateDelay of a logic gate has two components
d = f + p
• Logical effort describes relative ability of gate topology to deliver current (defined to be 1 for an inverter)
• Electrical effort is the ratio of output to input capacitance
parasitic delay
effort delay, stage effort
f = gh
logical effort
electrical effort = Cout/Cin
electrical effortis alsocalled “fanout”
*from Mathew Sanu / D. Harris
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
36
Logical Effort Parameters: Logical Effort Parameters: InverterInverter
• d = gh + p• Delay increases linearly with fanout• More complex gates have greater g and p
0
2
4
6
8
10
12
14
16
0 1 2 3 4 5 6
p=3.8ps (parasitic delay)
Fanout: h =Cin/Cout
Del
ay
d=gh+p
g=2.2 (logic effort)
*from Mathew Sanu / D. Harris
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
37
Normalized Logical Effort: InverterNormalized Logical Effort: Inverter
•Define delay of unloaded inverter = 1 •Define logical effort ‘g’ of inverter = 1•Delay of complex gates can be defined w.r.t d=1
1
2
3
4
5
6
1 2 3 4 5
parasitic delay
effortdelay
Fanout: h = Cout/Cin
Nor
mal
ized
del
ay:
d
inver
ter g =
p =d =
1 1gh + p = h+1
*from Mathew Sanu / D. Harris
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
38
Computing Logical EffortComputing Logical EffortDEF: Logical effort is the ratio of the input capacitance to
the input capacitance of an inverter delivering the same output current•Measured from delay vs. fanout plots of simulated
gates•Or estimated, counting capacitance in units of
transistor W
*from Mathew Sanu / D. Harris
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
39
L.E for Adder GatesL.E for Adder Gates
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
0 1 2 3 4 5 6
Fanout
Del
ay (
ps)
Inverter
Static CM
Dyn PG
Dyn CM
Mux
• Logical effort parameters obtained from simulation for std cells• Define logical effort ‘g’ of inverter = 1• Delay of complex gates can be defined w.r.t d=1
*from Mathew Sanu / D. Harris
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
40
Normalized L.ENormalized L.E
• Logical effort & parasitic delay normalized to that of inverter
Gate type Logical Eff. (g)Parasitics
(Pinv)
Inverter 1 1
Dyn. Nand 0.6 1.34
Dyn. CM 0.6 1.62
Dyn. CM-4N 1 3.71
Static CM 1.48 2.53
Mux 1.68 2.93
XOR 1.69 2.97
*from Mathew Sanu
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
41
Delay of a string of gatesDelay of a string of gates
•Delay of a path, D = di = gihi + pi
•gi & pi are constants
•To minimize path delay, optimal values of hi are to
be determined
D is minimized when each stage bears the same effort, i.e. gihi = g i+1h i+1
*from Mathew Sanu / D. Harris
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
42
Minimizing path delayMinimizing path delay
• Logical Effort of a string of gates:
• Path Electrical Effort:
• Branching Effort
• Path Branching Effort:
• Path Effort: F=GBH
giG = Cout(path)
Cin(path)
H = hi =
biB =
Con-path + Coff-path
Con-path
b =
Delay is minimized when each stage bears the same effort:
f = gihi = F1/N
The minimum delay of an N-stage path is: NF1/N + P*from Mathew Sanu / D. Harris
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
43
Inclusion of Wire DelayInclusion of Wire Delayinto Logical Effortinto Logical Effort
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
44
Wiring Wiring LoadLoad
•Wiring in hand analysis– Only lumped capacitance included
•Wiring in HSPICE– Short wire: 1-segment -model RC network– Long wire: 4-segment -model RC network– Using worst-case wire capacitance
•Wire length– Estimated from most critical 1-bit pitch
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
45
Modeling interconnect Modeling interconnect cap.cap.• Include interconnect cap in branching factor
Con-path + Coff-path
Con-path
b =
CM0
CM0
Coff-path
Con-path
PG
Add
er b
itpitc
h CM0
CM0Cint
Con-path
PG
Add
er b
itpitc
h
Coff-path
= 2 Con-path + Coff-path+Cint
Con-pathb = = 2+
Cint
Con-path
= 2 + I I : % int. cap to gate cap in 1 adder bitpitch
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
46
Branching
CINCOUT1
COUT2
f0 f1
f2 f3
g0 g1
g2 g3
Logical Effort assumes the “branching” factor of this circuit to be 2. This is incorrect and can create inaccuracies
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
47
CINCOUT1
COUT2
f0 f1
f2 f3
f0 = f1 , f2 = f3
Td1 = (f0 + f1 + parasitics) Td2 = (f2 + f3 + parasitics)
g0 g1
g2 g3
Minimum Delay occurs when Td1 = Td2
Correction on Branching
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
48
F1g0 g1 out1
CinF2
g2 g3 out2Cin
B1F1 F2
F1
B1g0 g1 out1 g2 g3 out2
g0 g1 out1
B2F1 F2
F2
B2g0 g1 out1 g2 g3 out2
g2 g3 out2
““Real” Branching CalculationReal” Branching Calculation
Branching only equals 2 when:
g0 g1 out1 g2 g3 out2
This explains why we had to resort to Excel !
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
49
Technology Characterization
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
50
Characterization Setup• Logical Effort Requirements:
– Equalize input and output transitions.
• Logical Effort is characterized by varying the h (Cout/Cin) of a gate. By using a variable load of inverters each gate can be characterized over the same range of loads.
• The Logical Effort of each gate is characterized for each input.
• Energy is characterized for each output transition of the gate caused by each input transition.
i.e. for an inverter: energy is measured for tLH and tHL
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
51
LE Characterization Setup LE Characterization Setup forfor
Static Gates Static Gates
Gate Gate Gate GateIn
•tLH
•tHL
•Average•Energy
..
Variable Load
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
52
LE Characterization Setup LE Characterization Setup forfor
Dynamic Gates Dynamic Gates
Gate GateIn
•tHL
•Energy
Variable Load
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
53
LE Table (Static LE Table (Static CMOS)CMOS)
• Technology: P/N Ratio = 2 INV = 3.67, pINV = 4.29
• Measured on worst-case single-input switching
Fan-out INV NAND2 NAND3 NOR2 TGXORi TGXORs TGM UXi TGM UXs AOI OAI2 11.6 16.3 22.2 20.5 34.9 22.3 8.0 26.0 23.2 21.33 15.3 20.0 26.6 25.4 42.6 28.2 9.9 33.0 28.5 26.74 19.0 24.0 31.2 30.6 50.2 34.2 12.0 39.0 34.1 32.16 26.4 32.4 40.6 41.1 64.4 45.7 16.0 53.0 45.3 43.68 33.6 40.6 50.0 51.9 79.8 56.5 20.2 68.0 56.7 55.3
g (ps) 3.67 4.08 4.65 5.25 7.43 5.71 2.04 6.97 5.60 5.68p (ps) 4.29 7.90 12.74 9.77 20.19 11.12 3.85 11.76 11.82 9.69
g (norm) 1.00 1.11 1.27 1.43 2.03 1.56 0.55 1.90 1.52 1.55p (norm) 1.00 1.84 2.97 2.28 4.71 2.59 0.90 2.74 2.76 2.26
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
54
0
10
20
30
40
50
60
70
80
90
0 1 2 3 4 5 6 7 8 9
Fanout
Delay
INV
NAND2
NAND3
NOR2
AOI
OAI
Static CMOS Gates: Delay Static CMOS Gates: Delay GraphsGraphs
0
10
20
30
40
50
60
70
80
90
0 1 2 3 4 5 6 7 8 9
FanoutD
elay
INV
TGXORi
TGXORs
TGMUXi
TGMUXs
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
55
Static Gates: Pull-up Delay Static Gates: Pull-up Delay GraphGraph
0
10
20
30
40
50
60
70
0 1 2 3 4 5 6 7 8 9
Fanout
Del
ayINV
NAND2
NAND3
NOR2
AOI
OAI
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
56
LE Table (Dynamic LE Table (Dynamic CMOS)CMOS)
• Technology:• Minimum-sized keeper included• Measured on all-input switching of worst
pathFan-out DN2 DN3 DN4 Dk1ND2 Dk1NR2 DAOI_A DOAI_O
2 9.9 12.7 16.0 13.7 10.6 10.1 8.83 12.6 14.7 19.1 16.7 13.2 12.1 11.34 16.0 18.3 23.2 20.7 16.7 14.7 14.06 21.7 24.7 30.2 27.9 23.2 20.0 19.28 27.3 31.2 37.8 36.1 29.5 24.8 24.0
g (ps) 2.92 3.15 3.65 3.75 3.19 2.49 2.55p (ps) 4.04 5.82 8.46 5.76 3.95 4.86 3.75
g (norm) 0.80 0.86 1.00 1.02 0.87 0.68 0.69p (norm) 0.94 1.36 1.97 1.34 0.92 1.13 0.87
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
57
Dynamic CMOS: Delay Dynamic CMOS: Delay GraphsGraphs
0
5
10
15
20
25
30
35
40
0 2 4 6 8 10
N2
N3
N4
k1ND2
k1NR2
AOI_A
OAI_O
0
5
10
15
20
25
30
35
40
0 2 4 6 8 10
G4
P4
C4
STBSum
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
58
Dynamic CMOS: Delay Dynamic CMOS: Delay GraphsGraphs
0
5
10
15
20
25
30
35
40
45
50
0 2 4 6 8 10
LG3
LP4
G4
P4
LC
Lsum
0
5
10
15
20
25
30
35
40
45
50
0 2 4 6 8 10
KSG4
KSP4
KSG16KSP16KSSum
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
59
Energy CalculationEnergy Calculation
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
60
Energy Calculation
8X Minimal Size Dyn-NAND
16X Minimal Size Dyn-NAND
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
61
Energy CalculationEnergy CalculationOffset (parasitic+wiring energy) vs. Size (in multiplesof the
gate size)
y = 0.8931x + 4.6411
y = 1.1413x + 10.22
y = 1.6382x + 11.988
y = 0.5538x + 12.338
y = 3.89x + 14.5
y = 1.9595x + 9.621
y = 1.2559x + 6.762
y = 1.0592x + 1.71
0
10
20
30
40
50
60
0 5 10 15 20 25 30 35 40 45
Gate Size (x)
Off
se
t
invdgckoai_odaoitgxoraoi_ona2stgmuxsLinear (inv)Linear (dgck)Linear (oai_o)Linear (daoi)Linear (tgxor)Linear (aoi_o)Linear (na2s)Linear (tgmuxs)
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
62
Energy CalculationEnergy Calculation
1218
2436
482.5
5
7.5
10
0.00E+00
2.00E+01
4.00E+01
6.00E+01
8.00E+01
1.00E+02
1.20E+02
1.40E+02
Energy [fJ]
Load [u]
Size
Inverter
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
63
Energy CalculationEnergy Calculation
M 1 5 10 15 20 1 5 10 15 200 1.12 5.6 11.2 16.8 22.4 2.51E+00 1.26E+01 2.51E+01 3.77E+01 5.02E+011 2.24 11.2 22.4 33.6 44.8 3.70E+00 1.85E+01 3.70E+01 5.54E+01 7.39E+012 3.36 16.8 33.6 50.4 67.2 4.85E+00 2.42E+01 4.85E+01 7.27E+01 9.70E+013 4.48 22.4 44.8 67.2 89.6 6.16E+00 3.08E+01 6.16E+01 9.24E+01 1.23E+024 5.6 28 56 84 112 7.45E+00 3.73E+01 7.45E+01 1.12E+02 1.49E+025 6.72 33.6 67.2 100.8 134.4 8.74E+00 4.37E+01 8.74E+01 1.31E+02 1.75E+026 7.84 39.2 78.4 117.6 156.8 1.02E+01 5.08E+01 1.02E+02 1.52E+02 2.03E+027 8.96 44.8 89.6 134.4 179.2 1.15E+01 5.75E+01 1.15E+02 1.72E+02 2.30E+028 10.08 50.4 100.8 151.2 201.6 1.27E+01 6.36E+01 1.27E+02 1.91E+02 2.54E+029 11.2 56 112 168 224 1.42E+01 7.08E+01 1.42E+02 2.13E+02 2.83E+0210 12.32 61.6 123.2 184.8 246.4 1.55E+01 7.76E+01 1.55E+02 2.33E+02 3.10E+0211 13.44 67.2 134.4 201.6 268.8 1.69E+01 8.44E+01 1.69E+02 2.53E+02 3.37E+0212 14.56 72.8 145.6 218.4 291.2 1.81E+01 9.05E+01 1.81E+02 2.71E+02 3.62E+0213 15.68 78.4 156.8 235.2 313.6 1.97E+01 9.85E+01 1.97E+02 2.96E+02 3.94E+0214 16.8 84 168 252 336 2.09E+01 1.04E+02 2.09E+02 3.13E+02 4.18E+0215 17.92 89.6 179.2 268.8 358.4 2.26E+01 1.13E+02 2.26E+02 3.39E+02 4.52E+0216 19.04 95.2 190.4 285.6 380.8 2.39E+01 1.20E+02 2.39E+02 3.59E+02 4.79E+0217 20.16 100.8 201.6 302.4 403.2 2.53E+01 1.27E+02 2.53E+02 3.80E+02 5.06E+0218 21.28 106.4 212.8 319.2 425.6 2.67E+01 1.34E+02 2.67E+02 4.01E+02 5.34E+0219 22.4 112 224 336 448 2.81E+01 1.40E+02 2.81E+02 4.21E+02 5.61E+02
INV
Output Capacitance (u) Energy [fJ]
Multiplier FactorEnergy Factors
1.211300121 7.39E-01Output Capacitance Factor
NAND-2
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
64
ExamplesExamples
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
65
64-Bit Adders
• Han-Carlson (prefix-2, HC2): Static and Dynamic
• Han-Carlson (prefix-2, HC2-2): Dynamic-Static
• Kogge-Stone (prefix-2, KS2): Static and Dynamic
• Kogge-Stone (prefix-2, KS2-2): Dynamic-Static
• Quaternary-Tree (prefix-2, QT2): Static and Dynamic
Included wire delay, tdelay = 0.7RwireCwire
Included wire energy, Ew = CwireV2
Len (um) 10 20 30 40 60 80 120 160 240 320 480Delay (ps) 0.01 0.04 0.09 0.17 0.38 0.67 1.50 2.67 6.01 10.7 24.1
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
66
Adder
S0
S63
A0
A63
Cwire
Cwire
Test Setup
1mm wire
H=(Cin + Cwire)/Cin
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
67
Energy-Delay Estimates
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
68
Adders: EnergyAdders: EnergyEnergy vs. Delay
Cout = 1mm wire (160u gate cap)For Cin = ~minimum input to 50*minimum input
0
100
200
300
400
500
600
700
800
900
0 50 100 150 200 250 300
Delay [pS]
En
erg
y [p
J]
HC Dynamic (2-2)
KS Dynamic (2-0)
HC Dynamic (2-0)
KS Dynamic (2-2)
KS Static Prefix 2
HC Static Prefix 2
Quarternary Dynamic (2-2)
Quarternary Static
Dynamic: KS, HC
Static
Dynamic-Static
QT
KS
HC
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
69
Dynamic Static Dynamic Static ImplementationImplementation
of Carry-Merge stageof Carry-Merge stage
VDD
Clk
Gi
Gi-1 Pi
VDD
Clk
Gi-2
Gi-3 Pi-2
VDD
Clk
Pi-1 Pi
VDD
Delayed Clk
VDD
Clk
Gi-2
Gi-3 Pi-2
VDD
Clk
Gi
Gi-1 Pi
VDD
Clk
Pi-1 Pi
Static Gate
Regular Domino Implementation Compound-Domino Implementation
inverters to be eliminated
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
70
Energy-Delay comparison of Energy-Delay comparison of 64-bit KS, HC and QT adders64-bit KS, HC and QT adders
0
0.5
1
1.5
2
2.5
3
0.9 1.1 1.3 1.5 1.7 1.9 2.1
Normalized Delay
No
rmal
ized
En
erg
y
QT Static
HC Static
KS Static
QT compound-domino
HC compound-domino
KS compound-domino
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
71
Adders: Critical Path EnergyAdders: Critical Path EnergyCritical Path Energy vs. Delay (no internal w ire Energy)
Cout = 1mm wire (160u gate cap)For Cin = ~minimum input to 50*minimum input
0
2000
4000
6000
8000
10000
12000
0 50 100 150 200 250 300
Delay [S]
En
erg
y [
fJ]
HC Dynamic (2-2)
KS Dynamic (2-0)
HC Dynamic (2-0)
KS Dynamic (2-2)
KS Static Prefix 2
HC Static Prefix 2
Quarternary (2-2)
Quarternary Static (2-2)
QT dynamic-static
HC dynamic-staticQT static
KS dynamic-static
HC-dynamic
KS dynamic
HC-staticKS-static
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
72
Intel 32-bit Adder 0.13u 1.2V [VLSI-2002]Intel 32-bit Adder 0.13u 1.2V [VLSI-2002]Comparison with Intel Measured Data
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140 160 180 200
Delay [pS]
En
erg
y [f
J]
Kogge-Stone (2-0)
Quarternary (2-2)
Intel Kogge-Stone (2-0)
Intel Quarternary (2-2)
QT
KS
KS estimated
QT Estimated
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
73
Energy-Delay comparison of 32-bit
QT and KS adders: estimated vs. simulation in 0.10mm technology
0
10
20
30
40
50
60
90 100 110 120 130 140 150 160Delay [pS]
En
erg
y [p
J]
KS [9]
QT [9]
KS Estimate
QT Estimate
55%
35%
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
74
Est. Results: All AddersEst. Results: All Addersw/o Wiresw/o Wires
0E+
002E
-11
4E-1
16E
-11
8E-1
11E
-10
7 8 9 10 11 12 13 14 15
Delay (FO4)
Est
imat
ed E
ner
gy
(J)
sKS
sHC
sQT9
dKS
dHC
dQT9
dQT7
dCLA
dIBM
dLNG
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
75
Est. Results: All Addersw/ Wires
0.0E
+00
5.0E
-11
1.0E
-10
1.5E
-10
2.0E
-10
8 10 12 14 16 18Delay (FO4)
Est
imat
ed E
ner
gy
(J).
sKS_LE
sHC_LE
sQT9_LE
dKS_LE
dHC_LE
dQT9_LE
dQT7_LE
dIBM_LE
dLNG_LE
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
76
Delay [ns]
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
En
erg
y [p
J]
0
10
20
30
40
50
60
70
80
Energy-Delay Trade-offsEnergy-Delay Trade-offs
Initial Design
Optimized Design Worst Case Energy VectorWith 100% Input Activity
EnergySavingDelay
Saving
90nm technology
Collaboration with
Intel AMR
June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN
77
ConclusionConclusion• Using realistic measures for
comparing various designs leads to better design choices
• Power is as important as speed• Making comparison in Energy-Delay
space is necessary:– power can always be traded for speed
and vice versa
• Wire effects are significant• Leakage currents ?