L15: VLSI Integration and Performance...
-
Upload
nguyendiep -
Category
Documents
-
view
226 -
download
0
Transcript of L15: VLSI Integration and Performance...
L15: 6.111 Spring 2009 1Introductory Digital Systems Laboratory
L15: VLSI Integration and Performance Transformations
Acknowledgements: • Lecture prepared by Professor Anantha Chandrakasan• Lecture material adapted from J. Rabaey, A. Chandrakasan, B. Nikolic, “Digital
Integrated Circuits: A Design Perspective” Copyright 2003 Prentice Hall/Pearson.• Curt Schurgers
• Moore’s Law and Integration• IC Design
ASIC DesignClocksTest
• Performance Transformations• Trends
L15: 6.111 Spring 2009 2Introductory Digital Systems Laboratory
0.0000001
0.000001
0.00001
0.0001
0.001
0.01
0.1
1
10
'68 '70 '72 '74 '76 '78 '80 '82 '84 '86 '88 '90 '92 '94 '96 '98 '00 '02
$
Aver
age
Cos
t of o
ne tr
ansi
stor Gordon Moore, Keynote
Presentation at ISSCC 2003
Cost of Transistor
L15: 6.111 Spring 2009 3Introductory Digital Systems Laboratory
Moore’s Law
In 1965, Gordon Moore was preparing a speech andmade a memorable observation. When he started tograph data about the growth in memory chipperformance, he realized there was a strikingtrend. Each new chip contained roughly twice as muchcapacity as its predecessor, and each chip was releasedwithin 18-24 months of the previous chip. If this trendcontinued, he reasoned, computing power would riseexponentially over relatively brief periods of time.
Electronics, April 19, 1965.
Source: Dataquest/Intel, 12/02
'68 '70 '72 '74 '76 '78 '80 '82 '84 '86 '88 '90 '92 '94 '96 '98 '00 '02
1017
1016
1015
1014
1013
1012
1011
1010
109
Tran
sist
ors
Ship
ped
Per Y
ear
1018
“…E.O. Wilson, the famous Harvard biologist who is an expert on ants, estimates that there are 10 to the 16th and 10 to the 17th ants on earth. But if you look at this curve, this year we’re making one transistor for every ant.” –Gordon Moore, “An Update on Moore’s Law”
Gordon Moore, KeynotePresentation at ISSCC 2003
40048008
80808085 8086
286386
486Pentium® proc
P6
0.001
0.01
0.1
1
10
100
1000
1970 1980 1990 2000 2010Year
Tran
sist
ors
(MT)
2X growth in 1.96 years!
Courtesy of S. Borkar (Intel)
L15: 6.111 Spring 2009 4Introductory Digital Systems Laboratory
Evolution of Transistor Integration
Moore’s Law: transistor density doubles every 1.5 - 2 years
1970 1974 1978 1982 1986 1990 1994 1998 2002
1K
4K16K
64K
1M
4M
16M
128MDRAMCost / Bit ($, 95)
2006
256M 1G
4004
8086
8080
Pentium Pro
8028680386
Pentium
Pentium IIPentium III
Cos
t/Bit
($, 9
5)
10-1
10-2
10-3
10-4
10-5
10-6
10-7
10-8
Tran
sist
ors
/ Chi
p
107
106
105
104
103
108
109
Cos
t/Fun
ctio
n
Pentium IV
#1
256K 80486
64M
#2
Microprocessor
102
10-9
L15: 6.111 Spring 2009 5Introductory Digital Systems Laboratory
Layout 101
N-channel MOSFET P-channel MOSFET
GND
VDD
metal poly p+ diff
contactfrommetalto ndiff
Ln
Wn
Lp
Wp
IN OUT
n-type well
p-type substrate
metal/pdiffcontact
n+ diff
IN OUT
VDD
SG
D
D
S
G
3-D Cross-Section
Circuit Representation
Layout
Follow simple design rules (contractbetween process and circuit designers)
L15: 6.111 Spring 2009 6Introductory Digital Systems Laboratory
Custom Design/Layout
Adder stage 1
Wiring
Adder stage 2
Wiring
Adder stage 3B
it slice 0
Bit slice 2
Bit slice 1
Bit slice 63
Sum Select
Shifter
Multiplexers
Loopback Bus
From register files / Cache / Bypass
To register files / Cache
Loopback Bus
Loopback Bus
Die photograph of the Itanium integer datapath
Bit-slice Design Methodology
Hand crafting the layout to achieve maximum clock rates (> 1Ghz)Exploits regularity in datapath structure to optimize interconnects
9-1
Mux
9-1
Mux
5-1
Mux
2-1
Mux
ck1
CARRYGEN
SUMGEN+ LU
1000um
b
s0
s1
g64
sum sumb
LU : LogicalUnit
SUM
SEL
a
to Cache
node1
REG
Itanium has 6 integer execution units like this
L15: 6.111 Spring 2009 7Introductory Digital Systems Laboratory
The ASIC Approach
Verilog (or VHDL )
Logic Synthesis
Floorplanning
Placement
Routing
Tape-out
Circuit Extraction
Pre-Layout Simulation
Post-Layout Simulation
Structural
Physical
BehavioralDesign Capture
Des
ign
Itera
tion
Most Common Design Approach for Designs up to 500Mhz Clock Rates
L15: 6.111 Spring 2009 8Introductory Digital Systems Laboratory
Standard Cell Example
3-input NAND cell(from ST Microelectronics):C = Load capacitanceT = input rise/fall time
Power Supply Line (VDD)
Ground Supply Line (GND)
Each library cell (FF, NAND, NOR, INV, etc.) and the variations on size (strength of the gate) is fully characterized across temperature, loading, etc.
Delay in (ns)!!
L15: 6.111 Spring 2009 9Introductory Digital Systems Laboratory
Standard Cell Layout Methodology
Cell-structure hidden under interconnect layers
2-level metal technology Current Day Technology
With limited interconnect layers, dedicated routing channels between rows of standard cells are needed
Width of the cell allowed to vary to accommodate complexityInterconnect plays a significant role in speed of a digital circuit
L15: 6.111 Spring 2009 10Introductory Digital Systems Laboratory
module adder64 (a, b, sum); input [63:0] a, b; output [63:0] sum;
assign sum = a + b;endmodule
Verilog to ASIC Layout (the push button approach)
After Synthesis
After Placement
After Routing
L15: 6.111 Spring 2009 11Introductory Digital Systems Laboratory
Macro Modules
256×32 (or 8192 bit) SRAM Generated by hard-macro module generator
Generate highly regular structures (entire memories, multipliers, etc.) with a few lines of code
Verilog models for memories automatically generated based on size
L15: 6.111 Spring 2009 12Introductory Digital Systems Laboratory
Clock Distribution
For 1Ghz clock, skew budget is 100ps.Variations along different paths arise
from:• Device: VT, W/L, etc.• Environment: VDD, °C• Interconnect: dielectric thickness
variation
D Q
D Q
Clock skew, courtesy Alpha
IBM Clock Routing
L15: 6.111 Spring 2009 13Introductory Digital Systems Laboratory
Analog Circuits: Clock Frequency Multiplication (Phase Locked Loop)
Courtesy M. Perrott
VCO produces high frequency square waveDivider divides down VCO frequencyPFD compares phase of ref and divLoop filter extracts phase error information
Used widely in digital systems for clock synthesis(a standard IP block in most ASIC flows)
PLL
Intel 486, 50Mhz
up
down
L15: 6.111 Spring 2009 14Introductory Digital Systems Laboratory
Scan Testing
CLK
shift inScanShift
shift out
01
ScanShift
shift in
01
ScanShift
... Idea: have a mode in which all registers are chainedinto one giant shift register which can be loaded/read-out bit serially. Test remaining (combinational)logic by(1) in “test” mode, shift in new values for all
register bits thus setting up the inputs to thecombinational logic
(2) clock the circuit once in “normal” mode, latchingthe outputs of the combinational logic back intothe registers
(3) in “test” mode, shift out the values of allregister bits and compare against expectedresults.
Courtesy of C. Terman and IEEE Press
L15: 6.111 Spring 2009 15Introductory Digital Systems Laboratory
There are a large number of implementations of the same functionality
These implementations present a different point in the area-time-power design space
Behavioral transformations allow exploring the design space a high-level
Optimization metrics:
area
time
power1. Area of the design2. Throughput or sample time TS3. Latency: clock cycles between
the input and associated output change
4. Power consumption5. Energy of executing a task6. …
Behavioral Transformations
L15: 6.111 Spring 2009 16Introductory Digital Systems Laboratory
Fixed-Coefficient Multiplication
X3 X2 X1 X0
Y3 Y2 Y1 Y0
X3 · Y0 X2 · Y0 X1 · Y0 X0 · Y0
X3 · Y1 X2 · Y1 X1 · Y1 X0 · Y1
X3 · Y2 X2 · Y2 X1 · Y2 X0 · Y2
X3 · Y3 X2 · Y3 X1 · Y3 X0 · Y3
Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0
Z = X · Y
X3 X2 X1 X0
1 0 0 1X3 X2 X1 X0
X3 X2 X1 X0
Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0
Conventional Multiplication
Z = X · (1001)2
Constant multiplication (become hardwired shifts and adds)
X Z
<< 3Y = (1001)2 = 23 + 20
shifts using wiring
L15: 6.111 Spring 2009 17Introductory Digital Systems Laboratory
Transform: Canonical Signed Digits (CSD)
0 1 1 … 1 1
0 1 1 0 1 1 1 1 0 1 1 1 0 0 0 -1
Canonical signed digit representation is used to increase the number of zeros. It uses digits {-1, 0, 1} instead of only {0, 1}.
Iterative encoding: replace string of consecutive 1’s
2N-2 + … + 21 + 20
1 0 0 … 0 -1
2N-1 - 20
01101111
10010001
=
1 0 0 -1 0 0 0 -1
Worst case CSD has 50% non zero bits
X << 7 Z
<< 4Shift translates to re-wiring
L15: 6.111 Spring 2009 18Introductory Digital Systems Laboratory
Algebraic Transformations
A B B A
⇔
Commutativity
⇔
Associativity
A
CB
C
A B
⇔
Distributivity
A BA B
⇔
Common sub-expressions
C
A BA C B
A + B = B + A (A + B) C = AB + BC
(A + B) + C = A + (B+C)
X YX Y X
L15: 6.111 Spring 2009 19Introductory Digital Systems Laboratory
Transforms for Efficient Resource Utilization
CA B FD E
2
1
IG H
2
1
CA B
distributivity
Time multiplexing: mapped to 3 multipliers and 3 adders
Reduce number of operators to 2 multipliers and 2 adders
FD E IG H
L15: 6.111 Spring 2009 20Introductory Digital Systems Laboratory
A Very Useful Transform: Retiming
Retiming is the action of moving delay around in the systemsDelays have to be moved from ALL inputs to ALL outputs or vice versa
D
D
D
D
D
Cutset retiming: A cutset intersects the edges, such that this would result in two disjoint partitions of these edges being cut. To retime, delays are moved from the ingoing to the outgoing edges or vice versa.
Benefits of retiming:• Modify critical path delay• Reduce total number of registers
D
D
D
L15: 6.111 Spring 2009 21Introductory Digital Systems Laboratory
Retiming Example: FIR Filter
associativity of the addition
D D Dx(n)
h(0) h(1) h(2) h(3)
y(n)
D D Dx(n)
h(0) h(1) h(2) h(3)
y(n)
D D D
x(n)
h(0) h(1) h(2) h(3)
y(n)
retime
Direct form
Transposed form
Symbol for multiplication
∑=
⋅−=⊗=K
i
ihinxnxnhny0
)()()()()(
(10)
(4)
Tclk = 22 ns
Tclk = 14 ns
Note: here we use a first cut analysis that assumes the delay of a chain of operators is the sum of their individual delays. This is not accurate.
L15: 6.111 Spring 2009 22Introductory Digital Systems Laboratory
Pipelining, Just Another Transformation(Pipelining = Adding Delays + Retiming)
How to pipeline:1. Add extra registers at
all inputs2. Retime
D
D
D
D
D
D
D
D
D
retime
add input registers
Contrary to retiming, pipelining adds extra registers to the system
L15: 6.111 Spring 2009 23Introductory Digital Systems Laboratory
The Power of Transforms: Lookahead
D
x(n) y(n)
A
2D
x(n) y(n)
D AA
2D
x(n) y(n)
DAAA
2D
x(n) y(n)
DA2A
D
x(n) y(n)
A2
A DD
loop unrolling
distributivity
associativity
retiming
precomputed
y(n) = x(n) + A[x(n-1) + A y(n-2)]
y(n) = x(n) + A y(n-1)
Try pipeliningthis structure
How about pipeliningthis structure!
L15: 6.111 Spring 2009 24Introductory Digital Systems Laboratory
Key Concern in Modern VLSI: Variations!
10
100
1000
10000
1000 500 250 130 65 32
Technology Node (nm)
Mea
n N
umbe
r of
Dop
ant
Ato
ms
Random Dopant Fluctuations
Temp Variation & Hot spots
ToxD
GATE
SOURCE DRAIN
Leff
BODY
Delay
Path Delay Prob
abili
ty Due to variations in:Vdd, Vt, and
Temp
Deterministic design techniques inadequate in the futureCourtesy of S. Borkar (Intel)
With 100b transistors, 1b unusable (variations)
L15: 6.111 Spring 2009 25Introductory Digital Systems Laboratory
Trends: “Chip in a Day”(Matlab/Simulink to Silicon…)
Mult2
Mac2Mult1 Mac1
S reg X reg Add,Sub,Shift
Courtesy of R. Brodersen
Map algorithms directly to silicon - bypass writing Verilog!
L15: 6.111 Spring 2009 26Introductory Digital Systems Laboratory
(courtesy of G. Qu, M. Potkonjak)
Trends: Watermarking of Digital Designs
Fingerprinting is a technique to deter people from illegallyredistributing legally obtained IP by enabling the author of the IP touniquely identify the original buyer of the resold copy.The essence of the watermarking approach is to encode the author'ssignature. The selection, encoding, and embedding of the signaturemust result in minimal performance and storage overhead.
same functionality, same area, same performance watermark of 4768 bits embedded
L15: 6.111 Spring 2009 27Introductory Digital Systems Laboratory
L15: 6.111 Spring 2009 28Introductory Digital Systems Laboratory
Evolution of Transistor Integration
Moore’s Law: transistor density doubles every 1.5 - 2 years
1970 1974 1978 1982 1986 1990 1994 1998 2002
1K
4K16K
64K
1M
4M
16M
128MDRAMCost / Bit ($, 95)
2006
256M 1G
4004
8086
8080
Pentium Pro
8028680386
Pentium
Pentium IIPentium III
Cos
t/Bit
($, 9
5)
10-1
10-2
10-3
10-4
10-5
10-6
10-7
10-8
Tran
sist
ors
/ Chi
p
107
106
105
104
103
108
109
Cos
t/Fun
ctio
n
Pentium IV
#1
256K80486
64M
#2
Microprocessor
102
10-9
L15: 6.111 Spring 2009 29Introductory Digital Systems Laboratory
Processor Performance Trends
Processor performance follows Moore’s Law- doubles every 2 years
Intel microprocessor
0.01
0.1
1
10
100
1000
10000
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00
Date of Introduction
MIP
S
386
4004
8080
8086 80286
486
PentiumPentium Pro
Pentium II
XeonPentium III
L15: 6.111 Spring 2009 30Introductory Digital Systems Laboratory
Clock Frequency of Microprocessors
P6Pentium ® proc486
38628680868085
8080800840040.1
1
10
100
1000
10000
1970 1980 1990 2000 2010Year
Freq
uenc
y (M
hz) Doubles every
2 years
S. Borkar
L15: 6.111 Spring 2009 31Introductory Digital Systems Laboratory
Power Consumption Trends
Courtesy S. Borkar, Intel
5KW
18KW
1.5KW
500W
40048008
80808085
8086286 386
486
Pentium® proc
0.1
1
10
100
1000
10000
100000
1971 1974 1978 1985 1992 2000 2004 2008
Year
Pow
er (W
atts
)
Power per gate goes down but total power …
L15: 6.111 Spring 2009 32Introductory Digital Systems Laboratory
Interconnect Metallization
Six layers of Cu metallization
Lower layers are finer and are used for “local” interconnection between cellsMiddle layers are wider and are used for global interconnection between blocksUpper layers are wider and are used for clocks, ground and power distributionOxide is the Inter Metal Dielectric (etched)
L15: 6.111 Spring 2009 33Introductory Digital Systems Laboratory
Interconnect Metallization