L15: VLSI Integration and Performance...

L15: 6.111 Spring 2009 1Introductory Digital Systems Laboratory

L15: VLSI Integration and Performance Transformations

Acknowledgements: • Lecture prepared by Professor Anantha Chandrakasan• Lecture material adapted from J. Rabaey, A. Chandrakasan, B. Nikolic, “Digital

Integrated Circuits: A Design Perspective” Copyright 2003 Prentice Hall/Pearson.• Curt Schurgers

• Moore’s Law and Integration• IC Design

ASIC DesignClocksTest

• Performance Transformations• Trends


0.0000001

0.000001

0.00001

0.0001

0.001

0.01

0.1

1

10

'68 '70 '72 '74 '76 '78 '80 '82 '84 '86 '88 '90 '92 '94 '96 '98 '00 '02

$

Aver

age

Cos

t of o

ne tr

ansi

stor Gordon Moore, Keynote

Presentation at ISSCC 2003

Cost of Transistor


Moore’s Law

In 1965, Gordon Moore was preparing a speech andmade a memorable observation. When he started tograph data about the growth in memory chipperformance, he realized there was a strikingtrend. Each new chip contained roughly twice as muchcapacity as its predecessor, and each chip was releasedwithin 18-24 months of the previous chip. If this trendcontinued, he reasoned, computing power would riseexponentially over relatively brief periods of time.

Electronics, April 19, 1965.

Source: Dataquest/Intel, 12/02

'68 '70 '72 '74 '76 '78 '80 '82 '84 '86 '88 '90 '92 '94 '96 '98 '00 '02

1017

1016

1015

1014

1013

1012

1011

1010

109

Tran

sist

ors

Ship

ped

Per Y

ear

1018

“…E.O. Wilson, the famous Harvard biologist who is an expert on ants, estimates that there are 10 to the 16th and 10 to the 17th ants on earth. But if you look at this curve, this year we’re making one transistor for every ant.” –Gordon Moore, “An Update on Moore’s Law”

Gordon Moore, KeynotePresentation at ISSCC 2003

40048008

80808085 8086

286386

486Pentium® proc

P6

0.001

0.01

0.1

1

10

100

1000

1970 1980 1990 2000 2010Year

Tran

sist

ors

(MT)

2X growth in 1.96 years!

Courtesy of S. Borkar (Intel)


Evolution of Transistor Integration

Moore’s Law: transistor density doubles every 1.5 - 2 years

1970 1974 1978 1982 1986 1990 1994 1998 2002

1K

4K16K

64K

1M

4M

16M

128MDRAMCost / Bit ($, 95)

2006

256M 1G

4004

8086

8080

Pentium Pro

8028680386

Pentium

Pentium IIPentium III

Cos

t/Bit

($, 9

5)

10-1

10-2

10-3

10-4

10-5

10-6

10-7

10-8

Tran

sist

ors

/ Chi

p

107

106

105

104

103

108

109

Cos

t/Fun

ctio

n

Pentium IV

#1

256K 80486

64M

#2

Microprocessor

102

10-9


Layout 101

N-channel MOSFET P-channel MOSFET

GND

VDD

metal poly p+ diff

contactfrommetalto ndiff

Ln

Wn

Lp

Wp

IN OUT

n-type well

p-type substrate

metal/pdiffcontact

n+ diff

IN OUT

VDD

SG

D

D

S

G

3-D Cross-Section

Circuit Representation

Layout

Follow simple design rules (contractbetween process and circuit designers)


Custom Design/Layout

Adder stage 1

Wiring

Adder stage 2

Wiring

Adder stage 3B

it slice 0

Bit slice 2

Bit slice 1

Bit slice 63

Sum Select

Shifter

Multiplexers

Loopback Bus

From register files / Cache / Bypass

To register files / Cache

Loopback Bus

Loopback Bus

Die photograph of the Itanium integer datapath

Bit-slice Design Methodology

Hand crafting the layout to achieve maximum clock rates (> 1Ghz)Exploits regularity in datapath structure to optimize interconnects

9-1

Mux

9-1

Mux

5-1

Mux

2-1

Mux

ck1

CARRYGEN

SUMGEN+ LU

1000um

b

s0

s1

g64

sum sumb

LU : LogicalUnit

SUM

SEL

a

to Cache

node1

REG

Itanium has 6 integer execution units like this


The ASIC Approach

Verilog (or VHDL )

Logic Synthesis

Floorplanning

Placement

Routing

Tape-out

Circuit Extraction

Pre-Layout Simulation

Post-Layout Simulation

Structural

Physical

BehavioralDesign Capture

Des

ign

Itera

tion

Most Common Design Approach for Designs up to 500Mhz Clock Rates


Standard Cell Example

3-input NAND cell(from ST Microelectronics):C = Load capacitanceT = input rise/fall time

Power Supply Line (VDD)

Ground Supply Line (GND)

Each library cell (FF, NAND, NOR, INV, etc.) and the variations on size (strength of the gate) is fully characterized across temperature, loading, etc.

Delay in (ns)!!


Standard Cell Layout Methodology

Cell-structure hidden under interconnect layers

2-level metal technology Current Day Technology

With limited interconnect layers, dedicated routing channels between rows of standard cells are needed

Width of the cell allowed to vary to accommodate complexityInterconnect plays a significant role in speed of a digital circuit


module adder64 (a, b, sum); input [63:0] a, b; output [63:0] sum;

assign sum = a + b;endmodule

Verilog to ASIC Layout (the push button approach)

After Synthesis

After Placement

After Routing


Macro Modules

256×32 (or 8192 bit) SRAM Generated by hard-macro module generator

Generate highly regular structures (entire memories, multipliers, etc.) with a few lines of code

Verilog models for memories automatically generated based on size


Clock Distribution

For 1Ghz clock, skew budget is 100ps.Variations along different paths arise

from:• Device: VT, W/L, etc.• Environment: VDD, °C• Interconnect: dielectric thickness

variation

D Q

D Q

Clock skew, courtesy Alpha

IBM Clock Routing


Analog Circuits: Clock Frequency Multiplication (Phase Locked Loop)

Courtesy M. Perrott

VCO produces high frequency square waveDivider divides down VCO frequencyPFD compares phase of ref and divLoop filter extracts phase error information

Used widely in digital systems for clock synthesis(a standard IP block in most ASIC flows)

PLL

Intel 486, 50Mhz

up

down


Scan Testing

CLK

shift inScanShift

shift out

01

ScanShift

shift in

01

ScanShift

... Idea: have a mode in which all registers are chainedinto one giant shift register which can be loaded/read-out bit serially. Test remaining (combinational)logic by(1) in “test” mode, shift in new values for all

register bits thus setting up the inputs to thecombinational logic

(2) clock the circuit once in “normal” mode, latchingthe outputs of the combinational logic back intothe registers

(3) in “test” mode, shift out the values of allregister bits and compare against expectedresults.

Courtesy of C. Terman and IEEE Press


There are a large number of implementations of the same functionality

These implementations present a different point in the area-time-power design space

Behavioral transformations allow exploring the design space a high-level

Optimization metrics:

area

time

power1. Area of the design2. Throughput or sample time TS3. Latency: clock cycles between

the input and associated output change

4. Power consumption5. Energy of executing a task6. …

Behavioral Transformations


Fixed-Coefficient Multiplication

X3 X2 X1 X0

Y3 Y2 Y1 Y0

X3 · Y0 X2 · Y0 X1 · Y0 X0 · Y0

X3 · Y1 X2 · Y1 X1 · Y1 X0 · Y1

X3 · Y2 X2 · Y2 X1 · Y2 X0 · Y2

X3 · Y3 X2 · Y3 X1 · Y3 X0 · Y3

Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0

Z = X · Y

X3 X2 X1 X0

1 0 0 1X3 X2 X1 X0

X3 X2 X1 X0

Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0

Conventional Multiplication

Z = X · (1001)2

Constant multiplication (become hardwired shifts and adds)

X Z

<< 3Y = (1001)2 = 23 + 20

shifts using wiring


Transform: Canonical Signed Digits (CSD)

0 1 1 … 1 1

0 1 1 0 1 1 1 1 0 1 1 1 0 0 0 -1

Canonical signed digit representation is used to increase the number of zeros. It uses digits {-1, 0, 1} instead of only {0, 1}.

Iterative encoding: replace string of consecutive 1’s

2N-2 + … + 21 + 20

1 0 0 … 0 -1

2N-1 - 20

01101111

10010001

=

1 0 0 -1 0 0 0 -1

Worst case CSD has 50% non zero bits

X << 7 Z

<< 4Shift translates to re-wiring


Algebraic Transformations

A B B A

⇔

Commutativity

⇔

Associativity

A

CB

C

A B

⇔

Distributivity

A BA B

⇔

Common sub-expressions

C

A BA C B

A + B = B + A (A + B) C = AB + BC

(A + B) + C = A + (B+C)

X YX Y X


Transforms for Efficient Resource Utilization

CA B FD E

2

1

IG H

2

1

CA B

distributivity

Time multiplexing: mapped to 3 multipliers and 3 adders

Reduce number of operators to 2 multipliers and 2 adders

FD E IG H


A Very Useful Transform: Retiming

Retiming is the action of moving delay around in the systemsDelays have to be moved from ALL inputs to ALL outputs or vice versa

D

D

D

D

D

Cutset retiming: A cutset intersects the edges, such that this would result in two disjoint partitions of these edges being cut. To retime, delays are moved from the ingoing to the outgoing edges or vice versa.

Benefits of retiming:• Modify critical path delay• Reduce total number of registers

D

D

D


Retiming Example: FIR Filter

associativity of the addition

D D Dx(n)

h(0) h(1) h(2) h(3)

y(n)

D D Dx(n)

h(0) h(1) h(2) h(3)

y(n)

D D D

x(n)

h(0) h(1) h(2) h(3)

y(n)

retime

Direct form

Transposed form

Symbol for multiplication

∑=

⋅−=⊗=K

i

ihinxnxnhny0

)()()()()(

(10)

(4)

Tclk = 22 ns

Tclk = 14 ns

Note: here we use a first cut analysis that assumes the delay of a chain of operators is the sum of their individual delays. This is not accurate.


Pipelining, Just Another Transformation(Pipelining = Adding Delays + Retiming)

How to pipeline:1. Add extra registers at

all inputs2. Retime

D

D

D

D

D

D

D

D

D

retime

add input registers

Contrary to retiming, pipelining adds extra registers to the system


The Power of Transforms: Lookahead

D

x(n) y(n)

A

2D

x(n) y(n)

D AA

2D

x(n) y(n)

DAAA

2D

x(n) y(n)

DA2A

D

x(n) y(n)

A2

A DD

loop unrolling

distributivity

associativity

retiming

precomputed

y(n) = x(n) + A[x(n-1) + A y(n-2)]

y(n) = x(n) + A y(n-1)

Try pipeliningthis structure

How about pipeliningthis structure!


Key Concern in Modern VLSI: Variations!

10

100

1000

10000

1000 500 250 130 65 32

Technology Node (nm)

Mea

n N

umbe

r of

Dop

ant

Ato

ms

Random Dopant Fluctuations

Temp Variation & Hot spots

ToxD

GATE

SOURCE DRAIN

Leff

BODY

Delay

Path Delay Prob

abili

ty Due to variations in:Vdd, Vt, and

Temp

Deterministic design techniques inadequate in the futureCourtesy of S. Borkar (Intel)

With 100b transistors, 1b unusable (variations)


Trends: “Chip in a Day”(Matlab/Simulink to Silicon…)

Mult2

Mac2Mult1 Mac1

S reg X reg Add,Sub,Shift

Courtesy of R. Brodersen

Map algorithms directly to silicon - bypass writing Verilog!


(courtesy of G. Qu, M. Potkonjak)

Trends: Watermarking of Digital Designs

Fingerprinting is a technique to deter people from illegallyredistributing legally obtained IP by enabling the author of the IP touniquely identify the original buyer of the resold copy.The essence of the watermarking approach is to encode the author'ssignature. The selection, encoding, and embedding of the signaturemust result in minimal performance and storage overhead.

same functionality, same area, same performance watermark of 4768 bits embedded

Presenter

Presentation Notes

Software piracy (www.nopiracy.com) 38% of all software used is illegally copied $11 billion revenue loss in 1998 Hardware piracy $12 billion market on hardware IP by 2002 100 reverse engineering labs (most not public) One of the key problems in SoC design


Evolution of Transistor Integration

Moore’s Law: transistor density doubles every 1.5 - 2 years

1970 1974 1978 1982 1986 1990 1994 1998 2002

1K

4K16K

64K

1M

4M

16M

128MDRAMCost / Bit ($, 95)

2006

256M 1G

4004

8086

8080

Pentium Pro

8028680386

Pentium

Pentium IIPentium III

Cos

t/Bit

($, 9

5)

10-1

10-2

10-3

10-4

10-5

10-6

10-7

10-8

Tran

sist

ors

/ Chi

p

107

106

105

104

103

108

109

Cos

t/Fun

ctio

n

Pentium IV

#1

256K80486

64M

#2

Microprocessor

102

10-9


Processor Performance Trends

Processor performance follows Moore’s Law- doubles every 2 years

Intel microprocessor

0.01

0.1

1

10

100

1000

10000

71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00

Date of Introduction

MIP

S

386

4004

8080

8086 80286

486

PentiumPentium Pro

Pentium II

XeonPentium III


Clock Frequency of Microprocessors

P6Pentium ® proc486

38628680868085

8080800840040.1

1

10

100

1000

10000

1970 1980 1990 2000 2010Year

Freq

uenc

y (M

hz) Doubles every

2 years

S. Borkar


Power Consumption Trends

Courtesy S. Borkar, Intel

5KW

18KW

1.5KW

500W

40048008

80808085

8086286 386

486

Pentium® proc

0.1

1

10

100

1000

10000

100000

1971 1974 1978 1985 1992 2000 2004 2008

Year

Pow

er (W

atts

)

Power per gate goes down but total power …


Interconnect Metallization

Six layers of Cu metallization

Lower layers are finer and are used for “local” interconnection between cellsMiddle layers are wider and are used for global interconnection between blocksUpper layers are wider and are used for clocks, ground and power distributionOxide is the Inter Metal Dielectric (etched)


Interconnect Metallization

L15: VLSI Integration and Performance...

Documents

Transcript of L15: VLSI Integration and Performance...