Download - Building Modern Integrated Systems: A Cross-cut Approachilp.mit.edu/images/conferences/2012/Japan/Stojanovic.pdf · Building Modern Integrated Systems: A Cross-cut Approach (The Electrical,

Building Modern Integrated Systems:

A Cross-cut Approach (The Electrical, The Optical and The Mechanical)

Vladimir Stojanović

Integrated Systems Group

Massachusetts Institute of Technology

Acknowledgments

Devices: Tsu-Jae King Liu, Rajeev Ram, Miloš Popović, Henry Smith

Architecture: Krste Asanović, Christopher Batten, Ajay Joshi

Circuits: Elad Alon, Dejan Marković

Students:

Devices - Jason Orcutt, Anatoly Khilo, Jie Sun, Cheryl Sorace, Eugen Zgraggen, Jaeseok Jeon, Rhesa Nathanael, Hei Kam

Circuits – Michael Georgas, Jonathan Leu, Ben Moss, Chen Sun, Fred Chen, Byungsub Kim, Hossein Fariborzi, Matthew Spencer, Chengcheng Wang, Kevin Dwan

Architecture - Yong-Jin Kwon, Scott Beamer, Chen Sun, Imran Shamim

DARPA MTO

Texas Instruments – Dennis Buss and Tom Bonifield

IBM and Trusted Foundry

Intel Corporation – Ian Young and Alex Kern

2

3

Chip design is going through a change

“The Processor is the new Transistor” [Rowen]

Intel 4004 (1971):

4-bit processor,

2312 transistors,

~100 KIPS,

10 micron PMOS,

11 mm2 chip

Sun Niagara 8 GPP cores (32 threads)

Intel®

XScale

™

Core 32K IC

32K DC

MEv2

10

MEv2

11

MEv2

12

MEv2

15

MEv2

14

MEv2

13

Rbuf

64 @

128B

Tbuf

64 @

128B

Hash

48/64/1

28 Scratc

h

16KB

QDR

SRAM

2

QDR

SRAM

1

RDRA

M

1

RDRA

M

3

RDRA

M

2

G

A

S

K

E

T

PCI

(64b)

66

MHz

IXP280

0 16b

16b

1

8 1

8

1

8 1

8

18 18 18

64b

S

P

I

4

or

C

S

I

X

Stripe

E/D Q E/D Q

QDR

SRAM

3 E/D Q

1

8 1

8

MEv2

9

MEv2

16

MEv2

2

MEv2

3

MEv2

4

MEv2

7

MEv2

6

MEv2

5

MEv2

1

MEv2

8

CSRs

-Fast_wr -UART

-Timers -GPIO

-BootROM/SlowPort

QDR

SRAM

4 E/D Q

1

8 1

8

Intel Network Processor 1 GPP Core 16 ASPs (128 threads)

IBM Cell 1 GPP (2 threads) 8 ASPs

Picochip DSP 1 GPP core 248 ASPs

Cisco CSR-1 188 Tensilica GPPs

1000s of processor cores and

accelerators per die Asanovic

Already have more devices than can use at once

Limited by power density and bandwidth

Subthreshold leakage: Game over for CMOS

CMOS circuits have well-defined minimum energy

Caused by leakage and finite sub-threshold swing

Need to balance leakage and active energy

Limits energy-efficiency, regardless how slow the circuit runs

Energy/op vs. Vdd Energy/op vs. 1/throughput

101

102

103

104

105

0

20

40

60

80

100

No

rma

lize

d E

ne

rgy

/op

1/throughput (ps/op)

0.1 0.2 0.3 0.4 0.5

5

10

15

20

25

No

rmalized

En

erg

y/c

ycle

Vdd (V)

Etotal

Edynamic

Eleak

Scale Vdd & VT:

4

5

Wire and I/O scaling

Increased wire resistivity makes wire caps scale very slowly

Can’t get both energy-efficiency and high-data rate in I/O

On-chip wires

copper resistivity

0

2

4

6

8

10

12

14

16

18

0 5 10 15 20 25

Chip2Chip Backplane

En

erg

y-c

ost

[pJ/b

]Data-rate [Gb/s]

Best electrical links

Loss ~10dB

Loss ~20-25dB

On-chip wires I/O

Opportunity for integrated system design

Energy-efficient computation and communication

CMOS – need cross-cut

approach to keep scaling

performance

Circuits & Logic

Tx, Rx, Ctrl, Meas

Cu

Interconnect

and switch

technology

Circuit modeling,

Characterization

Design

Optimization Network &

µArchitecture

Communications

(Eq., Mod, Coding)

0 1 2 30

0.5

1

1.5

2

2.5

Data Rate Density (Gbps/um)

En

erg

y/B

it (

pJ/B

it)

Equalized, 30mV Eye

Equalized, 50mV Eye

Equalized, 90mV Eye

Repeated

MOSFET

Φ Φ

ΦΦ

Φ

in+ in-

Φ

IPHOTO

6

Manycore SOC roadmap fuels

bandwidth demand 64-tile system (64-256 cores) - 4-way SIMD FMACs @ 2.5 – 5 GHz

- 5-10 TFlops on one chip

- Need 5-10 TB/s of off-chip I/O

- Even higher on-chip bandwidth

2 cm

2 cm

Intel 48 core -Xeon

7

Bandwidth, pin count and power scaling

Need 16k pins

in 2017 for HPC*

1 Byte/Flop

256 cores

2 TFlop/s signal pins @ 20 Gb/s/link

2,4 cores

Pa

cka

ge

pin

co

un

t

*> half pins for power supply

Emerging devices can help




performance

Post-CMOS – need cross-cut

approach to guide new

devices/systems

Circuits & Logic

Tx, Rx, Ctrl, Meas

Si-Photonics Cu

Interconnect

and switch

technology

Circuit modeling,

Characterization

Design


µArchitecture

Communications

(Eq., Mod, Coding)

0 1 2 30

0.5

1

1.5

2

2.5


En

erg

y/B

it (

pJ/B

it)

Equalized, 30mV Eye

Equalized, 50mV Eye

Equalized, 90mV Eye

Repeated

MOSFET

Φ Φ

ΦΦ

Φ

in+ in-

Φ

IPHOTO

Monolithic Si-Photonics for core-to-core and

core-to-DRAM networks

10 10

Supercomputers

Embedded apps

Si-photonics in advanced

CMOS and DRAM process

NO costly process changes

Bandwidth density – need dense WDM

Energy-efficiency – need monolithic integration

Many architectural studies show promise

11

[Shacham’07]

[Petracca’08]

[Vantrease’08]

[Psota’07]

[Kirman’06]

[Joshi’09]

[Pan’09]

[Batten’08] [Beamer’10] [Koka’08-10]

Laser energy increases with data-rate

– Limited Rx sensitivity

– Modulation more expensive -> extinction ratio / insertion loss trade-off

Tuning costs decrease with data-rate

Moderate data rates most energy-efficient

Reg

iste

r

Mu

x

Pre-Driver Mod-DriverReceiver

Front-end

Φ Φ Φ

Φ Φ

+

Samplers &

Monitoring

Dem

ux

Reg

iste

r

PLL or

Opt. Clk

1 2 3 4 in PLL or

Opt. Clk

Phase

Adjust

Reg

iste

r

Mu

x

Pre-Driver Mod-DriverReceiver

Front-end

Φ Φ Φ

Φ Φ

+

Samplers &

Monitoring

Dem

ux

Reg

iste

r

PLL or

Opt. Clk

1 2 3 4 in PLL or

Opt. Clk

Phase

Adjust

512 Gb/s aggregate throughput

assuming 32nm CMOS

Georgas CICC 2011

Optimize carefully – start at the link level

DWDM link efficiency optimization

Optimize for min energy-cost

Bandwidth density dominated by circuit and photonics area (not coupler pitch) 10x better than electrical bump limited

200x better than electrical package pin limit

13

Photonic DRAM Network Organization

Important Concepts

- Power/message switching (only to active DRAM chip in

DRAM cube/super DIMM)

- Vertical die-to-die coupling (minimizes cabling - 8 dies per

DRAM cube)

-Command distributed

electrically (broadcast)

- Data photonic (single writer

multiple readers)

MC 1

MC 16

Mem

Sch

edu

ler

MC K

CPUDRAM cube 1

DRAM cube 4

Super DIMM

cmdDwr

Drd

( cube 1, die 1)

cmdDwr

Drd

( cube 1, die 8)

Dwr

Drd

DRAM cube 4

Super DIMM K

die-die switch

Laser in

Modulator bank

Receiver/PD bank

Tunable filterbank

Through silicon via

Through silicon via holeBeamer ISCA 2010 Processor die

Enables energy-efficient

throughput and capacity

scaling per memory channel

Optimizing DRAM with photonics

Floorplan

Beamer ISCA 2010

P1 P4

Laser Power Guiding Effectiveness

Beamer ISCA 2010

Enables capacity scaling per channel and significant savings in laser energy

Significant integration activity,

but hybrid and older processes …

[Luxtera/Oracle/Kotura] [IBM]

[HP]

[Watts/Sandia/MIT]

[Intel]

130nm

thick BOX SOI

130nm

thick BOX SOI

Bulk CMOS

Backend

monolithic

[Lipson/Cornell]

[Kimerling/MIT]

[Many schools]

17

Optical Mode

Monolithic CMOS photonic integration

Photo credit: Intel

Polysilicon - transistor gates, local interconnect and resistors

Use for photonic components instead or with silicon body in SOI

Sub-100nm lithography has 1-5 nm design grid

Enables edge roughness necessary for photonic devices

18

65 nm bulk CMOS Texas Instruments

90 nm bulk CMOS IBM cmos9sf

45 nm SOI CMOS IBM 12SOIs0

19

32 nm bulk CMOS Texas Instruments

EOS Platform for Monolithic CMOS

photonic integration

-200 0 200 400 600 800 1000

-14

-12

-10

-8

-6

-4

-2

0

Tra

nsm

issio

n, dB

Frequency, GHz

2007

2011

Create integration platform to accelerate

technology development and adoption

Joint work with Ram and Popovic

A 32nm bulk CMOS photonic platform

Monolithic CMOS photonic platform integrated with CMOS circuits

32nm process – fabrication support from Texas Instruments

Robust post-processing steps at MIT

Second-order resonator filterbank shows process precision

Great on-die matching (rings track within 40GHz)

Record thermal heating efficiency 25uW/K

Orcutt et al – CLEO 2008, Optics Express 2011 20

Polysilicon and Silicon Photonics on Thin BOX IBM SOI

Reg

iste

r

Mu

xPre-Driver Mod-Driver

Receiver

Front-end

Φ Φ Φ

Φ Φ

+

Samplers &

Monitoring

Dem

ux

Reg

iste

r

PLL or

Opt. Clk

1 2 3 4 in PLL or

Opt. Clk

Phase

Adjust

Electrical and photonic integration – test row

EOS: A 45nm SOI Monolithic Photonic Platform

6 rows of electronic-photonic

WDM links with

body and polysilicon

photonic devices

54 Transmit-receive test-

sites,

~3M transistors and

hundreds of photonic devices

Body and polysilicon photonic devices

Filterbanks, waveguide paperclips, rings, stand-

alone modulators and photodetectors

21

Integration of photonics into VLSI tools

22

VERSION 5.6 ;

BUSBITCHARS "[]" ;

DIVIDERCHAR "/" ;

MACRO block_electronic_etch_row_1

CLASS BLOCK ;

ORIGIN -208 -1794 ;

FOREIGN block_electronic_etch_row_1 208 1794 ;

SIZE 2488 BY 165 ;

SYMMETRY X Y R90 ;

PIN heater_a_1

DIRECTION INOUT ;

USE SIGNAL ;

PORT

LAYER ua ;

RECT 431 1870.5 436.5 1882 ;

END

END heater_a_1

...

OBS

LAYER m1 ;

RECT 208 1794 2696 1959 ;

...

END

END block_electronic_etch_row_1

END LIBRARY

modulator.LEF

Layout of

photonics

Layout of

Circuit blocks

abstract

abstract

LEF

LEF

LEF of standard cells, I/O pads

(provided by ARM)

Chip-level verilog

(instantiation of.LEF macros and

connectivity)

Technology files

SOC Encounter

Place and route

Floorplan

(macro placement,power grid, routing

Constraints)

Place&routed

layout

Photonic device

p-cell abstract

custom photonics-friendly auto-fill

layout

Platform Organization

23

A full electro-optical test setup

24

DUT Chip

Board

HS

Clocks

FPGA

Control

Board

Fiber PositionerFiber

Positioner

USB to laptop

Microscope

Extremely good dimensional tolerances

in 45nm SOI

Good body waveguide loss

3.7dB/cm at ~1220nm

25

Integrated Delta-Sigma Heat Control

Tuning efficiency 2.6mW/nm (32.4mW/2π)

On fully substrate removed die

~10mW required

to retune all 8 rings

Thermal tuning BW

lower than 500kHz

Tuning control overhead

negligible

26

Current-sensing optical data receiver

Georgas ESSCIRC 2011

Receiver detects photo current

50fJ/b, uA sensitivities, 3-5Gb/s 27

Modulator test site

• Extinction ratio 9dB at 1280nm

• 60GHz 3dB bandwidth

• Carrier lifetime ~2-3ns

• Requires flexible drive circuits

• Sub-bit pre-emphasis

• Split-supplies

Silicon carrier injection modulator

monolithically integrated with

transistors

60 GHz3 dB bandwidth

9 dBextinction

First dynamic electro-optic test in 45nm SOI

Modulator Driver

Modulator

Transistors and Photonics can be built together in

advanced CMOS!

Silicon carrier injection modulator

monolithically integrated with

transistors

Modulation data-rate up to 1Gb/s

5-10 Gb/s achievable with device and biasing optimization

Lots of room to improve circuit/device designs

29

Power and pins required for 10TFlop/s

0

200

400

600

800

1000

1200

1400

1600

100 1000 10000 100000

Mobile LPDDR2-1066

Mobile LPDDRX-1666

Mobile LPDDRX 2017

DDR3-1333 4GB

DDR4-2667 8GB

GDDR5

HMC-Gen1

HMC-Gen2

POEM Phase 1

POEM Phase 2

POEM Post-phase 2

To

tal m

em

ory

ch

an

ne

l p

ow

er

[W]

# socket pins required for memory channels

80Tb/s sustained

bandwidth

assuming

1B/Flop

HMC

LPDDR

POEM

PIM

DDR4

GDDR5

30

Improving computation efficiency




performance

Post-CMOS – need cross-cut

approach to guide new

devices/systems

Circuits & Logic

Tx, Rx, Ctrl, Meas

Si-Photonics Cu

Interconnect

and switch

technology

Circuit modeling,

Characterization

Design


µArchitecture

Communications

(Eq., Mod, Coding)

0 1 2 30

0.5

1

1.5

2

2.5


En

erg

y/B

it (

pJ/B

it)

Equalized, 30mV Eye

Equalized, 50mV Eye

Equalized, 90mV Eye

Repeated

NEMS relay MOSFET

Φ Φ

ΦΦ

Φ

in+ in-

Φ

IPHOTO

31

Nearly ideal switching characteristics: Low on-state resistance (Ron <1kΩ)

Infinite off-state resistance Zero off-state leakage

Nano-electro-mechanical (NEM) relays

30mm

90nm

Body

Drain

Source

Body

GateA

A’

Relay schematic

Gate

Oxide

27.5mm

Channel

Joint work with T-J. King Liu, E. Alon and D. Markovic (UCB, UCLA)

32

Why not use relays to compute?

- Need to compare at block level -

Delay Comparison vs. CMOS

Single mechanical delay vs. several electrical gate delays

For reasonable load, NEMS delay unaffected by fan-out/fan-in

Area Comparison vs. CMOS

Larger individual devices

But often need fewer devices to implement same function

4 gate delays 1 mechanical delay

F. Chen et al., “Integrated Circuit Design with NEM Relays,” ICCAD 2008

NEMS: 12 relays

33

Scaled NEMS vs. CMOS adders

For similar area: >9x lower E/op, >10x greater delay

Scaled relays limited by contact surface energy

- 2aJ for 90nm litho – 50x better than 90nm CMOS

*D. Patil et. al., “Robust Energy-Efficient Adder Topologies,” in Proc. 18th IEEE Symp.

on Computer Arithmetic (ARITH'07).

9x

10x

Energy/op vs. Delay/op across Vdd

30x less capacitance

Lower device Cg, Cd

Fewer devices

2.4x lower Vdd

No leakage energy

Compare vs. Sklansky

CMOS adder*

90nm technology

34

Contact resistance

- Feedback from system level -

Low contact R

not critical

Good news for

reliability…

Can build test-

platforms that

work

Energy/op vs. Delay/op across Vdd & CL

35

CLICKR technology development platform:

NEM relay-based circuits ISSCC 2010 – TD Award

36

F. Chen et al, ISSCC2010

M. Spencer et al, JSSC Jan’11

Towards more complex designs

100

101

102

103

101

102

103

104

Delay(ns)

En

erg

y/o

p (

fJ)

Scaled MEM Relay

OTCT (90nm)

Dadda/HC (45nm)

16X Parallel

Y2 Y1 Y0 70

0μ

m

8mm

Multiplier building block: 7:3 compressor

98 relays – largest working relay circuit to

date

Input code

A1

Generate

A0

A1

A2

A3

A2

A4

A3

Y2

A1

Y2

A0

A1

A2

A3

A4

A5

A6

A1

A2 A2

A3A3A3

A4A4

A5

(a) (b)

(c)

A0

A1

A2

A3

A2

A4

A3

Y2

A1 A1

Kill

A0

Y2

A0

A1

A2

A3

A4

A5

A6

A1

A1

A1

A2A2A2

A2

A2

A3A3

A3

A3 A3 A3A3

A4A4A4A4

A4

A5 A5

A5

A6

Y2(d)

A0

A0

Y0 Y0

A1

A2

A3

A4

A5

A6

A1

A2

A3

A4

A5

A6

A0

A0

A1A1

A0 A1 A0 A1

A0

A0

A2

A4

A6

A1

A4

A6

A1A1

A2

A3A3

A5A5

Y0 Y0

A3

A5

(a) (b) (c)

Energy-benefit preserved even in

more complex functions

16-bit multipliers

Fariborzi ASSCC 2011

Verilog-A model and Logic Synthesis created for NEMS technology

The flow supports multiple device designs and foundries

NEM Relay VLSI design infrastructure

Device

Verilog-A

Model

DRC

B B

Vout

A A

Schematic

Layout

P-cell

Verilog

Spectre

Place & Route

LVS

SynthesisLogic

Synthesis

Place & Route

Verilog-A

Model

38

Toward full systems - NEM Relay scaling

1um litho

Scaled Relay size

20um x 20um

Sematech

Relay size

120um x 150um

0.25um litho

39

Microcontroller Test-Chip

64x8b

Scratchpad

64x18b

Program Memory

32x10b

Program Stack

2 x 72 I/O Pads

Instruction

DecodeRegister File + ALU

Control Logic

12k relays

9mm x 6mm (using 85um x 53um devices) 40

Summary

Cross-layer modeling and design key to continued system performance scaling Fast design-space exploration

Feedback to all layers of design hierarchy

Building early technology development platforms Feedback to device and circuit designers

Accelerated adoption

EOS Platform designed for multi-project wafer runs 50 fJ/b receivers with uA sensitivities

Record-high tuning efficiency with undercut ~ 25uW/K

First modulation demonstrated in 45nm process

CLICKR Platform designed for multiple foundries and devices Energy-gains preserved for larger blocks

Designs moving toward scaled devices and full VLSI systems

41