Building Modern Integrated Systems:
A Cross-cut Approach (The Electrical, The Optical and The Mechanical)
Vladimir Stojanović
Integrated Systems Group
Massachusetts Institute of Technology
Acknowledgments
Devices: Tsu-Jae King Liu, Rajeev Ram, Miloš Popović, Henry Smith
Architecture: Krste Asanović, Christopher Batten, Ajay Joshi
Circuits: Elad Alon, Dejan Marković
Students:
Devices - Jason Orcutt, Anatoly Khilo, Jie Sun, Cheryl Sorace, Eugen Zgraggen, Jaeseok Jeon, Rhesa Nathanael, Hei Kam
Circuits – Michael Georgas, Jonathan Leu, Ben Moss, Chen Sun, Fred Chen, Byungsub Kim, Hossein Fariborzi, Matthew Spencer, Chengcheng Wang, Kevin Dwan
Architecture - Yong-Jin Kwon, Scott Beamer, Chen Sun, Imran Shamim
DARPA MTO
Texas Instruments – Dennis Buss and Tom Bonifield
IBM and Trusted Foundry
Intel Corporation – Ian Young and Alex Kern
2
3
Chip design is going through a change
“The Processor is the new Transistor” [Rowen]
Intel 4004 (1971):
4-bit processor,
2312 transistors,
~100 KIPS,
10 micron PMOS,
11 mm2 chip
Sun Niagara 8 GPP cores (32 threads)
Intel®
XScale
™
Core 32K IC
32K DC
MEv2
10
MEv2
11
MEv2
12
MEv2
15
MEv2
14
MEv2
13
Rbuf
64 @
128B
Tbuf
64 @
128B
Hash
48/64/1
28 Scratc
h
16KB
QDR
SRAM
2
QDR
SRAM
1
RDRA
M
1
RDRA
M
3
RDRA
M
2
G
A
S
K
E
T
PCI
(64b)
66
MHz
IXP280
0 16b
16b
1
8 1
8
1
8 1
8
18 18 18
64b
S
P
I
4
or
C
S
I
X
Stripe
E/D Q E/D Q
QDR
SRAM
3 E/D Q
1
8 1
8
MEv2
9
MEv2
16
MEv2
2
MEv2
3
MEv2
4
MEv2
7
MEv2
6
MEv2
5
MEv2
1
MEv2
8
CSRs
-Fast_wr -UART
-Timers -GPIO
-BootROM/SlowPort
QDR
SRAM
4 E/D Q
1
8 1
8
Intel Network Processor 1 GPP Core 16 ASPs (128 threads)
IBM Cell 1 GPP (2 threads) 8 ASPs
Picochip DSP 1 GPP core 248 ASPs
Cisco CSR-1 188 Tensilica GPPs
1000s of processor cores and
accelerators per die Asanovic
Already have more devices than can use at once
Limited by power density and bandwidth
Subthreshold leakage: Game over for CMOS
CMOS circuits have well-defined minimum energy
Caused by leakage and finite sub-threshold swing
Need to balance leakage and active energy
Limits energy-efficiency, regardless how slow the circuit runs
Energy/op vs. Vdd Energy/op vs. 1/throughput
101
102
103
104
105
0
20
40
60
80
100
No
rma
lize
d E
ne
rgy
/op
1/throughput (ps/op)
0.1 0.2 0.3 0.4 0.5
5
10
15
20
25
No
rmalized
En
erg
y/c
ycle
Vdd (V)
Etotal
Edynamic
Eleak
Scale Vdd & VT:
4
5
Wire and I/O scaling
Increased wire resistivity makes wire caps scale very slowly
Can’t get both energy-efficiency and high-data rate in I/O
On-chip wires
copper resistivity
0
2
4
6
8
10
12
14
16
18
0 5 10 15 20 25
Chip2Chip Backplane
En
erg
y-c
ost
[pJ/b
]Data-rate [Gb/s]
Best electrical links
Loss ~10dB
Loss ~20-25dB
On-chip wires I/O
Opportunity for integrated system design
Energy-efficient computation and communication
CMOS – need cross-cut
approach to keep scaling
performance
Circuits & Logic
Tx, Rx, Ctrl, Meas
Cu
Interconnect
and switch
technology
Circuit modeling,
Characterization
Design
Optimization Network &
µArchitecture
Communications
(Eq., Mod, Coding)
0 1 2 30
0.5
1
1.5
2
2.5
Data Rate Density (Gbps/um)
En
erg
y/B
it (
pJ/B
it)
Equalized, 30mV Eye
Equalized, 50mV Eye
Equalized, 90mV Eye
Repeated
MOSFET
Φ Φ
ΦΦ
Φ
in+ in-
Φ
IPHOTO
6
Manycore SOC roadmap fuels
bandwidth demand 64-tile system (64-256 cores) - 4-way SIMD FMACs @ 2.5 – 5 GHz
- 5-10 TFlops on one chip
- Need 5-10 TB/s of off-chip I/O
- Even higher on-chip bandwidth
2 cm
2 cm
Intel 48 core -Xeon
7
Bandwidth, pin count and power scaling
Need 16k pins
in 2017 for HPC*
1 Byte/Flop
256 cores
2 TFlop/s signal pins @ 20 Gb/s/link
2,4 cores
Pa
cka
ge
pin
co
un
t
*> half pins for power supply
Emerging devices can help
Energy-efficient computation and communication
CMOS – need cross-cut
approach to keep scaling
performance
Post-CMOS – need cross-cut
approach to guide new
devices/systems
Circuits & Logic
Tx, Rx, Ctrl, Meas
Si-Photonics Cu
Interconnect
and switch
technology
Circuit modeling,
Characterization
Design
Optimization Network &
µArchitecture
Communications
(Eq., Mod, Coding)
0 1 2 30
0.5
1
1.5
2
2.5
Data Rate Density (Gbps/um)
En
erg
y/B
it (
pJ/B
it)
Equalized, 30mV Eye
Equalized, 50mV Eye
Equalized, 90mV Eye
Repeated
MOSFET
Φ Φ
ΦΦ
Φ
in+ in-
Φ
IPHOTO
Monolithic Si-Photonics for core-to-core and
core-to-DRAM networks
10 10
Supercomputers
Embedded apps
Si-photonics in advanced
CMOS and DRAM process
NO costly process changes
Bandwidth density – need dense WDM
Energy-efficiency – need monolithic integration
Many architectural studies show promise
11
[Shacham’07]
[Petracca’08]
[Vantrease’08]
[Psota’07]
[Kirman’06]
[Joshi’09]
[Pan’09]
[Batten’08] [Beamer’10] [Koka’08-10]
Laser energy increases with data-rate
– Limited Rx sensitivity
– Modulation more expensive -> extinction ratio / insertion loss trade-off
Tuning costs decrease with data-rate
Moderate data rates most energy-efficient
Reg
iste
r
Mu
x
Pre-Driver Mod-DriverReceiver
Front-end
Φ Φ Φ
Φ Φ
+
Samplers &
Monitoring
Dem
ux
Reg
iste
r
PLL or
Opt. Clk
1 2 3 4 in PLL or
Opt. Clk
Phase
Adjust
Reg
iste
r
Mu
x
Pre-Driver Mod-DriverReceiver
Front-end
Φ Φ Φ
Φ Φ
+
Samplers &
Monitoring
Dem
ux
Reg
iste
r
PLL or
Opt. Clk
1 2 3 4 in PLL or
Opt. Clk
Phase
Adjust
512 Gb/s aggregate throughput
assuming 32nm CMOS
Georgas CICC 2011
Optimize carefully – start at the link level
DWDM link efficiency optimization
Optimize for min energy-cost
Bandwidth density dominated by circuit and photonics area (not coupler pitch) 10x better than electrical bump limited
200x better than electrical package pin limit
13
Photonic DRAM Network Organization
Important Concepts
- Power/message switching (only to active DRAM chip in
DRAM cube/super DIMM)
- Vertical die-to-die coupling (minimizes cabling - 8 dies per
DRAM cube)
-Command distributed
electrically (broadcast)
- Data photonic (single writer
multiple readers)
MC 1
MC 16
Mem
Sch
edu
ler
MC K
CPUDRAM cube 1
DRAM cube 4
Super DIMM
cmdDwr
Drd
( cube 1, die 1)
cmdDwr
Drd
( cube 1, die 8)
Dwr
Drd
DRAM cube 4
Super DIMM K
die-die switch
Laser in
Modulator bank
Receiver/PD bank
Tunable filterbank
Through silicon via
Through silicon via holeBeamer ISCA 2010 Processor die
Enables energy-efficient
throughput and capacity
scaling per memory channel
Optimizing DRAM with photonics
Floorplan
Beamer ISCA 2010
P1 P4
Laser Power Guiding Effectiveness
Beamer ISCA 2010
Enables capacity scaling per channel and significant savings in laser energy
Significant integration activity,
but hybrid and older processes …
[Luxtera/Oracle/Kotura] [IBM]
[HP]
[Watts/Sandia/MIT]
[Intel]
130nm
thick BOX SOI
130nm
thick BOX SOI
Bulk CMOS
Backend
monolithic
[Lipson/Cornell]
[Kimerling/MIT]
[Many schools]
17
Optical Mode
Monolithic CMOS photonic integration
Photo credit: Intel
Polysilicon - transistor gates, local interconnect and resistors
Use for photonic components instead or with silicon body in SOI
Sub-100nm lithography has 1-5 nm design grid
Enables edge roughness necessary for photonic devices
18
65 nm bulk CMOS Texas Instruments
90 nm bulk CMOS IBM cmos9sf
45 nm SOI CMOS IBM 12SOIs0
19
32 nm bulk CMOS Texas Instruments
EOS Platform for Monolithic CMOS
photonic integration
-200 0 200 400 600 800 1000
-14
-12
-10
-8
-6
-4
-2
0
Tra
nsm
issio
n, dB
Frequency, GHz
2007
2011
Create integration platform to accelerate
technology development and adoption
Joint work with Ram and Popovic
A 32nm bulk CMOS photonic platform
Monolithic CMOS photonic platform integrated with CMOS circuits
32nm process – fabrication support from Texas Instruments
Robust post-processing steps at MIT
Second-order resonator filterbank shows process precision
Great on-die matching (rings track within 40GHz)
Record thermal heating efficiency 25uW/K
Orcutt et al – CLEO 2008, Optics Express 2011 20
Polysilicon and Silicon Photonics on Thin BOX IBM SOI
Reg
iste
r
Mu
xPre-Driver Mod-Driver
Receiver
Front-end
Φ Φ Φ
Φ Φ
+
Samplers &
Monitoring
Dem
ux
Reg
iste
r
PLL or
Opt. Clk
1 2 3 4 in PLL or
Opt. Clk
Phase
Adjust
Electrical and photonic integration – test row
EOS: A 45nm SOI Monolithic Photonic Platform
6 rows of electronic-photonic
WDM links with
body and polysilicon
photonic devices
54 Transmit-receive test-
sites,
~3M transistors and
hundreds of photonic devices
Body and polysilicon photonic devices
Filterbanks, waveguide paperclips, rings, stand-
alone modulators and photodetectors
21
Integration of photonics into VLSI tools
22
VERSION 5.6 ;
BUSBITCHARS "[]" ;
DIVIDERCHAR "/" ;
MACRO block_electronic_etch_row_1
CLASS BLOCK ;
ORIGIN -208 -1794 ;
FOREIGN block_electronic_etch_row_1 208 1794 ;
SIZE 2488 BY 165 ;
SYMMETRY X Y R90 ;
PIN heater_a_1
DIRECTION INOUT ;
USE SIGNAL ;
PORT
LAYER ua ;
RECT 431 1870.5 436.5 1882 ;
END
END heater_a_1
...
OBS
LAYER m1 ;
RECT 208 1794 2696 1959 ;
...
END
END block_electronic_etch_row_1
END LIBRARY
modulator.LEF
Layout of
photonics
Layout of
Circuit blocks
abstract
abstract
LEF
LEF
LEF of standard cells, I/O pads
(provided by ARM)
Chip-level verilog
(instantiation of.LEF macros and
connectivity)
Technology files
SOC Encounter
Place and route
Floorplan
(macro placement,power grid, routing
Constraints)
Place&routed
layout
Photonic device
p-cell abstract
custom photonics-friendly auto-fill
layout
Platform Organization
23
A full electro-optical test setup
24
DUT Chip
Board
HS
Clocks
FPGA
Control
Board
Fiber PositionerFiber
Positioner
USB to laptop
Microscope
Extremely good dimensional tolerances
in 45nm SOI
Good body waveguide loss
3.7dB/cm at ~1220nm
25
Integrated Delta-Sigma Heat Control
Tuning efficiency 2.6mW/nm (32.4mW/2π)
On fully substrate removed die
~10mW required
to retune all 8 rings
Thermal tuning BW
lower than 500kHz
Tuning control overhead
negligible
26
Current-sensing optical data receiver
Georgas ESSCIRC 2011
Receiver detects photo current
50fJ/b, uA sensitivities, 3-5Gb/s 27
Modulator test site
• Extinction ratio 9dB at 1280nm
• 60GHz 3dB bandwidth
• Carrier lifetime ~2-3ns
• Requires flexible drive circuits
• Sub-bit pre-emphasis
• Split-supplies
Silicon carrier injection modulator
monolithically integrated with
transistors
60 GHz3 dB bandwidth
9 dBextinction
First dynamic electro-optic test in 45nm SOI
Modulator Driver
Modulator
Transistors and Photonics can be built together in
advanced CMOS!
Silicon carrier injection modulator
monolithically integrated with
transistors
Modulation data-rate up to 1Gb/s
5-10 Gb/s achievable with device and biasing optimization
Lots of room to improve circuit/device designs
29
Power and pins required for 10TFlop/s
0
200
400
600
800
1000
1200
1400
1600
100 1000 10000 100000
Mobile LPDDR2-1066
Mobile LPDDRX-1666
Mobile LPDDRX 2017
DDR3-1333 4GB
DDR4-2667 8GB
GDDR5
HMC-Gen1
HMC-Gen2
POEM Phase 1
POEM Phase 2
POEM Post-phase 2
To
tal m
em
ory
ch
an
ne
l p
ow
er
[W]
# socket pins required for memory channels
80Tb/s sustained
bandwidth
assuming
1B/Flop
HMC
LPDDR
POEM
PIM
DDR4
GDDR5
30
Improving computation efficiency
Energy-efficient computation and communication
CMOS – need cross-cut
approach to keep scaling
performance
Post-CMOS – need cross-cut
approach to guide new
devices/systems
Circuits & Logic
Tx, Rx, Ctrl, Meas
Si-Photonics Cu
Interconnect
and switch
technology
Circuit modeling,
Characterization
Design
Optimization Network &
µArchitecture
Communications
(Eq., Mod, Coding)
0 1 2 30
0.5
1
1.5
2
2.5
Data Rate Density (Gbps/um)
En
erg
y/B
it (
pJ/B
it)
Equalized, 30mV Eye
Equalized, 50mV Eye
Equalized, 90mV Eye
Repeated
NEMS relay MOSFET
Φ Φ
ΦΦ
Φ
in+ in-
Φ
IPHOTO
31
Nearly ideal switching characteristics: Low on-state resistance (Ron <1kΩ)
Infinite off-state resistance Zero off-state leakage
Nano-electro-mechanical (NEM) relays
30mm
90nm
Body
Drain
Source
Body
GateA
A’
Relay schematic
Gate
Oxide
27.5mm
Channel
Joint work with T-J. King Liu, E. Alon and D. Markovic (UCB, UCLA)
32
Why not use relays to compute?
- Need to compare at block level -
Delay Comparison vs. CMOS
Single mechanical delay vs. several electrical gate delays
For reasonable load, NEMS delay unaffected by fan-out/fan-in
Area Comparison vs. CMOS
Larger individual devices
But often need fewer devices to implement same function
4 gate delays 1 mechanical delay
F. Chen et al., “Integrated Circuit Design with NEM Relays,” ICCAD 2008
NEMS: 12 relays
33
Scaled NEMS vs. CMOS adders
For similar area: >9x lower E/op, >10x greater delay
Scaled relays limited by contact surface energy
- 2aJ for 90nm litho – 50x better than 90nm CMOS
*D. Patil et. al., “Robust Energy-Efficient Adder Topologies,” in Proc. 18th IEEE Symp.
on Computer Arithmetic (ARITH'07).
9x
10x
Energy/op vs. Delay/op across Vdd
30x less capacitance
Lower device Cg, Cd
Fewer devices
2.4x lower Vdd
No leakage energy
Compare vs. Sklansky
CMOS adder*
90nm technology
34
Contact resistance
- Feedback from system level -
Low contact R
not critical
Good news for
reliability…
Can build test-
platforms that
work
Energy/op vs. Delay/op across Vdd & CL
35
CLICKR technology development platform:
NEM relay-based circuits ISSCC 2010 – TD Award
36
F. Chen et al, ISSCC2010
M. Spencer et al, JSSC Jan’11
Towards more complex designs
100
101
102
103
101
102
103
104
Delay(ns)
En
erg
y/o
p (
fJ)
Scaled MEM Relay
OTCT (90nm)
Dadda/HC (45nm)
16X Parallel
Y2 Y1 Y0 70
0μ
m
8mm
Multiplier building block: 7:3 compressor
98 relays – largest working relay circuit to
date
Input code
A1
Generate
A0
A1
A2
A3
A2
A4
A3
Y2
A1
Y2
A0
A1
A2
A3
A4
A5
A6
A1
A2 A2
A3A3A3
A4A4
A5
(a) (b)
(c)
A0
A1
A2
A3
A2
A4
A3
Y2
A1 A1
Kill
A0
Y2
A0
A1
A2
A3
A4
A5
A6
A1
A1
A1
A2A2A2
A2
A2
A3A3
A3
A3 A3 A3A3
A4A4A4A4
A4
A5 A5
A5
A6
Y2(d)
A0
A0
Y0 Y0
A1
A2
A3
A4
A5
A6
A1
A2
A3
A4
A5
A6
A0
A0
A1A1
A0 A1 A0 A1
A0
A0
A2
A4
A6
A1
A4
A6
A1A1
A2
A3A3
A5A5
Y0 Y0
A3
A5
(a) (b) (c)
Energy-benefit preserved even in
more complex functions
16-bit multipliers
Fariborzi ASSCC 2011
Verilog-A model and Logic Synthesis created for NEMS technology
The flow supports multiple device designs and foundries
NEM Relay VLSI design infrastructure
Device
Verilog-A
Model
DRC
B B
Vout
A A
Schematic
Layout
P-cell
Verilog
Spectre
Place & Route
LVS
SynthesisLogic
Synthesis
Place & Route
Verilog-A
Model
38
Toward full systems - NEM Relay scaling
1um litho
Scaled Relay size
20um x 20um
Sematech
Relay size
120um x 150um
0.25um litho
39
Microcontroller Test-Chip
64x8b
Scratchpad
64x18b
Program Memory
32x10b
Program Stack
2 x 72 I/O Pads
Instruction
DecodeRegister File + ALU
Control Logic
12k relays
9mm x 6mm (using 85um x 53um devices) 40
Summary
Cross-layer modeling and design key to continued system performance scaling Fast design-space exploration
Feedback to all layers of design hierarchy
Building early technology development platforms Feedback to device and circuit designers
Accelerated adoption
EOS Platform designed for multi-project wafer runs 50 fJ/b receivers with uA sensitivities
Record-high tuning efficiency with undercut ~ 25uW/K
First modulation demonstrated in 45nm process
CLICKR Platform designed for multiple foundries and devices Energy-gains preserved for larger blocks
Designs moving toward scaled devices and full VLSI systems
41
Top Related