Major Challenges to Achieve Exascale Performance · 1 Major Challenges to Achieve Exascale...
Transcript of Major Challenges to Achieve Exascale Performance · 1 Major Challenges to Achieve Exascale...
1
Major Challenges to Achieve
Exascale Performance
Shekhar Borkar
Intel Corp.
April 29, 2009
Acknowledgment: Exascale WG sponsored by Dr. Bill Harrod, DARPA (IPTO)
2
Outline
Exascale performance goals
Major challenges
Potential solutions
Paradigm shift
Summary
3
Performance Roadmap
1.E-04
1.E-02
1.E+00
1.E+02
1.E+04
1.E+06
1.E+08
1960 1970 1980 1990 2000 2010 2020
GF
LO
P
MFLOP
GFLOP
TFLOP
PFLOP
EFLOP
12 Years 11 Years 10 Years
4
From Giga to Exa, via Tera & Peta
1
10
100
1000
1986 1996 2006 2016
Re
lati
ve
Tr
Pe
rfo
rma
nc
e
G
Tera
Peta
30X 250X
0.001
0.01
0.1
1
1986 1996 2006 2016
Re
lati
ve
En
erg
y/O
p G
Tera
Peta
5V
Vcc scaling
1.E+00
1.E+02
1.E+04
1.E+06
1.E+08
1986 1996 2006 2016
G
Tera
Peta
36X
Exa
4,000X
Concurrency
2.5M X
Transistor Performance
1.E+00
1.E+02
1.E+04
1.E+06
1.E+08
1986 1996 2006 2016
G
Tera
Peta
80X
Exa
4,000X
1M X
Power
5
Building with Today’s Technology
200pJ per FLOP200W
150W
100W
100W
4450W
Decode and controlTranslations…etcPower supply lossesCooling…etc
5KW
Compute
Memory
Com
Disk
TFLOP Machine today
10TB disk @ 1TB/disk @10W
0.1B/FLOP @ 1.5nJ per Byte
100pJ com per FLOP
KW Tera, MW Peta, GW Exa?
6
The Power & Energy Challenge
200W
150W
100W
100W
4550W
5KW
Compute
Memory
Com
Disk
TFLOP Machine today
5W2W
~5W~3W5W
TFLOP Machine thenWith Exa Technology
~20W
7
1.5mm
2.0
mm
FPMAC0
Router
IMEM
DMEMRF
RIB
CLK
FPMAC1
MSINT
Glo
ba
l clk
sp
ine
+ c
lk b
uff
ers
1.5mm
2.0
mm
2.0
mm
FPMAC0
Router
IMEM
DMEMRF
RIB
CLK
FPMAC1
FPMAC0
Router
IMEM
DMEMRF
RIB
CLK
FPMAC1
MSINT
Glo
ba
l clk
sp
ine
+ c
lk b
uff
ers
Starting Point: Optimistic yet Realistic2
1.7
2m
m
12.64mmI/O Area
I/O Area
PLL
single tile
1.5mm
2.0mm
TAP
21
.72
mm
12.64mmI/O Area
I/O Area
PLL
single tile
1.5mm
2.0mm
TAP
100 MillionTransistors
275mm2Die Area
3mm2Tile area
1248 pin LGA, 14 layers, 343 signal pins
Package
1 poly, 8 metal (Cu)Interconnect
65nm CMOS ProcessTechnology
100 MillionTransistors
275mm2Die Area
3mm2Tile area
1248 pin LGA, 14 layers, 343 signal pins
Package
1 poly, 8 metal (Cu)Interconnect
65nm CMOS ProcessTechnology
80 Core TFLOP Chip
8
Scaling Assumptions
Technology
(High Volume)
45nm
(2008)
32nm
(2010)
22nm
(2012)
16nm
(2014)
11nm
(2016)
8nm
(2018)
5nm
(2020)
Transistor density 1.75 1.75 1.75 1.75 1.75 1.75 1.75
Frequency scaling 15% 10% 8% 5% 4% 3% 2%
Vdd scaling -10% -7.5% -5% -2.5% -1.5% -1% -0.5%
Dimension & Capacitance 0.75 0.75 0.75 0.75 0.75 0.75 0.75
SD Leakage scaling/micron 1X Optimistic to 1.43X Pessimistic
65nm Core + Local Memory
Memory 0.35MB
5mm2 (50%)
DP FP Add, MultiplyInteger Core, RF
Router
5mm2 (50%)
10mm2, 3GHz, 6GF, 1.8W
8nm Core + Local Memory
Memory 0.35MB0.17mm2 (50%)
DP FP Add, MultiplyInteger Core, RF
Router0.17mm2 (50%)
0.34mm2, 4.6GHz, 9.2GF, 0.24 to 0.46W
~0.6mm
9
Processor Chip
2018, 8nm technology node20mm
400mm2
20mm
Cores/Module 1150
Total Local Memory 400 MB
Frequency 4.61 GHz
Peak performance 10.6 TF
Power 300 - 600W
Energy efficiency 34 - 18 GF/Watt
30-60 MW for Exascale
0
5000
10000
15000
20000
65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm
Ch
ip P
erf
orm
an
ce (
GF
)
GFLOPs
0
100
200
300
400
500
65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm
Ch
ip P
ow
er
(W)
Power(W)
10
Processor Node
128 GB 128 GB
256GB/s 64b
128 GB 128 GB
Peak performance 10.6 TF
Total DRAM Capacity 512GB
Total DRAM BW 1TB/s (0.1B/FLOP)
DRAM Power 800 W*
Total Power 1100 - 1400W
Energy efficiency 9.5 - 8 GF/Watt
*Assumes 5% Vdd scaling each technology generation140 pJ energy consumed per accessed bit
110-140 MW for Exascale
11
Node Power Breakdown
DRAM
Fabric
Compute
10 TF, ~ 1KWAggressive voltage scaling
Hierarchical heterogeneous topologies
Efficient signalingRepartitioning
12
Voltage Scaling
0
0.2
0.4
0.6
0.8
1
0.3 0.5 0.7 0.9
Vdd (Normal)
No
rma
lize
d
0
2
4
6
8
10
Freq
Total Power
Leakage
Energy Efficiency
When designed to voltage scale
13
Energy Efficiency with Vdd Scaling
0
20
40
60
80
100
120
140
160
65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm
En
erg
y E
ffic
ien
cy (
GF
/W)
Vdd
0.7x
0.5x
~3X Compute energy efficiency with Vdd Scaling
14
On-die Mesh Interconnect
On-die network (mesh) power is highWorse if link width scales up each generation
20mm45nm
20mm32nm
20mm22nm
20mm16nm
70 Cores 123 Cores 214 Cores 375 Cores
0
100
200
300
400
500
65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm
Ch
ip P
ow
er
(W)
Network
Compute
15
Mesh—RetrospectiveBus: Good at board level, does not extend well
• Transmission line issues: loss and signal integrity, limited frequency
• Width is limited by pins and board area
• Broadcast, simple to implement
Point to point busses: fast signaling over longer distance
• Board level, between boards, and racks
• High frequency, narrow links
• 1D Ring, 2D Mesh and Torus to reduce latency
• Higher complexity and latency in each node
Hence, emergence of packet switched network
But, pt-to-pt packet switched network on a chip?
16
Interconnect Delay & Energy
1
10
100
1000
10000
0 5 10 15 20
Length (mm)
Dela
y (
ps)
0
0.5
1
1.5
2
pJ/B
it
65nm, 3GHz
Router Delay
17
Bus—The Other Extreme…Issues:Slow, < 300MHzShared, limited scalability?
Solutions:Repeaters to increase freqWide busses for bandwidthMultiple busses for scalability
Benefits:Power?Simpler cache coherency
Move away from frequency, embrace parallelism
18
Hierarchical & Heterogeneous
Bus
C C
C C
Bus to connect over short distances
Bus
C C
C C
Bus
C C
C C
Bus
C C
C C
Bus
C C
C C
2nd Level Bus
Hierarchy of BussesOr hierarchical circuit and packet switched networks
R R
R R
19
Revise DRAM Architecture
Page Page Page
RAS
CAS
Activates many pagesLots of reads and writes (refresh)Small amount of read data is usedRequires small number of pins
Traditional DRAM New DRAM architecture
Page Page PageAddr
Addr
Activates few pagesRead and write (refresh) what is neededAll read data is usedRequires large number of IO’s (3D)
Energy cost today:~175 pJ/bit
Signaling
DRAM
Array
M Control
20
Data Locality
Core-to-coreCommunication on the chip:~10pJ per Byte
Chip to memoryCommunication:~1.5nJ per Byte~150pJ per Byte
Chip to chipCommunication:~100pJ per Byte
Data movement is expensive—keep it local(1) Core to core, (2) Chip-to-chip, (3) Memory
21
Impact of Exploding Parallelism
100
150
200
250
300
350
400
450
65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm
Millio
n C
ore
s/E
FL
OP
1x Vdd
0.7x Vdd
0.5x Vdd
Almost flat because Vdd close to Vt
4X increase in the number of cores (Parallelism)
Increased communication and related energy
Increased HW, and unreliability
1. Strike a balance between Com & Computation2. Resiliency (Gradual, Intermittent, Permanent faults)
22
Road to Unreliability?
From Peta to Exa Reliability Issues
1,000X parallelism More hardware for something to go wrong
>1,000X intermittent faults due to soft errors
Aggressive Vcc scaling
to reduce power/energy
Gradual faults due to increased variations
More susceptible to Vcc droops (noise)
More susceptible to dynamic temp variations
Exacerbates intermittent faults—soft errors
Deeply scaled
technologies
Aging related faults
Lack of burn-in?
Variability increases dramatically
Resiliency will be the corner-stone
23
ResiliencyFaults Example
Permanent faults Stuck-at 0 & 1
Gradual faults Variability
Temperature
Intermittent faults Soft errors
Voltage droops
Aging faults Degradation
Faults cause errors (data & control)
Datapath errors Detected by parity/ECC
Silent data corruption Need HW hooks
Control errors Control lost (Blue screen)
Minimal overhead for resiliency
Circuit & Design
Microarchitecture
Microcode, Platform
Programming system
Applications Error detectionFault isolationFault confinementReconfigurationRecovery & Adapt
System Software
24
Needs a Paradigm Shift
Evaluate each (old) architecture feature with new priorities
Single thread performance Frequency
Programming productivity Legacy, compatibility
Architecture features for productivity
Constraints (1) Cost
(2) Reasonable Power/Energy
Throughput performance Parallelism
Power/Energy Architecture features for energy
Simplicity
Constraints (1) Programming productivity
(2) Cost
Past and present priorities—
Future priorities—
25
Summary
Von-Neumann computing & CMOS technology
(nothing else in sight)
Voltage scaling to reduce power and energy
• Explodes parallelism
• Cost of communication vs computation—critical balance
• Resiliency to combat side-effects and unreliability
Programming system for extreme parallelism
System software to harmonize all of the above