Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf ·...
Transcript of Hardware Evolution TrendsHardware Evolution Trends of ...george/aybi199/Kogge_Exascale.pdf ·...
Hardware Evolution TrendsHardware Evolution Trendsof Extreme Scale Computing
TeraFlopEmbedded
PetaFlopDepartmental
ExaFlopData Center
Peter M. KoggeMcCourtney Chair in Computer Science & Engineering
Univ. of Notre DameIBM Fellow (retired)
1CalTech AyBi 199b April 26, 2011
IBM Fellow (retired)April 26, 2011
Topics
• What Is Exascale
• Moore’s Law, Multicore, & Energy 101
• HPC Computing: How We Got To Today
• The DARPA Exascale Studyy
• Exascale Strawmen
• A Deep Dive into Memory EnergiesA Deep Dive into Memory Energies
• Rockster: The Problem Isn’t Just at the High End
Materials presented here based onMaterials presented here based onan 18 month DARPA-sponsored Study on “Exascale Computing”
2CalTech AyBi 199b April 26, 2011
y p gand later studies
Background-Computing
• Classical “Measures of Performance” Flop/s: floating point operations per second Bytes per Flop/s: memory capacity per unit of performance Bytes per Flop/s: memory capacity per unit of performance
• Classical Goal: 1 byte per flop/s Bytes/s per Flop/s: memory bandwidth per unit of performance
• Classical Goal: 1 byte/s per flop/s• Classical Goal: 1 byte/s per flop/s
• Last 30 years of HPC: focus on flops/s• Last 30 years of HPC: focus on flops/s 1980s: “Gigaflop” (109/s) first in single vector processor 1994: “Teraflop” (1012/s) first via thousands of microprocessors 2009: “Petaflop” (1015/s) first via several hundred thousand cores
3CalTech AyBi 199b April 26, 2011
• Let’s call technology to reach “xyz flop” as “xyz scale”
Background-Technology
• Commercial technology: to date Riding Moore’s Law to denser, “faster,” chips Always shrunk prior “XXX” scale systems to smaller form factor Always shrunk prior XXX scale systems to smaller form factor Shrink, with clock speedup, enabled next “XXX” scale
• But microprocessor clock rates flat-lined in 2004 Driven by power dissipation limits
• “Exascale” now on horizon (Exaflops = 1018/s) But technology change radically changing our path to get there
E i ll E /P
4CalTech AyBi 199b April 26, 2011
Especially Energy/Power
Exascale• Exascale = 1,000X capability of Today• Exascale != Exaflops but
Exascale at the data center size => Exaflops Exascale at the data center size => Exaflops Exascale at the “rack” size => Petaflops for
departmental systems Exascale embedded => Teraflops in a cube Exascale embedded => Teraflops in a cube
• It took us 14+ years to get from 1st Petaflops workshop: 1994 Th NSF t di HTMT HPCS Thru NSF studies, HTMT, HPCS … To give us to Petaflops in 2009
• Study Questions: Can we ride silicon to Exa? What will such systems look like? Where are the Challenges?
5CalTech AyBi 199b April 26, 2011
For Those of You Who Want an Exaflop…..
You should be OVERJOYED if all you will need is:JUST Milli•JUST a Million cores
•ONLY 1 Nuclear Power PlantONLY 1 Nuclear Power Plant•MINIMAL programming support
6CalTech AyBi 199b April 26, 2011
Moore’s Law Energy 101 & MulticoreMoore s Law, Energy 101, & Multicore
7CalTech AyBi 199b April 26, 2011
Today’s CMOS Silicon Structures
Cell Plate Si
Capacitor Insulator
Storage Node PolyRefilling Poly
Si S b t t2nd Field Oxide
Si Substrate
A Single Transistor Multiple LayersOf Wire
A Trench CapacitorFor Memory
“Feature Size:” Minimum linear dimension• Width of Transistor Gate
½ it h b t t i t l t l l f t l
8CalTech AyBi 199b April 26, 2011
• ½ pitch between two wires at lowest levels of metal
Moore’s Law
• 1965, Gordon Moore: “the number of transistors that can be integrated on a die would double every 18 to 14 months ”14 months. Equivalent to Feature Size decreasing by factor of √2 Memory density 4X every 3 years
1 000 000
4,000,000
16,000,000
64,000,000
1000000
10000000
100000000
p
0.1 m
0.07 m
h u m a n m em or yh u m a n D NA
4X growth every 3 years!100
1000
y (G
T/cm
2)
4,000
16,000
64,000
256,000
1,000,000
10000
100000
1000000
it ca
paci
ty/c
hi0.7-0.8 m
0.5-0.6 m
0.35-0.4 m
0.18-0.25 m
0.13 m
en cycl oped i a2 h r s CD a u d i o
book
1
10
nsis
tor D
ensi
ty
64
256
1,000
10
100
1000
1980 1983 1986 1989 1992 1995 1998 2001 2004 2007 2010
Kb
1.6-2.4 m
1.0-1.2 m
2 h r s CD a u d i o30 sec H D TV
pa g e0.1
1995
2000
2005
2010
2015
2020
2025
Tra n
9CalTech AyBi 199b April 26, 2011
YearLogic DRAM
Smaller Feature Sizes Also Enable Higher Clock Rates
100,000 100000
10,000
MH
z)
10000
MH
z)3 GHz The 3 GHz Wall
100
1,000
Clo
ck (M
100
1000
Clo
ck (M
The 3 GHz Wall
101975 1980 1985 1990 1995 2000 2005 2010 2015 2020
1010100100010000
Historical ITRS Max Clock Rate (12 invertors) Feature SizeHistorical ITRS Max
10CalTech AyBi 199b April 26, 2011
Why the Clock Flattening? POWERPOWER
1000 1000
100
Die
100
are
cm
Hot, Hot, Hot!Rocket Nozzle
10Wat
ts p
er D
10
tts p
er S
qua
Iron
1 0.1
1Wat Light Bulb
1976 1986 1996 2006 1976 1986 1996 2006
11CalTech AyBi 199b April 26, 2011
CMOS Energy 101
1.2
0.6
0.8
1
olta
ge VinVg
Vin Vg
R & C C
0
0.2
0.4Vo Vg
00 5 10 15 20
Dissipate CV2/2A d t CV2/2
Dissipate CV2/2From CapacitanceAnd store CV2/2 From Capacitance
One clock cycle dissipates C*V2
12CalTech AyBi 199b April 26, 2011
One clock cycle dissipates C V
How Did CV2 Improve With Time?1000
ize 33.5
44.5
5
d
100
Feat
ure
S
00.5
11.5
22.5
Vdd
101985 1990 1995 2000 2005 2010 2015 2020 2025
Assume capacitance of a circuitscales as feature size
01985 1990 1995 2000 2005 2010 2015 2020 2025
High PerfLow PowerDRAM
10.00
100.00
1000.00
nm H
i Per
f Log
ic 330X 90nm picked as breakpoint because that’s when Vdd and thus clocks flattened
0.01
0.10
1.00
CV^
2 re
lativ
e to
90n
15X
13CalTech AyBi 199b April 26, 2011
1/1/
88
1/1/
90
1/1/
92
1/1/
94
1/1/
96
1/1/
98
1/1/
00
1/1/
02
1/1/
04
1/1/
06
1/1/
08
1/1/
10
1/1/
12
1/1/
14
1/1/
16
1/1/
18
1/1/
20
1/1/
22
1/1/
24
High Perf Logic Low Operating Pow er Logic Memory Process
Loss of Clock Increase Left Brute Parallelism as Only Performance Option
Memory Memoryy yThe Same Straw
NNow
Up to ~2002Circa 2002
14CalTech AyBi 199b April 26, 2011
The Multi-core Architectural Spectrum
Cache/M
Cache/M
Cache/Memory
Cache
. . .Cache C C CM
Cache/MemoryMemory
CoreMemory
Core. . .
Interconnect & ControlCache
Core Core
. . .Cache
Core Core
. . .
( ) Hi hi l D i
CORE
CORE
CORE
MEM . . .
(b) Pi li d D i
Cache/Memory
Core
Cache/Memory
Core
. . .
( ) A D i(a) Hierarchical Designs (b) Pipelined Designs (c) Array Designs
• Intel Core Duo• IBM Power5• AMD Opteron
• IBM Cell• Most Router chips• Many Video chips
• Tilera• Terasys• ExecubeAMD Opteron
• SUN Niagara• …
y p Execube• Yukon• Intel Teraflop
A d Th Th ’ “H t ” M lti C
15CalTech AyBi 199b April 26, 2011
And Then There’s “Heterogeneous” Multi Core
NVIDIA GeFORCE Heterogeneous Architecture(Not shown: explicit graphics engines)
HostProcessor
HostMemory
Conventional Computer
a.k.a. the “Host”
St i
Kernel Scheduler GPU: GraphicsProcessing Unita.k.a. the “Device” Multithreaded
StreamingMulti Processor
(SM)Streaming
Multi Processor(SM)
StreamingMulti Processor
(SM)Streaming
Multi Processor(SM)
StreamingMulti ProcessorStreaming
Streaming esso
r
esso
r
esso
r
InstructionUnit
( )(SM)Multi Processor
(SM)Multi Processor(SM)
StreamingMulti Processor
(SM)
trea
m P
roce
trea
m P
roce
trea
m P
roce
…
L2 & Memory Channel Interfaces
MemoryChips
MemoryChips…. Graphics Memory
k th “D i Gl b l M ”St St St
LocalMemory
16CalTech AyBi 199b April 26, 2011
Chips Chips a.k.a. the “Device Global Memory”
Die Photos
AMD 12 core: HierarchicalIBM Execube: Tiled Tile64: Tiled
17CalTech AyBi 199b April 26, 2011
Nvidia: HeterogeneousIBM Cell: Pipelined/Heterogeneous
Modern High Performance S CSuperComputers
18CalTech AyBi 199b April 26, 2011
A Modern “Heavyweight” HPC System
Power Distribution
Silicon Area Distribution
Processors
Routers3% Memory
86%
Random8%
Memory9%
Routers33%
Random2%
3%
33%
Board Area DistributionMemory
Processors56%
y10%
Processors24%
White Space50%
24%
Routers8%
19CalTech AyBi 199b April 26, 2011
8%Random8%
A “Light Weight” Node Alternative
2 Nodes per “Compute Card.” Each node:2 Nodes per Compute Card. Each node:• A low power compute chip• Some memory chips• “Nothing Else”
System Architecture:• Multiple Identical Boards/Rack
• 2 simple dual issue cores• Each with dual FPUs• Memory controller
L DRAM L3• Each board holds multiple Compute Cards• “Nothing Else”
• Large eDRAM L3• 3D message interface• Collective interface• All at subGHz clock“Packaging the Blue Gene/L supercomputer ” IBM J R&D March/May 2005
20CalTech AyBi 199b April 26, 2011
Packaging the Blue Gene/L supercomputer, IBM J. R&D, March/May 2005“Blue Gene/L compute chip: Synthesis, timing, and physical design,” IBM J. R&D, March/May 2005
Heterogeneous Systems
• Mix of heavyweight masters and GPU compute enginescompute engines
• GPUs: hierarchy of cores:cores: Host microprocessors Streaming Multiprocessors Multiple Thread Engines
http://www.nvidia.com/object/fermi_architecture.html
Multiple Thread Engines Specialized function
engines
21CalTech AyBi 199b April 26, 2011
Where Are We Now?A T h h TOP 00 dA Tour through TOP500 Land
22CalTech AyBi 199b April 26, 2011
Top500 & The LinPack Benchmark
• TOP500: www.top500.org Ranking of top 500 supercomputers in world Updated every 6 months Updated every 6 months
• Benchmark Used: LinPack Solves a dense system of linear equations y q Assumes LU matrix factorization with partial pivoting Key parameter: floating point ops/second
Li P k P t• LinPack Parameters Rpeak: maximum possible # of flops per second Rmax: max observed flops/sec for any size LinPack Nmax: size of problem when achieving Rmax N1/2: problem size when ½ of Rmax is achieved
• Key kernel: DAXPY: Y[i] = Y[i] + αX[i]
23CalTech AyBi 199b April 26, 2011
• Key kernel: DAXPY: Y[i] = Y[i] + αX[i]
Overall Metrics
1.E+07
1.E+08
1.E+04
1.E+05
1.E+06
1.E+02
1.E+03
1.E+04
1.E+00
1.E+01
2 6 0 4 8 2 6
1/1/1992
1/1/1996
1/1/2000
1/1/2004
1/1/2008
1/1/2012
1/1/2016
Rpeak Rmax Memory
24CalTech AyBi 199b April 26, 2011
p yRpeak Trend Rmax Trend Memory Trend
What Kind of Systems Are These?R d & 1 t H
1.E+07Roadrunner & 1st Heterogeneous
1.E+06
1.E+05
1.E+04
1st BlueGene
1/1/2004
1/1/2008
1/1/2012
H i ht Li ht i ht
25CalTech AyBi 199b April 26, 2011
Heavyweight LightweightRpeak Trend Heterogeneous
Flop Efficiency
0.90
1.00
0.60
0.70
0.80
peak
0.30
0.40
0.50
Rm
ax/R
p
0.00
0.10
0.20
87 91 95 99 03 07
12/3
1/19
8
12/3
1/19
9
12/3
1/19
9
12/3
1/19
9
12/3
1/20
0
12/3
1/20
0
Heavyweight Lightweight Heterogeneous
26CalTech AyBi 199b April 26, 2011But 80% is NOT typical of more complex applications
Heavyweight Lightweight Heterogeneous
Bytes per Flop (Peak)
10.00
1.00
Rpe
ak)
0.10
Byt
es/F
lop
(R
0.01
7 1 5 9 3 7
12/3
1/19
87
12/3
1/19
91
12/3
1/19
95
12/3
1/19
99
12/3
1/20
03
12/3
1/20
07
H i ht Li ht i ht H t
27CalTech AyBi 199b April 26, 2011
Systems are getting less memory rich
Heavyweight Lightweight Heterogeneous
Bytes per Flop (Rmax)
10.00
1.00
(Rm
ax)
0.10
Byt
es/F
lop
(
0.01
87 91 95 99 03 07
12/3
1/19
8
12/3
1/19
9
12/3
1/19
9
12/3
1/19
9
12/3
1/20
0
12/3
1/20
0
Heavyweight Lightweight Heterogeneous
28CalTech AyBi 199b April 26, 2011
Heavyweight Lightweight Heterogeneous
Even on sustained performance, heterogeneous system are memory poor
How Has “Processor Count” Changed?
1.E+06
1.E+04
1.E+05
1 E+02
1.E+03
otal Cou
nt
1.E+01
1.E+02To
1.E+00
/1992
/1996
/2000
/2004
/2008
/2012
29CalTech AyBi 199b April 26, 2011
1/1
1/1
1/1
1/1
1/1
1/1
Sockets Cores
What About Clock Rate?
10,000
1,000
MHz)
100lock Rate (
C
10
1/1992
1/1996
1/2000
1/2004
1/2008
1/2012
30CalTech AyBi 199b April 26, 2011
1/1
1/1
1/1
1/1
1/1
1/1
Heavyweight Lightweight Heterogeneous
And Thread Level Concurrency?
14
16
10
12
14
re/C
ycle
)
6
8
LC (F
lops
/Cor
0
2
4T
12/3
1/19
87
12/3
1/19
91
12/3
1/19
95
12/3
1/19
99
12/3
1/20
03
12/3
1/20
07
31CalTech AyBi 199b April 26, 2011
Heavyweight Lightweight Heterogeneous
Finally, What About Total Concurrency?
1 000 000
10,000,000
100,000
1,000,000
1,000
10,000TC
100
1,000
10
/1992
/1996
/2000
/2004
/2008
/2012
32CalTech AyBi 199b April 26, 2011
1/1/
1/1/
1/1/
1/1/
1/1/
1/1/
Heavyweight Lightweight Heterogeneous
Total Power
9
10
6
7
8
W)
3
4
5
Pow
er (M
W
0
1
2
3
0
12/3
1/19
87
12/3
1/19
91
12/3
1/19
95
12/3
1/19
99
12/3
1/20
03
12/3
1/20
07
33CalTech AyBi 199b April 26, 2011
Heavyweight Lightweight Heterogeneous
Energy per Flop
1,000,000
10,000,000
10,000
100,000
per F
lop
(pJ)
100
1,000
Ener
gy p
10
100
1987
1991
1995
1999
2003
2007
12/3
1/1
12/3
1/1
12/3
1/1
12/3
1/1
12/3
1/2
12/3
1/2
34CalTech AyBi 199b April 26, 2011
Heavyweight Lightweight Heterogeneous Historical CMOS Projection
The DARPA Exascale Study
http://www.darpa.mil/tcto/docs/ExaScale_Study_Initial.pdf
35CalTech AyBi 199b April 26, 2011
The Exascale Study
Study Directive: can we ride silicon to Exa by 2015?
1. Baseline today’s: Commodity Technology Architectures Performance (Linpack)( p )
2. Articulate scaling of potential application classes
3. Extrapolate roadmaps for3. Extrapolate roadmaps for “Mainstream” technologies Possible offshoots of mainstream technologies Alternative and emerging technologies Alternative and emerging technologies
4. Use technology roadmaps to extrapolate use in “strawman” designs
36CalTech AyBi 199b April 26, 2011
g
5. Analyze results & id “Challenges”
Architectures Considered
• Evolutionary Strawmen“Heavyweight” Strawman based on
commodity-derived microprocessors “Lightweight” Strawman based on custom
microprocessors
• “Aggressive” Strawman“Clean Sheet of Paper” CMOS Silicon
37CalTech AyBi 199b April 26, 2011
In process of adding Heterogeneous
1 EFlop/s “Clean Sheet of Paper” Strawman
• 4 FPUs+RegFiles/Core (=6 GF @1.5GHz)• 1 Chip = 742 Cores (=4.5TF/s)
Sizing done by “balancing” power budgets with achievable capabilities
p ( )• 213MB of L1I&D; 93MB of L2
• 1 Node = 1 Proc Chip + 16 DRAMs (16GB)• 1 Group = 12 Nodes + 12 Routers (=54TF/s)• 1 Rack = 32 Groups (=1.7 PF/s)1 Rack 32 Groups ( 1.7 PF/s)
• 384 nodes / rack• 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)
• 166 MILLION cores• 166 MILLION cores• 680 MILLION FPUs• 3.6PB = 0.0036 bytes/flops• 68 MW w’aggressive assumptions
L l d t Bill D ll St f d
38CalTech AyBi 199b April 26, 2011
Largely due to Bill Dally, Stanford
A Single Node (No Router)
“Stacked” Memory
(a) Quilt Packaging (b) Thru via chip stack
Leakage28%
FPU21%
Reg File11%
28%
NodePower
Characteristics:• 742 Cores; 4 FPUs/core • 16 GB DRAM11%
Cache Access
Off-chip13%
• 290 Watts• 1.08 TF Peak @ 1.5GHz• ~3000 Flops per cycle
39CalTech AyBi 199b April 26, 2011
20%On-chip 6% DRAM Access
1%
Strawman Power Distribution
Processor13%
FPU16%Leakage Reg File
Disk14%
Assumed a 60X per bit decrease in access energy
Router
Network50%Memory
26%g
28%
L2/L316%
DRAM3%
34%
DRAM3%
e o y29%
Interconnect29% L1
39%
(a) Overall System Power (b) Memory Power (c) Interconnect Power
• 12 nodes per group32 gro ps per rack
Growing DRAM capacity 30X at least doubles overall Memory
• 32 groups per rack• 583 racks• 1 EFlops/3.6 PB
40CalTech AyBi 199b April 26, 2011
least doubles overall Memory Power• 166 million cores
• 67 MWatts
Data Center Performance Projections
Exascale1.E+09
1.E+10
1.E+08Lightweight
1.E+06
1.E+07
GFl
ops
Heavyweight
1.E+05
1.E+03
1.E+04
1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20
Projections made in mid 2007
41CalTech AyBi 199b April 26, 2011
But not at 20 MW!
Data Center Core Parallelism
1.E+08
1.E+09 170 Million CoresAND we will need
1.E+06
1.E+07
ralle
lism
10-100X moreThreading forLatency Management
1.E+05
1.E 06
oces
sor
Pa
y g
1.E+03
1.E+04Pro
1.E+021/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20
Historical Top 10 Top System
42CalTech AyBi 199b April 26, 2011
Historical Top 10 Top SystemExa Strawman Evolutionary Light Node Evolutionary Heavy Node
Data Center Total Concurrency
1.E+09
1.E+10
Billion-way concurrency
1.E+07
1.E+08
curr
ecnc
y
1.E+05
1.E+06
Tota
l Con
c
Million-way concurrency
1.E+03
1.E+04
Thousand-way concurrency1.E 03
1/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20
Top 10 Top System Top 1 TrendHistorical Exa Strawman Evolutionary Light Node
43CalTech AyBi 199b April 26, 2011
y gEvolutionary Heavy Node
“Business as Usual”Energy Projections
1.E+08
1.E+09
p)
“Bend” when Voltage stopped dropping
1 E+05
1.E+06
1.E+07
lop
( pJ/
Flop
1.E+03
1.E+04
1.E+05
nerg
y pe
r Fl
TechAs
Usual
1.E+01
1.E+02
1/1/80 1/1/84 1/1/88 1/1/92 1/1/96 1/1/00 1/1/04 1/1/08 1/1/12 1/1/16 1/1/20
En Usual
Historical Top 10 Green 500 UHPC 57KWUHPC 50 GF/W UHPC 80GF/W Top System Trend Line CMOS TechnologyExa HW Optimistic Exa HW Pessimistic Exa LW Optimistic Exa LW Pessimistic
44CalTech AyBi 199b April 26, 2011
But At What Memory Capacity?1.E+00
1.E-01
er-s
econ
dpe
r flo
p-pe 300TB
/Petaflop
1.E-02
Byt
es
How many useful apps are down here?
1.E-031/1/93 1/1/95 1/1/97 1/1/99 1/1/01 1/1/03 1/1/05 1/1/07 1/1/09 1/1/11 1/1/13 1/1/15
45CalTech AyBi 199b April 26, 2011
Top 10 Exa Strawman
Key Take-Aways• Developing Exascale systems will be really tough In any time frame, for any of the 3 classes
E l ti P i i t b t 2020i h• Evolutionary Progression is at best 2020ish With limited memory
• 4 key challenge areas4 key challenge areas Power!!!!!!!! Concurrency Memory Capacity Memory Capacity Resiliency
• Will require coordinated, cross-disciplinary efforts
46CalTech AyBi 199b April 26, 2011
A D Di iA Deep Dive into InterconnectInterconnect
to Deliver Operands
47CalTech AyBi 199b April 26, 2011
Assumed Technology Numbers
Operation Energy (pJ/bit)Register File Access 0.16Register File Access 0.16
SRAM Access 0.23DRAM Access 1DRAM Access 1
On-chip movement 0.0187Thru Silicon Vias (TSV) 0.011
Chip-to-Board 2Chip to Board 2Chip-to-optical 10Router on chip 2
48CalTech AyBi 199b April 26, 2011
Router on-chip 2
Aggressive Strawman Access Path: Interconnect-Driven
MoreRouters
ROUTER
Routers
Actual MemoryCore Actual MemoryAccess
Core Generating
Request
MemoryMICROPROCESSOR
49CalTech AyBi 199b April 26, 2011
Some sort of memory structure
Tracking the Energy
cces
s
spor
t
hip
Chi
p
(pJ)
cces
s
er L
ogic
Acc
ess
SRA
M A
c
M A
cces
s
hip
Tran
s
-Boa
rd-C
h
-Opt
ical
-C
0 Tota
l
RF
Ac
Rou
te
Tag
A
32K
B
DR
A M
On-
Ch
TSV
Chi
p -
Chi
p-
L1 Hit 39 30% 0% 28% 42% 0% 0% 0% 0% 0%L2/L3 Hit 385 7% 0% 6% 34% 0% 53% 0% 0% 0%L2/L3 Hit 385 7% 0% 6% 34% 0% 53% 0% 0% 0%
Local 1380 1% 0% 2% 2% 68% 26% 1% 0% 0%Global 13819 0% 24% 0% 0% 1% 13% 0% 10% 52%
Ave Read 706 2% 19% 2% 4% 8% 16% 0% 8% 41%Ave Read 706 2% 19% 2% 4% 8% 16% 0% 8% 41%Write to L1 39 30% 0% 28% 42% 0% 0% 0% 0% 0%Flush L1 891 0% 0% 1% 26% 32% 39% 1% 0% 0%
50CalTech AyBi 199b April 26, 2011
Linpack Energy
• Linpack: standard benchmark for TOP500 rankings
• Solve Ax=b by LU matrix decomposition & back solve
• Computationally intensive step: update rows after pivot
• Basic function Y[j,k] = α Y[I,k] + Y[j,k][j ] [ ] [j ] Read in C αs (C = “size” of cache line in dwords) (remote memory) Read in C Y’s from row i (remote memory) Repeat for all elements of each row k Repeat for all elements of each row k
• Read in C Y’s from row k (local memory)• Do processing• Write C Y’s (to cache)Write C Y s (to cache)• At some time flush C Y’s back to local memory from cache
51CalTech AyBi 199b April 26, 2011
Resulting Energy per Flop
Step Target pJ #Occurrances Total pJ % of TotalRead Alphas Remote 13 819 4 55 276 16 5%Read Alphas Remote 13,819 4 55,276 16.5%
Read pivot row Remote 13,819 4 55,276 16.5%Read 1st Y[i] Local 1,380 88 121,400 36.3%
R d Oth Y[i] L1 39 264 10 425 3 1%Read Other Y[i]s L1 39 264 10,425 3.1%Write Y's L1 39 352 13,900 4.2%Flush Y's Local 891 88 78,380 23.4%
Total 334,656Ave per Flop 475
And a core flop in 201552CalTech AyBi 199b April 26, 2011
pis only 10pJ!!!
Rockster on a TiledProcessor
“Extreme computing” in another dimension
53CalTech AyBi 199b April 26, 2011
The Problem:Take a Picture – Find the Rocks
• NxN pixel Image (1Kx1K)
• Grey scale pixels (8bits)y p ( )
• Intermediate “Rock Mask”
• Output: “Rock Descriptions”
• Image starts in memory (streamed from sensors)
54CalTech AyBi 199b April 26, 2011
The Rockster Algorithm1 M th i f t th til1. Move the image from memory to the tiles
2. 5x5 Gaussian smoothing
3. 3x3 Canny edge sharpener1. Include magnitude/angle computations per pixel
4 Horizon determination (sky fill)4. Horizon determination (sky fill)1. Homogeneous regions touching the ”top”
5. Edge following, including gap fixg g g g p
6. Background fill to generate binary “Rock Mask”1. 1 bit/pixel: rock/not rock2. Write mask back to memory
7. Connected Component to trace & report “Edge List” of rocks Al ith f JPL
55CalTech AyBi 199b April 26, 2011
rocks Algorithm from JPL
Simplified Tiled Multicore Architecture for Study
M Ctl
Til Til Til
MemCtl Q Columns
Tile(1,1)
Tile(1,j)
Tile(1,Q)
… …
.
• PxQ “Tiles”• All Tiles identical
• Each 64KB L2 today• Only X Y neighbor
Tile Tile Tile
.
.• Only X-Y neighbor routing• Memory Access only from (1,1)
P R
ows
Tile(i,1)
Tile(i,j)
Tile(i,Q)
… …
.
.
P
Tile(P 1)
Tile(P j)
Tile(P Q)
… …
.
56CalTech AyBi 199b April 26, 2011
(P,1) (P,j) (P,Q)
What Does All This Mean?• Under some reasonable assumptions with no off tile• Under some reasonable assumptions, with no off-tile
memory references:
% Core Logic44%
% L221%
A “Compute Cycle”
Of E
44%
% RegFile
% L120%
Of Energy% RegFile11%
Each off-tile memory reference costs eqvt of ddi i l 225 l f
% IFetch4%
57CalTech AyBi 199b April 26, 2011
an additional 225 compute cycles of energy
How Much Energy to Move the Image?
• Assume PxP tiles
• Each tile accesses N2/P2 pixels = 214/P2 lines
• Each access: 1 read from DRAM + 1 write to L2: T t th h (i j 1) til Transport through (i + j -1) tiles:
• Total: 3.2 + 0.15*P Million compute cycles
N t M til !• Note: More tiles – more energy!
58CalTech AyBi 199b April 26, 2011
Partitioning the Imageonto PxQ Tiles
(a) Subimage per Tilep = N/P; q = N/P
(b) Slice per Tilep = N/PQ; q =Np = N/P; q = N/P p = N/PQ; q =N
• Subimages of size “p” rows of pixels by “q” columns of pixels• Equivalently: pxq = N^2/(PQ) pixels
• Partitioning works only if 2N^2KB/(PQ) < L2 cache size
59CalTech AyBi 199b April 26, 2011
Partitioning works only if 2N 2KB/(PQ) L2 cache size
Ghost CellsM t d “ i hb i ll l ”• Many steps need “neighboring cell values”
• Ghost cells: copies of neighboring cells from neighboring tilestiles
• Data exchanges costs energy
TotalEnergy 2 4 8 16 32
Pgy
Subimage 149,760 299,520 599,040 1,198,080 2,396,160 Sliced 78,336 92,160 147,456 368,640 1,253,376
Explosive growth in energy as we grow # of tiles!
60CalTech AyBi 199b April 26, 2011
p g gy g
Cache Access Energy for Filters
Tiles 4 16 64 256 1024Filter PFilterEnergy 2 4 8 16 32
Subimage 3.9E+07 3.9E+07 3.9E+07 4.0E+07 4.0E+07
P
• Above #s in units of equivalent compute cycles summed
Sliced 3.2E+07 3.2E+07 3.2E+07 3.2E+07 3.2E+07
• Above #s in units of equivalent compute cycles, summed over all tiles
• Equivalent to about 30-40 compute cycles of energy perEquivalent to about 30 40 compute cycles of energy per pixel
• Remember need to add in Ghost energy
61CalTech AyBi 199b April 26, 2011
Rockster Conclusions
• Computation is cheap energy-wise• Accessing local memory is expensive• Accessing local memory is expensive• Accessing DRAM global memory is huge!• The way you partition data affects energy• Some odd counter-intuitive effectsSome odd, counter intuitive effectsSome energy grows as # tiles increase
62CalTech AyBi 199b April 26, 2011
New DARPA Program: UHPC• UHPC - Ubiquitous High Performance Computing
• Focii on “Extreme Scale” (1000X) technology power consumption power consumption cyber resiliency productivity
T t h hi b 2018• Target research machines by 2018 Petascale in a rack Terascale module
• Challenge problems streaming sensor data large dynamic graph-based informatics problemg y g p p decision class problem that encompasses search, hypothesis testing, and
planning 2 problems from HPCMP or CREATE suites
63CalTech AyBi 199b April 26, 2011
• 4 teams: Intel, Nvidia, Tilera, Sandia National Labs
Overall Conclusions• World has gone to multi-core to continue Moore’s Law
• Pushing performance another 1000X will be tough
• The major problem is in energy
• And that energy is in memory & interconnect For both scientific and embedded applications And even for data search applications
• We need to begin rearchitecting to reflect this• We need to begin rearchitecting to reflect this Need alternative metrics to “flops” DON’T MOVE THE DATA!
• We need to begin redesigning our architectures & codesfor energy minimization – now!
64CalTech AyBi 199b April 26, 2011