Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000...
Transcript of Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000...
![Page 1: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/1.jpg)
Computing on the Lunatic Fringe: Exascale Computers and Why You Should Care
Mattan Erez
The University of Texas at Austin
1
![Page 2: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/2.jpg)
Arch-focused whole-system approach
•Efficiency requirements require crossing layers
•Algorithms are key– Compute less, move less, store less
•Proportional systems– Minimize waste
•Utilize and improve emerging technologies
•Explore (and act) up and down– Programming model to circuits
•Preferably implement at micro-arch/system
(C) Mattan Erez 2
![Page 3: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/3.jpg)
Big problems and emerging platforms
•Memory systems– Capacity, bandwidth, efficiency – impossible to balance
– Adaptive and dynamic management helps
– New technologies to the rescue?
– Opportunities for in-memory computing
•GPUs, supercomputers, clouds, and more– Throughput oriented designs are a must
– Centralization trend is interesting
•Reliability and resilience– More, smaller devices – danger of poor reliability
– Trimmed margins – less room for error
– Hard constraints – efficiency is a must
– Scale exacerbates reliability and resilience concerns
(C) Mattan Erez 3
![Page 4: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/4.jpg)
Lots of interesting multi-level projects
Resilience/Reliability
GPU/CPUMemory
(C) Mattan Erez 4
![Page 5: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/5.jpg)
•A superscale computer is a high-performance system for solving big problems
(c) Mattan Erez 5
![Page 6: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/6.jpg)
•A supercomputer is a high-performance system for solving big cohesive problems
(c) Mattan Erez 6
![Page 7: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/7.jpg)
•A supercomputer is a high-performance system for solving big cohesive problems
(c) Mattan Erez 7
![Page 8: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/8.jpg)
•Simulate
• Analyze
• Predict
(c) Mattan Erez 8
![Page 9: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/9.jpg)
•Simulate
• Analyze
• Predict
(c) Mattan Erez 9
![Page 10: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/10.jpg)
•Supercomputers are general
– Not domain specific
•Cloud computers are general
– Algorithms keep changing
(c) Mattan Erez 10
![Page 11: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/11.jpg)
•Superscale computers must balance
– Compute
– Storage
– Communication
– Constraints
• OpEx and CapEx
11(c) Mattan Erez
![Page 12: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/12.jpg)
•Super is never super enough
(c) Mattan Erez 12
![Page 13: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/13.jpg)
1E+8
1E+9
1E+10
1E+11
1E+12
1E+13
1E+14
1E+15
1E+16
1E+17
1E+18
Su
sta
ine
d F
LOP
/s
Num. 1
Num. 10
Num.500
Idealized VLSI
(c) Mattan Erez 13
![Page 14: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/14.jpg)
•What’s keeping us from exascale?
(c) Mattan Erez 14
![Page 15: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/15.jpg)
•The power problem
1
10
100
1000
10000
100000
1000000 Power
Efficiency
Tota
l P
ow
er
[kW
]
Eff
icie
nc
y [
GFLO
PS/k
W]
(c) Mattan Erez 15
![Page 16: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/16.jpg)
•Solution: build more efficient systems
(c) Mattan Erez 16
![Page 17: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/17.jpg)
•Exascale computers are a lunacy:
– Crazy big (1M processors, >10MW)
– Need efficiency of embedded devices
– Effective for many domains
– Fully programmable
17(c) Mattan Erez
![Page 18: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/18.jpg)
•Why should you care?
•Harbingers of things to come
•Discovery awaits
•Enable and promote open innovation
18(c) Mattan Erez
![Page 19: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/19.jpg)
•Where does the power go?
(c) Mattan Erez 19
![Page 20: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/20.jpg)
•Cooling and infrastructure
– Amazing mechanical and electrical systems
– Getting close to optimal efficiency
17 PFLOP/s
18,688 nodes
>8 MW
~200 cabinets
~400 m2 floorspace
(c) Mattan Erez 20
![Page 21: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/21.jpg)
•Actual processing
+
−
×√ x2
Arithmetic…100101001
0100…
I/O
MemoryControl
Comm.
Idle/margin
How much of each component?
(c) Mattan Erez 21
![Page 22: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/22.jpg)
•The budget: Power ≤ 20MW
(c) Mattan Erez 22
![Page 23: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/23.jpg)
• 20MW / 1 exa-FLOP/sEnergy ≤
20pJ/op
• 50 GFLOPs/W sustained
(c) Mattan Erez 23
![Page 24: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/24.jpg)
• 20MW / 1 exa-FLOP/sEnergy ≤
20pJ/op
• 50 GFLOPs/W sustained
• Best supercomputer today: ~300pJ/op
(c) Mattan Erez 24
![Page 25: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/25.jpg)
Arithmetic
64-bit floating-point operation
Rough estimated numbers
+
−
×√ x2
40
2015
5
0
10
20
30
40
50
Today Scaled Researchy
Arith. Headroom
(c) Mattan Erez 25
260
![Page 26: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/26.jpg)
•Enough headroom?
(c) Mattan Erez 26
![Page 27: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/27.jpg)
•Unfortunately, hard tradeoffs
20mm
64-bit DP
15pJ
15 pJ 80 pJ
100 pJ
500 pJEfficient
off-chip
link
10nm
256-bit
buses
2 nJDRAM
Rd/Wr
256-bit access
8 kB SRAM
15 pJ
(c) Mattan Erez 27
![Page 28: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/28.jpg)
•Need more headroom
– Minimize waste
+
−
×√ x2
(c) Mattan Erez 28
![Page 29: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/29.jpg)
Do we care about single-unit performance?
•Must all results be equally precise?
•Must all results be correct?
•Lunacy?
+
−
×√ x2
(c) Mattan Erez 29
![Page 30: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/30.jpg)
•Relaxed reliability and precision– Some lunacy
(rare easy-to-detect errors + parallelism)
– Lunatic fringe: bounded imprecision
– Lunacy: live with real unpredictable errors
40
2015 12 8
2
5 8 1218
0
10
20
30
40
50
Today Scaled Researchy Some
lunacy
Lunatic
fringe
Lunacy
Arith. Headroom
Rough estimated numbers for illustration purposes
+
−
×√ x2
(c) Mattan Erez 30
![Page 31: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/31.jpg)
•Bounded Approximate Duplication +
−
×√ x2
(c) Mattan Erez 31
![Page 32: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/32.jpg)
•Bounded Approximate Duplication +
−
×√ x2
(c) Mattan Erez 32
![Page 33: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/33.jpg)
•Supercomputers are general
• Dynamic adaptivity and tuning
– Some programmers crazier than others
– Large effort in error-tolerant algorithms
+
−
×√ x2
(c) Mattan Erez 33
![Page 34: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/34.jpg)
•Proportionality and overprovisioning
(c) Mattan Erez 34
Arithmetic
Control
Memory
Margin
Dense
Sparse
Latency
![Page 35: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/35.jpg)
•Architecture goals:
– Balance possible and practical
– Enable generality
– Don’t hurt the common case
•It’s all about the algorithm
35(c) Mattan Erez
![Page 36: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/36.jpg)
•Architecture goals:
– Balance possible and practical
– Enable generality
– Don’t hurt the common case
•Architecture concepts so far:
– Proportionality
• Adaptivity
• SW-HW co-tuning
– Locality
– Parallelism
36(c) Mattan Erez
![Page 37: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/37.jpg)
•How to control a billion arithmetic units?
(c) Mattan Erez 37
![Page 38: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/38.jpg)
•Hierarchy minimizes control cost
– Amortize control decisions
Program
Teams
Processes
Threads
Instructions
Machine
Cabinets
/modules
Nodes
Cores
HW units
(c) Mattan Erez 38
![Page 39: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/39.jpg)
•Hierarchy minimizes control cost
– Amortize control decisions
(c) Mattan Erez 39
Program
Teams
Processes
Threads
Instructions
![Page 40: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/40.jpg)
•What does a HW instruction do?
++++
xxxx
+ + + +
x x x x
x x
+
+
√
- +
Amortize/specialize more
CPU
GPU
Embedded
(c) Mattan Erez 40
![Page 41: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/41.jpg)
•Single-thread or bulk-thread
latency
orientedthroughput
oriented
CPU GPU
(c) Mattan Erez 41
![Page 42: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/42.jpg)
Heterogeneity is a necessity disciplined specialization
(c) Mattan Erez 42
![Page 43: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/43.jpg)
•Must balance generality and control cost
•Over-specialization stifles innovation
– Algorithms more important than hardware
(c) Mattan Erez 43
![Page 44: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/44.jpg)
Heterogeneity is lunacy
– Programing and tuning extremely tricky
– Diverse and rapidly evolving algorithms
(c) Mattan Erez 44
![Page 45: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/45.jpg)
•Disciplined heterogeneityScarcity of choice
– Throughput oriented
– Latency oriented
– Disciplined reconfigurable accelerators
(c) Mattan Erez 45
![Page 46: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/46.jpg)
•Heterogeneous computing already here
– “GPU” for throughput
– CPU for latency
– Abstracted FPGA accelerators
46(c) Mattan Erez
![Page 47: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/47.jpg)
•Heterogeneous computing already here
Titan node (Cray XK7)
NVIDIA GPU + AMD CPU
Stampede node
Intel Xeon CPU + Xeon Phi
(c) AMD. Cray, Intel, NVIDIA, xSede 47
Copyrighted
picture of
an XK7 node
Copyrighted
picture of
Kepler die
Copyrighted
picture of
Opteron die
Copyrighted
picture of
Xeon Phi card
Copyrighted
picture of
Xeon Phi die
Copyrighted
picture of
Sany Bridge die
Copyrighted
picture of
Stampede
node
![Page 48: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/48.jpg)
•Choose most efficient core
48(c) Mattan Erez
![Page 49: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/49.jpg)
•Choose most efficient core
49(c) Mattan Erez
Arithmetic
Control
Memory
Margin
![Page 50: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/50.jpg)
+
+
+
+
+ + + +=?
•Generalize the throughput core
50(c) Mattan Erez
52
![Page 51: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/51.jpg)
•Generalize the throughput core
51(c) Mattan Erez
![Page 52: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/52.jpg)
•Generalize the throughput core
52(c) Mattan Erez
![Page 53: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/53.jpg)
•Synchronization may impair execution
– Predict when necessary robust optimization
– Adaptive compaction
: Compaction barrier
(c) Mattan Erez 53
![Page 54: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/54.jpg)
•Architecture goals:
– Balance possible and practical
– Enable generality
– Don’t hurt the common case
•Architecture so far:
– Proportionality
• Adaptivity
• Heterogeneity
• SW-HW co-tuning
– Locality
– Parallelism
– Hierarchy
54(c) Mattan Erez
![Page 55: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/55.jpg)
Heterogeneity is lunacy
– Programing and tuning extremely tricky
– Diverse and rapidly evolving algorithms
•How can we program these machines
(c) Mattan Erez 55
![Page 56: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/56.jpg)
Hierarchical languages restore some sanity
– Abstract key tuning knobs
• Parallelism
• Locality
• Hierarchy
• Throughput
(c) Mattan Erez 56
![Page 57: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/57.jpg)
Hierarchical languages restore some sanity
– CUDA and OpenCL
– Sequoia (Stanford)
– Habanero (Rice)
– Phalanx (NVIDIA)
– Legion (Stanford)
– Working into X10, Chapel
– …
•Domain specific languages even better
57(c) Mattan Erez
![Page 58: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/58.jpg)
•A counter example:DE Shaw Research Anton
58(c) DE Shaw Research
Copyrighted
picture of
Anton package
image search)
Copyrighted
figure demonstrating
molecular dynamics
image search)
Copyrighted
figure of
Anton system
image search)
![Page 59: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/59.jpg)
•Another counter example?
•Microsoft Catapult
– Disciplined reconfigurability for the cloud
– Distributed FPGA accelerator
• Within form-factor and cost constraints
– Architected interfaces and compiler
• Compiler for software, not (just) hardeare
59(c) Mattan Erez
Copyrighted
figure of
datacenter
(Google image search)
Copyrighted
figure of
Catapult diagram
(from ISCA 2014 paper)
![Page 60: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/60.jpg)
Memory
Cheap
Efficient
LargeFast
Reliable
(c) Mattan Erez 60
![Page 61: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/61.jpg)
•Fast + large hierarchy
(c) Mattan Erez 61
![Page 62: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/62.jpg)
TSVSilicon
Interposer
•Fast + large hierarchical
(c) Micron, Intel, Mattan Erez 62
mem
mem
![Page 63: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/63.jpg)
•Large + efficient heterogeneity
(c) Mattan Erez 63
![Page 64: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/64.jpg)
TSVSilicon
Interposer
•Large + efficient heterogeneous
(c) Micron, Intel, Mattan Erez 64
DRAM +
NVRAM
![Page 65: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/65.jpg)
•Heterogeneous + hierarchical big mess
•Research just starting
– New mechanisms needed
– Very interesting tradeoffs with reliability
(c) Mattan Erez 65
![Page 66: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/66.jpg)
•Fast + efficient control hierarchy
+ parallelism
(c) Mattan Erez 66
![Page 67: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/67.jpg)
•Fast + efficient hierarchy + parallelism
Access cell
(c) Mattan Erez 67
![Page 68: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/68.jpg)
•Fast + efficient hierarchy + parallelism
Access many
cells in parallel
Many pins
in parallel
(c) Mattan Erez 68
![Page 69: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/69.jpg)
•Fast + efficient hierarchy + parallelism
Access even more
cells in parallel
Many pins
in parallel over
multiple cycles
(c) Mattan Erez 69
![Page 70: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/70.jpg)
•Fast + efficient hierarchy + parallelism
Access even more
cells in parallel
Many pins
in parallel over
multiple cycles
(c) Mattan Erez 70
![Page 71: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/71.jpg)
•Hierarchy + parallelism potential waste
(c) Mattan Erez 71
![Page 72: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/72.jpg)
•Is fine-grained access better?
– Wasteful when all data is needed
D0 E0
DataC
C0
C1
D1 E1 D2 E2 D3 E3C2
C3
C4
D4 E4
CG
FGC5
C6
C7
D5 E5 D6 E6 D7 E7
ECC
D0 E0
DataC
C0
CG
FG
ECC
(c) Mattan Erez 72
72
![Page 73: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/73.jpg)
•Dynamically adapt granularity best of both worlds
73
ProcessorCore
$I$D
L2
Last Level Cache
MC
DRAM
Core0
Core1
CoreN-1
…
SLP
Co-scheduling CG/FG w/ split CG
Sector cache (8 8B sectors per line)
Sub-Ranked DRAM w/ 2X ABUS Support both CG and FG
Locality Predictor
(c) Mattan Erez
![Page 74: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/74.jpg)
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
mcf x8 omnetpp
x8
mst x8 em3d x8 canneal
x8
SSCA2 x8 linked list
x8
gups x8 MIX-A MIX-B MIX-C MIX-D
Sy
ste
m T
hro
ug
hp
ut
CG+ECC
FG+ECC
AG+ECC
•Dynamically adapt granularity
(c) Mattan Erez 74
![Page 75: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/75.jpg)
•Cost + … poor reliability
•Efficiency + … poor reliability
•New + … poor reliability
(c) Mattan Erez 75
![Page 76: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/76.jpg)
•Compensate with error protection
– ECC
(c) Mattan Erez 76
![Page 77: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/77.jpg)
•Compensate with proportional protection
(c) Mattan Erez 77
![Page 78: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/78.jpg)
•Adaptive reliability
– Adapt the mechanism
– Adapt the error rate
• precision (stochastic precision)
(c) Mattan Erez 78
![Page 79: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/79.jpg)
•Adaptive reliability
– By type
• Application guided
– By location
• Machine-state guided
(c) Mattan Erez 79
![Page 80: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/80.jpg)
•Example: Virtualized ECC
– Adapt the mechanism
(c) Mattan Erez 80
![Page 81: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/81.jpg)
Physical Memory
Data ECC
Virtual Address space
LowVirtual Page to Physical Frame mapping
Page frame – x
Page frame – y
Page frame – z
Virtual page – x
Virtual page – y
Virtual page – z
Protection Level
Medium
High
(c) Mattan Erez 81
![Page 82: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/82.jpg)
Physical Memory
Data
Virtual Address space
LowVirtual Page to Physical Frame mapping
Physical Frame to ECC Page mapping
ECC
StrongerECC
Page frame – x
Page frame – y
Page frame – z
ECC page – y
ECC page – z
Virtual page – x
Virtual page – y
Virtual page – z
Protection Level
Medium
High
(c) Mattan Erez 82
![Page 83: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/83.jpg)
Physical Memory
Data T1EC
Virtual Address space
LowVirtual Page to Physical Frame mapping
Physical Frame to ECC Page mapping
T2EC for Chipkill
T2EC for DoubleChipkill
Page frame – x
Page frame – y
Page frame – z
ECC page – y
ECC page – z
Virtual page – x
Virtual page – y
Virtual page – z
Protection Level
Medium
High
(c) Mattan Erez 83
![Page 84: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/84.jpg)
Two-Tiered ECC
Write Read
T1ECDecoding
Data T1EC
T1EC/T2ECEncoding
T2EC
(c) Mattan Erez 84
![Page 85: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/85.jpg)
85(c) Mattan Erez
Write Read
T1ECDecoding
Data T1EC
Virtualized
T2EC
T1EC/T2ECEncoding
T2ECMapping
![Page 86: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/86.jpg)
•Caching work great
0.98
0.99
1.00
1.01
1.02
Chipkill Detect ChipkillCorrect
2 ChipkillCorrect
Normalized Execution Time
(c) Mattan Erez 86
![Page 87: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/87.jpg)
•Bonus: flexibility of ECC
0.75
0.80
0.85
0.90
0.95
1.00
Chipkill Detect Chipkill Correct 2 ChipkillCorrect
Normalized Energy-Delay Product
(c) Mattan Erez 87
![Page 88: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/88.jpg)
•Can we do even better?
– Hide second-tier
88(C) Mattan Erez
![Page 89: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/89.jpg)
FrugalECC = Compression + VECC
89(C) Mattan Erez
![Page 90: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/90.jpg)
Adapt resilience scheme or adapt storage?
90(C) Mattan Erez
![Page 91: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/91.jpg)
Adapt compression scheme too!Coverage-oriented-Compression (CoC)
91(C) Mattan Erez
![Page 92: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/92.jpg)
92(C) Mattan Erez
![Page 93: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/93.jpg)
93(C) Mattan Erez
![Page 94: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/94.jpg)
Experiments promising:
Match or exceed reliability
Improve power
Minimal impact on performance
94(C) Mattan Erez
![Page 95: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/95.jpg)
95(C) Mattan Erez
![Page 96: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/96.jpg)
96(C) Mattan Erez
![Page 97: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/97.jpg)
•Architecture goals:
– Balance possible and practical
– Enable generality
– Don’t hurt the common case
•Architecture so far:
– Proportionality
• Adaptivity
• SW-HW co-tuning
• Heterogeneity
– Locality
– Parallelism
– Hierarchy
97(c) Mattan Erez
![Page 98: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/98.jpg)
•The reliability challenge
(c) Mattan Erez 98
1
10
100
1000
2013
(15PF)
2015
(40PF)
2017
(130PF)
2019
(450PF)
2021
(1500PF)
Pro
ject
ed
MT
TI
[ho
urs
]
Year (Expected performance in PetaFlops)
Hard
Soft
Intermittent SW
Overall
![Page 99: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/99.jpg)
Detect Contain Repair Recover
(c) Mattan Erez 99
![Page 100: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/100.jpg)
•Today’s techniques not good enough
– Hardware support expensive
– Checkpoint-restart doesn’t scale
(c) Mattan Erez 100
![Page 101: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/101.jpg)
•Exascale resilience must be:
– Proportional
– Parallel
– Adaptive
– Hierarchical
– Analyzable
(c) Mattan Erez 101
![Page 102: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/102.jpg)
•Containment domains
– Embed resilience within application
– Match system hierarchy and characteristics
– Analyzable abstraction consistent across layersRoot CD
Child CD
CDs don’t communicate
erroneous data
CDs have
means to recover
(c) Mattan Erez 102
![Page 103: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/103.jpg)
Hardware Abstraction Layer
Microkernel
Runtime Library Interface
Machine
efficiency-oriented
programming model
int main(int argc, char **argv){
main_task here = phalanx::initialize(argc, argv);
… Create test arrays here …
// Launch kernel on default CPU (“host”)openmp_event e1 = async(here, here.processor(), n)
(host_saxpy, 2.0f, host_x, host_y);
// Launch kernel on default GPU (“device”)cuda_event e2 = async(here, here.cuda_gpu(0), n)
(device_saxpy, 2.0f, device_x, device_y);
wait(here, e1&e2);return 0;
}
Phalanx C++ program
CD Annotations
resiliency model
Error
Reporting
Architecture
ECC, status
CD control and persistence
CD API
resiliency interface
(c) Mattan Erez 103
101, 106
![Page 104: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/104.jpg)
Mapping example: SpMV
104(c) Mattan Erez
void task<inner> SpMV( in M, in Vi, out Ri){
forall(…) reduce(…)SpMV(M[…],Vi[…],Ri[…]);
}
void task<leaf> SpMV(…){
for r=0..N
for c=rowS[r]..rowS[r+1] {resi[r]+=data[c]*Vi[cIdx[c]];
prevC=c;
}
}
𝑴
Matrix M
𝑽
Vector V
![Page 105: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/105.jpg)
Mapping example: SpMV
105(c) Mattan Erez
void task<inner> SpMV( in M, in Vi, out Ri){
forall(…) reduce(…)SpMV(M[…],Vi[…],Ri[…]);
}
void task<leaf> SpMV(…){
for r=0..N
for c=rowS[r]..rowS[r+1] {resi[r]+=data[c]*Vi[cIdx[c]];
prevC=c;
}
}
𝑴𝟎𝟎 𝑴𝟎𝟏
𝑴𝟏𝟎 𝑴𝟏𝟏
Matrix M
𝑽𝟎
Vector V
𝑽𝟏
![Page 106: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/106.jpg)
•Mapping example: SpMV
106(c) Mattan Erez
𝑴𝟎𝟎 𝑴𝟎𝟏𝑴𝟏𝟎 𝑴𝟏𝟏𝑽𝟎 𝑽𝟏𝑽𝟎 𝑽𝟏
𝑴𝟎𝟎 𝑴𝟎𝟏
𝑴𝟏𝟎 𝑴𝟏𝟏
Matrix M
𝑽𝟎
Vector V
𝑽𝟏
Distributed to 4 nodes
void task<inner> SpMV( in M, in Vi, out Ri){
forall(…) reduce(…)SpMV(M[…],Vi[…],Ri[…]);
}
void task<leaf> SpMV(…){
for r=0..N
for c=rowS[r]..rowS[r+1] {resi[r]+=data[c]*Vi[cIdx[c]];
prevC=c;
}
}
![Page 107: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/107.jpg)
•Mapping example: SpMV
107(c) Mattan Erez
𝑴𝟎𝟎 𝑴𝟎𝟏
𝑴𝟏𝟎 𝑴𝟏𝟏
Matrix M
𝑽𝟎
Vector V
𝑽𝟏
void task<inner> SpMV( in M, in Vi, out Ri){
forall(…) reduce(…)SpMV(M[…],Vi[…],Ri[…]);
}
void task<leaf> SpMV(…){
for r=0..N
for c=rowS[r]..rowS[r+1] {resi[r]+=data[c]*Vi[cIdx[c]];
prevC=c;
}
}
Distributed to 4 nodes
![Page 108: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/108.jpg)
•Mapping example: SpMV
108(c) Mattan Erez
𝑴𝟎𝟎 𝑽𝟎
Preserve
DetectRecover
𝑴𝟏𝟎 𝑽𝟎
Preserve
DetectRecover
𝑴𝟎𝟏 𝑽𝟏
Preserve
DetectRecover
𝑴𝟏𝟏 𝑽𝟏
Preserve
DetectRecover
Preserve
DetectRecover
M V
Parent CD
Child CD
Preserve (Parent)
Detect (Parent)
Recover (Parent)
Child
DetectRecover
Child
DetectRecover
Child
DetectRecover
Child
DetectRecover
![Page 109: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/109.jpg)
void task<inner> SpMV(in M, in Vi, out Ri) {
cd = begin(parentCD);
preserve_via_copy(cd, matrix, …);
forall(…) reduce(…)SpMV(M[…],Vi[…],Ri[…]);
complete(cd);
}
void task<leaf> SpMV(…) {
cd = begin(parentCD);
preserve_via_copy(cd, matrix, …);
preserve_via_parent(cd, veci, …);
for r=0..N
for c=rowS[r]..rowS[r+1] {resi[r]+=data[c]*Vi[cIdx[c]];
check {fault<fail>(c > prevC);}
prevC=c;
}
complete(cd);
}
109(c) Mattan Erez
API
begin
configure
preserve
check
complete
![Page 110: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/110.jpg)
•SpMV preservation tuning
110(c) Mattan Erez
Natural redundancy
void task<leaf> SpMV(…) {cd = begin(parentCD);preserve_via_copy(cd, matrix, …);preserve_via_parent(cd, veci, …);for r=0..N for c=rowS[r]..rowS[r+1] {
resi[r]+=data[c]*Vi[cIdx[c]];check {fault<fail>(c > prevC);}prevC=c;
}complete(cd);}
𝑴𝟎𝟎 𝑴𝟎𝟏
𝑴𝟏𝟎 𝑴𝟏𝟏
Matrix M
𝑽𝟎
Vector V
𝑽𝟏
Hierarchy
𝑴𝟎𝟎 𝑴𝟎𝟏𝑴𝟏𝟎 𝑴𝟏𝟏𝑽𝟎 𝑽𝟏𝑽𝟎 𝑽𝟏
![Page 111: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/111.jpg)
•Concise abstraction for complex behavior
111(c) Mattan Erez
void task<leaf> SpMV(…) {
cd = begin(parentCD);
preserve_via_copy(cd, matrix, …);
preserve_via_parent(cd, veci, …);
for r=0..N
for c=rowS[r]..rowS[r+1] {
resi[r]+=data[c]*Vi[cIdx[c]];
check {fault<fail>(c > prevC);}
prevC=c;
}
complete(cd);}
Local copy or regen Sibling Parent (unchanged)
![Page 112: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/112.jpg)
Recovery: concurrent, hierarchical, adaptive
112(c) Mattan Erez
Compute
Preserve
Detect
Re-execution overhead
Tim
e
![Page 113: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/113.jpg)
•Analyzable
– Application model (tree)
– Overhead model
– Fault model
113(c) Mattan Erez
Ex
ec
utio
n t
ime
![Page 114: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/114.jpg)
•𝑞 0,𝑚, 𝑛 = 1 − 𝑝𝑐𝑚∙𝑛
– Probability that all child CDs experienced no failures
•𝑞 𝑥,𝑚, 𝑛 = 𝑖=0𝑥 𝑖+𝑚−1
𝑖𝑝𝑐𝑖 1 − 𝑝𝑐
𝑚𝑛
– Probability that all child CDs experienced at most x failures in the m serial CDs
•𝑑 0,𝑚, 𝑛 = 𝑞[0,𝑚, 𝑛]– Probability that all child CDs experienced exactly 0
failures
•𝑑 𝑥,𝑚, 𝑛 = 𝑞 𝑥,𝑚, 𝑛 − 𝑞 𝑥 − 1,𝑚, 𝑛– Probability that sibling with the most failure
experiences exactly x failures
•𝑇𝑝𝑎𝑟𝑒𝑛𝑡 𝑥,𝑚, 𝑛 = 𝑖=0∞ 𝑖 + 𝑚 𝑇𝑐𝑑[𝑖,𝑚, 𝑛]
•Heterogeneous CDs
•Asymmetric preservation and restoration
114
Re
-exe
cu
tio
n t
ime
Parallel domains
Execution
Re-execution
Analyzable
(c) Mattan Erez
![Page 115: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/115.jpg)
•Analyzable?
115(c) Mattan Erez
Ex
ec
utio
n t
ime
108
![Page 116: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/116.jpg)
•CDs are scalable
116(c) Mattan Erez
Peak System Performance
NT
SpMV
HPCCG
0%20%40%60%80%
100%
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF
Pe
rfo
rma
nc
e
Effic
ien
cy CDs, NT
h-CPR, 80%
g-CPR, 80%
0%20%40%60%80%
100%
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF
Pe
rfo
rma
nc
e
Effic
ien
cy CDs, SpMV
h-CPR, 50%
g-CPR, 50%
0%20%40%60%80%
100%
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF
Pe
rfo
rma
nc
e
Effic
ien
cy CDs, HPCCG
h-CPR, 10%
g-CPR, 10%
![Page 117: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/117.jpg)
•CDs are proportional and efficient
117(c) Mattan Erez
0%
10%
20%
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF
En
erg
y
Ov
erh
ea
d
CDs, SpMV
h-CPR, 50%
g-CPR, 50%
0%
10%
20%
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF
En
erg
y
Ov
erh
ea
d
CDs, NT
h-CPR, 80%
gCPR, 80%
Peak System Performance
0%
10%
20%
2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF
En
erg
y
Ov
erh
ea
d
CDs, HPCCG
h-CPR, 10%
g-CPR, 10%
NT
SpMV
HPCCG
![Page 118: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/118.jpg)
•Architecture goals:
– Balance possible and practical
– Enable generality
– Don’t hurt the common case
•It’s all about the algorithm
118(c) Mattan Erez
![Page 119: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/119.jpg)
•Architecture goals:
– Balance possible and practical
– Enable generality
– Don’t hurt the common case
•Architecture so far:
– Proportionality
• Adaptivity
• SW-HW co-tuning + analyzability
• Heterogeneity
– Locality
– Parallelism
– Hierarchy
119(c) Mattan Erez
Arithmetic
Control
Memory
Reliability
Programming
![Page 120: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/120.jpg)
•Exascale computers are lunacy
120(c) Mattan Erez
![Page 121: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/121.jpg)
•Exascale computers are lunacyscience
121(c) Mattan Erez
![Page 122: Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000 10000 100000 1000000 Power ... –Getting close to optimal efficiency 17 PFLOP/s 18,688](https://reader035.fdocuments.us/reader035/viewer/2022071102/5fdb24a79ee8486334111570/html5/thumbnails/122.jpg)
•Harbingers of things to come
•Discovery awaits
•Enable and promote open innovation
•Math + Systems +
Engineering +
• Technology
122(c) Mattan Erez