Computing on the Lunatic Fringe - University of Texas at ... · The power problem 1 10 100 1000...

Computing on the Lunatic Fringe: Exascale Computers and Why You Should Care

Mattan Erez

The University of Texas at Austin

Arch-focused whole-system approach

•Efficiency requirements require crossing layers

•Algorithms are key– Compute less, move less, store less

•Proportional systems– Minimize waste

•Utilize and improve emerging technologies

•Explore (and act) up and down– Programming model to circuits

•Preferably implement at micro-arch/system

(C) Mattan Erez 2

Big problems and emerging platforms

•Memory systems– Capacity, bandwidth, efficiency – impossible to balance

– Adaptive and dynamic management helps

– New technologies to the rescue?

– Opportunities for in-memory computing

•GPUs, supercomputers, clouds, and more– Throughput oriented designs are a must

– Centralization trend is interesting

•Reliability and resilience– More, smaller devices – danger of poor reliability

– Trimmed margins – less room for error

– Hard constraints – efficiency is a must

– Scale exacerbates reliability and resilience concerns

(C) Mattan Erez 3

Lots of interesting multi-level projects

Resilience/Reliability

GPU/CPUMemory

(C) Mattan Erez 4

•A superscale computer is a high-performance system for solving big problems

(c) Mattan Erez 5

•A supercomputer is a high-performance system for solving big cohesive problems

(c) Mattan Erez 6

•A supercomputer is a high-performance system for solving big cohesive problems

(c) Mattan Erez 7

•Simulate

• Analyze

• Predict

(c) Mattan Erez 8

•Simulate

• Analyze

• Predict

(c) Mattan Erez 9

•Supercomputers are general

– Not domain specific

•Cloud computers are general

– Algorithms keep changing

(c) Mattan Erez 10

•Superscale computers must balance

– Compute

– Storage

– Communication

– Constraints

• OpEx and CapEx

11(c) Mattan Erez

•Super is never super enough

(c) Mattan Erez 12

Num. 1

Num. 10

Num.500

Idealized VLSI

(c) Mattan Erez 13

•What’s keeping us from exascale?

(c) Mattan Erez 14

•The power problem

100000

1000000 Power

Efficiency

(c) Mattan Erez 15

•Solution: build more efficient systems

(c) Mattan Erez 16

•Exascale computers are a lunacy:

– Crazy big (1M processors, >10MW)

– Need efficiency of embedded devices

– Effective for many domains

– Fully programmable

17(c) Mattan Erez

•Why should you care?

•Harbingers of things to come

•Discovery awaits

•Enable and promote open innovation

18(c) Mattan Erez

•Where does the power go?

(c) Mattan Erez 19

•Cooling and infrastructure

– Amazing mechanical and electrical systems

– Getting close to optimal efficiency

17 PFLOP/s

18,688 nodes

~200 cabinets

~400 m2 floorspace

(c) Mattan Erez 20

•Actual processing

×√ x2

Arithmetic…100101001

0100…

MemoryControl

Idle/margin

How much of each component?

(c) Mattan Erez 21

•The budget: Power ≤ 20MW

(c) Mattan Erez 22

• 20MW / 1 exa-FLOP/sEnergy ≤

20pJ/op

• 50 GFLOPs/W sustained

(c) Mattan Erez 23

• 20MW / 1 exa-FLOP/sEnergy ≤

20pJ/op

• 50 GFLOPs/W sustained

• Best supercomputer today: ~300pJ/op

(c) Mattan Erez 24

Arithmetic

64-bit floating-point operation

Rough estimated numbers

×√ x2

Today Scaled Researchy

Arith. Headroom

(c) Mattan Erez 25

•Enough headroom?

(c) Mattan Erez 26

•Unfortunately, hard tradeoffs

64-bit DP

15 pJ 80 pJ

100 pJ

500 pJEfficient

off-chip

256-bit

2 nJDRAM

256-bit access

8 kB SRAM

(c) Mattan Erez 27

•Need more headroom

– Minimize waste

×√ x2

(c) Mattan Erez 28

Do we care about single-unit performance?

•Must all results be equally precise?

•Must all results be correct?

•Lunacy?

×√ x2

(c) Mattan Erez 29

•Relaxed reliability and precision– Some lunacy

(rare easy-to-detect errors + parallelism)

– Lunatic fringe: bounded imprecision

– Lunacy: live with real unpredictable errors

2015 12 8

5 8 1218

Today Scaled Researchy Some

lunacy

Lunatic

fringe

Lunacy

Arith. Headroom

Rough estimated numbers for illustration purposes

×√ x2

(c) Mattan Erez 30

•Bounded Approximate Duplication +

×√ x2

(c) Mattan Erez 31

•Bounded Approximate Duplication +

×√ x2

(c) Mattan Erez 32

•Supercomputers are general

• Dynamic adaptivity and tuning

– Some programmers crazier than others

– Large effort in error-tolerant algorithms

×√ x2

(c) Mattan Erez 33

•Proportionality and overprovisioning

(c) Mattan Erez 34

Arithmetic

Control

Memory

Margin

Sparse

Latency

•Architecture goals:

– Balance possible and practical

– Enable generality

– Don’t hurt the common case

•It’s all about the algorithm

35(c) Mattan Erez

•Architecture concepts so far:

– Proportionality

• Adaptivity

• SW-HW co-tuning

– Locality

– Parallelism

36(c) Mattan Erez

•How to control a billion arithmetic units?

(c) Mattan Erez 37

•Hierarchy minimizes control cost

– Amortize control decisions

Program

Processes

Threads

Instructions

Machine

Cabinets

/modules

HW units

(c) Mattan Erez 38

•Hierarchy minimizes control cost

– Amortize control decisions

(c) Mattan Erez 39

Program

Processes

Threads

Instructions

•What does a HW instruction do?

+ + + +

x x x x

Amortize/specialize more

Embedded

(c) Mattan Erez 40

•Single-thread or bulk-thread

latency

orientedthroughput

oriented

CPU GPU

(c) Mattan Erez 41

Heterogeneity is a necessity disciplined specialization

(c) Mattan Erez 42

•Must balance generality and control cost

•Over-specialization stifles innovation

– Algorithms more important than hardware

(c) Mattan Erez 43

Heterogeneity is lunacy

– Programing and tuning extremely tricky

– Diverse and rapidly evolving algorithms

(c) Mattan Erez 44

•Disciplined heterogeneityScarcity of choice

– Throughput oriented

– Latency oriented

– Disciplined reconfigurable accelerators

(c) Mattan Erez 45

•Heterogeneous computing already here

– “GPU” for throughput

– CPU for latency

– Abstracted FPGA accelerators

46(c) Mattan Erez

•Heterogeneous computing already here

Titan node (Cray XK7)

NVIDIA GPU + AMD CPU

Stampede node

Intel Xeon CPU + Xeon Phi

(c) AMD. Cray, Intel, NVIDIA, xSede 47

Copyrighted

picture of

an XK7 node

Copyrighted

picture of

Kepler die

Copyrighted

picture of

Opteron die

Copyrighted

picture of

Xeon Phi card

Copyrighted

picture of

Xeon Phi die

Copyrighted

picture of

Sany Bridge die

Copyrighted

picture of

Stampede

•Choose most efficient core

48(c) Mattan Erez

•Choose most efficient core

49(c) Mattan Erez

Arithmetic

Control

Memory

Margin

+ + + +=?

•Generalize the throughput core

50(c) Mattan Erez

51(c) Mattan Erez

52(c) Mattan Erez

•Synchronization may impair execution

– Predict when necessary robust optimization

– Adaptive compaction

: Compaction barrier

(c) Mattan Erez 53

•Architecture so far:

– Proportionality

• Adaptivity

• Heterogeneity

• SW-HW co-tuning

– Locality

– Parallelism

– Hierarchy

54(c) Mattan Erez

Heterogeneity is lunacy

– Programing and tuning extremely tricky

– Diverse and rapidly evolving algorithms

•How can we program these machines

(c) Mattan Erez 55

Hierarchical languages restore some sanity

– Abstract key tuning knobs

• Parallelism

• Locality

• Hierarchy

• Throughput

(c) Mattan Erez 56

Hierarchical languages restore some sanity

– CUDA and OpenCL

– Sequoia (Stanford)

– Habanero (Rice)

– Phalanx (NVIDIA)

– Legion (Stanford)

– Working into X10, Chapel

– …

•Domain specific languages even better

57(c) Mattan Erez

•A counter example:DE Shaw Research Anton

58(c) DE Shaw Research

Copyrighted

picture of

Anton package

(Google

image search)

Copyrighted

figure demonstrating

molecular dynamics

(Google

image search)

Copyrighted

figure of

Anton system

(Google

image search)

•Another counter example?

•Microsoft Catapult

– Disciplined reconfigurability for the cloud

– Distributed FPGA accelerator

• Within form-factor and cost constraints

– Architected interfaces and compiler

• Compiler for software, not (just) hardeare

59(c) Mattan Erez

Copyrighted

figure of

datacenter

(Google image search)

Copyrighted

figure of

Catapult diagram

(from ISCA 2014 paper)

Memory

Efficient

LargeFast

Reliable

(c) Mattan Erez 60

•Fast + large hierarchy

(c) Mattan Erez 61

TSVSilicon

Interposer

•Fast + large hierarchical

(c) Micron, Intel, Mattan Erez 62

•Large + efficient heterogeneity

(c) Mattan Erez 63

TSVSilicon

Interposer

•Large + efficient heterogeneous

(c) Micron, Intel, Mattan Erez 64

DRAM +

•Heterogeneous + hierarchical big mess

•Research just starting

– New mechanisms needed

– Very interesting tradeoffs with reliability

(c) Mattan Erez 65

•Fast + efficient control hierarchy

+ parallelism

(c) Mattan Erez 66

•Fast + efficient hierarchy + parallelism

Access cell

(c) Mattan Erez 67

Access many

cells in parallel

Many pins

in parallel

(c) Mattan Erez 68

Access even more

cells in parallel

Many pins

in parallel over

multiple cycles

(c) Mattan Erez 69

Access even more

cells in parallel

Many pins

in parallel over

multiple cycles

(c) Mattan Erez 70

•Hierarchy + parallelism potential waste

(c) Mattan Erez 71

•Is fine-grained access better?

– Wasteful when all data is needed

D1 E1 D2 E2 D3 E3C2

D5 E5 D6 E6 D7 E7

(c) Mattan Erez 72

•Dynamically adapt granularity best of both worlds

ProcessorCore

Last Level Cache

CoreN-1

Co-scheduling CG/FG w/ split CG

Sector cache (8 8B sectors per line)

Sub-Ranked DRAM w/ 2X ABUS Support both CG and FG

Locality Predictor

(c) Mattan Erez

mcf x8 omnetpp

mst x8 em3d x8 canneal

SSCA2 x8 linked list

gups x8 MIX-A MIX-B MIX-C MIX-D

CG+ECC

FG+ECC

AG+ECC

•Dynamically adapt granularity

(c) Mattan Erez 74

•Cost + … poor reliability

•Efficiency + … poor reliability

•New + … poor reliability

(c) Mattan Erez 75

•Compensate with error protection

– ECC

(c) Mattan Erez 76

•Compensate with proportional protection

(c) Mattan Erez 77

•Adaptive reliability

– Adapt the mechanism

– Adapt the error rate

• precision (stochastic precision)

(c) Mattan Erez 78

•Adaptive reliability

– By type

• Application guided

– By location

• Machine-state guided

(c) Mattan Erez 79

•Example: Virtualized ECC

– Adapt the mechanism

(c) Mattan Erez 80

Physical Memory

Data ECC

Virtual Address space

LowVirtual Page to Physical Frame mapping

Page frame – x

Page frame – y

Page frame – z

Virtual page – x

Virtual page – y

Virtual page – z

Protection Level

Medium

(c) Mattan Erez 81

Physical Memory

Physical Frame to ECC Page mapping

StrongerECC

Page frame – x

Page frame – y

Page frame – z

ECC page – y

ECC page – z

Virtual page – x

Virtual page – y

Virtual page – z

Protection Level

Medium

(c) Mattan Erez 82

Physical Memory

Data T1EC

Physical Frame to ECC Page mapping

T2EC for Chipkill

T2EC for DoubleChipkill

Page frame – x

Page frame – y

Page frame – z

ECC page – y

ECC page – z

Virtual page – x

Virtual page – y

Virtual page – z

Protection Level

Medium

(c) Mattan Erez 83

Two-Tiered ECC

Write Read

T1ECDecoding

Data T1EC

T1EC/T2ECEncoding

(c) Mattan Erez 84

85(c) Mattan Erez

Write Read

T1ECDecoding

Data T1EC

Virtualized

T1EC/T2ECEncoding

T2ECMapping

•Caching work great

Chipkill Detect ChipkillCorrect

2 ChipkillCorrect

Normalized Execution Time

(c) Mattan Erez 86

•Bonus: flexibility of ECC

Chipkill Detect Chipkill Correct 2 ChipkillCorrect

Normalized Energy-Delay Product

(c) Mattan Erez 87

•Can we do even better?

– Hide second-tier

88(C) Mattan Erez

FrugalECC = Compression + VECC

89(C) Mattan Erez

Adapt resilience scheme or adapt storage?

90(C) Mattan Erez

Adapt compression scheme too!Coverage-oriented-Compression (CoC)

91(C) Mattan Erez

92(C) Mattan Erez

93(C) Mattan Erez

Experiments promising:

Match or exceed reliability

Improve power

Minimal impact on performance

94(C) Mattan Erez

95(C) Mattan Erez

96(C) Mattan Erez

– Proportionality

• Adaptivity

• SW-HW co-tuning

• Heterogeneity

– Locality

– Parallelism

– Hierarchy

97(c) Mattan Erez

•The reliability challenge

(c) Mattan Erez 98

(15PF)

(40PF)

(130PF)

(450PF)

(1500PF)

Year (Expected performance in PetaFlops)

Intermittent SW

Overall

Detect Contain Repair Recover

(c) Mattan Erez 99

•Today’s techniques not good enough

– Hardware support expensive

– Checkpoint-restart doesn’t scale

(c) Mattan Erez 100

•Exascale resilience must be:

– Proportional

– Parallel

– Adaptive

– Hierarchical

– Analyzable

(c) Mattan Erez 101

•Containment domains

– Embed resilience within application

– Match system hierarchy and characteristics

– Analyzable abstraction consistent across layersRoot CD

Child CD

CDs don’t communicate

erroneous data

CDs have

means to recover

(c) Mattan Erez 102

Hardware Abstraction Layer

Microkernel

Runtime Library Interface

Machine

efficiency-oriented

programming model

int main(int argc, char **argv){

main_task here = phalanx::initialize(argc, argv);

… Create test arrays here …

// Launch kernel on default CPU (“host”)openmp_event e1 = async(here, here.processor(), n)

(host_saxpy, 2.0f, host_x, host_y);

// Launch kernel on default GPU (“device”)cuda_event e2 = async(here, here.cuda_gpu(0), n)

(device_saxpy, 2.0f, device_x, device_y);

wait(here, e1&e2);return 0;

Phalanx C++ program

CD Annotations

resiliency model

Reporting

Architecture

ECC, status

CD control and persistence

CD API

resiliency interface

(c) Mattan Erez 103

101, 106

Mapping example: SpMV

104(c) Mattan Erez

void task<inner> SpMV( in M, in Vi, out Ri){

forall(…) reduce(…)SpMV(M[…],Vi[…],Ri[…]);

void task<leaf> SpMV(…){

for r=0..N

for c=rowS[r]..rowS[r+1] {resi[r]+=data[c]*Vi[cIdx[c]];

prevC=c;

Matrix M

Vector V

Mapping example: SpMV

105(c) Mattan Erez

for r=0..N

prevC=c;

𝑴𝟎𝟎 𝑴𝟎𝟏

𝑴𝟏𝟎 𝑴𝟏𝟏

Matrix M

𝑽𝟎

Vector V

𝑽𝟏

•Mapping example: SpMV

106(c) Mattan Erez

𝑴𝟎𝟎 𝑴𝟎𝟏𝑴𝟏𝟎 𝑴𝟏𝟏𝑽𝟎 𝑽𝟏𝑽𝟎 𝑽𝟏

𝑴𝟎𝟎 𝑴𝟎𝟏

𝑴𝟏𝟎 𝑴𝟏𝟏

Matrix M

𝑽𝟎

Vector V

𝑽𝟏

Distributed to 4 nodes

for r=0..N

prevC=c;

107(c) Mattan Erez

𝑴𝟎𝟎 𝑴𝟎𝟏

𝑴𝟏𝟎 𝑴𝟏𝟏

Matrix M

𝑽𝟎

Vector V

𝑽𝟏

for r=0..N

prevC=c;

Distributed to 4 nodes

108(c) Mattan Erez

𝑴𝟎𝟎 𝑽𝟎

Preserve

DetectRecover

𝑴𝟏𝟎 𝑽𝟎

Preserve

DetectRecover

𝑴𝟎𝟏 𝑽𝟏

Preserve

DetectRecover

𝑴𝟏𝟏 𝑽𝟏

Preserve

DetectRecover

Preserve

DetectRecover

Parent CD

Child CD

Preserve (Parent)

Detect (Parent)

Recover (Parent)

DetectRecover

void task<inner> SpMV(in M, in Vi, out Ri) {

cd = begin(parentCD);

preserve_via_copy(cd, matrix, …);

complete(cd);

void task<leaf> SpMV(…) {

preserve_via_parent(cd, veci, …);

for r=0..N

check {fault<fail>(c > prevC);}

prevC=c;

complete(cd);

109(c) Mattan Erez

configure

preserve

complete

•SpMV preservation tuning

110(c) Mattan Erez

Natural redundancy

void task<leaf> SpMV(…) {cd = begin(parentCD);preserve_via_copy(cd, matrix, …);preserve_via_parent(cd, veci, …);for r=0..N for c=rowS[r]..rowS[r+1] {

resi[r]+=data[c]*Vi[cIdx[c]];check {fault<fail>(c > prevC);}prevC=c;

}complete(cd);}

𝑴𝟎𝟎 𝑴𝟎𝟏

𝑴𝟏𝟎 𝑴𝟏𝟏

Matrix M

𝑽𝟎

Vector V

𝑽𝟏

Hierarchy

𝑴𝟎𝟎 𝑴𝟎𝟏𝑴𝟏𝟎 𝑴𝟏𝟏𝑽𝟎 𝑽𝟏𝑽𝟎 𝑽𝟏

•Concise abstraction for complex behavior

111(c) Mattan Erez

void task<leaf> SpMV(…) {

preserve_via_parent(cd, veci, …);

for r=0..N

for c=rowS[r]..rowS[r+1] {

resi[r]+=data[c]*Vi[cIdx[c]];

check {fault<fail>(c > prevC);}

prevC=c;

complete(cd);}

Local copy or regen Sibling Parent (unchanged)

Recovery: concurrent, hierarchical, adaptive

112(c) Mattan Erez

Compute

Preserve

Detect

Re-execution overhead

•Analyzable

– Application model (tree)

– Overhead model

– Fault model

113(c) Mattan Erez

•𝑞 0,𝑚, 𝑛 = 1 − 𝑝𝑐𝑚∙𝑛

– Probability that all child CDs experienced no failures

•𝑞 𝑥,𝑚, 𝑛 = 𝑖=0𝑥 𝑖+𝑚−1

𝑖𝑝𝑐𝑖 1 − 𝑝𝑐

𝑚𝑛

– Probability that all child CDs experienced at most x failures in the m serial CDs

•𝑑 0,𝑚, 𝑛 = 𝑞[0,𝑚, 𝑛]– Probability that all child CDs experienced exactly 0

failures

•𝑑 𝑥,𝑚, 𝑛 = 𝑞 𝑥,𝑚, 𝑛 − 𝑞 𝑥 − 1,𝑚, 𝑛– Probability that sibling with the most failure

experiences exactly x failures

•𝑇𝑝𝑎𝑟𝑒𝑛𝑡 𝑥,𝑚, 𝑛 = 𝑖=0∞ 𝑖 + 𝑚 𝑇𝑐𝑑[𝑖,𝑚, 𝑛]

•Heterogeneous CDs

•Asymmetric preservation and restoration

Parallel domains

Execution

Re-execution

Analyzable

(c) Mattan Erez

•Analyzable?

115(c) Mattan Erez

•CDs are scalable

116(c) Mattan Erez

Peak System Performance

0%20%40%60%80%