Failure, Resilience, Opportunity and...

50
Failure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense Salishan High Speed Computing Conference April 27, 2015

Transcript of Failure, Resilience, Opportunity and...

Page 1: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Failure, Resilience, Opportunity and Innovation

John Daly, U.S. Department of Defense Salishan High Speed Computing Conference

April 27, 2015

Page 2: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Outline• The call to innovation • A brief history of computer reliability • Resilience comes of age • Opportunities for the future

Page 3: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

http://en.wikipedia.org/wiki/Tandem_Computers

Resilience at Scale

Page 4: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

http://en.wikipedia.org/wiki/Tandem_Computers

Resilience at Scale“Diversity jolts us into cognitive action in ways that homogeneity simply does not…. For this reason, diversity appears to lead to higher-quality scientific research.” - Scientific American, Volume 331, Issue 4, 2014.

“If you always do what you always did; you will always get what you always got.” - Albert Einstein

Page 5: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Innovation and Failure

https://www.youtube.com/watch?v=iJAq6drKKzE

Page 6: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Outline• The call to innovation • A brief history of computer reliability • Resilience comes of age • Opportunities for the future

Page 7: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Scientific – Bell Labs Relay Calculator

http

://w

ww

.com

pute

rhis

tory

.org

/revo

lutio

n/bi

rth-o

f-th

e-co

mpu

ter/4

/85/

342

Page 8: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Scientific – Bell Labs Relay Calculator

“Starting with the Model III delivered to the Armed Forces in 1944, not one of our customers has reported their computers giving out a wrong answer as the result of a machine error.”

- Second Symposium on Large Scale Digital Calculating Machinery, 1949.

http

://w

ww

.com

pute

rhis

tory

.org

/revo

lutio

n/bi

rth-o

f-th

e-co

mpu

ter/4

/85/

342

Bi-quinary notation

Page 9: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Business – UNIVAC I (1951)• 5200 vacuum

tubes • 29,000 lbs • 125 kW • 2.25 MHz

clock • 66 hours

mean time to system failure 470 million instructions per hard stop

http://commons.wikimedia.org/wiki/File:UNIVAC-I-PRL61-0977.jpg

Page 10: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

http://en.wikipedia.org/wiki/Colossus_computer

Intel – Colossus (1944)

"speed was the essence”

- Dorothy Du Boisso

n

“I was instructed to destroy all the records, which I did. I took all the drawings and the plans and all the information about Colossus on paper and put it in the boiler fire. And saw it burn.” - Tommy Flowers

• 2400 vacuum tubes

• ??? lbs • ??? kWatts • 5.8 MHz clock

• ??? MTTF

Page 11: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

A call to confront faults scientifically

John von Neumann, “Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components”, from a lecture delivered at the California Institute of Technology, January 1952

Page 12: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Rapid pace of innovation385 instructions per second

http://en.wikipedia.org/wiki/ENIAC

1945 – ENIAC becomes first general purpose electronic computer 1947 – Bardeen, Brattain and Shockley develop transistor at Bell Labs 1947 – Richard Hamming develops codes for error correction and detection 1954 – IBM 608 becomes first commercial all-transistor calculator

Page 13: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Fault-tolerance is about redundancy• In space…

– Hardware• Prediction and migration• Detection and standby

sparing• Modular redundancy

– Software• Redundant copies of code• Redundant versions of

code

• In time…– Coding techniques

• Error correcting codes (data)

• Residue codes (arithmetic)

Page 14: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Fault-tolerance is about redundancy• In space…

– Hardware• Prediction and migration• Detection and standby

sparing• Modular redundancy

– Software• Redundant copies of code• Redundant versions of

code

• In time…– Coding techniques

• Error correcting codes (data)

• Residue codes (arithmetic)– Recovery blocks

• Checkpoint and rollback• Transactional programming• Fault containment domains

Page 15: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Hardware / software reliability

Fault ErrorA

ctiv

atio

nFailure

Prop

agat

ion

Underlying system state is erroneous

Delivered service deviates from specified service

System state observed to be erroneous

Page 16: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Hardware / software reliability

Fault ErrorA

ctiv

atio

nFailure

Prop

agat

ion

Fault Latency Error Latency

Error latency cannot be measured on a real system

Page 17: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Tandem NonStop VLX (1986)• 2-16 procs • 256 MBytes • 12 MHz

• 240,000 hours mean time to system failure

Non-Stop II System (1981)

“Unlike the situation with hardware components, it is possible to develop perfect, defect-free, failure proof software. It is only a matter of cost to the manufacturer and inconvenience to the customer who must wait much longer for some needed software to be delivered.”

- Bartlet, et al., Fault Tolerance in Tandem Computer Systems, 1990.

http://en.wikipedia.org/wiki/Tandem_Computers

Page 18: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Tandem NonStop VLX (1986)• 2-16 procs • 256 MBytes • 12 MHz

• 240,000 hours mean time to system failure

Non-Stop II System (1981)

Fail-Fast O

peration = fault-i

ntolerant

http://en.wikipedia.org/wiki/Tandem_Computers

2.6 peta instructions per hard stop

(>1,000,000x in 35 years)

Page 19: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Outline• The call to innovation • A brief history of computer reliability • Resilience comes of age • Opportunities for the future

Page 20: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

ASCI Q (2002)• ES-45 cluster

• 4096 cores

• 10 Tflops • 6.5 hours

MTBF

http://ya-ru.ru/10-samyx-bystryx-kompyuterov-v-mire

23 Pflops per hard stop

(<10x in 17 years)

Page 21: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

U N C L A S S I F I E D

U N C L A S S I F I E D

Operated by the Los Alamos National Security, LLC for the DOE/NNSA LA-UR-07-4292/5853/6490

Slide 5

Defining Solve Efficiency in Terms of How the System is Spending its Time*

Solve Efficiency =tstr⋅trtop

=tstop

Functional Status KeyFully-

Functionalt

Partially-Functional

t!

Non-Functional

t

OperationTimetop

IntegrationTimetint

TotalTimettot

DefensiveIO Time

tdio / t!dio

RestartTime

trst / t!rst

ReworkTime

trwk / t!rwk

ComputeTime

tcmp / t!cmp

ProductiveIO Time

tpio / t!pio

SolveTime

ts / t!s

Fault TolerantTime

tf / t!f

ArchivalStorage Time

ta / t!a

ProductionTime

tpr / t!pr

UnscheduledDowntime

tusch

ScheduledDowntime

tsch

ExternalFailure Time

tef

InternalFailure Time

tif

ExternalUnusable Time

teu

InternalUnusable Time

tiu

ReservedTime

trsv / t!rsv

IdleTime

tidle / t!idle

RunTime

tr / t!r

Application States

Inspired by Jon Stearley (based on SEMI-E10)J. Stearley. Defining and measuring supercomputer Reliability, Availability, and Serviceability (RAS). In Proceedings of the Linux Clusters Institute Conference, 2005. See http://www.cs.sandia.gov/~jrstear/ras.

* Proposed in collaboration with S. Michalak (LANL) and L. Davey (LANL)

What are we trying to measure?

Page 22: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

http://ascii.jp/elem/000/00/982/982434/index-4.html

Red Storm (2005)

U N C L A S S I F I E D

U N C L A S S I F I E D

Operated by the Los Alamos National Security, LLC for the DOE/NNSA LA-UR-07-4292/5853/6490

Slide 3

Operations Rate Only Tells Part of the Story: Red Storm From The Application�s Perspective

5000 Node Job Daily Availability, 7-Day Average MTTI, and Efficiency

(Cumulative Availability = 60% and Cumulative Efficiency = 63%)

80%

70%

70%

67%

68%

64%

64%

59%

58%

56%

52%

41% 46%

39%

63%

58%

55%

62%

61%

62%

65%

62% 67%

55%

70%

54%

10.7

7.2

8.2

7.2

8.5

8.3

6.6

5.6

5.7

5.1

3.9

3.3

3.0

2.6

4.2

3.8

3.7

4.6

4.0 4.5

5.4

4.3

5.2 5.5 6.1

7.6

0

4

8

12

16

20

24

01/14/0

6

01/21/0

6

01/28/0

6

02/04/0

6

Nu

mb

er

of

Inte

rr

up

ts o

r T

ime

in

Ho

urs

Production Availability

System Interrupts

Application Interrupts

Runtime Efficiency

MTTI (System Only)

MTTI (Sys + App)

• 10,000 cores

• 36 Tflops • < 10 hours MTBF

(early adopter)

Page 23: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Ah, Checkpoint RestartCheckpointing Efficiency and the Optimum Checkpoint Interval as

Functions of the Dump Time, System MTBI, and Restart Overhead

0.00

0.25

0.50

0.75

1.00

0.01 0.1 1 10

Tsolv

e /

Tw

all

0.01

0.1

1

10

100

t c /

M

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

!

2 td

/M

R / M

!

t c"

2 t d

M

!

t c"M

+t d

Page 24: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Silent Data Corruption

CC BY-NC 4.0

Page 25: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Karl-Heinz Winkler (2006)

“Resilience = keep the application going at scale, despite component failures”

Page 26: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Karl-Heinz Winkler (June, 2007)

Page 27: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

National HPC Workshop (2009)

Page 28: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Some more reliability data*

*DeBardeleben, Laros, Daly, Scott, Engelmann, Harrod. High-End Computing Resilience. http://www.csm.ornl.gov/~engelman/publications/debardeleben09high-end.pdf

Page 29: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Some more reliability data*

*DeBardeleben, Laros, Daly, Scott, Engelmann, Harrod. High-End Computing Resilience. http://www.csm.ornl.gov/~engelman/publications/debardeleben09high-end.pdf

Page 30: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Some more reliability data*

*DeBardeleben, Laros, Daly, Scott, Engelmann, Harrod. High-End Computing Resilience. http://www.csm.ornl.gov/~engelman/publications/debardeleben09high-end.pdf

Page 31: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Some more reliability data*

Page 32: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Why are fault rates rising?• Number of components is going up which will increase hard and

soft faults• Smaller circuit sizes, running at lower voltages to reduce power

increase the impact of thermal noise and radiation induced faults• Power management cycling significantly decreases component

lifetimes due to thermal and mechanical stresses• Resistance to adding additional detection and recovery logic on

the chip because of additional power consumption and chip costs• Heterogeneous systems make error detection and recovery even

harder• Increasing system and algorithm complexity makes faulty

interaction of components more likely

Thanks to Al Geist (ORNL) and Sudip Dosanjh (LBNL)

Page 33: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Fault Classification• Type

– Permanent/Hard – continuous and stable events on the system – Intermittent/Soft – occasional events, cause intrinsic to system – Transient/Soft – occasional events, cause extrinsic to system

• Extent – Single-event – independent events that alter only a single

component of system hardware or software state – Multi-event/common cause – correlated events that alter more

than one component of system state

Page 34: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Defining resilience• “The persistence of service delivery that can

justifiably be trusted, when facing changes.” (LaPrie, 2008)

• “The persistence of performability when facing changes.” (Meyer, 2009)

• “The ability of a system to keep applications running and maintain an acceptable level of service in the face of transient, intermittent, and permanent faults.” (HEC Resilience Report, 2009)

Page 35: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Data & Informa,on 

Collec,on 

Anomaly 

detec,on 

Visualiza,on Sta,s,cal 

Analysis 

Machine 

Learning 

Efficiency 

Modeling & 

Uncertainty 

Quan,fica,on 

Metrics & 

Measurement 

Simula,on & 

Emula,on 

Formal 

Methods 

Sta,s,cs & 

Op,mal Control 

SoF Errors 

Silent Data 

Corrup,on 

Fault‐tolerant 

Design 

Fault 

Injec,on Forward 

Migra,on & 

Verifica,on 

Degraded 

Modes 

PlaKorm & 

Applica,on 

Monitoring 

Applica,on & 

PlaKorm Knobs 

Tunable Fidelity & 

Quality of Service 

RAS Theory & 

Performability 

Response & Recovery 

Next‐genera,on 

Architectures 

Programming 

Models 

System SoFware 

& Middleware 

RAS Systems 

Tools Standards & 

Standard 

Framework 

Nailing down resilienceResilience is a cross-domain challenge!

Page 36: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Fault-Tolerance Workshop (2009)

Resilience Layer

The architecture of a resilience feedback-control infrastructure

User-Centric Requirements

Job Input Parameters

Job Control and Resource

Allocation + Application

Configuration

System State

Application and System Monitoring

Performability Model

Resilience is a cross-stack challenge!

Page 37: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Outline• The call to innovation • A brief history of computer reliability • Resilience comes of age • Opportunities for the future

Page 38: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

When HPC gives you lemons…

Page 39: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Opportunities for Innovation

Fault Characterization

Algorithm Based Fault Tolerance

Fault Analysis Tools

Fault Prediction and Detection

Fault-Tolerant System Software

Fault Aware Programming

Models

RESILIENCE

Page 40: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Sorting as iterative optimization

Sloan, Kesler, Kumar and Rahimi, “A Numerical Optimization-based Methodology for Application Robustification”, Dependable Systems and Networks (DSN), 2010.

Page 41: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Iterative asynchronous algorithms

Charr, J. and Couturier, R. and Laiymani, D., “JACEP2P-V2: A Fully Decentralized and Fault Tolerant Environment for Executing Parallel Iterative Asynchronous Applications on Volatile Distributed Architectures,” FGCS, 2011, pp. 606—613.

Bahi, J. and Couturier, R. and Vuillermin, P., “Asynchronous iterative algorithms for computational science on the grid: three case studies,” VECPAR, 2004, pp. 302—314.

Page 42: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Developing fault-tolerant solvers

Hoemmen, M. and Heroux M., “Fault-Tolerant Iterative Methods via Selective Reliability,” Tech. Rep. SAND2011-3915 C, Sandia National Laboratories, 2011.

Page 43: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Developing fault-tolerant solvers

Hoemmen, M. and Heroux M., “Fault-Tolerant Iterative Methods via Selective Reliability,” Tech. Rep. SAND2011-3915 C, Sandia National Laboratories, 2011.

Can we use approaches like

this for discrete mathematics?

Page 44: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Probabilistic computing: a starting point?

George, J., “Harnessing Resilience: Biased Voltage Overscaling for Probabilistic Signal Processing,” Doctoral Dissertation, 2011.

Page 45: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Probabilistic computing: a starting point?

George, J., “Harnessing Resilience: Biased Voltage Overscaling for Probabilistic Signal Processing,” Doctoral Dissertation, 2011.

Page 46: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

A case for recovery-driven design

- Sartori, J. and Sloan, J. and Kumar, R. “Stochastic Computing: Embracing Errors in Architecture and Design of Processors and Applications,” CASES, 2011.

Page 47: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

A case for recovery-driven design

- Sartori, J. and Sloan, J. and Kumar, R. “Stochastic Computing: Embracing Errors in Architecture and Design of Processors and Applications,” CASES, 2011.

Page 48: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Counting the costs• How much am I willing to pay for reliability?• How much am I already paying?• What am I giving up?

– Power? – Performance? – Other?

• Can I give up reliability and get something useful back in exchange?

Page 49: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Resilience Tradeoffs

Thanks to John Shalf, Lawrence Berkeley National Laboratory

TMR

PairingChecksum2Arrays

FT6HPL

ABFTResilient2Math2Formulation

Page 50: Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Conclusion

Resilience is a call to innovation in HPC