Reliability, Availability, and Serviceability (RAS) for High-Performance Computing

Presented by

Reliability, Availability, and Serviceability (RAS) for High-Performance Computing

Stephen L. ScottChristian Engelmann

Computer Science Research GroupComputer Science and Mathematics Division

2 Scott_RAS_SC07

Research and development goals

Provide high-level RAS capabilities for current terascale and next-generation petascale high-performance computing (HPC) systems

Eliminate many of the numerous single points of failure and control in today’s HPC systems Develop techniques to enable HPC systems to run computational

jobs 24/7 Develop proof-of-concept prototypes and production-type

RAS solutions

3 Scott_RAS_SC07

MOLAR: Adaptive runtime support for high-end computing operating and runtime systems Addresses the challenges for operating and runtime systems to

run large applications efficiently on future ultrascale high-end computers

Part of the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS)

MOLAR is a collaborative research effort (www.fastos.org/molar)

4 Scott_RAS_SC07

Many active head nodes Workload distribution Symmetric replication

between head nodes Continuous service Always up to date No fail-over necessary No restore-over necessary Virtual synchrony model Complex algorithms Prototypes for Torque and

Parallel Virtual File System metadata server

Active/active head nodes

Symmetric active/active redundancy

Compute nodes

5 Scott_RAS_SC07

Writing throughput Reading throughput

Symmetric active/active Parallel Virtual File System metadata server

Nodes Availability Est. annual downtime1 98.58% 5d, 4h, 21m

2 99.97% 1h, 45m

3 99.9997% 1m, 30s

0

20

40

60

80

100

120

1 2 4 8 16 32Number of clients

Thro

ughp

ut (R

eque

sts/

sec)

PVFSA/A 1

A/A 2A/A 4

050

100150200250300350400

1 2 4 8 16 32Number of clients

Thro

ughp

ut (R

eque

sts/

sec)

1 server

2 servers

4 servers

6 Scott_RAS_SC07

Operational nodes: Pause BLCR reuses existing

processes LAM/MPI reuses existing

connections Restore partial process state

from checkpoint Failed nodes: Migrate

Restart process on new node from checkpoint

Reconnect with paused processes

Scalable MPI membership management for low overhead

Efficient, transparent, and automatic failure recovery

Reactive fault tolerance for HPC with LAM/MPI+BLCR job-pause mechanism

Paused MPI process

Live node

Migrated MPI process

Spare node

Paused MPI process

Live node

Failed MPI process

Failed node

Shared storage

Failed

Processmigration

New connection

New connection

Faile

d

Existingconnection

7 Scott_RAS_SC07

3.4% overhead over job restart, but No LAM reboot overhead Transparent continuation of execution

0123456789

10

BT CG EP FT LU MG SP

Seco

nds

Job pause and migrate LAM reboot Job restart

No requeue penalty Less staging overhead

LAM/MPI+BLCR job pause performance

8 Scott_RAS_SC07

Proactive fault tolerance for HPC using Xen virtualization

Standby Xen host (spare node without guest VM)

Deteriorating health Migrate guest VM to

spare node New host generates

unsolicited ARP reply Indicates that guest VM

has moved ARP tells peers to resend

to new host Novel fault-tolerance

scheme that acts before a failure impacts a system

Xen VMM

Ganglia

Privileged VM

PFT daemon

H/w BMC

Xen VMM

Privileged VM Guest VM

MPItaskGanglia

PFT daemon

H/w BMC

Xen VMM


MPItaskGanglia

PFT daemon

H/w BMC

Xen VMM


MPItaskGanglia

PFT daemon

H/w BMC

Migrate

9 Scott_RAS_SC07

0

50

100

150

200

250

300

350

400

450

500

BT CG EP LU SP

Seco

nds

Single node failure

Single node failure: 0.5–5% additional cost over total wall clock time Double node failure: 2–8% additional cost over total wall clock time

VM migration performance impact

without migrationone migration

Double node failure

BT CG EP LU SP0

50

100

150

200

250

300without migrationone migrationtwo migrations

Seco

nds

10 Scott_RAS_SC07

HPC reliability analysis and modeling Programming paradigm and system scale impact reliability Reliability analysis Estimate mean time to failure (MTTF) Obtain failure distribution: exponential, Weibull, gamma, etc. Feedback into fault-tolerance schemes for adaptation

010

Tota

l sys

tem

MTT

F (h

rs)

200

400

600

800

50 100 500 1000 2000 5000Number of Participating Nodes

System reliability (MTTF) for k-of-n AND Survivability (k=n) Parallel Execution Model

Node MTTF 1000 hrsNode MTTF 3000 hrsNode MTTF 5000 hrsNode MTTF 7000 hrs

Exponencial

2653.3

Weibull 3532.8

Lognormal 2604.3

Gamma 2627.4

Negative likelihood value

Cum

ulat

ive

prob

abili

ty

1

0

0.2

0.4

0.6

0.8

0 50 100 150 200 250 300Time between failure (TBF)

11 Scott_RAS_SC0711 Scott_RAS_SC07

Contacts regarding RAS research

Stephen L. ScottComputer Science Research GroupComputer Science and Mathematics Division(865) [email protected]

Christian EngelmannComputer Science Research GroupComputer Science and Mathematics Division(865) [email protected]

Reliability, Availability, and Serviceability (RAS) for High-Performance Computing

Documents

Transcript of Reliability, Availability, and Serviceability (RAS) for High-Performance Computing