Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet...

Using Fault Injection and Modeling to Evaluate the Performability of Cluster-Based Internet Services

Kiran Nagaraja, Xiaoyan LiRicardo Bianchini, Richard P. MartinThu D. Nguyen

Dept. of Computer Science,Rutgers University

USITS’03

Thu D. Nguyen, Rutgers U. The Vivo Project2

Motivation

Accumulating evidence that today’s services only achieve ~99-99.9% availability

[Gray 2001], [Patterson et al. 2002]

Compare to public telephone system: close to 99.999%

Unavailability is costly (downtime costs per hour)Brokerage operations $6,450,000

Credit card authorization $2,600,000

Ebay (1 outage 22 hours) $225,000

Amazon.com $180,000Sources: InternetWeek 4/3/2000


Motivation

Complexity of Internet servicesLarge design space

Many software and hardware components numerous fault points and types

Currently used ad-hoc techniques (e.g., unplugging cables) not sufficient

Need methodology to systematically quantify availability as well as performanceAvailability may conflict with performance

performability metric combining performance and availability


Contributions

Methodology for quantitative evaluation of cluster-based services

Availability and Performability

MendosusCluster-based fault injection and network emulation

Support injection of network faults such as switch failure

Capable of injecting multiple types of faults appropriate to cluster-based environments

Case study of a high-performance cluster-based web serverEffect of faults on overall behavior

Tradeoff of performance against availability

Effects of design and environmental decisions


Methodology: Overview

Phase I – Fault injection experiments

Define set of fault types

Inject each fault (and subsequent recovery) into “live” system

Measure system behavior under each fault type

Case study: throughput under constant load

Phase II - Use analytical model to quantify overall service performability

Inputs:

Measured throughputs from phase I

MTTF and MTTR for each fault type

Environmental parameters: operator response time and server reset time

Outputs: average availability and average throughput


Assumed Platform: Clusters


Phase II: Per-Fault Seven-Stage Model


Phase II: Computing Average Throughput and Average Availability

Assume: Fault arrive independent and do not overlap


Performability Metric

T – Throughput under normal execution

AI - Availability of “Ideal” system (e.g., 0.99999)

A – Average Availability

Normal performance

Penalty component

U

UT

A

ATlityPerformabi II

)log(

)log(


Current Limitations

Does not quantify effect of correlated faultsInsufficient data

Sensitivity analysis in the future?

Explosion of fault-injection experiments?

Does not consider session and data-integrity faultsRestricts the class of cluster-based servers

Only consider averagesDoes not capture potential importance of variance in throughput

Does not capture resiliency to sudden changes in load


Case Study: PRESS Web Server

Cluster-based web serverNodes cooperate to globally manage memory to cache contentRequests distributed based on locality and load balancing

Several versions developed over time for increasing performanceVIA-PRESS: Cooperative caching using VIA

VIA connection break for fault detection

Dynamic reconfiguration to tolerate node and application crashes

ReTCP-PRESS: Cooperative caching using TCPHeart-beats for fault detection

Dynamic reconfiguration to tolerate node and application crashes

TCP-PRESS: TCP timeouts for fault-detection; no dynamic reconfiguration

I-PRESS: Independent servers

Question: did increased performance come at a cost in availability?


Phase I: Single-Fault Experiments

Setup: 4-PC cluster running at 90% utilization

800Mhz, 2 SCSI disks, 1 Gbps network

4 client nodes make HTTP requests

Discussion of scaling to larger clusters in paper

Fault Set

Link down

Switch down

SCSI timeout

Node crash

Application crash

All faults are modeled as fail-stop


Single Faults – Link Down

OperatorReset


Phase II – Model Parameters

Average operator response time: 5 minutes

Average restart time: 5 minutes

Fault MTTF MTTR

Link down 6 months 3 minutes

Switch down 1 year 1 hour

SCSI timeout 1 year 1 hour

Node crash 2 weeks 3 minutes

Application crash 1 month 3 minutes

Sources: [Iyer99, Talagala99, Heath02]


Performance

Throughput

0

1000

2000

3000

4000

5000

6000

7000

I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS

PRESS Version

Req

ues

ts/s

ec

+21%


Unavailability

Unavailability by Component

0

0.0005

0.001

0.0015

0.0020.0025

0.003

0.0035

0.004

0.0045

0.005


PRESS Version

% U

nav

aila

bil

ity

application crash

node crash

scsi timeout

internal switch

internal link

+58%


Performability

0

10

20

30

40

50

60


PRESS Version

Per

form

abili

ty


Performability with More Extensive Fault Model + FME

0

20

40

60

80

100

120

140

I-PRESS TCP-PRESS

PRESS Versions

Per

form

abili

ty


Design Tradeoffs

Performability Tradeoffs

0

10

20

30

40

50

60

70

80


PRESS Versions

Per

form

abili

ty

4hour Operator

Normal

RAID


Discussion

Fault injection uncovered bugs

Modeling allowed quantification and analysis of different design decisions and parameters

Single fault can halt a cooperative service

Problem: cooperation disseminates the effect of faults

Solution: Early detection/exclusion and fault-model enforcement

TCP connection termination not good for fault detection; heartbeats not ideal either

Solution: More extensive infrastructure?

Mismatch between fault model and actual faults

Solution: Extend the PRESS fault model?


Related Work

Our work depends on studies of actual fault types and rates

Large body of work based on stochastic analysisOur model much simpler

Easy application vs. more limited domain?

Some similar methodologies and studies of fault-tolerant systems

Concentrated on fault-tolerance of redundant platform


Summary

Proposed a methodology for quantitative evaluation of cluster-based services

Quantify both performance and availability

Fault-injection infrastructure criticalUsed Mendosus

Will be available sometime soon

Case study of PRESSQuantified performability of several versions

Studied performance vs availability tradeoff

Studied effect of operator coverage and RAID


Thank you! Questions?

http://vivo.cs.rutgers.eduImpact of communication architecture [HPCA03]

Detailed study of TCP and VIA fault models

SW faults: application bugs, memory exhaustion

Fault Model Enforcement (FME) [EASY02]

Techniques for improving availability [Rutgers DCS-TR-517]

Extensive monitoring + FME improve availability 10x

Compiler-directed program-fault coverage [DSN03]

Support for testing of fault-detection and recovery code


Related Work

Empirical measurements of fault ratesDifficult to extrapolate beyond observed behavior

Benchmarking methodologiesSingle-node robustness and availability

Difficult to extrapolate to overall availability and performability

Analytical modeling Stochastic models of availability and performability

Difficult to construct (& solve) models, esp. w/o fault injection

Do not consider penalty for being away from ideal

Availability and performability of cluster-based serversNo work out there. Closest: availability of single-node Apache


Future Work

Study more complex systems3 tiers: front-end, application, & back-end servers

General class of servers (e.g., Web store)

Possibly more complex dependencies

Validate methodology

Extend our infrastructure/methodologyData-integrity faults, session-loss faults, etc

Metrics to capture user satisfaction (e.g., response time)


Future Work

Eliminate limitations of our modelingAccount for concurrent and correlated failures

Improving availability and manageabilityMinimize “on-line” operator intervention

Design services for automatic recovery

Validate operator actions when they are necessary

Explore the full benefits of FME

Arbitrary software failures fail-stop

Recovery procedures are complex, untested (buggy)


Average Availability: Details

AT = (1 – ΣF (ΣS Dfs )) Tn + ΣF ΣS ( Dfs Tfs)

MTTFf MTTFf

AA = AT/Tn

F – faults, S - stages

Dfs – Duration of stage s of fault f

Tfs – Throughput during stage s of fault f

Tn - Throughput under normal execution




MTTFf MTTFf

AA = AT/Tn






Our Study

Evaluate impact of 2 different communication architectures on service performance and availability in presence of faults

TCP vs. VIAKernel-level comm. vs. user-level

Mature vs. new technology

Differ in fault-model

Quantify performability (performance + availability)

Study systems under various fault scenarios

Sensitivity to fault rates and fault classes

Case study: High performance cluster-based Web server

Understand relation between high performance and high availability design choices


PRESS Versions Comparison

PRESS Versions Description Fault Detection

General Protocol Characteristics

TCP-PRESS Base version Connection based

TCP

Assumes: Very few h/w permanent faults, transient faults are common

Robust to transient faults

OK to lose packets

TCP-PRESS-HB Periodic heartbeats

VIA-PRESS-0 Base version Connection based

VIA

Assumes: Faults indicate serious problems

Fail-stop model

Lost packets are bad

VIA-PRESS-3 RDMA for comm. Same

VIA-PRESS-5 RDMA and

Zero-copy (Dynamic pinning)

Same


Performance Comparison

VIA-based communication enables higher performance

Low latency, less software overhead

Performance Comparison

0

1000

2000

3000

4000

5000

6000

7000

8000

TCP TCP-HB VIA-0 VIA-3 VIA-5

PRESS Versions

Req

ues

ts/s

ec


Performability Results

Identical fault load for all versionsApplication fault rate 1/month

All versions of VIA do better than TCP

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0.004

TCP TCP-HB

VIA-0 VIA-3 VIA-5

PRESS Versions

Un

avai

lab

ilit

y

0

5

10

15

20

25

30

Perfo

rmab

ility

internal link internal switch node crash node freeze

os-mem-no-locking os-sk-buf-no-mem application crash application hang

app-nullpointer app-offbyNpointer app-offbyNsize Performability


TCP Vs VIA: Program Robustness

VIA application fault rates 1/day, 1/week, 1/monthProgramming complexity

TCP application fault rate 1/month

Program Robustness

0

5

10

15

20

25

30

TCP TCP-HB

VIA-0 VIA-3 VIA-5 VIA-0 VIA-3 VIA-5 VIA-0 VIA-3 VIA-5

PRESS Versions

Per

form

abil

ity

Cross over point


VIA under Stressful Fault Load

Additional fault load Transient packet drops1/month, system failure 1/month

Application faults -> 2/month

TCP-HB performs slightly better than 2 VIA versions

Performability

0

5

10

15

20


PRESS Versions

Per

form

abil

ity


Observations – Cluster Communication

Match fault-model of network stack to fabricNon-fatal behavior on transient faults

TCP is robust to packet drops

Fail-stop behavior on permanent faults

Protocol level fault-avoidance Preserve message boundaries

Reduce number of copies

Pre-allocate communication resources

Explicit fault reporting by all components in “path”End-to-End necessary, but may not be sufficient

Reduces detection latency

Allows more accurate recovery actions


Related Work

Impact of faults on systemsRobustness and availability studies

[Lee93, Liu99, Murphy95, Brown00, Asami00]

Protocol performance studies Congestion avoidance and control

[Jacobson88, Brakmo94, Hoe96]Back-off based algorithms

Interconnects in cluster environmentSAN context: Packet drops Serious failures

Evidence of faults [Wilkes92, Seitz94, Boden95]

Fault tolerant interconnects: Myrinet


Summary & Conclusion

Studied impact of communication architecture on service performability Surprisingly VIA versions delivered better availability

Comparison under varying fault loadsEvaluated architecture maturity and complexity

Desirable cluster-based protocol characteristicsMessaging, single-copy transfers, pre-allocated resources


Mendosus – Fault Injection

Central Controller

Fast & Reliable SAN

Node A Node B

Events

Kernel

User-Level

SCSI

Process Ctrl

Daemon

MlibApplications E.g. PRESS

emulation

n/w faults

n/w stack

comLib glibc sys_calls

Node/OS


Communication Architecture

All operations by main thread are non-blocking

Separate send, receive and multiple disk helper threads

Filling up of queues could stall the entire node


Modeling Parameters

5 minutes duration for operator intervention(E) and restart(F) stages

Fault MTTF MTTR

Link down 6 months 3 minutes

Switch down 1 year 1 hour

Node crash 2 weeks 3 minutes

Node freeze 2 weeks 3 minutes

Process Crash variable 3 minutes

Process Hang variable 3 minutes

Bad parameters

- off-by-N data pointer

variable 3 minutes

Bad parameters

- off-by-N size

variable 3 minutes

Bad parameters – Null pointer

variable 3minutesSources: [Chillarege95, Sullivan91, Iyer99, Talagala99, Heath02,

Trivedi00]

Application

Faults


Pessimistic Fault Load for VIA

Faults due to immature technologyTransient packet drops1/month, system failure 1/month

Program robustness Application faults -> 2/month


0

0.001

0.002

0.003

0.004

0.005

0.006


PRESS Versions

Un

avai

lab

ilit

y

bleeding edge complexitytransient n/w errrorsapp-offbyNsizeapp-offbyNpointerapp-nullpointerapplication hangapplication crashos-sk-buf-no-memos-mem-no-lockingnode freezenode crashinternal switch internal link


Results - Performability

Varying application fault rates: 1/day, 1/month

VIA versions do better due to higher performance

Performability

0

5

10

15

20

25

30


PRESS Versions

Per

form

abil

ity


TCP Vs VIA: Transient Packet Drops

VIA packet drop rates 1/day, 1/week, 1/month

TCP is modeled as no additional losses

Transient Packet Drops

0

5

10

15

20

25

TCP TCP-HB


PRESS Versions

Per

form

abil

ity


TCP Vs VIA: Immature Technology

VIA complexity failures 1/day, 1/week, 1/monthModeled as total interconnect failure

TCP is modeled as no additional losses

Bleeding Edge Complexity

0

5

10

15

20

25

TCP TCP-HB


PRESS Versions

Per

form

abil

ity


Scaling Results

Model can be used to scale results

Extrapolation to 8 nodes leads to same results as measurements, for constant memory


Related Work *

Approaches to Fault Tolerance & HAImproving Component Robustness

E.g., ECC in disks & memory, RAID

End-to-End Approaches

CRC, TCP and RPC like protocols – RETRY approach

Replication and Failover

Redundancy: TMR, Tandem, Stratus, primary-secondary

N-version programming

Reactive and Proactive Techniques

Recovery oriented computing[ROC01]

Software rejuvenation, Recursive restartability

Improving Availability

High Availability Techniques

Quantifying The Improvement



Front-end and extra capacity(FE-X)

Widely used

Implementation: Linux Virtual Server + extra nodes

Robust Group Membership(MEM)

ReCOOP could not handle switch, link faults

Implementation: Based on TRM[cristian95]

Service Monitoring(QMON)

Handle “lack of progress” scenarios

Implementation: Queue monitoring

Fault Model Enforcement(FME)


Fault Model Enforcement (FME)

Main idea: map unexpected run-time faults to expected ones in the fault model

E.g., If only node crashes are handled:On fatal disk fault crash node

Designer can focus on smaller fault model

Enforces uniform view of system

PRESS Application hang Process crash (then restart)

Disk failures Node Crash (or take off-line)


Quantifying Availability

Apply techniques to COOPCOOP - “fault-tolerance-free” version

Quantitative analysis after each enhancement Same fault-load, environment parameters

Model extended for front-end failure


Quantitative Results

88% improvement in availability of FME over COOP


00.00040.00080.00120.0016

0.0020.00240.00280.00320.0036

0.0040.00440.00480.00520.0056

0.0060.0064

COOP FE- X MEM Q-MON

MQ FME

PRESS Versions

Un

avai

lab

ilit

y

frontend failure

application hang

application crash

node freeze

node crash

scsi timeout

internal switch

internal link


Parameterizable Modeling

Flexible model allows “What if…?” scenariosVariable fault rates, additional components, operator times

With extended analysis COOP PRESS 0.9997 availability

Modeling Other Approaches

0

0.0001

0.0002

0.0003

0.0004

0.0005

0.0006

0.0007

FME C-MON

X-SW

RAID

PRESS Versions

Un

avai

lab

ilit

y


Background – Internet Services

Popular services handle large volumesGoogle handles ~30 million requests/day (Computeruser.com article Jun 2000)

Internet growth Explosive Commercializatio

n WWW On-line services


Background – Internet Services

Internet growth has been explosive since 90sOnset of WWW, browsers, search engines

Commercialization of the Internet: on-line services


Internet Services - Popularity

Servers offer variety of servicesEmail, news, search, shopping

Popular ones service large volumesGoogle handles ~30 million requests/day (Computeruser.com article Jun 2000)


Internet Services - Infrastructure

Cluster-based solutions are popular[Brewer01]Incrementally scalable, cost-effective

Scalability, Performance have been addressedAvailability evaluation – less attention


Approach

Guide designer through evaluation and improvement of availability

Observe system under failuresMeasure service level availability

Quantify “expected” system behaviorAnalytical modeling to predict behavior under various fault

load, “what if” scenarios

Improve upon problem areas Apply well-defined techniques


Front-end and Extra Capacity

Widely usedMasks service failures from client

Fail-over using Linux Virtual Server(LVS) Front-end distributor

Round Robin, IP Tunneling

Monitor backend nodes using “MON”

Request forwarded to “live” node set

Extra CapacityAdditional “live” nodes to soak up load

Increases number of prospective fault sites!


Robust Group Membership(MEM)

Should handle realistic fault loadsNode, link, switch faults and n/w partition

Heartbeats in ReCOOP were insufficient

Up-to-date list of active nodesAllow dynamic reconfiguration of list

Enables effective resource sharing

Detection of failuresFault model: reachability

UNREACHABLE DOWN


Membership Implementation

Independent service Report membership at advertised memory segment

Based on Three Round Membership[Cristian95] Additions and removals follow 2-phase commit

No single point of failure

Coordinator for adds and leaves chosen dynamically

Heartbeat protocol for failure detectionNodes arranged in logical ring, monitor neighbors

Loss of 3 consecutive heartbeats initiates removal


Queue Monitoring (QMON)

Service level monitoringApplication hangs, problems with disk

Queues as basic building-blocks[Ninja, SEDA]

Run-time Q-properties indicate progress of associated components

E.g., buildup at communication send queue transient/permanent failures at other end

Need to avoid false alarms!


Self-Monitoring Queue

In PRESS: N per-node communication queuesMonitors progress at cooperating nodes

Failure triggersQueue length or threshold unanswered requests

Node removed from “cooperation set”


Interaction – MEM and QMON

Both Membership & QMON, monitor node-level activity

Can result in inconsistencies, e.g., application hang


Fault Model Enforcement (FME)

Enforce a reduced fault model at runtimeAllow service to perform correct recovery action to regain full

functionality

How to enforce a reduced fault model?Two ideas so far

Map an unexpected fault to an expected faultE.g., crash a node if the network link connecting it to the switch fails

Fail outer component if sub-component failsE.g., crash a node if the disk fails

How is it different from fail-stop ?Allows reasoning about failures at a desired abstraction


FME - Future Directions

How extensive should the fault model be?Determines programming complexity/effort

How to prevent FME from reducing availability?Bugs within enforcement?

When to declare a symptom a fault?

FME reduces human interventionAre humans better at deciding?

8-23 % of recovery procedures are botched [Brown 2001]


FME Approach

Define a reduced abstract fault model

Components, faults, symptoms, component behavior during faults

Enforce this fault model at run-time

If an “unexpected” fault occurs, map to one that was planned for in the abstract model

“If the facts don’t fit the theory, change the facts.” - Albert Einstein

Allow designer to concentrate on tolerating a well-defined, yet limited in complexity, set of faults


PRESS with FME

Recovery upon fault model mismatchRestart 0, 1 or all nodes?

FME approach: reboot the appropriate node after a fault and its recovery have occurred

Link down – reboot unreachable node

Switch down – reboot all nodes

Disk failure – reboot node with faulty disk

Node, application crash – do nothing


FME Implementation

FME daemon on each nodeMonitors service progress using exported i/f

HTTP requests for PRESS

Monitors disk using SCSI Generic InterfaceLow level operations and error probing

Application hang Process crashUpon consecutive failure to service requests

Process restarted

Disk failures Node failures Detection: Service no-progress + Disk Error

Node taken offline for maintenance


Modeling Results - Unavailability

Unavailability of INDEP ~ 1/10 of COOP

Heartbeat helps, but availability is lower than INDEP


0

0.001

0.002

0.003

0.004

0.005

0.006

INDEP COOP ReCOOP

PRESS Versions

Un

avai

lab

ilit

y

application hang

application crash

node freeze

node crash

scsi timeout

internal switch

internal link


Results - Performability

COOP has lower performability than INDEP

ReCOOP glimpse of possibilities

Performability

0

5

10

15

20

25

30

35

40

45

INDEP COOP ReCOOP

PRESS Versions

Per

form

abil

ity


Future Work

Explore applicability to more structured systemsMulti-tiered: front-end, application & back-end

E.g., Web store

Extend our infrastructure/methodologyMetrics to capture users’ satisfaction

Fault-model for data-integrity faults

Look at more high-availability componentsCollection for a designer to pick from

Extend and validate FME for complex servers


Future Work

Considering operator in the loopGather information about extent of participation – operations, durations, expertise

Extend fault-model for human induced faults


Quantitative Results

88% improvement in availability of FME over COOP

Unavailability

00.00040.00080.00120.0016

0.0020.00240.00280.00320.0036

0.0040.00440.00480.00520.0056

0.0060.0064

COOP FE- X MEM Q-MON

MQ FME

PRESS Versions

Un

avai

lab

ilit

y

frontend failure

application hang

application crash

node freeze

node crash

scsi timeout

internal switch

internal link

Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet...

Documents

Transcript of Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet...