Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet...
-
date post
19-Dec-2015 -
Category
Documents
-
view
212 -
download
0
Transcript of Using Fault Injection and Modeling to Evaluate the Performability of Cluster- Based Internet...
Using Fault Injection and Modeling to Evaluate the Performability of Cluster-Based Internet Services
Kiran Nagaraja, Xiaoyan LiRicardo Bianchini, Richard P. MartinThu D. Nguyen
Dept. of Computer Science,Rutgers University
USITS’03
Thu D. Nguyen, Rutgers U. The Vivo Project2
Motivation
Accumulating evidence that today’s services only achieve ~99-99.9% availability
[Gray 2001], [Patterson et al. 2002]
Compare to public telephone system: close to 99.999%
Unavailability is costly (downtime costs per hour)Brokerage operations $6,450,000
Credit card authorization $2,600,000
Ebay (1 outage 22 hours) $225,000
Amazon.com $180,000Sources: InternetWeek 4/3/2000
Thu D. Nguyen, Rutgers U. The Vivo Project3
Motivation
Complexity of Internet servicesLarge design space
Many software and hardware components numerous fault points and types
Currently used ad-hoc techniques (e.g., unplugging cables) not sufficient
Need methodology to systematically quantify availability as well as performanceAvailability may conflict with performance
performability metric combining performance and availability
Thu D. Nguyen, Rutgers U. The Vivo Project4
Contributions
Methodology for quantitative evaluation of cluster-based services
Availability and Performability
MendosusCluster-based fault injection and network emulation
Support injection of network faults such as switch failure
Capable of injecting multiple types of faults appropriate to cluster-based environments
Case study of a high-performance cluster-based web serverEffect of faults on overall behavior
Tradeoff of performance against availability
Effects of design and environmental decisions
Thu D. Nguyen, Rutgers U. The Vivo Project5
Methodology: Overview
Phase I – Fault injection experiments
Define set of fault types
Inject each fault (and subsequent recovery) into “live” system
Measure system behavior under each fault type
Case study: throughput under constant load
Phase II - Use analytical model to quantify overall service performability
Inputs:
Measured throughputs from phase I
MTTF and MTTR for each fault type
Environmental parameters: operator response time and server reset time
Outputs: average availability and average throughput
Thu D. Nguyen, Rutgers U. The Vivo Project6
Assumed Platform: Clusters
Thu D. Nguyen, Rutgers U. The Vivo Project7
Phase II: Per-Fault Seven-Stage Model
Thu D. Nguyen, Rutgers U. The Vivo Project8
Phase II: Computing Average Throughput and Average Availability
Assume: Fault arrive independent and do not overlap
Thu D. Nguyen, Rutgers U. The Vivo Project9
Performability Metric
T – Throughput under normal execution
AI - Availability of “Ideal” system (e.g., 0.99999)
A – Average Availability
Normal performance
Penalty component
U
UT
A
ATlityPerformabi II
)log(
)log(
Thu D. Nguyen, Rutgers U. The Vivo Project10
Current Limitations
Does not quantify effect of correlated faultsInsufficient data
Sensitivity analysis in the future?
Explosion of fault-injection experiments?
Does not consider session and data-integrity faultsRestricts the class of cluster-based servers
Only consider averagesDoes not capture potential importance of variance in throughput
Does not capture resiliency to sudden changes in load
Thu D. Nguyen, Rutgers U. The Vivo Project11
Case Study: PRESS Web Server
Cluster-based web serverNodes cooperate to globally manage memory to cache contentRequests distributed based on locality and load balancing
Several versions developed over time for increasing performanceVIA-PRESS: Cooperative caching using VIA
VIA connection break for fault detection
Dynamic reconfiguration to tolerate node and application crashes
ReTCP-PRESS: Cooperative caching using TCPHeart-beats for fault detection
Dynamic reconfiguration to tolerate node and application crashes
TCP-PRESS: TCP timeouts for fault-detection; no dynamic reconfiguration
I-PRESS: Independent servers
Question: did increased performance come at a cost in availability?
Thu D. Nguyen, Rutgers U. The Vivo Project12
Phase I: Single-Fault Experiments
Setup: 4-PC cluster running at 90% utilization
800Mhz, 2 SCSI disks, 1 Gbps network
4 client nodes make HTTP requests
Discussion of scaling to larger clusters in paper
Fault Set
Link down
Switch down
SCSI timeout
Node crash
Application crash
All faults are modeled as fail-stop
Thu D. Nguyen, Rutgers U. The Vivo Project13
Single Faults – Link Down
OperatorReset
Thu D. Nguyen, Rutgers U. The Vivo Project14
Phase II – Model Parameters
Average operator response time: 5 minutes
Average restart time: 5 minutes
Fault MTTF MTTR
Link down 6 months 3 minutes
Switch down 1 year 1 hour
SCSI timeout 1 year 1 hour
Node crash 2 weeks 3 minutes
Application crash 1 month 3 minutes
Sources: [Iyer99, Talagala99, Heath02]
Thu D. Nguyen, Rutgers U. The Vivo Project15
Performance
Throughput
0
1000
2000
3000
4000
5000
6000
7000
I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS
PRESS Version
Req
ues
ts/s
ec
+21%
Thu D. Nguyen, Rutgers U. The Vivo Project16
Unavailability
Unavailability by Component
0
0.0005
0.001
0.0015
0.0020.0025
0.003
0.0035
0.004
0.0045
0.005
I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS
PRESS Version
% U
nav
aila
bil
ity
application crash
node crash
scsi timeout
internal switch
internal link
+58%
Thu D. Nguyen, Rutgers U. The Vivo Project17
Performability
0
10
20
30
40
50
60
I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS
PRESS Version
Per
form
abili
ty
Thu D. Nguyen, Rutgers U. The Vivo Project18
Performability with More Extensive Fault Model + FME
0
20
40
60
80
100
120
140
I-PRESS TCP-PRESS
PRESS Versions
Per
form
abili
ty
Thu D. Nguyen, Rutgers U. The Vivo Project19
Design Tradeoffs
Performability Tradeoffs
0
10
20
30
40
50
60
70
80
I-PRESS TCP-PRESS ReTCP-PRESS VIA-PRESS
PRESS Versions
Per
form
abili
ty
4hour Operator
Normal
RAID
Thu D. Nguyen, Rutgers U. The Vivo Project20
Discussion
Fault injection uncovered bugs
Modeling allowed quantification and analysis of different design decisions and parameters
Single fault can halt a cooperative service
Problem: cooperation disseminates the effect of faults
Solution: Early detection/exclusion and fault-model enforcement
TCP connection termination not good for fault detection; heartbeats not ideal either
Solution: More extensive infrastructure?
Mismatch between fault model and actual faults
Solution: Extend the PRESS fault model?
Thu D. Nguyen, Rutgers U. The Vivo Project21
Related Work
Our work depends on studies of actual fault types and rates
Large body of work based on stochastic analysisOur model much simpler
Easy application vs. more limited domain?
Some similar methodologies and studies of fault-tolerant systems
Concentrated on fault-tolerance of redundant platform
Thu D. Nguyen, Rutgers U. The Vivo Project22
Summary
Proposed a methodology for quantitative evaluation of cluster-based services
Quantify both performance and availability
Fault-injection infrastructure criticalUsed Mendosus
Will be available sometime soon
Case study of PRESSQuantified performability of several versions
Studied performance vs availability tradeoff
Studied effect of operator coverage and RAID
Thu D. Nguyen, Rutgers U. The Vivo Project23
Thank you! Questions?
http://vivo.cs.rutgers.eduImpact of communication architecture [HPCA03]
Detailed study of TCP and VIA fault models
SW faults: application bugs, memory exhaustion
Fault Model Enforcement (FME) [EASY02]
Techniques for improving availability [Rutgers DCS-TR-517]
Extensive monitoring + FME improve availability 10x
Compiler-directed program-fault coverage [DSN03]
Support for testing of fault-detection and recovery code
Thu D. Nguyen, Rutgers U. The Vivo Project24
Related Work
Empirical measurements of fault ratesDifficult to extrapolate beyond observed behavior
Benchmarking methodologiesSingle-node robustness and availability
Difficult to extrapolate to overall availability and performability
Analytical modeling Stochastic models of availability and performability
Difficult to construct (& solve) models, esp. w/o fault injection
Do not consider penalty for being away from ideal
Availability and performability of cluster-based serversNo work out there. Closest: availability of single-node Apache
Thu D. Nguyen, Rutgers U. The Vivo Project25
Future Work
Study more complex systems3 tiers: front-end, application, & back-end servers
General class of servers (e.g., Web store)
Possibly more complex dependencies
Validate methodology
Extend our infrastructure/methodologyData-integrity faults, session-loss faults, etc
Metrics to capture user satisfaction (e.g., response time)
Thu D. Nguyen, Rutgers U. The Vivo Project26
Future Work
Eliminate limitations of our modelingAccount for concurrent and correlated failures
Improving availability and manageabilityMinimize “on-line” operator intervention
Design services for automatic recovery
Validate operator actions when they are necessary
Explore the full benefits of FME
Arbitrary software failures fail-stop
Recovery procedures are complex, untested (buggy)
Thu D. Nguyen, Rutgers U. The Vivo Project27
Average Availability: Details
AT = (1 – ΣF (ΣS Dfs )) Tn + ΣF ΣS ( Dfs Tfs)
MTTFf MTTFf
AA = AT/Tn
F – faults, S - stages
Dfs – Duration of stage s of fault f
Tfs – Throughput during stage s of fault f
Tn - Throughput under normal execution
Thu D. Nguyen, Rutgers U. The Vivo Project28
Average Availability: Details
AT = (1 – ΣF (ΣS Dfs )) Tn + ΣF ΣS ( Dfs Tfs)
MTTFf MTTFf
AA = AT/Tn
F – faults, S - stages
Dfs – Duration of stage s of fault f
Tfs – Throughput during stage s of fault f
Tn - Throughput under normal execution
Thu D. Nguyen, Rutgers U. The Vivo Project29
Average Availability: Details
AT = (1 – ΣF (ΣS Dfs )) Tn + ΣF ΣS ( Dfs Tfs)
MTTFf MTTFf
AA = AT/Tn
F – faults, S - stages
Dfs – Duration of stage s of fault f
Tfs – Throughput during stage s of fault f
Tn - Throughput under normal execution
Thu D. Nguyen, Rutgers U. The Vivo Project30
Average Availability: Details
AT = (1 – ΣF (ΣS Dfs )) Tn + ΣF ΣS ( Dfs Tfs)
MTTFf MTTFf
AA = AT/Tn
F – faults, S - stages
Dfs – Duration of stage s of fault f
Tfs – Throughput during stage s of fault f
Tn - Throughput under normal execution
Thu D. Nguyen, Rutgers U. The Vivo Project31
Our Study
Evaluate impact of 2 different communication architectures on service performance and availability in presence of faults
TCP vs. VIAKernel-level comm. vs. user-level
Mature vs. new technology
Differ in fault-model
Quantify performability (performance + availability)
Study systems under various fault scenarios
Sensitivity to fault rates and fault classes
Case study: High performance cluster-based Web server
Understand relation between high performance and high availability design choices
Thu D. Nguyen, Rutgers U. The Vivo Project32
PRESS Versions Comparison
PRESS Versions Description Fault Detection
General Protocol Characteristics
TCP-PRESS Base version Connection based
TCP
Assumes: Very few h/w permanent faults, transient faults are common
Robust to transient faults
OK to lose packets
TCP-PRESS-HB Periodic heartbeats
VIA-PRESS-0 Base version Connection based
VIA
Assumes: Faults indicate serious problems
Fail-stop model
Lost packets are bad
VIA-PRESS-3 RDMA for comm. Same
VIA-PRESS-5 RDMA and
Zero-copy (Dynamic pinning)
Same
Thu D. Nguyen, Rutgers U. The Vivo Project33
Performance Comparison
VIA-based communication enables higher performance
Low latency, less software overhead
Performance Comparison
0
1000
2000
3000
4000
5000
6000
7000
8000
TCP TCP-HB VIA-0 VIA-3 VIA-5
PRESS Versions
Req
ues
ts/s
ec
Thu D. Nguyen, Rutgers U. The Vivo Project34
Performability Results
Identical fault load for all versionsApplication fault rate 1/month
All versions of VIA do better than TCP
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
0.004
TCP TCP-HB
VIA-0 VIA-3 VIA-5
PRESS Versions
Un
avai
lab
ilit
y
0
5
10
15
20
25
30
Perfo
rmab
ility
internal link internal switch node crash node freeze
os-mem-no-locking os-sk-buf-no-mem application crash application hang
app-nullpointer app-offbyNpointer app-offbyNsize Performability
Thu D. Nguyen, Rutgers U. The Vivo Project35
TCP Vs VIA: Program Robustness
VIA application fault rates 1/day, 1/week, 1/monthProgramming complexity
TCP application fault rate 1/month
Program Robustness
0
5
10
15
20
25
30
TCP TCP-HB
VIA-0 VIA-3 VIA-5 VIA-0 VIA-3 VIA-5 VIA-0 VIA-3 VIA-5
PRESS Versions
Per
form
abil
ity
Cross over point
Thu D. Nguyen, Rutgers U. The Vivo Project36
VIA under Stressful Fault Load
Additional fault load Transient packet drops1/month, system failure 1/month
Application faults -> 2/month
TCP-HB performs slightly better than 2 VIA versions
Performability
0
5
10
15
20
TCP TCP-HB VIA-0 VIA-3 VIA-5
PRESS Versions
Per
form
abil
ity
Thu D. Nguyen, Rutgers U. The Vivo Project37
Observations – Cluster Communication
Match fault-model of network stack to fabricNon-fatal behavior on transient faults
TCP is robust to packet drops
Fail-stop behavior on permanent faults
Protocol level fault-avoidance Preserve message boundaries
Reduce number of copies
Pre-allocate communication resources
Explicit fault reporting by all components in “path”End-to-End necessary, but may not be sufficient
Reduces detection latency
Allows more accurate recovery actions
Thu D. Nguyen, Rutgers U. The Vivo Project38
Related Work
Impact of faults on systemsRobustness and availability studies
[Lee93, Liu99, Murphy95, Brown00, Asami00]
Protocol performance studies Congestion avoidance and control
[Jacobson88, Brakmo94, Hoe96]Back-off based algorithms
Interconnects in cluster environmentSAN context: Packet drops Serious failures
Evidence of faults [Wilkes92, Seitz94, Boden95]
Fault tolerant interconnects: Myrinet
Thu D. Nguyen, Rutgers U. The Vivo Project39
Summary & Conclusion
Studied impact of communication architecture on service performability Surprisingly VIA versions delivered better availability
Comparison under varying fault loadsEvaluated architecture maturity and complexity
Desirable cluster-based protocol characteristicsMessaging, single-copy transfers, pre-allocated resources
Thu D. Nguyen, Rutgers U. The Vivo Project40
Mendosus – Fault Injection
Central Controller
Fast & Reliable SAN
Node A Node B
Events
Kernel
User-Level
SCSI
Process Ctrl
Daemon
MlibApplications E.g. PRESS
emulation
n/w faults
n/w stack
comLib glibc sys_calls
Node/OS
Thu D. Nguyen, Rutgers U. The Vivo Project41
Communication Architecture
All operations by main thread are non-blocking
Separate send, receive and multiple disk helper threads
Filling up of queues could stall the entire node
Thu D. Nguyen, Rutgers U. The Vivo Project42
Modeling Parameters
5 minutes duration for operator intervention(E) and restart(F) stages
Fault MTTF MTTR
Link down 6 months 3 minutes
Switch down 1 year 1 hour
Node crash 2 weeks 3 minutes
Node freeze 2 weeks 3 minutes
Process Crash variable 3 minutes
Process Hang variable 3 minutes
Bad parameters
- off-by-N data pointer
variable 3 minutes
Bad parameters
- off-by-N size
variable 3 minutes
Bad parameters – Null pointer
variable 3minutesSources: [Chillarege95, Sullivan91, Iyer99, Talagala99, Heath02,
Trivedi00]
Application
Faults
Thu D. Nguyen, Rutgers U. The Vivo Project43
Pessimistic Fault Load for VIA
Faults due to immature technologyTransient packet drops1/month, system failure 1/month
Program robustness Application faults -> 2/month
Unavailability by Component
0
0.001
0.002
0.003
0.004
0.005
0.006
TCP TCP-HB VIA-0 VIA-3 VIA-5
PRESS Versions
Un
avai
lab
ilit
y
bleeding edge complexitytransient n/w errrorsapp-offbyNsizeapp-offbyNpointerapp-nullpointerapplication hangapplication crashos-sk-buf-no-memos-mem-no-lockingnode freezenode crashinternal switch internal link
Thu D. Nguyen, Rutgers U. The Vivo Project44
Results - Performability
Varying application fault rates: 1/day, 1/month
VIA versions do better due to higher performance
Performability
0
5
10
15
20
25
30
TCP TCP-HB VIA-0 VIA-3 VIA-5
PRESS Versions
Per
form
abil
ity
Thu D. Nguyen, Rutgers U. The Vivo Project45
TCP Vs VIA: Transient Packet Drops
VIA packet drop rates 1/day, 1/week, 1/month
TCP is modeled as no additional losses
Transient Packet Drops
0
5
10
15
20
25
TCP TCP-HB
VIA-0 VIA-3 VIA-5 VIA-0 VIA-3 VIA-5 VIA-0 VIA-3 VIA-5
PRESS Versions
Per
form
abil
ity
Thu D. Nguyen, Rutgers U. The Vivo Project46
TCP Vs VIA: Immature Technology
VIA complexity failures 1/day, 1/week, 1/monthModeled as total interconnect failure
TCP is modeled as no additional losses
Bleeding Edge Complexity
0
5
10
15
20
25
TCP TCP-HB
VIA-0 VIA-3 VIA-5 VIA-0 VIA-3 VIA-5 VIA-0 VIA-3 VIA-5
PRESS Versions
Per
form
abil
ity
Thu D. Nguyen, Rutgers U. The Vivo Project47
Scaling Results
Model can be used to scale results
Extrapolation to 8 nodes leads to same results as measurements, for constant memory
Thu D. Nguyen, Rutgers U. The Vivo Project48
Related Work *
Approaches to Fault Tolerance & HAImproving Component Robustness
E.g., ECC in disks & memory, RAID
End-to-End Approaches
CRC, TCP and RPC like protocols – RETRY approach
Replication and Failover
Redundancy: TMR, Tandem, Stratus, primary-secondary
N-version programming
Reactive and Proactive Techniques
Recovery oriented computing[ROC01]
Software rejuvenation, Recursive restartability
Improving Availability
High Availability Techniques
Quantifying The Improvement
Thu D. Nguyen, Rutgers U. The Vivo Project50
High Availability Techniques
Front-end and extra capacity(FE-X)
Widely used
Implementation: Linux Virtual Server + extra nodes
Robust Group Membership(MEM)
ReCOOP could not handle switch, link faults
Implementation: Based on TRM[cristian95]
Service Monitoring(QMON)
Handle “lack of progress” scenarios
Implementation: Queue monitoring
Fault Model Enforcement(FME)
Thu D. Nguyen, Rutgers U. The Vivo Project51
Fault Model Enforcement (FME)
Main idea: map unexpected run-time faults to expected ones in the fault model
E.g., If only node crashes are handled:On fatal disk fault crash node
Designer can focus on smaller fault model
Enforces uniform view of system
PRESS Application hang Process crash (then restart)
Disk failures Node Crash (or take off-line)
Thu D. Nguyen, Rutgers U. The Vivo Project52
Quantifying Availability
Apply techniques to COOPCOOP - “fault-tolerance-free” version
Quantitative analysis after each enhancement Same fault-load, environment parameters
Model extended for front-end failure
Thu D. Nguyen, Rutgers U. The Vivo Project53
Quantitative Results
88% improvement in availability of FME over COOP
High Availability Techniques
00.00040.00080.00120.0016
0.0020.00240.00280.00320.0036
0.0040.00440.00480.00520.0056
0.0060.0064
COOP FE- X MEM Q-MON
MQ FME
PRESS Versions
Un
avai
lab
ilit
y
frontend failure
application hang
application crash
node freeze
node crash
scsi timeout
internal switch
internal link
Thu D. Nguyen, Rutgers U. The Vivo Project54
Parameterizable Modeling
Flexible model allows “What if…?” scenariosVariable fault rates, additional components, operator times
With extended analysis COOP PRESS 0.9997 availability
Modeling Other Approaches
0
0.0001
0.0002
0.0003
0.0004
0.0005
0.0006
0.0007
FME C-MON
X-SW
RAID
PRESS Versions
Un
avai
lab
ilit
y
Thu D. Nguyen, Rutgers U. The Vivo Project55
Background – Internet Services
Popular services handle large volumesGoogle handles ~30 million requests/day (Computeruser.com article Jun 2000)
Internet growth Explosive Commercializatio
n WWW On-line services
Thu D. Nguyen, Rutgers U. The Vivo Project56
Background – Internet Services
Internet growth has been explosive since 90sOnset of WWW, browsers, search engines
Commercialization of the Internet: on-line services
Thu D. Nguyen, Rutgers U. The Vivo Project57
Internet Services - Popularity
Servers offer variety of servicesEmail, news, search, shopping
Popular ones service large volumesGoogle handles ~30 million requests/day (Computeruser.com article Jun 2000)
Thu D. Nguyen, Rutgers U. The Vivo Project58
Internet Services - Infrastructure
Cluster-based solutions are popular[Brewer01]Incrementally scalable, cost-effective
Scalability, Performance have been addressedAvailability evaluation – less attention
Thu D. Nguyen, Rutgers U. The Vivo Project59
Approach
Guide designer through evaluation and improvement of availability
Observe system under failuresMeasure service level availability
Quantify “expected” system behaviorAnalytical modeling to predict behavior under various fault
load, “what if” scenarios
Improve upon problem areas Apply well-defined techniques
Thu D. Nguyen, Rutgers U. The Vivo Project60
Front-end and Extra Capacity
Widely usedMasks service failures from client
Fail-over using Linux Virtual Server(LVS) Front-end distributor
Round Robin, IP Tunneling
Monitor backend nodes using “MON”
Request forwarded to “live” node set
Extra CapacityAdditional “live” nodes to soak up load
Increases number of prospective fault sites!
Thu D. Nguyen, Rutgers U. The Vivo Project61
Robust Group Membership(MEM)
Should handle realistic fault loadsNode, link, switch faults and n/w partition
Heartbeats in ReCOOP were insufficient
Up-to-date list of active nodesAllow dynamic reconfiguration of list
Enables effective resource sharing
Detection of failuresFault model: reachability
UNREACHABLE DOWN
Thu D. Nguyen, Rutgers U. The Vivo Project62
Membership Implementation
Independent service Report membership at advertised memory segment
Based on Three Round Membership[Cristian95] Additions and removals follow 2-phase commit
No single point of failure
Coordinator for adds and leaves chosen dynamically
Heartbeat protocol for failure detectionNodes arranged in logical ring, monitor neighbors
Loss of 3 consecutive heartbeats initiates removal
Thu D. Nguyen, Rutgers U. The Vivo Project63
Queue Monitoring (QMON)
Service level monitoringApplication hangs, problems with disk
Queues as basic building-blocks[Ninja, SEDA]
Run-time Q-properties indicate progress of associated components
E.g., buildup at communication send queue transient/permanent failures at other end
Need to avoid false alarms!
Thu D. Nguyen, Rutgers U. The Vivo Project64
Self-Monitoring Queue
In PRESS: N per-node communication queuesMonitors progress at cooperating nodes
Failure triggersQueue length or threshold unanswered requests
Node removed from “cooperation set”
Thu D. Nguyen, Rutgers U. The Vivo Project65
Interaction – MEM and QMON
Both Membership & QMON, monitor node-level activity
Can result in inconsistencies, e.g., application hang
Thu D. Nguyen, Rutgers U. The Vivo Project66
Fault Model Enforcement (FME)
Enforce a reduced fault model at runtimeAllow service to perform correct recovery action to regain full
functionality
How to enforce a reduced fault model?Two ideas so far
Map an unexpected fault to an expected faultE.g., crash a node if the network link connecting it to the switch fails
Fail outer component if sub-component failsE.g., crash a node if the disk fails
How is it different from fail-stop ?Allows reasoning about failures at a desired abstraction
Thu D. Nguyen, Rutgers U. The Vivo Project67
FME - Future Directions
How extensive should the fault model be?Determines programming complexity/effort
How to prevent FME from reducing availability?Bugs within enforcement?
When to declare a symptom a fault?
FME reduces human interventionAre humans better at deciding?
8-23 % of recovery procedures are botched [Brown 2001]
Thu D. Nguyen, Rutgers U. The Vivo Project68
FME Approach
Define a reduced abstract fault model
Components, faults, symptoms, component behavior during faults
Enforce this fault model at run-time
If an “unexpected” fault occurs, map to one that was planned for in the abstract model
“If the facts don’t fit the theory, change the facts.” - Albert Einstein
Allow designer to concentrate on tolerating a well-defined, yet limited in complexity, set of faults
Thu D. Nguyen, Rutgers U. The Vivo Project69
PRESS with FME
Recovery upon fault model mismatchRestart 0, 1 or all nodes?
FME approach: reboot the appropriate node after a fault and its recovery have occurred
Link down – reboot unreachable node
Switch down – reboot all nodes
Disk failure – reboot node with faulty disk
Node, application crash – do nothing
Thu D. Nguyen, Rutgers U. The Vivo Project70
FME Implementation
FME daemon on each nodeMonitors service progress using exported i/f
HTTP requests for PRESS
Monitors disk using SCSI Generic InterfaceLow level operations and error probing
Application hang Process crashUpon consecutive failure to service requests
Process restarted
Disk failures Node failures Detection: Service no-progress + Disk Error
Node taken offline for maintenance
Thu D. Nguyen, Rutgers U. The Vivo Project71
Modeling Results - Unavailability
Unavailability of INDEP ~ 1/10 of COOP
Heartbeat helps, but availability is lower than INDEP
Unavailability by Component
0
0.001
0.002
0.003
0.004
0.005
0.006
INDEP COOP ReCOOP
PRESS Versions
Un
avai
lab
ilit
y
application hang
application crash
node freeze
node crash
scsi timeout
internal switch
internal link
Thu D. Nguyen, Rutgers U. The Vivo Project72
Results - Performability
COOP has lower performability than INDEP
ReCOOP glimpse of possibilities
Performability
0
5
10
15
20
25
30
35
40
45
INDEP COOP ReCOOP
PRESS Versions
Per
form
abil
ity
Thu D. Nguyen, Rutgers U. The Vivo Project73
Future Work
Explore applicability to more structured systemsMulti-tiered: front-end, application & back-end
E.g., Web store
Extend our infrastructure/methodologyMetrics to capture users’ satisfaction
Fault-model for data-integrity faults
Look at more high-availability componentsCollection for a designer to pick from
Extend and validate FME for complex servers
Thu D. Nguyen, Rutgers U. The Vivo Project74
Future Work
Considering operator in the loopGather information about extent of participation – operations, durations, expertise
Extend fault-model for human induced faults
Thu D. Nguyen, Rutgers U. The Vivo Project75
Quantitative Results
88% improvement in availability of FME over COOP
Unavailability
00.00040.00080.00120.0016
0.0020.00240.00280.00320.0036
0.0040.00440.00480.00520.0056
0.0060.0064
COOP FE- X MEM Q-MON
MQ FME
PRESS Versions
Un
avai
lab
ilit
y
frontend failure
application hang
application crash
node freeze
node crash
scsi timeout
internal switch
internal link