Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and...
-
Upload
tabitha-matthews -
Category
Documents
-
view
212 -
download
0
Transcript of Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and...
Evaluating Undo: Human-Aware Recovery Benchmarks
Aaron Brownwith Leonard Chung, Calvin Ling,
and William Kakes
January 2004 ROC Retreat
Slide 2
Recap: ROC Undo
• We have developed & built a ROC Undo Tool– a recovery tool for human operators– lets operators take a system back in time to undo
damage, while preserving end-user work
• We have evaluated its feasibility via performance and overhead benchmarks
• Now we must answer the key question:– does Undo-based recovery improve dependability?
Slide 3
Approach: Recovery Benchmarks
• Recovery benchmarks measure the dependability impact of recovery– behavior of system during recovery period– speed of recovery
recovery time
performability impact(performance, correctness)
fault/errorinjection
normal behavior
perf
orm
ab
ilit
y recoverycomplete
Slide 4
What About the People?
• Existing recovery/dependability benchmarks ignore the human operator– inappropriate for undo, where human drives
recovery
• To measure Undo, we need benchmarks that capture human-driven recovery– by including people in the benchmarking
process
Slide 5
Outline
• Introduction
• Methodology– overview– faultload development– managing human subjects
• Evaluation of Undo
• Discussion and conclusions
Slide 6
Methodology
• Combine traditional recovery benchmarks with human user studies – apply workload and faultload– measure system behavior during recovery from
faults– run multiple trials with a pool of human subjects
acting as system operators
• Benchmark measures system, not humans– indirectly captures human aspects of recovery
» quality of situational awareness, applicability of tools, usability & error-proneness of recovery procedures
Slide 7
Human-Aware Recovery Benchmarks
• Key components– workload: reuse performance benchmark– faultload: survey plus cognitive walkthrough– metrics: performance, correctness, and availability– human operators: handle non-self-healing recovery
recovery time
performability impact(performance, correctness)
fault/errorinjection
normal behavior
perf
orm
ab
ilit
y recoverycomplete
• Key components– workload: reuse performance benchmark» faultload: survey plus cognitive walkthrough– metrics: performance, correctness, and availability» human operators: handle recovery tasks/tools
Slide 8
Developing the Faultload
• ROC approach combines surveys and cognitive walkthrough– surveys to establish common failure modes,
symptoms, and error-prone administrative tasks» domain-specific, system-independent
– cognitive walkthrough to translate to system-specific faultload
• Faultload specifies generic errors and events– provides system-independence, broader applicability– cognitive walkthrough maps to system-specific faults
Slide 9
Example: E-mail Service Faultload• Web-based survey of e-mail admins– core questions:
» “Describe any incidents in the past 3 months where data was lost or the service was unavailable.”
» “Describe any administrative tasks you performed in the past 3 months that were particularly challenging.”
– cost: 4 x $50 gift certificate to amazon.com» raffled off as incentive for participation
– response: 68 respondents from SAGE mailing list
Slide 10
E-mail Survey Results
• Results
configurationdeployment/upgradeotherundoablenon-undoable
Common Tasks Challenging Tasks Lost e-mail problems
50%56%
25%
26% 17%
25%18%
31%
33%12%1%
6%
(151 total) (68 total) (12 total)
– results dominated by» configuration errors (e.g., mail filters)» botched software/platform upgrades» hardware & environmental failures
– Undo potentially useful for majority of problems
Slide 11
From Survey to Faultload
• Cognitive walkthrough example: SW upgrade– platform: sendmail on linux– task: upgrade from sendmail-8.2.9 to sendmail-8.2.10– approach:
1. configure/locate existing sendmail-linux system2. clone system to test machine (or use virtual machine)3. attempt upgrade, identifying possible failure points
» benchmarker must understand system to do this4. simulate failures and select those that match symptom
report from task survey
– sample result: simulate failed upgrade that disables spam filtering by omitting -DMILTER compile-time flag
Slide 12
Human-Aware Recovery Benchmarks
• Key components– workload: reuse performance benchmark– faultload: survey plus cognitive walkthrough– metrics: performance, correctness, and availability– human operators: handle non-self-healing recovery
recovery time
performability impact(performance, correctness)
fault/errorinjection
normal behavior
perf
orm
ab
ilit
y recoverycomplete
• Key components– workload: reuse performance benchmark» faultload: survey plus cognitive walkthrough– metrics: performance, correctness, and availability» human operators: handle recovery tasks/tools
Slide 13
Human Subject Protocol
• Benchmarks structured as human trials
• Protocol– human subject plays the role of system operator– subjects complete multiple sessions– in each session:
» apply workload to test system» select random scenario and simulate problem» give human subject 30 minutes to complete recover
• Results reflect statistical average across subjects
Slide 14
The Variability Challenge
• Must control human variability to get reproducible, meaningful results
• Techniques– subject pool selection– screening– training– self-comparison
» each subject faces same recovery scenario on all systems
» system’s score determined by fraction of subjects with better recovery behavior
» powerful, but only works for comparison benchmarks
Slide 15
Outline
• Introduction
• Methodology
• Evaluation of Undo– setup– per-subject results– aggregate results
• Discussion and conclusions
Slide 16
Evaluating Undo: Setup
• Faultload scenarios1. SPAM filter configuration error2. failed e-mail server upgrade3. simple software crash (undo not useful here)
• Subject pool (after screening)
– 12 UCB Computer Science graduate students
• Self-comparison protocol– each subject given same scenario in each of 2
sessions» undo available in first session only» imposes learning bias against undo, but lowers variability
Slide 17
Sample Single User Result
• Undo significantly improves correctness– with some (partially-avoidable) availability cost
Without Undo With Undo
Co
rrec
tnes
s
0
1
SM
TP
Ava
ilab
ilit
y
0
1
Time (minutes)
0 5 10 15 20 25 30
IMA
PA
vail
abil
ity
0
1
Failure Recovery Period
Co
rrec
tnes
s
0
1
SM
TP
Ava
ilab
ilit
y
0
1
Time (minutes)
0 5 10 15 20 25 30
IMA
PA
vail
abil
ity
0
1
Failure Recovery Period
Slide 18
Inco
rrec
tly-
han
dle
dm
essa
ges
0
50
100
150
200 With Undo (session 1)Without Undo (session 2)
Fai
led
SM
TP
Co
nn
ecti
on
s
0
50
100
150
200
Failure Scenario
1 1 1 2 2 2 2
Fai
led
IM
AP
Co
nn
ecti
on
s
0
50
100
150
200
Overall Evaluation
• Undo significantly improves correctness– and reduces variance
across operators– statistically-justified,
p-value 0.045
• Undo hurts IMAP availability– several possible
workarounds exist
• Overall, Undo has a positive impact on dependability
Sessions where Undo used
Slide 19
Outline
• Introduction
• Methodology
• Evaluation of Undo
• Discussion and conclusions
Slide 20
Discussion
• Undo-based recovery improves dependability– reduces incorrectly-handled mail in common
failure cases
• More can still be done– tweaks to Undo implementation will reduce
availability impact
• Benchmark methodology is effective at controlling human variability– self-comparison protocol gives statistically-justified
results with 9 subjects (vs 15+ for random design)
Slide 21
Future Directions: Controlling Cost• Human subject experiments are still costly– recruiting and compensating participants– extra time spent on training, multiple benchmark runs– extra demands on benchmark infrastructure– less than a user study, more than a perf. benchmark
• A necessary price to pay!
• Techniques for cost reduction– best-case results using best-of-breed operator– remote web-based participation– avoid human trials: extended cognitive walkthrough
Evaluating Undo: Human-Aware Recovery Benchmarks
• For more info:– [email protected]– http://roc.cs.berkeley.edu/– paper:
A. Brown, L. Chung et al. “Dependability Benchmarking of Human-Assisted Recovery Processes.” Submitted to DSN 2004, June 2004.
Backup Slides
Slide 24
Example: E-mail Service Faultload• Results of e-mail task survey
Lost E-mail
Operator error (8%)
Usererror (8%)
Externalresource (8%)
Software error (8%)
Hardware/Env’t (17%)
Unknown (8%)
(12 reports) Challenging Tasks
FilterInstallation
(37%)
PlatformChange/Upgrade(26%)
Tool Dev. (6%)
Config.(13%)
Other (6%)User Ed.(4%)
ArchitectureChanges (7%)
(68 total)
Configurationproblems (25%)
Upgrade-related (17%)
Slide 25
Full Summary Dataset
Inco
rrec
tly-
han
dle
dm
essa
ges
0
50
100
150
200
250F
aile
d S
MT
PC
on
nec
tio
ns
0
25
50
75
100
125
Failure Scenario
1 1 1 2 2 2 2 3 3 3 1 2
Fai
led
IMA
PC
on
nec
tio
ns
0
10
20
30Session 1: undo tool available Session 2: baseline
Undo used(in Session 1)
Undo not usedor completed