Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and...

Evaluating Undo: Human-Aware Recovery Benchmarks

Aaron Brownwith Leonard Chung, Calvin Ling,

and William Kakes

January 2004 ROC Retreat

Recap: ROC Undo

• We have developed & built a ROC Undo Tool– a recovery tool for human operators– lets operators take a system back in time to undo

damage, while preserving end-user work

• We have evaluated its feasibility via performance and overhead benchmarks

• Now we must answer the key question:– does Undo-based recovery improve dependability?

Approach: Recovery Benchmarks

• Recovery benchmarks measure the dependability impact of recovery– behavior of system during recovery period– speed of recovery

recovery time

performability impact(performance, correctness)

fault/errorinjection

normal behavior

perf

orm

ab

ilit

y recoverycomplete

What About the People?

• Existing recovery/dependability benchmarks ignore the human operator– inappropriate for undo, where human drives

recovery

• To measure Undo, we need benchmarks that capture human-driven recovery– by including people in the benchmarking

process

Aaron Brown

(which are few and far between for that matter)

Outline

• Introduction

• Methodology– overview– faultload development– managing human subjects

• Evaluation of Undo

• Discussion and conclusions

Methodology

• Combine traditional recovery benchmarks with human user studies – apply workload and faultload– measure system behavior during recovery from

faults– run multiple trials with a pool of human subjects

acting as system operators

• Benchmark measures system, not humans– indirectly captures human aspects of recovery

» quality of situational awareness, applicability of tools, usability & error-proneness of recovery procedures

Human-Aware Recovery Benchmarks

• Key components– workload: reuse performance benchmark– faultload: survey plus cognitive walkthrough– metrics: performance, correctness, and availability– human operators: handle non-self-healing recovery

recovery time



normal behavior

perf

orm

ab

ilit

y recoverycomplete

• Key components– workload: reuse performance benchmark» faultload: survey plus cognitive walkthrough– metrics: performance, correctness, and availability» human operators: handle recovery tasks/tools

Developing the Faultload

• ROC approach combines surveys and cognitive walkthrough– surveys to establish common failure modes,

symptoms, and error-prone administrative tasks» domain-specific, system-independent

– cognitive walkthrough to translate to system-specific faultload

• Faultload specifies generic errors and events– provides system-independence, broader applicability– cognitive walkthrough maps to system-specific faults

Example: E-mail Service Faultload• Web-based survey of e-mail admins– core questions:

» “Describe any incidents in the past 3 months where data was lost or the service was unavailable.”

» “Describe any administrative tasks you performed in the past 3 months that were particularly challenging.”

– cost: 4 x $50 gift certificate to amazon.com» raffled off as incentive for participation

– response: 68 respondents from SAGE mailing list

E-mail Survey Results

• Results

configurationdeployment/upgradeotherundoablenon-undoable

Common Tasks Challenging Tasks Lost e-mail problems

50%56%

25%

26% 17%

25%18%

31%

33%12%1%

6%

(151 total) (68 total) (12 total)

– results dominated by» configuration errors (e.g., mail filters)» botched software/platform upgrades» hardware & environmental failures

– Undo potentially useful for majority of problems

From Survey to Faultload

• Cognitive walkthrough example: SW upgrade– platform: sendmail on linux– task: upgrade from sendmail-8.2.9 to sendmail-8.2.10– approach:

1. configure/locate existing sendmail-linux system2. clone system to test machine (or use virtual machine)3. attempt upgrade, identifying possible failure points

» benchmarker must understand system to do this4. simulate failures and select those that match symptom

report from task survey

– sample result: simulate failed upgrade that disables spam filtering by omitting -DMILTER compile-time flag

Human-Aware Recovery Benchmarks

• Key components– workload: reuse performance benchmark– faultload: survey plus cognitive walkthrough– metrics: performance, correctness, and availability– human operators: handle non-self-healing recovery

recovery time



normal behavior

perf

orm

ab

ilit

y recoverycomplete

• Key components– workload: reuse performance benchmark» faultload: survey plus cognitive walkthrough– metrics: performance, correctness, and availability» human operators: handle recovery tasks/tools

Human Subject Protocol

• Benchmarks structured as human trials

• Protocol– human subject plays the role of system operator– subjects complete multiple sessions– in each session:

» apply workload to test system» select random scenario and simulate problem» give human subject 30 minutes to complete recover

• Results reflect statistical average across subjects

The Variability Challenge

• Must control human variability to get reproducible, meaningful results

• Techniques– subject pool selection– screening– training– self-comparison

» each subject faces same recovery scenario on all systems

» system’s score determined by fraction of subjects with better recovery behavior

» powerful, but only works for comparison benchmarks

Outline

• Introduction

• Methodology

• Evaluation of Undo– setup– per-subject results– aggregate results


Evaluating Undo: Setup

• Faultload scenarios1. SPAM filter configuration error2. failed e-mail server upgrade3. simple software crash (undo not useful here)

• Subject pool (after screening)

– 12 UCB Computer Science graduate students

• Self-comparison protocol– each subject given same scenario in each of 2

sessions» undo available in first session only» imposes learning bias against undo, but lowers variability

Sample Single User Result

• Undo significantly improves correctness– with some (partially-avoidable) availability cost

Without Undo With Undo

Co

rrec

tnes

s

0

1

SM

TP

Ava

ilab

ilit

y

0

1

Time (minutes)

0 5 10 15 20 25 30

IMA

PA

vail

abil

ity

0

1

Failure Recovery Period

Co

rrec

tnes

s

0

1

SM

TP

Ava

ilab

ilit

y

0

1

Time (minutes)

0 5 10 15 20 25 30

IMA

PA

vail

abil

ity

0

1

Failure Recovery Period

Inco

rrec

tly-

han

dle

dm

essa

ges

0

50

100

150

200 With Undo (session 1)Without Undo (session 2)

Fai

led

SM

TP

Co

nn

ecti

on

s

0

50

100

150

200

Failure Scenario

1 1 1 2 2 2 2

Fai

led

IM

AP

Co

nn

ecti

on

s

0

50

100

150

200

Overall Evaluation

• Undo significantly improves correctness– and reduces variance

across operators– statistically-justified,

p-value 0.045

• Undo hurts IMAP availability– several possible

workarounds exist

• Overall, Undo has a positive impact on dependability

Sessions where Undo used

Outline

• Introduction

• Methodology

• Evaluation of Undo


Discussion

• Undo-based recovery improves dependability– reduces incorrectly-handled mail in common

failure cases

• More can still be done– tweaks to Undo implementation will reduce

availability impact

• Benchmark methodology is effective at controlling human variability– self-comparison protocol gives statistically-justified

results with 9 subjects (vs 15+ for random design)

Future Directions: Controlling Cost• Human subject experiments are still costly– recruiting and compensating participants– extra time spent on training, multiple benchmark runs– extra demands on benchmark infrastructure– less than a user study, more than a perf. benchmark

• A necessary price to pay!

• Techniques for cost reduction– best-case results using best-of-breed operator– remote web-based participation– avoid human trials: extended cognitive walkthrough

Evaluating Undo: Human-Aware Recovery Benchmarks

• For more info:– [email protected]– http://roc.cs.berkeley.edu/– paper:

A. Brown, L. Chung et al. “Dependability Benchmarking of Human-Assisted Recovery Processes.” Submitted to DSN 2004, June 2004.

Backup Slides

Example: E-mail Service Faultload• Results of e-mail task survey

Lost E-mail

Operator error (8%)

Usererror (8%)

Externalresource (8%)

Software error (8%)

Hardware/Env’t (17%)

Unknown (8%)

(12 reports) Challenging Tasks

FilterInstallation

(37%)

PlatformChange/Upgrade(26%)

Tool Dev. (6%)

Config.(13%)

Other (6%)User Ed.(4%)

ArchitectureChanges (7%)

(68 total)

Configurationproblems (25%)

Upgrade-related (17%)

Full Summary Dataset

Inco

rrec

tly-

han

dle

dm

essa

ges

0

50

100

150

200

250F

aile

d S

MT

PC

on

nec

tio

ns

0

25

50

75

100

125

Failure Scenario

1 1 1 2 2 2 2 3 3 3 1 2

Fai

led

IMA

PC

on

nec

tio

ns

0

10

20

30Session 1: undo tool available Session 2: baseline

Undo used(in Session 1)

Undo not usedor completed

Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and...

Documents

Transcript of Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and...