Transparency of Extractive Industries Toward Spatial Transparency
Exploring Failure Transparency and the Limits of Generic Recovery
-
Upload
isadora-jacobson -
Category
Documents
-
view
24 -
download
1
description
Transcript of Exploring Failure Transparency and the Limits of Generic Recovery
Exploring Failure Transparency and the Limits of Generic Recovery
Dave LowellCompaq Western Research Labxxx
Subhachandra Chandra andPeter M. Chen, University of Michigan
2
Introduction
Failure transparency: abstraction of failure-free operation
OS recovers app after hardware, OS, and application failures
– No programmer help– No slow down
Will explore theory, performance, and limitations
3
Consistent recovery
Visible output equivalent to failure-free run
– equivalence: allows duplicates– avoids “exactly once” problem
Failure transparency consistent recovery with generic techniques
4
Guaranteeing consistent recovery
Key players: non-deterministic events, visible events, commit events
Save-work invariant (simplified):– There’s a commit after each non-
deterministic event that happens-before a visible event.
– Full theorem handles liveness, distinguishes causality and ordering
6
CAND CAND-LOG
Effort to identify/convert ND events
CPVS
CPV-2PCE
ffort
to c
om
mit
onl
y vi
sib
le e
vent
s
CBNDVS
CBNDV-2PC
CBNDVS-LOG
7
CAND CAND-LOG
Effort to identify/convert ND events
CPVS
CPV-2PCE
ffort
to c
om
mit
onl
y vi
sib
le e
vent
s
CBNDVS
CBNDV-2PC
CBNDVS-LOG
Coord. CheckpointingManethoOptimistic Logging
Targon/32SBL Hypervisor
8
Effort to identify/convert ND events
Effo
rt to
co
mm
it o
nly
visi
ble
eve
nts increasing recovery time
app
lica
tion
failu
re r
eco
very
incre
asing
sim
plicit
y
incre
asing
per
form
ance
9
Performance study
Discount Checking: fast checkpoints to reliable memory (Rio)
– Logging and two-phase commit– Disk version
Mostly interactive applications– Localized and distributed
10
CAND1%
43%
CAND-LOG0%
13%
Effort to identify/convert ND events
CPVS1%44%
Effo
rt to
co
mm
it o
nly
visi
ble
eve
nts
CBNDVS1%42%
CBNDVS-LOG0%12%
Nvi Text Editor
11
CAND199%
11499%
CAND-LOG126%
7700%
Effort to identify/convert ND events
CPVS129%7346%
CPV-2PC12%319%
Effo
rt to
co
mm
it o
nly
visi
ble
eve
nts
CBNDVS101%5743%
CBNDV-2PC12% 252%
CBNDVS-LOG73%4973%
TreadMarks Barnes-Hut
12
Have only considered “stop” failures
Committing everything is okay– Save-work: when we must commit
Some failures affect application state– Can we commit too much?
15
Lose-work invariant
To recover from propagation failure, never commit on a “dangerous path”.
Save-work and Lose-work conflict!– Visible event on dangerous path– Can’t guarantee consistent recovery
from propagation failures
Do we see this conflict in practice?
16
Measuring Lose-work violations
Fault-injection study : OS crashes– injected faults into running kernel– induced 350 OS crashes– recovered nvi and postgres using
Discount Checking
Results– nvi: 15% crashes violate Lose-work– postgres: 3% crashes violate Lose-work
17
Application crashes
Fault-injection study: ND bugs– nvi: 37% violate Lose-work– postgres: 33% violate Lose-work
Published bug distributions: 85-95% of application bugs are deterministic
– intrinsically violate Lose-work
Perhaps > 90% app crashes violate Lose-work!
18
Conclusions
Save-work and Lose-work invariants Save-work protocol space Invariants fundamentally conflict Failure transparency performance:
– 0-12% overhead on reliable memory– 13-40% overhead on disk (interactive apps)
> 90% application failures violate Lose-work