Dr dasen brajkovic md board certified psychiatrist and medical professional
Colin Scott, Aurojit Panda, Vjekoslav Brajkovic, George Necula ...
Transcript of Colin Scott, Aurojit Panda, Vjekoslav Brajkovic, George Necula ...
Minimizing Faulty Executions of Distributed Systems
Colin Scott, Aurojit Panda, Vjekoslav Brajkovic, George Necula, Arvind Krishnamurthy, Scott Shenker
1 LaToza, Venolia, DeLine, ICSE’ 06
49% of developers’ time spent on debugging!1
Understanding How Bug Is Triggered
Fixing Problematic Code
OutlineIntroductionBackground
Minimization
Evaluation Conclusion
Randomized Testing with
DEMi
Node 1 Node N
Test Coordinator
QA Testbed
Software Under Test
OutlineIntroductionBackground
Minimization
Evaluation Conclusion
Randomized Testing with
DEMi
Node 1 Node N
Test Coordinator
QA Testbed
Software Under Test
OutlineIntroductionBackground
Minimization
Evaluation Conclusion
Randomized Testing with
DEMi
Node 1 Node N
Test Coordinator
QA Testbed
Software Under Test
OutlineIntroductionBackground
Minimization
Evaluation Conclusion
Randomized Testing with
DEMi
Node 1 Node N
Test Coordinator
QA Testbed
Software Under Test
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
msg dst: b
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
msg dst: b
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
msg dst: b
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
msg dst: b
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
msg dst: b
message delivery
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
msg dst: b
message delivery
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
timer dst: bmsg
dst: a
message delivery
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
timer dst: b
msg dst: a
message delivery
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
timer dst: b
msg dst: a
message delivery
External events (events outside
system’s control): Crash-recovery Process creation External message
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
timer dst: b
msg dst: a
message delivery
crash recovery
External events (events outside
system’s control): Crash-recovery Process creation External message
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
timer dst: b
msg dst: a
message delivery
crash recovery
External events (events outside
system’s control): Crash-recovery Process creation External message
Randomized Testing with DEMiApp
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
App
RPC lib
OSAspectJ
timer dst: b
msg dst: a
message delivery
crash recovery
Invariant Checking
An invariant is a predicate P over the state of all processes.
a b c d e
{ ✔ ✗
✗
A faulty execution is one that ends in an invariant violation.
e1 i1 i2 i3 i4e2
OutlineIntroductionBackground
Node 1 Node N
Test Coordinator
QA Testbed
Software Under Test
Minimization
Evaluation Conclusion
Randomized Testing with
DEMi
Formal Problem Statement
Find: locally minimal reproducing sequence τ’:
τ’ violates P, |τ’| ≤ |τ|
τ’ contains a subsequence of the external events of τ
if we remove any external event e from τ’,
¬∃ τ’’ containing same external events - e, s.t. τ’’ violates P
Given: schedule τ that results in violation of P
Formal Problem Statement
After finding τ’, minimize internal events:
remove extraneous message deliveries from τ’
Minimization
τ :Given
Straightforward approach: Enumerate all schedules |τ’| ≤ |τ|, Pick shortest sequence that reproduces ✗
τ Schedule Space
… ✗e1 i1 i2 i4e2 enim
Minimization
τ :Given
Straightforward approach: Enumerate all schedules |τ’| ≤ |τ|, Pick shortest sequence that reproduces ✗
τ Schedule Space
… ✗e1 i1 i2 i4e2 enim
Observation #1: many schedules are commutative
Adopt DPOR: Dynamic Partial Order Reduction
C. Flanagan, P. Godefroid, “Dynamic Partial-Order Reduction for Model Checking Software”, POPL ‘05
Approach: prioritize schedule space exploration
Assume: fixed time budget Objective: quickly find small failing schedules
x
τ :
ene3ext: e5e1 e2 e4
… ✗e1 i1 i2 i4e2 enim
(Apply Delta Debugging1)
1A Zeller, R. Hildebrandt, “Simplifying and Isolating Failure-Inducing Input”, IEEE ‘02
Observation #2: selectively mask original events
τ :
ene3ext: e5
sub1:
e1 e2 e4
… ✗e1 i1 i2 i4e2 enim
e4 e5 en…
(Apply Delta Debugging1)
1A Zeller, R. Hildebrandt, “Simplifying and Isolating Failure-Inducing Input”, IEEE ‘02
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
ene5e4e1 e2 e3
foreach i in τ: if i is pending: deliver i # ignore unexpected
… e5e4 en
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
ene5e4e1 e2 e3
foreach i in τ: if i is pending: deliver i # ignore unexpected
i1 … e5e4 en
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
ene5e4e1 e2 e3
foreach i in τ: if i is pending: deliver i # ignore unexpected
i1 … e5e4 en
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
ene5e4e1 e2 e3
foreach i in τ: if i is pending: deliver i # ignore unexpected
i1 i4 … e5e4 enim
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
ene5e4e1 e2 e3
foreach i in τ: if i is pending: deliver i # ignore unexpected
i1 i4 … e5e4 enim
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
ene5e4e1 e2 e3
foreach i in τ: if i is pending: deliver i # ignore unexpected
i1 i4 ✗… e5e4 enim
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
ene5e4e1 e2 e3
foreach i in τ: if i is pending: deliver i # ignore unexpected
i1 i4 ✗… e5e4 enim
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
ene5e4
i1 i4 ✗… e5e4 enim
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
ene5e4
i1 i4 ✗… e5e4 enim
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
sub2:
ene5e4
i1 i4 ✗… e5e4 enim
e5 en
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
sub2: i1 i4 …
ene5e4
i1 i4 ✗… e5e4 enim
e5 enim
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
sub2: i1 i4 ✔…
ene5e4
i1 i4 ✗… e5e4 enim
e5 enim
Observation #2: selectively mask original events
τ :
ext:
sub1:
… ✗e1 i1 i2 i4e2 enim
sub2:
…. . .
i1 i4 ✔…Explore backtrack points until (i) ✗ or (ii) time budget for sub2 expired
ene5e4
i1 i4 ✗… e5e4 enim
e5 enim
Observation #2: selectively mask original events
Observation #4: shrink external message contents
Observation #1: many schedules are commutative
Approach: prioritize schedule space exploration
Goal: find minimal schedule that produces violation
Minimize internal events after externals minimized
Observation #2: selectively mask original events
Observation #3: some contents should be masked
OutlineIntroductionBackground
Node 1 Node N
Test Coordinator
QA Testbed
Software Under Test
Minimization
Evaluation Conclusion
Randomized Testing with
DEMi
How well does DEMi work?To
tal E
vent
s
0
300
600
900
1200
1500
1800
2100
2400
2700
3000
Case Studyraft-45 raft-46 raft-56 raft-58a raft-58b raft-42 raft-66 spark-2294 spark-3150 spark-9256
1114407718040226823523 300
600
1000
400
17101500
2850
2380
1250
2160
Initial ExecutionAfter Minimization
How well does DEMi work?To
tal E
vent
s
0
300
600
900
1200
1500
1800
2100
2400
2700
3000
Case Studyraft-45 raft-46 raft-56 raft-58a raft-58b raft-42 raft-66 spark-2294 spark-3150 spark-9256
1114407718040226823523 300
600
1000
400
17101500
2850
2380
1250
2160
Initial ExecutionAfter Minimization
80% - 97% Reduction!
How well does DEMi work?To
tal E
vent
s
0306090
120150180210240270300
Case Studyraft-45 raft-46 raft-56 raft-58a raft-58b raft-42 raft-66 spark-2294spark-3150spark-9256
11112529392851
212322 111440
77
180
40
226
82
3523
After MinimizationSmallest Manual Trace
How well does DEMi work?To
tal E
vent
s
0306090
120150180210240270300
Case Studyraft-45 raft-46 raft-56 raft-58a raft-58b raft-42 raft-66 spark-2294spark-3150spark-9256
11112529392851
212322 111440
77
180
40
226
82
3523
After MinimizationSmallest Manual Trace
Factor of 1x - 5x from hand-crafted
170 69
How quickly does DEMi work?Ru
ntim
e in
Sec
onds
0
400
800
1200
1600
2000
2400
2800
3200
3600
4000
Case Studyraft-45 raft-46 raft-56 raft-58a raft-58b raft-42 raft-66 spark-2294 spark-3150 spark-9256
210245427348
10676
69
43482
2132
282170
Overall Minimization(~12 hours) (~3 hours)
(~35 minutes)
170 69
How quickly does DEMi work?Ru
ntim
e in
Sec
onds
0
400
800
1200
1600
2000
2400
2800
3200
3600
4000
Case Studyraft-45 raft-46 raft-56 raft-58a raft-58b raft-42 raft-66 spark-2294 spark-3150 spark-9256
210245427348
10676
69
43482
2132
282170
Overall Minimization
<10 minutes except 3 cases
(~12 hours) (~3 hours)
(~35 minutes)
See the paper for… How we handle non-determinism Handling multithreaded processes Supporting other RPC libraries Sketch for minimizing production traces More in-depth evaluation Related work …
Conclusion
Open source tool: github.com/NetSys/demi
Read our paper! eecs.berkeley.edu/~rcs/research/nsdi16.pdf
Optimistic that these techniques can be successfully applied more broadly
Thanks for your time!Contact me! [email protected]
AttributionsInspiration for slide design: Jay Lorch’s IronFleet slides
Graphic Icons: thenounproject.org logfile: mantisshrimpdesign magnifying glass: Ricardo Moreira disk: Anton Outkine hook: Seb Cornelius bug report: Lemon Liu devil: Mourad Mokrane Putin: Remi Mercier
Production TracesModel: feed partially ordered log into
single machine DEMi
Require: - Partial ordering of all message deliveries - All crash-recoveries logged to disk
Related WorkThread Schedule Minimization
•Isolating Failure-Inducing Thread Schedules. SIGSOFT ’02. •A Trace Simplification Technique for Effective Debugging of
Concurrent Programs. FSE ’10. Program Flow Analysis.
•Enabling Tracing of Long-Running Multithreaded Programs via Dynamic Execution Reduction. ISSTA ’07.
•Toward Generating Reducible Replay Logs. PLDI ’11. Best-Effort Replay of Field Failures
•A Technique for Enabling and Supporting Debugging of Field Failures. ICSE ’07.
•Triage: Diagnosing Production Run Failures at the User’s Site. SOSP ’07.
Dealing With ThreadsIf you’re lucky: threads are largely independent (Spark)
If you’re unlucky: key insight: A write to shared memory is equivalent to a message delivery
Approach: •interpose on virtual memory, thread scheduler •pause a thread whenever it writes to shared memory / disk
Cf. “Enabling Tracing Of Long-Running Multithreaded Programs Via Dynamic Execution Reduction”, ISSTA ‘07
Dealing With Non-DeterminismInterpose on:
- Timers - Random number generators - Unordered hash values - ID allocation
Stop-gap: replay each schedule multiple times