Orchestrated Chaos: Applying Failure Testing Research at Scale.
-
Upload
reactivesummit -
Category
Technology
-
view
255 -
download
0
Transcript of Orchestrated Chaos: Applying Failure Testing Research at Scale.
What could possibly go wrong?
Consider computation involving 100 services
Search Space:2100 executions
Reflections
1. Managing complexity can be a zero-sum game2. Productivity trumps purity3. Chaos results…. and gives rise to a new order
A cunning malevolent sentience?
A fault injectionframework(e.g. FIT)
Call graph tracing(e.g. Zipkin)
A cunning malevolent sentience?
A fault injectionframework(e.g. FIT)
Call graph tracing(e.g. Zipkin)
Lineage-driven Fault Injection
A fault injectionframework(e.g. FIT)
LDFICall graph tracing(e.g. Zipkin)
But how do we know redundancy when we see it?
Hard question: “Could a bad thing ever happen?”
Easier: “Exactly why did a good thing happen?”
“What could have gone wrong?”
Lineage-driven fault injection
Why did a good thing happen?
Consider its lineage.
The write is stable
Stored on RepA
Stored on RepB
Bcast1 Bcast2
Client Client
Lineage-driven fault injection
Why did a good thing happen?
Consider its lineage.
What could have gone wrong?
Faults are cuts in the lineage graph.
Is there a cut that breaks all supports?
The write is stable
Stored on RepA
Stored on RepB
Bcast1 Bcast2
Client Client
Lineage-driven fault injection
Why did a good thing happen?
Consider its lineage.
What could have gone wrong?
Faults are cuts in the lineage graph.
Is there a cut that breaks all supports?
The write is stable
Stored on RepA
Stored on RepB
Bcast1 Bcast2
Client Client
What would have to go wrong?
(RepA OR Bcast1)
The write is stable
Stored on RepA
Stored on RepB
Bcast2
Client Client
Bcast1
What would have to go wrong?
(RepA OR Bcast1)
AND (RepA OR Bcast2)
The write is stable
Stored on RepA
Stored on RepB
Bcast1 Bcast2
Client Client
What would have to go wrong?
(RepA OR Bcast1)
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
The write is stable
Stored on RepA
Stored on RepB
Bcast1
Client Client
Bcast2
What would have to go wrong?
(RepA OR Bcast1)
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
AND (RepB OR Bcast1)
The write is stable
Stored on RepA
Stored on RepB
Bcast1 Bcast2
Client Client
Lineage-driven fault injection The write is stable
Stored on RepA
Stored on RepB
Bcast1 Bcast2
Client Client
Hypothesis: {Bcast1, Bcast2}
Lineage-driven fault injection The write is stable
Stored on RepA
Stored on RepB
Bcast1 Bcast2
Client Client
Bcast3
Client
(RepA OR Bcast1)
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
AND (RepB OR Bcast1)
Lineage-driven fault injection The write is stable
Stored on RepA
Stored on RepB
Bcast1 Bcast2
Client Client
Bcast3
Client
(RepA OR Bcast1)
AND (RepA OR Bcast2)
AND (RepB OR Bcast2)
AND (RepB OR Bcast1)
AND (RepA OR Bcast3)
AND (RepB OR Bcast3)
Lineage-driven Fault InjectionRecipe:
1. Start with a successful outcome. Work backwards.
2. Ask why it happened: Lineage3. Convert lineage to a boolean
formula and solve4. Lather, rinse, repeat
2. Lineage 3. CNF
Fail1. Success
Why?
Encode
Solve
4. REPEAT
Minimal requirements
1. Fault injection infrastructure2. Mechanism for collecting lineage3. Ability to replay interactions
Case study: “Netflix AppBoot”
Services ~100
Search space (executions) 2100 (1,000,000,000,000,000,000,000,000,000,000)
Experiments performed 200
Critical bugs found 11
Search prioritization: minimal hitting sets
(A ∨ B ∨ C ) ∧ (C ∨ D ∨ E ∨ F) ∧ (D ∨ E ∨ F ∨ G) ∧ (H ∨ I)
(A, B, C), (C, D, E, F), (D, E, F, G), (H, I)
⇩
Search prioritization: minimal hitting sets
(A ∨ B ∨ C ) ∧ (C ∨ D ∨ E ∨ F) ∧ (D ∨ E ∨ F ∨ G) ∧ (H ∨ I)
(A, B, C), (C, D, E, F), (D, E, F, G), (H, I)
⇩
e.g. (C, E, H) ✔
X X X X X
Where we’re headed
A fault injectionframework(e.g. FIT)
Lineage-driven faultinjection
Call graph tracing(e.g. Zipkin)
References● ‘Automating Failure Testing at Internet Scale [ACM SoCC’16]
https://people.ucsc.edu/~palvaro/fit-ldfi.pdf
● ‘Lineage Driven Fault Injection’ [ACM SIGMOD’15]http://people.ucsc.edu/~palvaro/molly.pdf
● Netflix Tech Blog on ‘Automated Failure Testing’ http://techblog.netflix.com/2016/01/automated-failure-testing.html
True Silicon Valley Stories
1. Crazy legwork2. The “what the hell does our site do” project3. Offsite => online