The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
How did I get here? Building confidence in a distributed stream processor
-
Upload
sean-t-allen -
Category
Technology
-
view
342 -
download
0
Transcript of How did I get here? Building confidence in a distributed stream processor
How Did I Get Here?Building Confidence in a Distributed Stream Processor
Sean T. Allen
T
T
Experience Report
Stream Processor
PrototypeStarted January 2016
PrototypeStarted January 2016
ProductionStarted April 2016
ProductionStarted April 2016
America is all about speed.
Hot, nasty, bad-ass speed. — Eleanor Roosevelt
High Throughput
Buffy: Goals
Low Latency
Buffy: Goals
Less Hardware
Buffy: Goals
America is all about data quality.
Quiet, demure data quality. — Andrew Jackson
High Fidelity
Buffy: Goals
Stream Processing
Message at a time
Never ending
Failure
Machine Failure
Slow Machine
Segfaulting Process
GC Pause
Network Error
Failure Happens
Delivery Guarantees
At-Most-Once
At-Most-OnceBest Effort
At-Least-Once
At-Least-OnceACK or resend
Exactly-Once
Exactly-OnceAt-Least-Once + Idempotence
Exactly-Once
Confidence
Black Box Testing
Black Box Testing
Black Box Testing
Black Box Testing
Black Box Testing
System Under Test
Black Box Testing
Input Source
Black Box Testing
Output Receiver
Black Box Testing
Unit Testingbecause
isn't enough
Black Box Testing
Integration Testingbecause
isn't enough
Black Box Testing
composed componentsbecause
have interesting new failure modes
Black Box Testing
Test The Entire System
Black Box Testing
Test The Entire Systemend to end
Black Box Testing
Test The Entire Systemend to end
Black Box Testing
and verify your expectations
WesleyExpectation verification for Buffy
Wesley
Wesley
Input
Wesley
Output
Wesley
Input Output
Input Source
Wesley
Input Source
Wesley
Output Receiver
Wesley
Input Source
Records sent data
1,2,3,4
Wesley
Input Source
Records sent data Records received data
2,4,6,81,2,3,4
Output Receiver
Wesley
Wesley
Analyze!
Wesley
Wesley
Wesley
Wesley
Wesley
Wesley
Wesley
Wesley
Wesley
It Works!
SpikeFault injection for Buffy
Fault Injection
Lineage-driven fault injection
Start from a good result
Spike: LDFI
Input
Spike: LDFI
Output
Spike: LDFI
Figure out what can go wrong
Spike: LDFI
Nemesis
Spike: LDFI
Each "wrong" is a possible
The Network
Spike: LDFI
Our first nemesis:
Determinism is key
Spike
Repeated runs with different results
==
Mostly Useless
Spike
Spike
Spike
Inject failures as informed by TCP
Spike
TCP Guarantees:
Spike
TCP Guarantees:
Per connection in order delivery
Spike
Per connection in order delivery Per connection duplicate detection
TCP Guarantees:
Spike
Per connection in order delivery Per connection duplicate detection
Per connection retransmission of lost data
TCP Guarantees:
TCP in Pony: Event Driven
TCP in Pony: Event Driven
TCP in Pony: Event Driven
TCP in Pony: Event Driven
TCP in Pony: Event Driven
Useless Notifier
Useless Notifier
Useless Notifier
Dropped Connections
Nemesis #1:
Spike: Drop Connection
Spike: Drop Connection
Spike: Drop Connection
Spike: Drop Connection
Spike: Drop Connection
Spike: Drop Connection
• Incoming connection accepted
Spike: Drop Connection
• Incoming connection accepted
• Attempting outgoing connection
Spike: Drop Connection
• Incoming connection accepted
• Attempting outgoing connection
• Connection established
Spike: Drop Connection
• Incoming connection accepted
• Attempting outgoing connection
• Connection established
• Data sent
Spike: Drop Connection
• Incoming connection accepted
• Attempting outgoing connection
• Connection established
• Data sent
• Data received
Integrating Spike"Double and Halve" app
Integrating Spike"Double and Halve" app
Integrating Spike"Double and Halve" app
Integrating Spike"Double and Halve" app
Integrating Spike"Double and Halve" app
Integrating Spike"Double and Halve" app
Integrating Spike"Double and Halve" app
Integrating Spike"Double and Halve" app
Integrating Spike"Double and Halve" app
Integrating Spike"Double and Halve" app
• Easy to verify
Integrating Spike"Double and Halve" app
• Easy to verify
• Messages cross process boundary
Integrating Spike"Double and Halve" app
• Easy to verify
• Messages cross process boundary
• Messages cross network boundary
Integrating Spike"Double and Halve" app
Integrating Spike
• Double and Halve App
Integrating Spike
• Double and Halve App
• No Spiking
Integrating Spike
• Double and Halve App
• No Spiking
• Test, Test, Test
Integrating Spike
• Double and Halve App
• No Spiking
• Test, Test, Test
• Wesley: It passes! It passes! It passes!
Integrating Spike
• Double and Halve App
Integrating Spike
• Double and Halve App
• Spike with “drop connection”
Integrating Spike
• Double and Halve App
• Spike with “drop connection”
• Test, Test, Test
Integrating Spike
• Double and Halve App
• Spike with “drop connection”
• Test, Test, Test
• Wesley: It fails! It fails! It fails!
Integrating Spike
Integrating Spike
== Session Recovery!
Integrating Spike
• Double and Halve App
Integrating Spike
• Double and Halve App
• Spike with “drop connection”
Integrating Spike
• Double and Halve App
• Spike with “drop connection”
• Test, Test, Test
Integrating Spike
• Double and Halve App
• Spike with “drop connection”
• Test, Test, Test
• Wesley: It passes! It passes! It passes!
Repeated runs with different results
==
Mostly Useless
Spike
Determinism & Spike
It's easy to get wrong
Determinism & Spike
Determinism & Spike
TCP delivery is not deterministic
Determinism & Spike
TCP guarantees:
Per connection in order delivery
Determinism & Spike
Per connection in order delivery Per connection duplicate detection
TCP guarantees:
Determinism & Spike
Per connection in order delivery Per connection duplicate detection
Per connection retransmission of lost data
TCP guarantees:
Determinism & Spike
Per connection in order delivery Per connection duplicate detection
Per connection retransmission of lost data
but it doesn't guarantee determinism
TCP guarantees:
Determinism & Spike
TCP delivery is not deterministic
Determinism & Spike
TCP delivery is not deterministic
Determinism & Spike
TCP delivery is not deterministic
Determinism & Spike
TCP delivery is not deterministicPer method call Spiking won't work
Determinism & Spike
TCP delivery is not deterministicPer method call Spiking won't work unless we make it work…
Determinism & Spike
TCP message framing
Determinism & Spike
TCP message framing
Determinism & Spike
TCP message framing
Determinism & Spike
TCP message framing
Determinism & Spike
TCP message framing
Determinism & Spike
TCP message framing
Determinism & Spike
TCP message framing
Determinism & Spike
TCP message framing
Determinism & Spike
TCP message framing
Determinism & SpikeExpect in action
Determinism & SpikeExpect in action
Determinism & SpikeExpect in action
Determinism & SpikeExpect in action
Determinism & SpikeExpect in action
Determinism & SpikeExpect in action
Determinism & SpikeExpect in action
Determinism & SpikeExpect in action
Determinism & SpikeExpect in action
Determinism & SpikeExpect in action
Determinism & Spike
Expect makes received deterministic
Determinism & Spike
Expect makes received deterministic
Determinism & Spike
Expect makes received deterministic
Determinism & Spike
Expect makes received deterministic
Determinism & Spike
Expect makes received deterministic
Determinism & Spike
Expect makes received deterministic
Determinism & Spike
Expect makes received deterministic
Determinism & Spike
Received gets called with
Determinism & Spike
then…
Determinism & Spike
and then another…
Determinism & Spike
and finally…
Same number of notifier method calls
Determinism & Spike
no matter how the data arrives
Drop Connection & Expect fast deterministic friends
Determinism & SpikeDeterminism & Spike
Slow Connections
Nemesis #1:
Spike: Delay
Spike: Delay
Spike: Delay
Spike: Delay
Spike: Delay
Delay overrides expect
Spike: Delay
Delay overrides expectand controls the flow of bytes
Spike: Delay
Delay overrides expectand controls the flow of bytes
to maintain determinism
Spike: Delay
Spike: Delay
Spike: Delay
Spike: Delay
Spike: Delay
r TCP
Spike
Spike: Delay
r TCP
Spike
Spike: Delay
r TCP
Spike
Spike: DelayTCP
Spike: DelayTCP
TCP
Spike
Spike: DelayTCP
TCP
TCP
Spike
Spike
Early Results
Early Results
• Bugs in Session Recovery
Found…
Early Results
• Bugs in Session Recovery
• Bug in Pony standard library
Found…
Early Results
• Bugs in Session Recovery
• Bug in Pony standard library
• Bugs in Spike
Found…
Early Results
• Bugs in Session Recovery
• Bug in Pony standard library
• Bugs in Spike
• And more bugs…
Found…
Determinism is key
Early ResultsFound…
Determinism is key
Early Results
but hard to achieve
Found…
Data Lineage
WARNING!!!Vaporware ahead
Output
Data Lineage
How did I get here?
Output
Data Lineage
Data LineageInput: 1,2,3
Data LineageInput: 1,2,3
Expect: 2,4,6
Data LineageInput: 1,2,3
Expect: 2,4,6
Get: 4,6
Data LineageInput: 1,2,3
Expect: 2,4,6
Get: 4,6
How did we get here? these are not our beautiful results
Data LineageInput: 1,2,3
Data LineageInput: 1,2,3
Expect: 2,4,6
Data LineageInput: 1,2,3
Expect: 2,4,6
Get: 2,6,12
Data LineageInput: 1,2,3
Expect: 2,4,6
Get: 2,6,12
¯\_( )_/¯
Data Lineage to the Rescue!
Data Lineage
Externally verify determinism
Data Lineage
Externally verify determinismis it REALLY deterministic?
Data Lineage
Find incorrect executions
Data Lineage
Find incorrect executionsbugs in Buffy
Data LineageInput: 1
Expected: 2
Got: 4
¯\_( )_/¯
Data Lineage
Execution path was…
when it should have been
Data Lineage
when it should have been
Execution path was…
Data Lineage
Useful outside of development
Data Lineage
Production Debugging
Data Lineage
Production Debugginghow did I get here?
Data Lineage
Audit Log
Data Lineage
Audit Logwhy did you do that?
Data Lineage
Hindsight Machine
Building Confidence is difficult
and frustrating
Don't be this dog
Be this dog
Peter Alvaro
http://www.cs.berkeley.edu/~palvaro/molly.pdf
@palvaro
https://www.youtube.com/watch?v=ggCffvKEJmQ
Lineage-driven Fault Injection:
Outwards from the Middle of the Maze:
Will Wilson
https://www.youtube.com/watch?v=4fFDFbi3tocTesting Distributed Systems w/ Deterministic Simulation:
Catie McCaffrey
http://queue.acm.org/detail.cfm?ref=rss&id=2889274
@caitie
The Verification of a Distributed System
The Verification of a Distributed System: A practitioner's guide to increasing confidence in system correctness
2:55 PM Tomorrow in Salon E
Inés Sombra
https://www.youtube.com/watch?v=KSdNYi55kjgTesting in a Distributed World:
@randommood
http://principlesofchaos.orgPrinciples of Chaos Engineering:
Chaos Engineering
Thanks
Peter Alvaro Sylvan Clebsch
Zeeshan Lakhani John Mumm Rob Roland
Andrew Turley
@SeanTAllenNote:
The 'T' is very important