The Case for Chaos
-
Upload
bruce-wong -
Category
Technology
-
view
268 -
download
4
Transcript of The Case for Chaos
The Case for Chaos – AWS Pop-up Loft
Bruce Wong – Engineering Manager – Chaos Engineering, Netflix
1
Who am I?
Bruce Wong
2@bruce_m_wong
Who am I?
Bruce Wong
Netflix since 2010
3@bruce_m_wong
Who am I?
Bruce Wong
Netflix since 2010
Computer Science
4@bruce_m_wong
Who am I?
Bruce Wong
Netflix since 2010
Computer Science
Builds Engineering Teams
5 different teams so far
5@bruce_m_wong
Agenda
Why?
Case Studies
How you can start chaos testing
Future chaos
6@bruce_m_wong
Failure is Unavoidable
Disks Fail
Power outages. And your generator fails
Software bugs
Human Error
7@bruce_m_wong
What about the cloud?
8@bruce_m_wong
Cloud Case Study
9@bruce_m_wong
XSA-108 Security Vulnerability
~10% of EC2 instances
rebooted
Spread over a 5 days
One availability-zone at a time
Chaos Validated + Public Cloud Validated
10@bruce_m_wong
Netflix & Micro-Services
11@bruce_m_wong
http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
Netflix & Micro-Services
12@bruce_m_wong
13@bruce_m_wong
14@bruce_m_wong
15@bruce_m_wong
16@bruce_m_wong
17
Graceful Degradation
@bruce_m_wong
Product + Engineering Decision
18
Designing for Failure
@bruce_m_wong
Infrastructure Failure
Instance terminations – single points of failure
Latency
Availability Zone
Regional
Application Failure
Graceful degradation
Software Bugs
19
Testing
@bruce_m_wong
Unit testing
Integration testing
Functional testing
Regression testing
Chaos Testing
Finding bugs earlier
20
Resilience needs to be tested
@bruce_m_wong
Testing is hard
Large and growing data sets
Internet-scale traffic
Innovation and New features
Change is constant
21
Resilience needs to be tested
@bruce_m_wong
Validate resilience design
Don’t wait for next outage
Un-controlled
Un-predictable
Hope is not a strategy
Types of Chaos
22
Instances Fail
Lessons
• Be as stateless as possible
• Autoscaling groups are good
• Invest in automation to rebuilt
state when necessary
• Running Chaos Monkey on
C*
@bruce_m_wong
Types of Chaos
23
Many Instances can Fail
Lessons
• Cassandra works as expected
• Moving Traffic back to steady
state is just as hard
• Infrastructure Management tools
can be a bottleneck
@bruce_m_wong
Types of Chaos
24
Natural Disasters Happen
Lessons
• Cassandra works as expected
• Moving Traffic back to steady
state is just as hard
• Infrastructure Management can
be a bottleneck
• Smaller Blast-Radius Benefits
• Traffic + Capacity orchestration
is hard
@bruce_m_wong
Types of Chaos
25
Latency
Still Learning
• Functional fallbacks don’t
account for system limitations
• Thread pools
• Connection pools
• Slow can be hard to find
• Slow can be hard to contain
• Unbounded Queues are BAD
@bruce_m_wong
26
Unbounded Queues
@bruce_m_wong
Come in many forms, to name a few
Threads
Memory
Disk
Bounded by physical limitations
VERY difficult to find
Elastic is not Infinite
27
For Example: Memory and Data
@bruce_m_wong
Data is important
In-Memory Queue grows and shrinks
Failure Mode # 1 – Out of memory
NOT A MEMORY LEAK!
28
For Example: Memory and Data
@bruce_m_wong
Data is important
If Queue gets to size X
Write to disk
Flush later
Failure Mode # 2
Disk Full
File Descriptors Saturated
29
For Example: Memory and Data
@bruce_m_wong
Data is important
…
But not as important as uptime
Starting Chaos
30
Start small, very small.
Start simple, stateless systems
Start manually and coordinated
Failure Injection Fridays
Build confidence
Outages are opportunities
@bruce_m_wong
Chaos takes time
31@bruce_m_wong
2010
2012
2014
Aspirational Chaos
32
Increase Frequency & Intensity
Reduces chance of drift
Infrastructure
Continuous Latency injection
Chaos Gorilla random AZ weekly
Latency Gorilla
CPU, Memory, Disk
Application
Continuous Validation of fallbacks
Startup dependency failure injection
@bruce_m_wong
Questions
33@bruce_m_wong