CHAOS PATTERNSArchitecting for failure in distributed systems
Bruce Wong - @bruce_m_wong / Jos Boumans - @jiboumanshttp://www.soponderando.com.br/
http://fotos.subefotos.com/7a6b3e6df9453d5adf150087e5300834o.jpg
How to measure everything
Architecting in AWS for
resilience & cost
www.slideshare.net/jiboumans/aws-architecting-for-resilience-cost-at-scale http://www.slideshare.net/jiboumans/how-to-measure-everything-a-million-metrics-per-second-with-minimal-developer-overhead
VP of Operations & Infrastructure
http://www.krux.com/
3 Billion Users
ABOUT BRUCE
2010 2015
Software Engineer
Insight Engineering
Senior Engineering Manager
Chaos Engineering
Prosumers Consumers Enterprise
http://techblog.netflix.com/2014/09/introducing-chaos-engineering.html
A LOT OF TRAFFIChttp://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html
http://grandprix247.com/2012/09/03/spa-pile-up-renews-focus-on-formula-1-safety-matters/
REAL WORLD FAILURES
SEPTEMBER 20TH, 2015Also: April 21, 2011 - June 29, 2012 - October 22, 2012 - December 24, 2012 - August 26, 2013 <out of space>
https://twitter.com/iamDeveloper/status/645659734767329281 https://aws.amazon.com/message/5467D2/
ISOLATION & CONTAINMENTIdeally limit failure to a single service
Stop it from spreadinghttp://businessnerds.wordpress.com/2011/05/28/so-far-so-good…-the-review/
So#ware,)8)
Automa/on,)4)
Process,)14)
#"of"Issues"
Amazon"Cloud"Major"Outage"7"Issues"Categories"
https://steamcommunity.com/app/620/ http://fotos.subefotos.com/7a6b3e6df9453d5adf150087e5300834o.jpg
AWS Root Cause Analysis over time
http://www.slideshare.net/rahultyagi50999/amazon-cloud-major-outages-analysis
Humans, Software, Processes
All likely causes of failure
Isolation Unlikely
2 - 4x Yearly frequency of catastrophic failure
THERE ARE DOWNSIDEShttp://modernsavage.hubpages.com/hub/10-springfield-shopper-headlines
Complex SystemsDifficult to model, not feasible to simulate at scale
Software is Iterativetesting, code coverage, “agile”
Resilience Design is also Iterative…unlike software, complexity makes testing difficult
Rich Search ExperienceMany optional enhancements
http://usa.streetsblog.org/category/issues-campaigns/air-quality/
NAVIGATING THE CHAOS
FALLBACK PATTERNS“Expect the Unexpected”
http://blabitcanada.com/category/twitter-2/
BASIC API CALL3 potential points of failure
FALLBACK PATTERNSThe cost of resilience should be accuracy or latency
http://redis.io/ http://memcached.org/
http://varnish-cache.org/
ENSURING DATA ACCESS
https://www.flickr.com/photos/ichijo2009/8501266124
CAP THEOREM APPLIESYour choice: sacrifice availability or consistency. Orange is a lie.
RDBMS BigTable Based
Master / Slave based
CouchDB Dynamo Based
http://ferd.ca/beating-the-cap-theorem-checklist.html
SPLIT OUT YOUR CONTROL PLANE
http://paul-barford.blogspot.com/2015/01/sappho-pap-obbink-further-painting-into.html
EC2 EMR RDS
Dynamo
Cloudfront CDN
Route53 DNS
Cloudwatch Monitoring
Cloudfront CDN
Route53 DNS
Cloudwatch Monitoring
Control plane Separate
from workload
DNS & CDN Your best friends
Latency or Accuracy
Pick one to sacrificefor resilience
USER EXPERIENCEMy tweet got posted
http://mclaughlindrums.com/wp-content/uploads/2013/04/Relativity-by-Escher.jpg
ORDERED CHAOS
Nation’s Business, 1977
CHAOS DEFINED
Intentionally introducing failure into a system with the purpose of validating resilience design.
http://www.cnbc.com/id/102394893
BREAKING THE SYSTEM
How Confident are you?
-Next week?
-Next month?
-After that “quick patch”
CHAOS VS OUTAGEChaos
• Controlled
• Planned
• Intentional
• Microscopic user impact
Outages
• Uncontrolled
• Unpredictable
• Unintended
• Large impact
Single Point of FailureDiscover - Fix - Validate
CHAOS MONKEY
http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.htmlhttps://github.com/Netflix/SimianArmy
9am-5pm Mon-Fri Don’t upset your on-call
1 Instance Per group / per day
Detect SPOF Intentionally
Slow is HardProduct + Business + Engineering Decisions
https://pragprog.com/book/mnee/release-it
Custom Fallback
accuracy or latency
Fail Silent For optional data
Fail Fast to keep servers healthy
LATENCY MONKEYother frameworks
http://www.infoq.com/presentations/failure-as-a-service-netflix
http://techblog.netflix.com/2014/10/fit-failure-injection-testing.html
HTTP 5xx 1 minute duration
10-100ms Sleep during request
1-100% Of requests
Prevent Propagationto avoid cascading failure
CHAOS KONGbecause regions fail
http://techblog.netflix.com/2015/09/chaos-engineering-upgraded.html
GeoDNS fallback to LatencyDNS
Proxy Cross-Region
communication
Capacity Cost-Benefit Decision
"ONCE IN A BLUE MOON"Happens at least a few times a year....
https://whisperofangels.wordpress.com/2013/08/20/once-in-a-blue-moon/
TAKE AWAYgo found chaos engineering at your company RIGHT
NOW
Most enterprises hire people to fix things. Netflix hires people to break things….
…we should embrace Netflix's culture of "chaos engineering" throughout organizations of all shapes and sizes.
http://readwrite.com/2014/09/17/netflix-chaos-engineering-for-everyone
Q & A
http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html
@bruce_m_wong / @jiboumansSlides - https://www.linkedin.com/in/brucemwong
Top Related