Antifragility and testing for distributed systems failure

43
Antifragility and testing distributed systems Approaches for testing and improving resiliency

Transcript of Antifragility and testing for distributed systems failure

Page 1: Antifragility and testing for distributed systems failure

Antifragility and testing distributed systemsApproaches for testing and improving resiliency

Page 2: Antifragility and testing for distributed systems failure

FailureIt’s inevitable

Page 3: Antifragility and testing for distributed systems failure

Microservice Architectures

■ Bounded contexts■ Deterministic in nature■ Simple behaviour■ Independently testable (e.g. Pact)

Page 4: Antifragility and testing for distributed systems failure
Page 5: Antifragility and testing for distributed systems failure

Distributed Architectures

Conversely…

■ Unbounded context■ Non-determinism■ Exhibit chaotic behaviour■ Emergent behaviour■ Complex testing

Page 6: Antifragility and testing for distributed systems failure
Page 7: Antifragility and testing for distributed systems failure

Problems with traditional approaches

■ Integration test hell■ Need to get by without E2E environments■ Learnings are non-representative anyway■ Slower■ Costly (effort + $$)

Page 8: Antifragility and testing for distributed systems failure

Alternative?

Create an isolated, simulated environment

■ Run locally or on a CI environment■ Fast - no need to setup complex test data, scenarios etc.■ Enables single-variable hypothesis testing■ Automatable

Page 9: Antifragility and testing for distributed systems failure

Lab Testing w\ Docker ComposeHypothesis testing simulated environments

Page 10: Antifragility and testing for distributed systems failure

Docker Compose

■ Docker container orchestration tool■ Run locally or remotely■ Works across platforms (Windows, Mac, *nix)■ Easy to use

Page 11: Antifragility and testing for distributed systems failure
Page 12: Antifragility and testing for distributed systems failure

Nginx

Let’s take a practical, real-world example: Nginx as an API Proxy.

Page 13: Antifragility and testing for distributed systems failure
Page 14: Antifragility and testing for distributed systems failure

Simulating failure with Muxy

“A tool to help simulate distributed systems failures”

Page 15: Antifragility and testing for distributed systems failure

Hypothesis testing

Our job is to hypothesise, test, learn, change, and repeat

Page 16: Antifragility and testing for distributed systems failure

Nginx TestingH0 = Introducing network latency does not cause errors

Test setup:

● Nginx running locally, with Production configuration● DNSMasq used to resolve production urls to other Docker

containers● Muxy container setup, proxying the API● A test harness to hit the API via Nginx n times, expecting

0 failures

Page 17: Antifragility and testing for distributed systems failure
Page 18: Antifragility and testing for distributed systems failure

Demo

Fingers crossed...

Page 19: Antifragility and testing for distributed systems failure

Knobs and Levers

We can now have a number of levers to pull. What if we...

● Want to improve on our SLA?● Want to see how it performs if the API is hard down?● ...

Page 20: Antifragility and testing for distributed systems failure

AntifragilityFailure is inevitable, let’s make it normal

Page 21: Antifragility and testing for distributed systems failure

Titanic Architectures

Architectures

Page 22: Antifragility and testing for distributed systems failure

Titanic Architectures

“Titanic architectures are architectures that are good in theory, but haven’t been put into practice”

Page 23: Antifragility and testing for distributed systems failure

Anti-titanic architectures?

“What doesn’t kill you makes you stronger”

Page 24: Antifragility and testing for distributed systems failure

Antifragility

“The resilient resists shocks and stays the same; the antifragile gets better” - Nasim Taleb

Page 25: Antifragility and testing for distributed systems failure

Chaos Engineering

● We expect our teams to build resilient applications○ Fault tolerance across and within service boundaries

● We expect servers and dependent services to fail● Let’s make that normal● Production is a playground● Levelling up

Page 26: Antifragility and testing for distributed systems failure

Chaos Engineering - Principles

1. Build a hypothesis around Steady State Behavior2. Vary real-world events3. Run experiments in production4. Automate experiments to run continuously

Requires the ability to measure - you need metrics!!

http://www.principlesofchaos.org/

Page 27: Antifragility and testing for distributed systems failure

Production Hypothesis Testing

H0 = Loss of an AWS region does not result in errors

Test setup:

● Multi-region application setup for the video playing API● Apply Chaos Kong to us-west-2● Measure aggregate production traffic for ‘normal’ levels

Page 28: Antifragility and testing for distributed systems failure

Kill an AWS region

http://techblog.netflix.com/2015/09/chaos-engineering-upgraded.html

Page 29: Antifragility and testing for distributed systems failure

Go/Hystrix API Demo

H0 = Introducing network latency does not cause API errors

Test setup:

● API1 running with Hystrix circuit breaker enabled if API2 does not respond within SLAs

● Muxy container setup, proxying upstream API2● A test harness to hit API1 n times, expecting 0 failures

Page 30: Antifragility and testing for distributed systems failure

Human FactorsTechnology is only part of the problem, can we test that too?

Page 31: Antifragility and testing for distributed systems failure
Page 32: Antifragility and testing for distributed systems failure

Chernobyl

● Worst nuclear disaster of all time (1986)● Public information sketchy● Estimated > 3M Ukrainians affected● Radioactive clouds sent over Europe● Combination of system + human errors● Series of seemingly logical steps ->

catastrophe

Page 33: Antifragility and testing for distributed systems failure

What we know about human factors

● Accidents happen● 1am - 8am = higher incidence of human errors● Humans will ignore directions

○ They sometimes need to (e.g. override)○ Other times they think they need to

(mistake)● Computers are better at following processes

Page 34: Antifragility and testing for distributed systems failure

Let’s use a Production deployment as a key example:

● CI -> CD pipeline used to deploy● Production incident occurs 6 hours later (2am)● ...what do we do?● We trust the build pipeline, avoid non-standard

actions

These events help us understand and improve our systems

Translation

Page 35: Antifragility and testing for distributed systems failure

“ A game day exercise is where we intentionally try to break our system, with the goal of being able to understand it better and learn from it ”

Game Day Exercises

Page 36: Antifragility and testing for distributed systems failure

Prerequisites:

● A game plan● All team members and affected staff aware of it● Close collaboration between Dev, Ops, Test,

Product people etc.● An open mind● Hypotheses● Metrics● Bravery

Game Day Exercises

Page 37: Antifragility and testing for distributed systems failure

● Get entire team together● Make a simple diagram of system on a

whiteboard● Come up with ~5 failure scenarios● Write down hypotheses for each scenario● Backup any data you can’t lose● Induce each failure and observe the results

Game Day Exercises

https://stripe.com/blog/game-day-exercises-at-stripe

Page 38: Antifragility and testing for distributed systems failure

Examples of things that fail:

● Application dies● Hard disk fail● Machine dies < AZ < Region…● Github/Source control goes down● Build server dies● Loss of \ degraded network connectivity● Loss of dependent API● ...

Game Day Exercises

Page 39: Antifragility and testing for distributed systems failure

Wrapping upI hope I didn’t fail

Page 40: Antifragility and testing for distributed systems failure

■ Apply the scientific method■ Use metrics to make learn and make decisions■ Docker-compose + Muxy to automate failure ■ Build resilience into software & architecture■ Regularly Production resilience until it’s normal■ Production outages are opportunities to learn■ Start small!

Wrapping up

Page 41: Antifragility and testing for distributed systems failure

Thank you

PRESENTED BY:

@matthewfellows

Page 42: Antifragility and testing for distributed systems failure

■ Antifragility (https://en.wikipedia.org/wiki/Antifragile) ■ Chaos Engineering (

http://techblog.netflix.com/2014/09/introducing-chaos-engineering.html)

■ Principles of Chaos (http://www.principlesofchaos.org/)■ Human factors in large-scale technological systems'

accidents: Three Mile Island, Bhopal, Chernobyl (http://oae.sagepub.com/content/5/2/133.abstract)

References

Page 43: Antifragility and testing for distributed systems failure

■ Docker Compose (https://www.docker.com/docker-compose)

■ Muxy (https://github.com/mefellows/muxy)■ Nginx resilience testing with Docker Compose (

www.onegeek.com.au/articles/resilience-testing-nginx-with-docker-dnsmasq-and-muxy)

■ Golang + Hystrix resilience testing with Docker Compose (https://github.com/mefellows/muxy/tree/mst-meetup-demo/examples/hystrix)

Code \ Tool References