Expect the unexpected: Prepare for failures in microservices

Expect the unexpected: Anticipate and prepare for failures

in micro services Bhak& Mehta

@bhak&_mehta

Introduc&on

•  Senior So7ware Engineer at Blue Jeans Network

•  Worked at Sun Microsystems/Oracle for 13 years

•  CommiGer to numerous open source projects including GlassFish Applica&on Server

My recent book

Previous book

Blue Jeans Network

Blue Jeans Network

•  Video conferencing in the cloud •  Customers in all segments •  Millions of users •  Interoperable •  Video sharing, Content sharing •  Mobile friendly •  Solu&ons for large scale events

What you will learn

•  Microservices architecture •  Challenges at scale •  Lessons learned, &ps and prac&ces to prevent cascading failures

•  Resilience planning at various stages •  Real world examples

Customer B

Top level architecture

INTERNET

Customer A

SIP, H.323

HTTP / HTTPS

Media Node

Web Server

Middleware services

Cache

Service discovery

Messaging

DB

Proxy layer

Connector Node

Micro services architecture

Path to Micro services

•  Advantages – Simplicity –  Isola&on of problems – Scale up and scale down – Easy deployment – Clear separa&on of concerns – Heterogeneity and polyglo&sm

Microservices

•  Disadvantages – Not a free lunch! – Distributed systems prone to failures – Eventual consistency – More effort in terms of deployments, release managements

–  Challenges in tes&ng the various services evolving independently, regression tests etc

Monoliths to Micro services

Resilient system

•  Processes transac&ons, even when there are transient impulses, persistent stresses

•  Func&ons even when there are component failures disrup&ng normal processing

•  Accepts failures will happen •  Designs for crumple zones

Kinds of failures

•  Challenges at scale •  Integra&on point failures

–  Network errors – Seman&c errors. – Slow responses – Outright hang – GC issues

An&cipate failures at scale

•  An&cipate growth •  Design for next order of magnitude •  Design for 10x plan to rewrite for 100x

Resiliency planning Stage 1

•  When developing code – Avoiding Cascading failures

•  Circuit breaker •  Timeouts •  Retry •  Bulkhead •  Cache op&miza&ons

– Avoid malicious clients •  Rate limi&ng


•  Planning for dealing with failures before deploy –  load test – a/b test –  longevity


•  Watching out for failures a7er deploy – health check – metrics

Cascading failures

Caused by Chain reac&ons For example One node in a load balance group fails Others need to pick up work Eventually performance can degenerate

Cascading failures with aggrega&on

Cascading failure with aggrega&on

Timeouts

•  Clients may prefer a response –  failure –  success –  job queued for later All aggrega&on requests to microservices should have reasonable &meouts set

Types of Timeouts

•  Connec&on &meout – Max &me before connec&on can be established or Error

•  Socket &meout – Max &me of inac&vity between two packets once connec&on is established

Timeouts paGern

•  Timeouts + Retries go together •  Transient failures can be remedied with fast retries

•  However problems in network can last for a while so probability of retries failing

Timeouts in code In JAX-‐RS Client client = ClientBuilder.newClient(); client.property(ClientProperties.CONNECT_TIMEOUT, 5000); client.property(ClientProperties.READ_TIMEOUT, 5000)

Retry paGern

•  Retry for failures in case of network failures, &meouts or server errors

•  Helps transient network errors such as dropped connec&ons or server fail over

Retry paGern

•  If one of the services is slow or malfunc&oning and other services keep retrying then the problem becomes worse

•  Solu&on – Exponen&al backoff – Circuit breaker paGern

Circuit breaker paGern

Circuit breaker A circuit breaker is an electrical device used in an electrical panel that monitors and controls the amount of amperes (amps) being sent through

Circuit breaker paGern

•  Safety device •  If a power surge occurs in the electrical wiring, the breaker will trip.

•  Flips from “On” to “Off” and shuts electrical power from that breaker

Circuit breaker

•  Neflix Hystrix follows circuit breaker paGern •  If a service’s error rate exceeds a threshold it will trip the circuit breaker and block the requests for a specific period of &me

Bulkhead

Bulkhead

•  Avoiding chain reac&ons by isola&ng failures •  Helps prevent cascading failures

Bulkhead

•  An example of bulkhead could be isola&ng the database dependencies per service

•  Similarly other infrastructure components can be isolated such as cache infrastructure

Rate Limi&ng

•  Restric&ng the number of requests that can be made by a client

•  Client can be iden&fied based on the access token used

•  Addi&onally clients can be iden&fied based on IP address

Rate Limi&ng

•  With JAX-‐RS Rate limi&ng can be implemented as a filter

•  This filter can check the access count for a client and if within limit accept the request

•  Else throw a 429 Error •  Code at hGps://github.com/bhak&-‐mehta/samples/tree/master/ratelimi&ng

Cache op&miza&ons

•  Stores response informa&on related to requests in a temporary storage for a specific period of &me

•  Ensures that server is not burdened processing those requests in future when responses can be fulfilled from the cache

Cache op&miza&ons

Gelng from first level cache

Gelng from second level cache

Gelng from the DB

Dealing with latencies in response

•  Have a &meout for the aggrega&on service •  Dispatch requests in parallel and collect responses

•  Associate a priority with all the responses collected

Handling par&al failures best prac&ces

•  One service calls another which can be slow or unavailable

•  Never block indefinitely wai&ng for the service •  Try to return par&al results •  Provide a caching layer and return cached data

Asynchronous PaGerns

•  PaGern to deal with long running jobs •  Some resources may take longer &me to provide results

•  Not needing client to wait for the response

Reac&ve programming model

•  Use reac&ve programming such as CompletableFuture in Java 8, ListenableFuture

•  Rx Java

Asynchronous API

•  Reac&ve paGerns •  Message Passing

– Akka actor model

•  Message queues – Communica&on between services via shared message queues

– Websockets

Logging

•  Complex distributed systems introduce many points of failure

•  Logging helps link events/transac&ons between various components that make an applica&on or a business service

•  ELK stack •  Splunk, syslog •  Loggly •  LogEntries

Logging best prac&ces

•  Include detailed, consistent paGern across service logs

•  Obfuscate sensi&ve data •  Iden&fy caller or ini&ator as part of logs •  Do not log payloads by default

Best prac&ces when designing APIs for mobile clients

– Avoid chalness – Use aggregator paGern

Resilience planning Stage 2

•  Before deploy – Load tes&ng – Longevity tes&ng – Capacity planning

Load tes&ng

•  Ensure that you test for load on APIs –  Jmeter

•  Plan for longevity tes&ng

Capacity Planning

•  An&cipate growth •  Design for handling exponen&al growth

Resilience planning Stage 3

•  A7er deploy – Health check – Metrics and Monitoring – Phased rollout of features

Health Check

•  Memory •  CPU •  Threads •  Error rate •  If any of the checks exceed a threshold send alert

Metrics

•  Response &mes, throughput –  Iden&fy slow running DB queries

•  GC rate and pause dura&on – Garbage collec&on can cause slow responses

•  Monitor unusual ac&vity •  Third party library metrics

– For example Couchbase hits – atop

Metrics

•  Load average •  Up&me •  Log sizes

Monitoring

Monitoring server

Produc&on Environment

CHECKS

ALERTS

Email

Monitoring Stack • Log Aggrega&on framework Applica&on

• So7ware analy&cs tool that monitors performance

OS / Applica&on Code

• Collectd / Graphite Network, Server

Rollout of new features

•  Phasing rollout of new features •  Have a way to turn features off if not behaving as expected

•  Alerts and more alerts!

Real &me examples

•  Neflix's Simian Army induces failures of services and even datacenters during the working day to test both the applica&on's resilience and monitoring.

•  Latency Monkey to simulate slow running requests

•  Wiremock to mock services •  Saboteur to create deliberate network mayhem

Takeaway

•  Inevitability of failures – Expect systems will fail – Failure preven&on

References •  hGps://commons.wikimedia.org/wiki/File:Bulkhead_PSF.png •  hGps://en.wikipedia.org/wiki/Circuit_breaker#/media/

File:Four_1_pole_circuit_breakers_fiGed_in_a_meter_box.jpg •  hGps://www.flickr.com/photos/skynoir/ Beer in hand: skynoir/Flickr/Crea&ve Commons License

Ques&ons •  TwiGer: @bhak&_mehta •  Email: bhak&@bluejeans.com

Expect the unexpected: Prepare for failures in microservices

Engineering

Transcript of Expect the unexpected: Prepare for failures in microservices