Expect the unexpected: Prepare for failures in microservices
-
Upload
bhakti-mehta -
Category
Engineering
-
view
9.246 -
download
3
Transcript of Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Anticipate and prepare for failures
in micro services Bhak& Mehta
@bhak&_mehta
Introduc&on
• Senior So7ware Engineer at Blue Jeans Network
• Worked at Sun Microsystems/Oracle for 13 years
• CommiGer to numerous open source projects including GlassFish Applica&on Server
My recent book
Previous book
Blue Jeans Network
Blue Jeans Network
• Video conferencing in the cloud • Customers in all segments • Millions of users • Interoperable • Video sharing, Content sharing • Mobile friendly • Solu&ons for large scale events
What you will learn
• Microservices architecture • Challenges at scale • Lessons learned, &ps and prac&ces to prevent cascading failures
• Resilience planning at various stages • Real world examples
Customer B
Top level architecture
INTERNET
Customer A
SIP, H.323
HTTP / HTTPS
Media Node
Web Server
Middleware services
Cache
Service discovery
Messaging
DB
Proxy layer
Connector Node
Micro services architecture
Path to Micro services
• Advantages – Simplicity – Isola&on of problems – Scale up and scale down – Easy deployment – Clear separa&on of concerns – Heterogeneity and polyglo&sm
Microservices
• Disadvantages – Not a free lunch! – Distributed systems prone to failures – Eventual consistency – More effort in terms of deployments, release managements
– Challenges in tes&ng the various services evolving independently, regression tests etc
Monoliths to Micro services
Resilient system
• Processes transac&ons, even when there are transient impulses, persistent stresses
• Func&ons even when there are component failures disrup&ng normal processing
• Accepts failures will happen • Designs for crumple zones
Kinds of failures
• Challenges at scale • Integra&on point failures
– Network errors – Seman&c errors. – Slow responses – Outright hang – GC issues
An&cipate failures at scale
• An&cipate growth • Design for next order of magnitude • Design for 10x plan to rewrite for 100x
Resiliency planning Stage 1
• When developing code – Avoiding Cascading failures
• Circuit breaker • Timeouts • Retry • Bulkhead • Cache op&miza&ons
– Avoid malicious clients • Rate limi&ng
Resiliency planning Stage 2
• Planning for dealing with failures before deploy – load test – a/b test – longevity
Resiliency planning Stage 3
• Watching out for failures a7er deploy – health check – metrics
Cascading failures
Caused by Chain reac&ons For example One node in a load balance group fails Others need to pick up work Eventually performance can degenerate
Cascading failures with aggrega&on
Cascading failure with aggrega&on
Timeouts
• Clients may prefer a response – failure – success – job queued for later All aggrega&on requests to microservices should have reasonable &meouts set
Types of Timeouts
• Connec&on &meout – Max &me before connec&on can be established or Error
• Socket &meout – Max &me of inac&vity between two packets once connec&on is established
Timeouts paGern
• Timeouts + Retries go together • Transient failures can be remedied with fast retries
• However problems in network can last for a while so probability of retries failing
Timeouts in code In JAX-‐RS Client client = ClientBuilder.newClient(); client.property(ClientProperties.CONNECT_TIMEOUT, 5000); client.property(ClientProperties.READ_TIMEOUT, 5000)
Retry paGern
• Retry for failures in case of network failures, &meouts or server errors
• Helps transient network errors such as dropped connec&ons or server fail over
Retry paGern
• If one of the services is slow or malfunc&oning and other services keep retrying then the problem becomes worse
• Solu&on – Exponen&al backoff – Circuit breaker paGern
Circuit breaker paGern
Circuit breaker A circuit breaker is an electrical device used in an electrical panel that monitors and controls the amount of amperes (amps) being sent through
Circuit breaker paGern
• Safety device • If a power surge occurs in the electrical wiring, the breaker will trip.
• Flips from “On” to “Off” and shuts electrical power from that breaker
Circuit breaker
• Neflix Hystrix follows circuit breaker paGern • If a service’s error rate exceeds a threshold it will trip the circuit breaker and block the requests for a specific period of &me
Bulkhead
Bulkhead
• Avoiding chain reac&ons by isola&ng failures • Helps prevent cascading failures
Bulkhead
• An example of bulkhead could be isola&ng the database dependencies per service
• Similarly other infrastructure components can be isolated such as cache infrastructure
Rate Limi&ng
• Restric&ng the number of requests that can be made by a client
• Client can be iden&fied based on the access token used
• Addi&onally clients can be iden&fied based on IP address
Rate Limi&ng
• With JAX-‐RS Rate limi&ng can be implemented as a filter
• This filter can check the access count for a client and if within limit accept the request
• Else throw a 429 Error • Code at hGps://github.com/bhak&-‐mehta/samples/tree/master/ratelimi&ng
Cache op&miza&ons
• Stores response informa&on related to requests in a temporary storage for a specific period of &me
• Ensures that server is not burdened processing those requests in future when responses can be fulfilled from the cache
Cache op&miza&ons
Gelng from first level cache
Gelng from second level cache
Gelng from the DB
Dealing with latencies in response
• Have a &meout for the aggrega&on service • Dispatch requests in parallel and collect responses
• Associate a priority with all the responses collected
Handling par&al failures best prac&ces
• One service calls another which can be slow or unavailable
• Never block indefinitely wai&ng for the service • Try to return par&al results • Provide a caching layer and return cached data
Asynchronous PaGerns
• PaGern to deal with long running jobs • Some resources may take longer &me to provide results
• Not needing client to wait for the response
Reac&ve programming model
• Use reac&ve programming such as CompletableFuture in Java 8, ListenableFuture
• Rx Java
Asynchronous API
• Reac&ve paGerns • Message Passing
– Akka actor model
• Message queues – Communica&on between services via shared message queues
– Websockets
Logging
• Complex distributed systems introduce many points of failure
• Logging helps link events/transac&ons between various components that make an applica&on or a business service
• ELK stack • Splunk, syslog • Loggly • LogEntries
Logging best prac&ces
• Include detailed, consistent paGern across service logs
• Obfuscate sensi&ve data • Iden&fy caller or ini&ator as part of logs • Do not log payloads by default
Best prac&ces when designing APIs for mobile clients
– Avoid chalness – Use aggregator paGern
Resilience planning Stage 2
• Before deploy – Load tes&ng – Longevity tes&ng – Capacity planning
Load tes&ng
• Ensure that you test for load on APIs – Jmeter
• Plan for longevity tes&ng
Capacity Planning
• An&cipate growth • Design for handling exponen&al growth
Resilience planning Stage 3
• A7er deploy – Health check – Metrics and Monitoring – Phased rollout of features
Health Check
• Memory • CPU • Threads • Error rate • If any of the checks exceed a threshold send alert
Metrics
• Response &mes, throughput – Iden&fy slow running DB queries
• GC rate and pause dura&on – Garbage collec&on can cause slow responses
• Monitor unusual ac&vity • Third party library metrics
– For example Couchbase hits – atop
Metrics
• Load average • Up&me • Log sizes
Monitoring
Monitoring server
Produc&on Environment
CHECKS
ALERTS
Monitoring Stack • Log Aggrega&on framework Applica&on
• So7ware analy&cs tool that monitors performance
OS / Applica&on Code
• Collectd / Graphite Network, Server
Rollout of new features
• Phasing rollout of new features • Have a way to turn features off if not behaving as expected
• Alerts and more alerts!
Real &me examples
• Neflix's Simian Army induces failures of services and even datacenters during the working day to test both the applica&on's resilience and monitoring.
• Latency Monkey to simulate slow running requests
• Wiremock to mock services • Saboteur to create deliberate network mayhem
Takeaway
• Inevitability of failures – Expect systems will fail – Failure preven&on
References • hGps://commons.wikimedia.org/wiki/File:Bulkhead_PSF.png • hGps://en.wikipedia.org/wiki/Circuit_breaker#/media/
File:Four_1_pole_circuit_breakers_fiGed_in_a_meter_box.jpg • hGps://www.flickr.com/photos/skynoir/ Beer in hand: skynoir/Flickr/Crea&ve Commons License
Ques&ons • TwiGer: @bhak&_mehta • Email: bhak&@bluejeans.com