Devoxx France: Fault tolerant microservices on the JVM with Cassandra

Post on 18-Jul-2015

763 views 3 download


Transcript of Devoxx France: Fault tolerant microservices on the JVM with Cassandra


Fault tolerant microservices on the JVM

Christopher Batey DataStax @chbatey


Who am I?

• DataStax

- Technical Evangelist / Software Engineer

- Builds enterprise ready version of Apache Cassandra

• Sky: Building next generation Internet TV platform

• Lots of time working on a test double for Apache Cassandra



•Setting the scene

-What do we mean by a fault?

-What is a micro(ish)service?

-Monolith application vs the micro(ish)service

•A worked example

-Identify an issue

-Reproduce/test it

-Show how to deal with the issue


So… what do applications look like?


So… what do applications look like?


So… what do applications look like?


So… what do applications look like?


So… what do applications look like?


Small horizontal scalable services

• Move to small services independently deployed

- Login service

- Device service

- etc

• Move to a horizontally scalable Database that can run active active in multiple data centres


So… what do applications look like?


So... what do systems look like now?


Pin Service

Movie Player

User Service

Device Service

Play Movie

Example: Movie player service


Time for an example...

•All examples are on github

•Technologies used:


-Spring Boot






Testing microservices

• You don’t know a service is fault tolerant if you don’t test faults


The test double

Wiremock for HTTP integration Stubbed Cassandra for Database Kafka Unit


Isolated service tests

Movie service

Mocks User Device Pin service

Play Movie






Fault tolerance

1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off


1 - Don’t take forever

• If at first you don’t succeed, don’t take forever to tell someone

• Timeout and fail fast


Which timeouts?

• Socket connection timeout

• Socket read timeout


Your service hung for 30 seconds :(


You :(


Which timeouts?

• Socket connection timeout

• Socket read timeout

• Resource acquisition


Your service hung for 10 minutes :(


Let’s think about this


A little more detail


Adding a automated test


Adding a automated test

•Vagrant - launches + provisions local VMs

•Saboteur - uses tc, iptables to simulate network issues

•Wiremock - used to mock HTTP dependencies

•Cucumber - acceptance tests


I can write an automated test for that?

Wiremock: •User Service •Device Service •Pin Service

S a b o t e u r

Vagrant + Virtual box VM

Movie Service


prime to drop traffic



Implementing reliable timeouts


Implementing reliable timeouts

• Protect the container thread!

• Homemade: Worker Queue + Thread pool (executor)


Implementing reliable timeouts

• Protect the container thread!

• Homemade: Worker Queue + Thread pool (executor)

• Hystrix

• Spring cloud Netflix


A simple Spring RestController

@RestControllerpublic class Resource { private static final Logger LOGGER = LoggerFactory.getLogger(Resource.class); @Autowired private ScaryDependency scaryDependency; @RequestMapping("/scary") public String callTheScaryDependency() {"Resource later: I wonder which thread I am on!"); return scaryDependency.getScaryString(); }}


Scary dependency

@Componentpublic class ScaryDependency { private static final Logger LOGGER = LoggerFactory.getLogger(ScaryDependency.class); public String getScaryString() {"Scary Dependency: I wonder which thread I am on! Tomcats?”); if (System.currentTimeMillis() % 2 == 0) { return "Scary String"; } else { Thread.sleep(5000) return “Slow Scary String"; } }}


All on the tomcat thread

13:47:20.200 [http-8080-exec-1] INFO info.batey.examples.Resource - Resource later: I wonder which thread I am on!13:47:20.200 [http-8080-exec-1] INFO info.batey.examples.ScaryDependency - Scary Dependency: I wonder which thread I am on! Tomcats?


Scary dependency@Componentpublic class ScaryDependency { private static final Logger LOGGER = LoggerFactory.getLogger(ScaryDependency.class); @HystrixCommand() public String getScaryString() {"Scary Dependency: I wonder which thread I am on! Tomcats?”); if (System.currentTimeMillis() % 2 == 0) { return "Scary String"; } else { Thread.sleep(5000) return “Slow Scary String"; } }}


What an annotation can do...

13:51:21.513 [http-8080-exec-1] INFO info.batey.examples.Resource - Resource later: I wonder which thread I am on!13:51:21.614 [hystrix-ScaryDependency-1] INFO info.batey.examples.ScaryDependency - Scary Dependency: I wonder which thread I am on! Tomcats? :P


Async libraries are your friend

• DataStax Java Driver

- Guava ListenableFuture


Timeouts take home

• You can’t use network level timeouts for SLAs

• Test your SLAs - if someone says you can’t, hit them with a stick

• Scary things happen without network issues


Fault tolerance

1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off


2 - Don’t try if you can’t succeed



“When an application grows in complexity it will eventually start sending emails”



“When an application grows in complexity it will eventually start using queues and thread pools”


Don’t try if you can’t succeed


Don’t try if you can’t succeed

• Executor Unbounded queues :(

- newFixedThreadPool

- newSingleThreadExecutor

- newThreadCachedThreadPool

• Bound your queues and threads

• Fail quickly when the queue / maxPoolSize is met

• Know your drivers


This is a functional requirement

• Set the timeout very high

• Use Wiremock to add a large delay to the requests


This is a functional requirement

• Set the timeout very high

• Use Wiremock to add a large delay to the requests

• Set queue size and thread pool size to 1

• Send in 2 requests to use the thread and fill the queue

• What happens on the 3rd request?


Fault tolerance

1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off


3 - Fail gracefully


Expect rubbish

• Expect invalid HTTP

• Expect malformed response bodies

• Expect connection failures

• Expect huge / tiny responses


Testing with WiremockstubFor(get(urlEqualTo("/dependencyPath"))



{ "request": { "method": "GET", "url": "/fault" }, "response": { "fault": "RANDOM_DATA_THEN_CLOSE" }

{ "request": { "method": "GET", "url": "/fault" }, "response": { "fault": "EMPTY_RESPONSE" } }


Stubbed Cassandra


Fault tolerance

1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off


4 - Know if it’s your fault


Record stuff

• Metrics:

- Timings

- Errors

- Concurrent incoming requests

- Thread pool statistics

- Connection pool statistics

• Logging: Boundary logging, ElasticSearch / Logstash

• Request identifiers


Zipkin from Twitter


Graphite + Codahale


Response times


Separate resource pools

• Don’t flood your dependencies

• Be able to answer the questions:

- How many connections will you make to dependency X?

- Are you getting close to your max connections?


So easy with Dropwizard + Hystrix

metrics: reporters: - type: graphite host: port: 2003 prefix: shiny_app



- type: graphite


port: 2003

prefix: shiny_app

@Overridepublic void initialize(Bootstrap<AppConfig> appConfigBootstrap) { HystrixCodaHaleMetricsPublisher metricsPublisher = new HystrixCodaHaleMetricsPublisher(appConfigBootstrap.getMetricRegistry()); HystrixPlugins.getInstance().registerMetricsPublisher(metricsPublisher);}


Fault tolerance

1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off


5 - Don’t whack a dead horse

Movie Player

User Service

Device Service

Play Movie

Pin Service


What to do…

• Yes this will happen…

• Mandatory dependency - fail *really* fast

• Throttling

• Fallbacks


Circuit breaker pattern


Implementation with Hystrix

@Path("integrate") public class IntegrationResource { private static final Logger LOGGER = LoggerFactory.getLogger(IntegrationResource.class); @GET @Timed public String integrate() {"integrate"); String user = new UserServiceDependency(userService).execute(); String device = new DeviceServiceDependency(deviceService).execute(); Boolean pinCheck = new PinCheckDependency(pinService).execute(); return String.format("[User info: %s] \n[Device info: %s] \n[Pin check: %s] \n", user, device, pinCheck); }}


Implementation with Hystrix

public class PinCheckDependency extends HystrixCommand<Boolean> { private HttpClient httpClient; public PinCheckDependency(HttpClient httpClient) { super(HystrixCommandGroupKey.Factory.asKey("PinCheckService")); this.httpClient = httpClient; } @Override protected Boolean run() throws Exception { HttpGet pinCheck = new HttpGet("http://localhost:9090/pincheck"); HttpResponse pinCheckResponse = httpClient.execute(pinCheck); int statusCode = pinCheckResponse.getStatusLine().getStatusCode(); if (statusCode != 200) { throw new RuntimeException("Oh dear no pin check, status code " + statusCode); } String pinCheckInfo = EntityUtils.toString(pinCheckResponse.getEntity()); return Boolean.valueOf(pinCheckInfo); }}


Implementation with Hystrixpublic class PinCheckDependency extends HystrixCommand<Boolean> { private HttpClient httpClient; public PinCheckDependency(HttpClient httpClient) { super(HystrixCommandGroupKey.Factory.asKey("PinCheckService")); this.httpClient = httpClient; } @Override protected Boolean run() throws Exception { HttpGet pinCheck = new HttpGet("http://localhost:9090/pincheck"); HttpResponse pinCheckResponse = httpClient.execute(pinCheck); int statusCode = pinCheckResponse.getStatusLine().getStatusCode(); if (statusCode != 200) { throw new RuntimeException("Oh dear no pin check, status code " + statusCode); } String pinCheckInfo = EntityUtils.toString(pinCheckResponse.getEntity()); return Boolean.valueOf(pinCheckInfo); } @Override public Boolean getFallback() { return true; }}


Triggering the fallback

• Error threshold percentage

• Bucket of time for the percentage

• Minimum number of requests to trigger

• Time before trying a request again

• Disable

• Per instance statistics


Fault tolerance

1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off


6 - Turn off broken stuff

• The kill switch


To recap

1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off



• Examples:




• Tech:







Thanks for listening!Questions: @chbatey