How microservices fail, and what to do about it

30
How Microservices Fail… and what to do about it. Richard Rodger @rjrodger

Transcript of How microservices fail, and what to do about it

How Microservices Fail… and what to do about it.

Richard Rodger @rjrodger

github.com/rjrodger/nodezoo

https://aws.amazon.com/message/5467D2/

github.com/rjrodger/nodezoo

github.com/rjrodger/nodezoo

github.com/rjrodger/nodezoo

Pattern Matching Service discovery is an anti-pattern. Instead, make messages first-class citizens. Use message data to define patterns, and these patterns define a language! Transport Independence Services should not know about each other, or how to send messages. Services are fully defined by: message patterns that they recognise, and message patterns that they emit.

github.com/rjrodger/nodezoo

// a search message { "role": "search", // a namespace "cmd": "search", // this is a command "query": "ldap", // some data } !// the pattern to match role:search,cmd:search

github.com/rjrodger/nodezoo

// some nodezoo message patterns !role:search,cmd:search // do a search role:search,cmd:insert // insert into index role:info,cmd:get // get module info role:npm,cmd:get // get npm data role:npm,info:change // module changed! role:info,req:part // need module info role:info,res:part // here's module info !!!

github.com/rjrodger/nodezoo

role:search,cmd:search

role:info,cmd:get

synchronous request/response

github.com/rjrodger/nodezoo

role:npm,info:change

asynchronous "winner-take-all" (actor)

github.com/rjrodger/nodezoo

role:info,req:part

asynchronous "fire-and-forget" (publish/subscribe)

role:info,res:part

github.com/rjrodger/nodezoo

asynchronous "fire-and-forget" (publish/subscribe)

synchronous request/response

asynchronous "winner-take-all" (actor)

synchronous "sidewinder" (side effects!)

synchronous/ asynchronous

consumed/ observed

github.com/rjrodger/nodezoo

senecajs.org

github.com/rjrodger/nodezoo

code (branch: msdub201509)

github.com/rjrodger/nodezoo

kintsugi ⾦金継ぎ (golden joinery)

github.com/rjrodger/nodezoo

asynchronous "fire-and-forget" (publish/subscribe)

synchronous request/response

asynchronous "winner-take-all" (actor)

synchronous "sidewinder" (side effects!)

How do these break?

github.com/rjrodger/nodezoo

asynchronous "fire-and-forget" (publish/subscribe)

synchronous request/response

asynchronous "winner-take-all" (actor)

synchronous "sidewinder" (side effects!)

github.com/rjrodger/nodezoo

failure mode

"Slow downstream" B responses are getting slower,

consuming As resources.

mitigation

Drop B. A should consider B dead.

Or: the transport should handle this.

github.com/rjrodger/nodezoo

failure mode

"Upstream overload" A is sending messages to B

at a higher rate than B can handle.

mitigation

Back-pressure from B. A should accept back pressure notifications and scale back.

Assumes B is doing more work than A.

github.com/rjrodger/nodezoo

asynchronous "fire-and-forget" (publish/subscribe)

synchronous request/response

asynchronous "winner-take-all" (actor)

synchronous "sidewinder" (side effects!)

github.com/rjrodger/nodezoo

failure mode

"Lost Actions" A is sending messages to B,

and C is listening. But perhaps the latest version of C is broken?

mitigation

Measure message flow rates. Do the flow ratios match

the business rules?

github.com/rjrodger/nodezoo

failure mode

"Broken Contracts" A and B are using a newer

message schema, but you forgot about C.

mitigation

Don't use contracts! Message schemas are a net negative and hinder

multi-version deployment.

github.com/rjrodger/nodezoo

asynchronous "fire-and-forget" (publish/subscribe)

synchronous request/response

asynchronous "winner-take-all" (actor)

synchronous "sidewinder" (side effects!)

github.com/rjrodger/nodezoo

failure mode

"Poison Message" A is sending messages that crash B.

B keeps restarting and trying to handle the poison message. Now nothing works!

mitigation

B should drop out of date messages on the floor. B should maintain a list of

recently seen messages and ignore duplicates. B sends bad messages to the "Dead Letter" log.

github.com/rjrodger/nodezoo

failure mode

"Guaranteed Delivery ... ain't" B expects at-most-once, exactly-one,

or at-least-once delivery of unique messages. This is not possible.

mitigation

Idempotency. Where possible, duplicate messages

should have no bad effects.

github.com/rjrodger/nodezoo

asynchronous "fire-and-forget" (publish/subscribe)

synchronous request/response

asynchronous "winner-take-all" (actor)

synchronous "sidewinder" (side effects!)

github.com/rjrodger/nodezoo

failure mode

"Emergent Behaviour" Strange loops and unexplained

message paths make it hard to understand what the system is doing.

mitigation

Correlate messages. Attach correlation identifiers to messages so that you can trace the flow of causality.

github.com/rjrodger/nodezoo

failure mode

"Catastrophic Collapse" You've introduced feedback that grows exponentially.

And you've no idea how to fix it.

mitigation

Have a kill switch. Microservices aren't a

silver bullet. Sometimes you need to selectively reboot.

github.com/rjrodger/nodezoo

P( success ) = 1

P( failure ) < εvs.

github.com/rjrodger/nodezoo

// apoptosis setTimeout(function(){ process.exit(0) }, 60*60*1000*Math.random()) !!!!!!!!!

Thanks!Richard Rodger

@rjrodger