Circuit breakers @ API World 2016

Post on 13-Apr-2017

174 views 0 download

Transcript of Circuit breakers @ API World 2016

Handle API Failure gracefully with Circuit Breakers

Scott Triglia

1

Jim’s

2

Yelp’s Mission:Connecting people with great

local businesses.

Yelp StatsAs of Q2 2016

92M monthly

users

32 countries

72% of searches from

mobile

108M reviews

Scott Triglia

4

@scott_triglia

Work with the $$$

Let’s talk Circuit Breakers

5

6

7

8Photo by zerial, CC BY-NC 2.0

9

10

Our goals today: introduce a basic circuit breaker

11

Our goals today: a modular circuit breaker

12

Our goals today: test it out on several scenarios

13

14

15

1616

1

2

34

the fundamental rule: your systems will fail

what’s your response?

17

1818

1

2

1919

1

2

3

2020

1

2

34

21

Nygard’s circuit breaker

22

23

24

25

26

Circuit Breaker States: * Healthy (or “closed”) * Recovering (or “half-open”) * Unhealthy (or “open”)

27

28

Recovery:

* Wait for recovery_timeout seconds* Send a trial request, trust its results

29

Before a circuit breaker

30

Assume the kitchen gets slow

31

Kitchen’s backlog grows

32

Diners wait much longer to get food

33

New diners make issues worse

34

And your entire system collapses

35

And your entire system collapses

36

With a circuit breaker

37

CB sees backlog, stops orders

38

Fewer frustrated diners

39

Reduced load on the kitchen

40

A well defined failure mode

41

How can we do better?

42

Improvements

43

1) Detecting Unhealthiness

44

2) Mitigating downtime

45

3) Recovering Effectively

46

Detecting Unhealthiness

Component 1:

47

48

def signal_overload(cb): if len(jobs) > THRESH: cb.mark_unhealthy()

49

New Behavior:

* CB gets signals from anywhere * Signal combining logic

50

* Allows many (many) new signals

* Must combine signals * Adds complexity to system

51

Mitigating Downtime

Component 2:

52

53

54

55

56

New Behavior: * Code can check in advance about healthiness of system * Automatic monitoring!

57

* Build features on top of system health status

* Requires a single source of truth?

58

Recovering Effectively

Component 3:

59

60

Dark launch:

* Reject but process normally * Dangerous with side effects

Block User Request Try to process anyway!

61

Synthetic:

* Dark launching with fake requests * Not necessarily representative

Block User Request Process fake requests

62

New Behavior:

* Traffic determines health * Removal of recovery timeouts

63

* Faster(?) recovery * No timeout tuning required * Dark launching not always possible * Synthetic can be unrepresentative

64

in summary

65

Your system will fail, have a plan!

66

The basic CB is better than nothing

67

Questions to ask:

* What is “unhealthy” for my system? * How should I react to unhealthiness? * How do we recover?

68

Questions to ask:

* What is “unhealthy” for my system? * How should I react to unhealthiness? * How do we recover?

69

Questions to ask:

* What is “unhealthy” for my system? * How should I react to unhealthiness? * How do we recover?

70

Questions to ask:

* What is “unhealthy” for my system? * How should I react to unhealthiness? * How do we recover?

71

…and much more!

Much comes down to your use case

72

@YelpEngineering

fb.com/YelpEngineers

engineeringblog.yelp.com

github.com/yelp

striglia@yelp.com @scott_triglia

75

Can’t we do better than rejecting requests?

http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html

77

How do I safely test out a new circuit breaker?

https://engineering.heroku.com/blogs/2015-06-30-improved-production-stability-with-circuit-breakers/