Velocity 2015 Amsterdam: Alerts overload

108
Alerts Overload How to adopt a microservices architecture without being overwhelmed with noise Sarah Wells @sarahjwells

Transcript of Velocity 2015 Amsterdam: Alerts overload

Page 1: Velocity 2015 Amsterdam: Alerts overload

Alerts OverloadHow to adopt a microservices

architecture without being

overwhelmed with noise

Sarah Wells

@sarahjwells

Page 2: Velocity 2015 Amsterdam: Alerts overload
Page 3: Velocity 2015 Amsterdam: Alerts overload

Microservices make it worse

Page 4: Velocity 2015 Amsterdam: Alerts overload

microservices (n,pl): an efficient device for

transforming business problems into distributed

transaction problems

@drsnooks

Page 5: Velocity 2015 Amsterdam: Alerts overload

You have a lot more systems

Page 6: Velocity 2015 Amsterdam: Alerts overload

45 microservices

Page 7: Velocity 2015 Amsterdam: Alerts overload

45 microservices

3 environments

Page 8: Velocity 2015 Amsterdam: Alerts overload

45 microservices

3 environments

2 instances for each service

Page 9: Velocity 2015 Amsterdam: Alerts overload

45 microservices

3 environments

2 instances for each service

20 checks per service

Page 10: Velocity 2015 Amsterdam: Alerts overload

45 microservices

3 environments

2 instances for each service

20 checks per service

running every 5 minutes

Page 11: Velocity 2015 Amsterdam: Alerts overload

> 1,500,000 system checks

per day

Page 12: Velocity 2015 Amsterdam: Alerts overload

Over 19,000 system

monitoring alerts in 50 days

Page 13: Velocity 2015 Amsterdam: Alerts overload

Over 19,000 system

monitoring alerts in 50 days

An average of 380 per day

Page 14: Velocity 2015 Amsterdam: Alerts overload

Functional monitoring is also an issue

Page 15: Velocity 2015 Amsterdam: Alerts overload

12,745 response time/error

alerts in 50 days

Page 16: Velocity 2015 Amsterdam: Alerts overload

12,745 response time/error

alerts

An average of 255 per day

Page 17: Velocity 2015 Amsterdam: Alerts overload

Why so many?

Page 18: Velocity 2015 Amsterdam: Alerts overload
Page 19: Velocity 2015 Amsterdam: Alerts overload
Page 20: Velocity 2015 Amsterdam: Alerts overload
Page 21: Velocity 2015 Amsterdam: Alerts overload
Page 22: Velocity 2015 Amsterdam: Alerts overload

http://devopsreactions.tumblr.com/post/122408751191/alerts-when-an-outage-starts

Page 23: Velocity 2015 Amsterdam: Alerts overload

How can you make it better?

Page 24: Velocity 2015 Amsterdam: Alerts overload

Quick starts: attack your problem

See our EngineRoom blog for more:

http://bit.ly/1PP7uQQ

Page 25: Velocity 2015 Amsterdam: Alerts overload

1 2 3

Page 26: Velocity 2015 Amsterdam: Alerts overload

Think about monitoring from the start

1

Page 27: Velocity 2015 Amsterdam: Alerts overload

It's the business functionality you care about

Page 28: Velocity 2015 Amsterdam: Alerts overload
Page 29: Velocity 2015 Amsterdam: Alerts overload
Page 30: Velocity 2015 Amsterdam: Alerts overload

1

Page 31: Velocity 2015 Amsterdam: Alerts overload

2

1

Page 32: Velocity 2015 Amsterdam: Alerts overload

3

1

2

Page 33: Velocity 2015 Amsterdam: Alerts overload

4

1

2

3

Page 34: Velocity 2015 Amsterdam: Alerts overload

We care about whether published content made it to us

Page 35: Velocity 2015 Amsterdam: Alerts overload

When people call our APIs, we care about speed

Page 36: Velocity 2015 Amsterdam: Alerts overload

… we also care about errors

Page 37: Velocity 2015 Amsterdam: Alerts overload

But it's the end-to-end that matters

https://www.flickr.com/photos/robef/16537786315/

Page 38: Velocity 2015 Amsterdam: Alerts overload

You only want an alert where you need to take

action

Page 39: Velocity 2015 Amsterdam: Alerts overload

If you just want information, create a dashboard or report

Page 40: Velocity 2015 Amsterdam: Alerts overload

Make sure you can't miss an alert

Page 41: Velocity 2015 Amsterdam: Alerts overload

Make the alert great

http://www.thestickerfactory.co.uk/

Page 42: Velocity 2015 Amsterdam: Alerts overload

Build your system with support in mind

Page 43: Velocity 2015 Amsterdam: Alerts overload

Transaction ids tie all microservices together

Page 44: Velocity 2015 Amsterdam: Alerts overload
Page 45: Velocity 2015 Amsterdam: Alerts overload

Healthchecks tell you whether a service is OK

GET http://{service}/__health

Page 46: Velocity 2015 Amsterdam: Alerts overload

Healthchecks tell you whether a service is OK

GET http://{service}/__health

returns 200 if the service can run the healthcheck

Page 47: Velocity 2015 Amsterdam: Alerts overload

Healthchecks tell you whether a service is OK

GET http://{service}/__health

returns 200 if the service can run the healthcheck

each check will return "ok": true or "ok": false

Page 48: Velocity 2015 Amsterdam: Alerts overload
Page 49: Velocity 2015 Amsterdam: Alerts overload
Page 50: Velocity 2015 Amsterdam: Alerts overload

Synthetic requests tell you about problems early

https://www.flickr.com/photos/jted/5448635109

Page 51: Velocity 2015 Amsterdam: Alerts overload

Use the right tools for the job

2

Page 52: Velocity 2015 Amsterdam: Alerts overload

There are basic tools you need

Page 53: Velocity 2015 Amsterdam: Alerts overload

FT Platform: An internal PaaS

Page 54: Velocity 2015 Amsterdam: Alerts overload

Service monitoring (e.g. Nagios)

Page 55: Velocity 2015 Amsterdam: Alerts overload

Log aggregation (e.g. Splunk)

Page 56: Velocity 2015 Amsterdam: Alerts overload

Graphing (e.g. Graphite/Grafana)

Page 57: Velocity 2015 Amsterdam: Alerts overload

metrics:

reporters:

- type: graphite

frequency: 1 minute

durationUnit: milliseconds

rateUnit: seconds

host: <%= @graphite.host %>

port: 2003

prefix: content.<%= @config_env %>.api-policy-component.<%=

scope.lookupvar('::hostname') %>

Page 58: Velocity 2015 Amsterdam: Alerts overload
Page 59: Velocity 2015 Amsterdam: Alerts overload
Page 60: Velocity 2015 Amsterdam: Alerts overload

Real time error analysis (e.g. Sentry)

Page 61: Velocity 2015 Amsterdam: Alerts overload

Build other tools to support you

Page 62: Velocity 2015 Amsterdam: Alerts overload

SAWS

Built by Silvano Dossan

See our Engine room blog: http://bit.ly/1GATHLy

Page 63: Velocity 2015 Amsterdam: Alerts overload

"I imagine most people do exactly

what I do - create a google filter to

send all Nagios emails straight to the

bin"

Page 64: Velocity 2015 Amsterdam: Alerts overload

"Our screens have a viewing angle of

about 10 degrees"

Page 65: Velocity 2015 Amsterdam: Alerts overload

"Our screens have a viewing angle of

about 10 degrees"

"It never seems to show the page I

want"

Page 66: Velocity 2015 Amsterdam: Alerts overload

Code at: https://github.com/muce/SAWS

Page 67: Velocity 2015 Amsterdam: Alerts overload

Dashing

Page 68: Velocity 2015 Amsterdam: Alerts overload
Page 69: Velocity 2015 Amsterdam: Alerts overload

Nagios chart

Built by Simon Gibbs

@simonjgibbs

Page 70: Velocity 2015 Amsterdam: Alerts overload
Page 71: Velocity 2015 Amsterdam: Alerts overload
Page 72: Velocity 2015 Amsterdam: Alerts overload
Page 73: Velocity 2015 Amsterdam: Alerts overload
Page 74: Velocity 2015 Amsterdam: Alerts overload

Use the right communication channel

Page 75: Velocity 2015 Amsterdam: Alerts overload

It's not email

Page 76: Velocity 2015 Amsterdam: Alerts overload

Slack integration

Page 77: Velocity 2015 Amsterdam: Alerts overload
Page 78: Velocity 2015 Amsterdam: Alerts overload

Radiators everywhere

Page 79: Velocity 2015 Amsterdam: Alerts overload

Cultivate your alerts

3

Page 80: Velocity 2015 Amsterdam: Alerts overload

Review the alerts you get

Page 81: Velocity 2015 Amsterdam: Alerts overload

If it isn't

helpful, make

sure you don't

get sent it

again

Page 82: Velocity 2015 Amsterdam: Alerts overload

See if you can improve it

www.workcompass.com/

Page 83: Velocity 2015 Amsterdam: Alerts overload

Splunk Alert: PROD - MethodeAPIResponseTime5MAlert

Business Impact

The methode api server is slow responding to requests.

This might result in articles not getting published to the new

content platform or publishing requests timing out.

...

Page 84: Velocity 2015 Amsterdam: Alerts overload

Splunk Alert: PROD - MethodeAPIResponseTime5MAlert

Business Impact

The methode api server is slow responding to requests.

This might result in articles not getting published to the new

content platform or publishing requests timing out.

...

Page 85: Velocity 2015 Amsterdam: Alerts overload

Technical Impact

The server is experiencing service degradation because of

network latency, high publishing load, high bandwidth

utilization, excessive memory or cpu usage on the VM. This

might result in failure to publish articles to the new content

platform.

Page 86: Velocity 2015 Amsterdam: Alerts overload

Splunk Alert: PROD Content Platform Ingester Methode

Publish Failures Alert

There has been one or more publish failures to the

Universal Publishing Platform. The UUIDs are listed below.

Please see the run book for more information.

_time transaction_id uuid

Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

Page 87: Velocity 2015 Amsterdam: Alerts overload

Splunk Alert: PROD Content Platform Ingester Methode

Publish Failures Alert

There has been one or more publish failures to the

Universal Publishing Platform. The UUIDs are listed below.

Please see the run book for more information.

_time transaction_id uuid

Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

Page 88: Velocity 2015 Amsterdam: Alerts overload

Splunk Alert: PROD Content Platform Ingester Methode

Publish Failures Alert

There has been one or more publish failures to the

Universal Publishing Platform. The UUIDs are listed below.

Please see the run book for more information.

_time transaction_id uuid

Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

Page 89: Velocity 2015 Amsterdam: Alerts overload

When you didn't get an alert

Page 90: Velocity 2015 Amsterdam: Alerts overload

What would have told you about this?

Page 91: Velocity 2015 Amsterdam: Alerts overload
Page 92: Velocity 2015 Amsterdam: Alerts overload

Setting up an alert is part of fixing the problem

✔ code

✔ test

alerts

Page 93: Velocity 2015 Amsterdam: Alerts overload

System boundaries are more difficult

Severin.stalder [CC BY-SA 3.0

(http://creativecommons.org/licenses/by-sa/3.0)], via

Wikimedia Commons

Page 94: Velocity 2015 Amsterdam: Alerts overload

Make sure you would know if an alert stopped

working

Page 95: Velocity 2015 Amsterdam: Alerts overload

Add a unit test

public void shouldIncludeTriggerWordsForPublishFailureAlertInSplunk() {

}

Page 96: Velocity 2015 Amsterdam: Alerts overload

Deliberately break things

Page 97: Velocity 2015 Amsterdam: Alerts overload

Chaos snail

Page 98: Velocity 2015 Amsterdam: Alerts overload

The thing that sends you alerts need to be up and running

https://www.flickr.com/photos/davidmasters/2564786205/

Page 99: Velocity 2015 Amsterdam: Alerts overload

What's happened to our alerts?

Page 100: Velocity 2015 Amsterdam: Alerts overload

We turned off ALL emails from

system monitoring

Page 101: Velocity 2015 Amsterdam: Alerts overload

Our two most important alerts

come in via our team slack

channel

Page 102: Velocity 2015 Amsterdam: Alerts overload

We have dashboards for

our read APIs in Grafana

Page 103: Velocity 2015 Amsterdam: Alerts overload

To summarise...

Page 104: Velocity 2015 Amsterdam: Alerts overload

Build microservices

Page 105: Velocity 2015 Amsterdam: Alerts overload

1 2 3

Page 106: Velocity 2015 Amsterdam: Alerts overload

About technology at the FT:

Look us up on Stack Overflow

http://bit.ly/1H3eXVe

Read our blog

http://engineroom.ft.com/

Page 107: Velocity 2015 Amsterdam: Alerts overload

The FT on github

https://github.com/Financial-Times/

https://github.com/ftlabs

Page 108: Velocity 2015 Amsterdam: Alerts overload

Questions?