Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t...

182
Iterative dashboards & monitors CARMEL HINKS | SOFTWARE ENGINEER | ATLASSIAN

Transcript of Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t...

Page 1: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Iterative dashboards & monitors

CARMEL HINKS | SOF TWARE ENGINEER | ATLASSIAN

Page 2: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

You build it, you run it

Page 3: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

I addressed all operational concerns

You build it, you run it

Nice! We are finished forever!

Page 4: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

I addressed all operational concerns

Nice! We are finished forever!

Past Present

Page 5: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

For now…

Past Present

I addressed all operational concerns

Page 6: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

Page 7: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

Page 8: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,
Page 9: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Multi-tenant

Page 10: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,
Page 11: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Database

App App App

Queue

Page 12: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Database

App App App

Queue

Page 13: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Page 14: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Shard

Page 15: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Sign me up to Jira!

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Provisioning pipeline

Page 16: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Sign me up to Jira!

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Provisioning pipeline

Page 17: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Sign me up to Jira!

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Provisioning pipeline

Page 18: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Provisioning pipeline

Page 19: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Page 20: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Servic

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Shard Service

Page 21: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Shard Service

Database

App App App

Queue

Page 22: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Page 23: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Page 24: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Page 25: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

Europe Australia USA

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Page 26: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Shard Service

Europe Australia USA

Page 27: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

Database

App App

Queue

App

Page 28: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

70% 20% 90% 5%

Shard A

Page 29: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

70% 20% 90% 5%

Shard A

Metrics from the shards, about the shards

Page 30: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

70% 20% 90% 5%

Shard A

Page 31: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

70% 20% 90% 5%

Shard A

Page 32: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

70% 20% 90% 5%

Shard A

Page 33: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

70% 20% 90% 5%

Shard A

Page 34: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

70% 20% 90% 5%

Shard A

Page 35: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

Page 36: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

Page 37: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

It is a capital mistake to theorise before one has data

SHERLOCK HOLMES

Page 38: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

It is a capital mistake to theorise before one has data

SHERLOCK HOLMES (DEVOPS ADVOCATE)

Page 39: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Measure nothing

Page 40: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Measure nothingMetrics aren’t verified before going live

Page 41: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Measure nothingMetrics aren’t verified before going live

First incident is going to SUCK

Page 42: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Measure nothingMetrics aren’t verified before going live

First incident is going to SUCK

This isn’t a solution, it’s a deferral

Page 43: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Measure everything

Page 44: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Measure everythingExpensive (time, money & resources)

Page 45: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Measure everything

Lots of noise

Expensive (time, money & resources)

Page 46: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Measure everythingExpensive (time, money & resources)

Lots of noise

Does not scale

Page 47: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Measure stuff from out the box

Page 48: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

SHARD SERVICE DASHBOARD

Page 49: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

SHARD SERVICE DASHBOARD

Page 50: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

SHARD SERVICE DASHBOARD

Page 51: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

SHARD SERVICE DASHBOARD

Page 52: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

IF CPU > 80% for over 5 minutes THEN page

Page 53: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Think of idea

Design service

Build service (MVP)

Reach operational maturity

Release

Iterate service

Iterate service

Page 54: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Think of idea

Design service

Build service (MVP)

Reach operational maturity

Release

Iterate service

Iterate service

Page 55: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Sign me up to Jira!

Database

Nod Nod Nod

Queue

Database

Nod Nod Nod

Queue

Database

Nod Nod Nod

Queue

Page 56: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Sign me up to Jira!

Database

Nod Nod Nod

Queue

Database

Nod Nod Nod

Queue

Database

Nod Nod Nod

Queue

Page 57: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Sign me up to Jira!

Database

Nod Nod Nod

Queue

Database

Nod Nod Nod

Queue

Database

Nod Nod Nod

Queue

Page 58: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Sign me up to Jira!

Database

Nod Nod Nod

Queue

Database

Nod Nod Nod

Queue

Database

Nod Nod Nod

Queue

Page 59: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Sign me up to Jira!

Page 60: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

What questions do we want to be able to answer with our operational resources?

TAKING A STEP BACK

Page 61: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard ServicePerforms the selection of a suitable shard based on geographical location and dynamic capacity metrics

Page 62: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard ServicePerforms the selection of a suitable shard based on geographical location and dynamic capacity metrics

Synchronous, http-facing

Page 63: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

Requests slow down significantly

Page 64: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

Requests slow down significantly

Requests are accepted, but then fail

Page 65: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

Requests slow down significantly

Requests are accepted, but then fail

Requests start being rejected

Page 66: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

Requests slow down significantly

Requests are accepted, but then fail

Requests start being rejected

There are no suitable shards

Page 67: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

Requests slow down significantly

Requests are accepted, but then fail

Requests start being rejected

There are no suitable shards

Incorrect shards were selected

Page 68: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

Requests slow down significantly

Requests are accepted, but then fail

Requests start being rejected

There are no suitable shards

Incorrect shards were selected

There is insufficient data to make decisions

Page 69: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

Requests slow down significantly

Requests are accepted, but then fail

Requests start being rejected

There are no suitable shards

Incorrect shards were selected

There is insufficient data to make decisions

Infrastructure metrics

Application metrics

Page 70: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

Requests slow down significantly

Requests are accepted, but then fail

Requests start being rejected

There are no suitable shards

Incorrect shards were selected

There is insufficient data to make decisions

Infrastructure metrics

Application metrics

Infrastructure health Useful metrics tied to components in your techstack.

Page 71: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

Requests slow down significantly

Requests are accepted, but then fail

Requests start being rejected

There are no suitable shards

Incorrect shards were selected

There is insufficient data to make decisions

Infrastructure metrics

Application metrics

Infrastructure health Useful metrics tied to components in your techstack.

Application health Useful metrics tied to the domain of your application

Page 72: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Application metricsInfrastructure metrics + =

Page 73: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Application metricsInfrastructure metrics +

Latency

Memory utilisation

Load balancer errors

=

Page 74: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Application metricsInfrastructure metrics

Shard capacity

Errors logged

Shard selection reason

Latency

Memory utilisation

Load balancer errors

+ =

Page 75: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Application metricsInfrastructure metrics

Shard capacity

Errors logged

Shard selection reason

Latency

Memory utilisation

Load balancer errorsMetrics about the shards, from Shard Service

+ =

Page 76: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

`

Page 77: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)

Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)

Top selected region (internal) Top selected region (AWS) Top selection reasons

Top selected shards

Page 78: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

SHARD SERVICE INFRASTRUCTURE

Page 79: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Region capacity exhausted

Monitors

Surge in errors logged

Shard capacity exhausted

Page 80: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

How can you…

Page 81: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

How can you… Figure out what to measure?

Page 82: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

What questions do you want to answer?

Page 83: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

What questions do you want to answer?

Why does your service exist (what are its roles and responsibilities)?

What does it look like for those roles and responsibilities to degrade?

How can you verify whether or not such a degradation is occurring?

Page 84: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

Page 85: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

Page 86: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Everything is right.

Page 87: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Fine… or exploding Never checked operational health unless it was on fire

Noisy alerts Frequent & un-actionable

As time went on…

Things changed Because, you know, agile

Page 88: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Elastic load balancer

Shard Service Node 1

Shard Service Node 2

Shard Service Node 3

As time went on…

Page 89: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Elastic load balancer

Shard Service Node 1

Shard Service Node 2

Shard Service Node 3

LatencyLoad balancer errors

Healthy hosts

Page 90: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Elastic load balancerApplication load balancer

Shard Service Node 1

Shard Service Node 2

Shard Service Node 3

Page 91: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Application load balancer

Shard Service Node 1

Shard Service Node 2

Shard Service Node 3

LatencyLoad balancer errors

Healthy hosts

As time went on…

Page 92: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Noisy alerts Frequent & un-actionable

As time went on…

Things changed Because, you know, agile

Fine… or exploding Never checked operational health unless it was on fire

Page 93: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Noisy alerts Frequent & un-actionable

As time went on…

Things changed Because, you know, agile

Fine… or exploding Never checked operational health unless it was on fire

Page 94: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Our team

Team who could actually fix the problem

Page 95: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Noisy alerts Frequent & un-actionable

As time went on…

Things changed Because, you know, agile

Fine… or exploding Never checked operational health unless it was on fire

Page 96: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Noisy alerts Frequent & un-actionable

As time went on…

Fine… or exploding Never checked operational health unless it was on fire

Things changed Because, you know, agile

Page 97: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

What level of service you can commit to offer

SERVICE LEVEL OBJECTIVE

Page 98: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

What level of service you can commit to offer

SERVICE LEVEL OBJECTIVE

E.g. 99.99% requests should succeed

Page 99: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

We were not alone

Page 100: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Process dedicated to regularly reviewing, discussing and iterating on operational health

TECHOPS

Page 101: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Develop measurable goalsTechOpsCollect data

Prepare a report

Meet and discuss

Repeat and iterate

Page 102: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Develop measurable goalsTechOpsCollect data

Prepare a report

Meet and discuss

Repeat and iterate

Page 103: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Develop measurable goalsTechOpsCollect data

Prepare a report

Meet and discuss

Repeat and iterate

Page 104: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Develop measurable goalsTechOpsCollect data

Prepare a report

Meet and discuss

Repeat and iterate

Page 105: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Develop measurable goalsTechOpsCollect data

Prepare a report

Meet and discuss

Repeat and iterate

Page 106: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

TechOps for everyone!

Page 107: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Goal

TechOps for everyone!

Page 108: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

GoalReduce the number of noisy alerts

Page 109: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

DataReduce the number of noisy alerts

Page 110: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

DataAlerts received in the past weekReduce the number of noisy alerts

87

Page 111: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

87Alerts received in the past week

Page 112: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

87Low priority alerts

Page 113: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Reduce the number of noisy alerts

87

Report

Page 114: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

ReportAlerts, dashboard screenshots, incidents…Reduce the number of noisy alerts

87

Page 115: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Reduce the number of noisy alertsMeet & discuss

Page 116: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Meet & discussActionable? Discoverable? Useful?Reduce the number of noisy alerts

Page 117: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Meet & discussActionable? Discoverable? Useful?Reduce the number of noisy alerts

Page 118: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Meet & discussActionable? Discoverable? Useful?Reduce the number of noisy alerts

Page 119: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Meet & discussActionable? Discoverable? Useful?Reduce the number of noisy alerts

Page 120: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

ALERTS (ALL SERVICES, STAGING + PRODUCTION)

0

25

50

75

100

Week 1 Week 3 Week 5 Week 7 Week 9 Week 11 Week 13 Week 15 Week 17 Week 19 Week 21

Total High priority Low priority

Page 121: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

CASE #2 - RELIABILITY INCREASETotal High priority Low priority

Page 122: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

CASE #3 - ALERT REDUCTION

Page 123: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

How can you…

Page 124: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Verify you’re measuring the right things?How can you…

Page 125: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Review your operational resources!

Page 126: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

…frequently

Review your operational resources!

Page 127: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

Page 128: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

Page 129: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

Database

App App

Queue

App

Page 130: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

70% 20% 90% 5%

Shard A

Page 131: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard Service

70% 20% 90% 5%

Shard A

Page 132: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard A

Database

App

Queue

App App

Page 133: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard A

Database

App

Queue

App App

Shard Sevice

Page 134: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard A

Database

App

Queue

App App

Shard Sevice

Page 135: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard A

Database

App

Queue

App App

Shard Sevice

Page 136: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard A

Database

App

Queue

App App

Shard Sevice

Page 137: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Shard A

Database

App

Queue

App App

Shard Sevice

Page 138: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

What’s the big deal?

Page 139: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)

Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)

Top selected region (internal) Top selected region (AWS) Top selection reasons

Top selected shards

Page 140: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)

Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)

Top selected region (internal) Top selected region (AWS) Top selection reasons

Top selected shards

Panel per metric

Page 141: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)

Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)

Top selected region (internal) Top selected region (AWS) Top selection reasons

Top selected shards

Panel per metric

Slow

Page 142: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)

Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)

Top selected region (internal) Top selected region (AWS) Top selection reasons

Top selected shards

Panel per metric

Slow, error prone

Page 143: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)

Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)

Top selected region (internal) Top selected region (AWS) Top selection reasons

Top selected shards

Panel per metric

Slow, error prone, forgettable

Page 144: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Confluence Apdex Count metric: current vs target (by shard)Jira Apdex Count metric: current vs target (by shard)Database utilisation metric: current vs target (by shard)

Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)

Top selected region (internal) Top selected region (AWS) Top selection reasons

Top selected shards

{ }Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)

Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)

Top selected region (internal) Top selected region (AWS) Top selection reasons

Top selected shards

Page 145: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

{ }

{ }Application

{ }Test

Shard Service repository

ConfluenJira Databas

Provisioni Average Average

Top Top Top

To

ConfluencJira Databas

Provision Average Average

Top Top Top

To

Page 146: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

{ }

Shard Service repository Dashboard tool

Marge’s Service

Homer’s Service

Shard Service

{ }Application

{ }Test

ConfluenJira Databas

Provisioni Average Average

Top Top Top

To

ConfluencJira Databas

Provision Average Average

Top Top Top

To

ConfluenJira Databas

Provisioni Average Average

Top Top Top

To

ConfluencJira Databas

Provision Average Average

Top Top Top

To

Page 147: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

{ }

Shard Service repository Dashboard tool

Marge’s Service

Homer’s Service

Shard Service

{ }Application

{ }Test

ConfluenJira Databas

Provisioni Average Average

Top Top Top

To

ConfluencJira Databas

Provision Average Average

Top Top Top

To

Page 148: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Discoverable

Operational resources as code

Front of mind

Version control

Page 149: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Discoverable

Operational resources as code

Front of mind

Version control

Page 150: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Discoverable

Operational resources as code

Front of mind

Version control

Page 151: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Going a step further…

286Operational resources as code

Page 152: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

286

Page 153: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

286Dashboards

Page 154: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

JIRA SHARD DASHBOARD (SUBSET)

286

Page 155: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Cue, templates

Page 156: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

This can be solved at the platform level

SERGEJS SINICA, ATLASSIAN SENIOR DEVELOPER

Cue, templates

Page 157: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Introducing, Sauron

Page 158: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

The “all seeing eye” for dashboards & monitors

Introducing, Sauron

Page 159: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Dashboard tool

Shard Service

Sauron

Marge’s Service

Homer’s Service

Page 160: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

SauronExport my dashboard!

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

Page 161: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

SauronExport my dashboard!

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

Page 162: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

SauronExport my dashboard!

{ }

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

ConfJira DatProvi Aver Avera

ToToT T

ConflJira DatProvi Aver Aver

ToTT T

Page 163: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

SauronExport my dashboard!

{ }

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

ConfJira DatProvi Aver Avera

ToToT T

ConflJira DatProvi Aver Aver

ToTT T

Page 164: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Sauron

{ }

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

ConfJira DatProvi Aver Avera

ToToT T

ConflJira DatProvi Aver Aver

ToTT T

Page 165: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Sauron

Application repository

{ }

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

ConfJira DatProvi Aver Avera

ToToT T

ConflJira DatProvi Aver Aver

ToTT T

Page 166: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Sauron

Application repository

{ }

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

Page 167: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Sauron

Application repository

{ }{ }

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

Page 168: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Sauron

Application repository

{ }

{ }

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

Page 169: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Sauron

Application repository

{ }

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

Page 170: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Monitors

Dashboards

Screenboards

shard-service operations

Page 171: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

> 50

Page 172: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

> 50Services adopted Sauron

Page 173: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

How can you…

Page 174: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Help your team keep up to date with change?

How can you…

Page 175: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Define operational resources in code

Page 176: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

Page 177: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

Page 178: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,
Page 179: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

SelectLearn what questions you want to answer

Page 180: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Select

Verify

Learn what questions you want to answer

Review, review, review

Page 181: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Select

VerifyKeep up

Learn what questions you want to answer

Review, review, reviewDefine all the things in code

Page 182: Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t verified before going live First incident is going to SUCK This isn’t a solution,

Thank you!