Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t...

Post on 18-Jul-2020

2 views 0 download

Transcript of Iterative dashboards & monitors...First incident is going to SUCK. Measure nothing Metrics aren’t...

Iterative dashboards & monitors

CARMEL HINKS | SOF TWARE ENGINEER | ATLASSIAN

You build it, you run it

I addressed all operational concerns

You build it, you run it

Nice! We are finished forever!

I addressed all operational concerns

Nice! We are finished forever!

Past Present

For now…

Past Present

I addressed all operational concerns

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

Multi-tenant

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Shard

Sign me up to Jira!

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Provisioning pipeline

Sign me up to Jira!

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Provisioning pipeline

Sign me up to Jira!

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Provisioning pipeline

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Provisioning pipeline

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Shard Servic

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Shard Service

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Shard Service

Database

App App App

Queue

Shard Service

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Shard Service

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Shard Service

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Shard Service

Europe Australia USA

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Database

App App App

Queue

Shard Service

Europe Australia USA

Shard Service

Database

App App

Queue

App

Shard Service

70% 20% 90% 5%

Shard A

Shard Service

70% 20% 90% 5%

Shard A

Metrics from the shards, about the shards

Shard Service

70% 20% 90% 5%

Shard A

Shard Service

70% 20% 90% 5%

Shard A

Shard Service

70% 20% 90% 5%

Shard A

Shard Service

70% 20% 90% 5%

Shard A

Shard Service

70% 20% 90% 5%

Shard A

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

It is a capital mistake to theorise before one has data

SHERLOCK HOLMES

It is a capital mistake to theorise before one has data

SHERLOCK HOLMES (DEVOPS ADVOCATE)

Measure nothing

Measure nothingMetrics aren’t verified before going live

Measure nothingMetrics aren’t verified before going live

First incident is going to SUCK

Measure nothingMetrics aren’t verified before going live

First incident is going to SUCK

This isn’t a solution, it’s a deferral

Measure everything

Measure everythingExpensive (time, money & resources)

Measure everything

Lots of noise

Expensive (time, money & resources)

Measure everythingExpensive (time, money & resources)

Lots of noise

Does not scale

Measure stuff from out the box

SHARD SERVICE DASHBOARD

SHARD SERVICE DASHBOARD

SHARD SERVICE DASHBOARD

SHARD SERVICE DASHBOARD

IF CPU > 80% for over 5 minutes THEN page

Think of idea

Design service

Build service (MVP)

Reach operational maturity

Release

Iterate service

Iterate service

Think of idea

Design service

Build service (MVP)

Reach operational maturity

Release

Iterate service

Iterate service

Sign me up to Jira!

Database

Nod Nod Nod

Queue

Database

Nod Nod Nod

Queue

Database

Nod Nod Nod

Queue

Sign me up to Jira!

Database

Nod Nod Nod

Queue

Database

Nod Nod Nod

Queue

Database

Nod Nod Nod

Queue

Sign me up to Jira!

Database

Nod Nod Nod

Queue

Database

Nod Nod Nod

Queue

Database

Nod Nod Nod

Queue

Sign me up to Jira!

Database

Nod Nod Nod

Queue

Database

Nod Nod Nod

Queue

Database

Nod Nod Nod

Queue

Sign me up to Jira!

What questions do we want to be able to answer with our operational resources?

TAKING A STEP BACK

Shard ServicePerforms the selection of a suitable shard based on geographical location and dynamic capacity metrics

Shard ServicePerforms the selection of a suitable shard based on geographical location and dynamic capacity metrics

Synchronous, http-facing

Shard Service

Requests slow down significantly

Shard Service

Requests slow down significantly

Requests are accepted, but then fail

Shard Service

Requests slow down significantly

Requests are accepted, but then fail

Requests start being rejected

Shard Service

Requests slow down significantly

Requests are accepted, but then fail

Requests start being rejected

There are no suitable shards

Shard Service

Requests slow down significantly

Requests are accepted, but then fail

Requests start being rejected

There are no suitable shards

Incorrect shards were selected

Shard Service

Requests slow down significantly

Requests are accepted, but then fail

Requests start being rejected

There are no suitable shards

Incorrect shards were selected

There is insufficient data to make decisions

Shard Service

Requests slow down significantly

Requests are accepted, but then fail

Requests start being rejected

There are no suitable shards

Incorrect shards were selected

There is insufficient data to make decisions

Infrastructure metrics

Application metrics

Shard Service

Requests slow down significantly

Requests are accepted, but then fail

Requests start being rejected

There are no suitable shards

Incorrect shards were selected

There is insufficient data to make decisions

Infrastructure metrics

Application metrics

Infrastructure health Useful metrics tied to components in your techstack.

Shard Service

Requests slow down significantly

Requests are accepted, but then fail

Requests start being rejected

There are no suitable shards

Incorrect shards were selected

There is insufficient data to make decisions

Infrastructure metrics

Application metrics

Infrastructure health Useful metrics tied to components in your techstack.

Application health Useful metrics tied to the domain of your application

Application metricsInfrastructure metrics + =

Application metricsInfrastructure metrics +

Latency

Memory utilisation

Load balancer errors

=

Application metricsInfrastructure metrics

Shard capacity

Errors logged

Shard selection reason

Latency

Memory utilisation

Load balancer errors

+ =

Application metricsInfrastructure metrics

Shard capacity

Errors logged

Shard selection reason

Latency

Memory utilisation

Load balancer errorsMetrics about the shards, from Shard Service

+ =

`

Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)

Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)

Top selected region (internal) Top selected region (AWS) Top selection reasons

Top selected shards

SHARD SERVICE INFRASTRUCTURE

Region capacity exhausted

Monitors

Surge in errors logged

Shard capacity exhausted

How can you…

How can you… Figure out what to measure?

What questions do you want to answer?

What questions do you want to answer?

Why does your service exist (what are its roles and responsibilities)?

What does it look like for those roles and responsibilities to degrade?

How can you verify whether or not such a degradation is occurring?

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

Everything is right.

Fine… or exploding Never checked operational health unless it was on fire

Noisy alerts Frequent & un-actionable

As time went on…

Things changed Because, you know, agile

Elastic load balancer

Shard Service Node 1

Shard Service Node 2

Shard Service Node 3

As time went on…

Elastic load balancer

Shard Service Node 1

Shard Service Node 2

Shard Service Node 3

LatencyLoad balancer errors

Healthy hosts

Elastic load balancerApplication load balancer

Shard Service Node 1

Shard Service Node 2

Shard Service Node 3

Application load balancer

Shard Service Node 1

Shard Service Node 2

Shard Service Node 3

LatencyLoad balancer errors

Healthy hosts

As time went on…

Noisy alerts Frequent & un-actionable

As time went on…

Things changed Because, you know, agile

Fine… or exploding Never checked operational health unless it was on fire

Noisy alerts Frequent & un-actionable

As time went on…

Things changed Because, you know, agile

Fine… or exploding Never checked operational health unless it was on fire

Our team

Team who could actually fix the problem

Noisy alerts Frequent & un-actionable

As time went on…

Things changed Because, you know, agile

Fine… or exploding Never checked operational health unless it was on fire

Noisy alerts Frequent & un-actionable

As time went on…

Fine… or exploding Never checked operational health unless it was on fire

Things changed Because, you know, agile

What level of service you can commit to offer

SERVICE LEVEL OBJECTIVE

What level of service you can commit to offer

SERVICE LEVEL OBJECTIVE

E.g. 99.99% requests should succeed

We were not alone

Process dedicated to regularly reviewing, discussing and iterating on operational health

TECHOPS

Develop measurable goalsTechOpsCollect data

Prepare a report

Meet and discuss

Repeat and iterate

Develop measurable goalsTechOpsCollect data

Prepare a report

Meet and discuss

Repeat and iterate

Develop measurable goalsTechOpsCollect data

Prepare a report

Meet and discuss

Repeat and iterate

Develop measurable goalsTechOpsCollect data

Prepare a report

Meet and discuss

Repeat and iterate

Develop measurable goalsTechOpsCollect data

Prepare a report

Meet and discuss

Repeat and iterate

TechOps for everyone!

Goal

TechOps for everyone!

GoalReduce the number of noisy alerts

DataReduce the number of noisy alerts

DataAlerts received in the past weekReduce the number of noisy alerts

87

87Alerts received in the past week

87Low priority alerts

Reduce the number of noisy alerts

87

Report

ReportAlerts, dashboard screenshots, incidents…Reduce the number of noisy alerts

87

Reduce the number of noisy alertsMeet & discuss

Meet & discussActionable? Discoverable? Useful?Reduce the number of noisy alerts

Meet & discussActionable? Discoverable? Useful?Reduce the number of noisy alerts

Meet & discussActionable? Discoverable? Useful?Reduce the number of noisy alerts

Meet & discussActionable? Discoverable? Useful?Reduce the number of noisy alerts

ALERTS (ALL SERVICES, STAGING + PRODUCTION)

0

25

50

75

100

Week 1 Week 3 Week 5 Week 7 Week 9 Week 11 Week 13 Week 15 Week 17 Week 19 Week 21

Total High priority Low priority

CASE #2 - RELIABILITY INCREASETotal High priority Low priority

CASE #3 - ALERT REDUCTION

How can you…

Verify you’re measuring the right things?How can you…

Review your operational resources!

…frequently

Review your operational resources!

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

Shard Service

Database

App App

Queue

App

Shard Service

70% 20% 90% 5%

Shard A

Shard Service

70% 20% 90% 5%

Shard A

Shard A

Database

App

Queue

App App

Shard A

Database

App

Queue

App App

Shard Sevice

Shard A

Database

App

Queue

App App

Shard Sevice

Shard A

Database

App

Queue

App App

Shard Sevice

Shard A

Database

App

Queue

App App

Shard Sevice

Shard A

Database

App

Queue

App App

Shard Sevice

What’s the big deal?

Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)

Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)

Top selected region (internal) Top selected region (AWS) Top selection reasons

Top selected shards

Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)

Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)

Top selected region (internal) Top selected region (AWS) Top selection reasons

Top selected shards

Panel per metric

Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)

Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)

Top selected region (internal) Top selected region (AWS) Top selection reasons

Top selected shards

Panel per metric

Slow

Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)

Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)

Top selected region (internal) Top selected region (AWS) Top selection reasons

Top selected shards

Panel per metric

Slow, error prone

Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)

Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)

Top selected region (internal) Top selected region (AWS) Top selection reasons

Top selected shards

Panel per metric

Slow, error prone, forgettable

Confluence Apdex Count metric: current vs target (by shard)Jira Apdex Count metric: current vs target (by shard)Database utilisation metric: current vs target (by shard)

Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)

Top selected region (internal) Top selected region (AWS) Top selection reasons

Top selected shards

{ }Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)

Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)

Top selected region (internal) Top selected region (AWS) Top selection reasons

Top selected shards

{ }

{ }Application

{ }Test

Shard Service repository

ConfluenJira Databas

Provisioni Average Average

Top Top Top

To

ConfluencJira Databas

Provision Average Average

Top Top Top

To

{ }

Shard Service repository Dashboard tool

Marge’s Service

Homer’s Service

Shard Service

{ }Application

{ }Test

ConfluenJira Databas

Provisioni Average Average

Top Top Top

To

ConfluencJira Databas

Provision Average Average

Top Top Top

To

ConfluenJira Databas

Provisioni Average Average

Top Top Top

To

ConfluencJira Databas

Provision Average Average

Top Top Top

To

{ }

Shard Service repository Dashboard tool

Marge’s Service

Homer’s Service

Shard Service

{ }Application

{ }Test

ConfluenJira Databas

Provisioni Average Average

Top Top Top

To

ConfluencJira Databas

Provision Average Average

Top Top Top

To

Discoverable

Operational resources as code

Front of mind

Version control

Discoverable

Operational resources as code

Front of mind

Version control

Discoverable

Operational resources as code

Front of mind

Version control

Going a step further…

286Operational resources as code

286

286Dashboards

JIRA SHARD DASHBOARD (SUBSET)

286

Cue, templates

This can be solved at the platform level

SERGEJS SINICA, ATLASSIAN SENIOR DEVELOPER

Cue, templates

Introducing, Sauron

The “all seeing eye” for dashboards & monitors

Introducing, Sauron

Dashboard tool

Shard Service

Sauron

Marge’s Service

Homer’s Service

SauronExport my dashboard!

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

SauronExport my dashboard!

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

SauronExport my dashboard!

{ }

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

ConfJira DatProvi Aver Avera

ToToT T

ConflJira DatProvi Aver Aver

ToTT T

SauronExport my dashboard!

{ }

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

ConfJira DatProvi Aver Avera

ToToT T

ConflJira DatProvi Aver Aver

ToTT T

Sauron

{ }

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

ConfJira DatProvi Aver Avera

ToToT T

ConflJira DatProvi Aver Aver

ToTT T

Sauron

Application repository

{ }

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

ConfJira DatProvi Aver Avera

ToToT T

ConflJira DatProvi Aver Aver

ToTT T

Sauron

Application repository

{ }

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

Sauron

Application repository

{ }{ }

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

Sauron

Application repository

{ }

{ }

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

Sauron

Application repository

{ }

Dashboard tool

Shard Service

Marge’s Service

Homer’s Service

Monitors

Dashboards

Screenboards

shard-service operations

> 50

> 50Services adopted Sauron

How can you…

Help your team keep up to date with change?

How can you…

Define operational resources in code

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

Agenda

Setting some context

Deciding what to measure

Verifying your metrics

Keeping up with change

Summary

SelectLearn what questions you want to answer

Select

Verify

Learn what questions you want to answer

Review, review, review

Select

VerifyKeep up

Learn what questions you want to answer

Review, review, reviewDefine all the things in code

Thank you!