Why Visibility into Your Stack Matters

Post on 14-Aug-2015

258 views 0 download

Tags:

Transcript of Why Visibility into Your Stack Matters

Why visibility into your

stack mattersor, Do you see it all?

Mike Fiedler

Operations

Datadog.comTwitter: @mikefiedler

GitHub: @miketheman

OpsSchool.org

Chef Community

Roller Derby Referee

Skydiver

©Alex Erde

–CEO calling your cellphone at 03:00

“The site is slow.”

What?

• typical monitoring implementation story

• an alternative approach

(CC BY 2.0) http://www.gotcredit.com/ https://flic.kr/p/6439SA

LB

Data

User

Web

(CC BY 2.0) www.futurealpha.com https://flic.kr/p/8PhF4g

(CC BY 2.0) Aristocrats-hat https://flic.kr/p/6qdTC1

–W. Edwards Deming, The Elements of Statistical Learning

“In God we trust; all others bring data.”

You want more?

• graphite

• ganglia

• mongodb

• mysql

• influxdb

• socket.io

• datadog

• …

from bottle import routeimport pymongoimport json

db = pymongo.Connection(‘mongodb://...

@route('/insert/:name')def insert(name):

doc = {'name': name}db.words.update(

doc, {"$inc":{"count": 1}}, upsert=True)return json.dumps(doc, default=default)

from bottle import routeimport pymongoimport jsonfrom statsd import statsd

db = pymongo.Connection(‘mongodb://...

@route(‘/insert/:name')@statsd.increment('wordcount.insert')def insert(name):

doc = {'name': name}db.words.update(

doc, {"$inc":{"count": 1}}, upsert=True)return json.dumps(doc, default=default)

Time is a Cruel Master

(CC BY-SA 2.0)

https://www.flickr.com/theilr/

https://flic.kr/p/8MC5YM

Have

• systems

• applications

• services

• developers

• operators

• customers

Have

• systems

• applications

• services

• developers

• operators

• customers

Polyglot Platforms

Complex Systems

Disparate Locations

Information Overload

–CEO calling your cellphone at 03:00

“The site is slow.”

(CC BY 2.0) www.futurealpha.com https://flic.kr/p/8PhF4g

Does this matter?

Top-down

• work metrics

• resource metrics

• events

Work Metricsthroughput (rps), success/error, performance (latency)

Resource Metricsutilization (%busy), saturation (queued), errors, availability

Eventschange/build/deploy, alerts, anything notable

Trend resource metrics,

notify on changes

Wake people up when

work metrics go awry

Slice and Diceexploration and aggregation

Set-and-Forget

Just-In-Time

Information

Does it scale?

Customer Stats

• AdRoll, ~2m transactions/second

• SimpleReach, ~7b measurements/day

• MercadoLibre, ~18k hosts monitored

• AirBnB, 3000+ monitors defined

–CEO calling your cellphone at 03:00

“The site is slow.”

–You

“Thanks. We know, and are already investigating.”

–You, because you never got that call in the first place due to

proactive data collection and alerting.

“[silence]”

Questions?

–M. Fiedler, Twitter: @mikefiedler

“If you don’t measure, you don’t won’t know.”