Data Driven Monitoring

46
Data Driven Monitoring Daniel Schauenberg [email protected] @mrtazz

Transcript of Data Driven Monitoring

Page 1: Data Driven Monitoring

Data Driven Monitoring

Daniel Schauenberg

[email protected]

@mrtazz

Page 2: Data Driven Monitoring
Page 3: Data Driven Monitoring

@mrtazz

Page 4: Data Driven Monitoring

Item by TheBackPackShoppe

Page 5: Data Driven Monitoring

http://www.flickr.com/photos/brianglanz/1095706242

Page 6: Data Driven Monitoring

@mrtazz

Page 7: Data Driven Monitoring

How comfortable are you deploying a

change right now?

Page 8: Data Driven Monitoring

“If this is your first day at Etsy, you deploy the site”

Page 9: Data Driven Monitoring
Page 10: Data Driven Monitoring
Page 11: Data Driven Monitoring

@mrtazz

Page 12: Data Driven Monitoring

@mrtazz

Ganglia• System level metrics

• Instance per DC/environment

• > 220k RRD files

• Fully configured through Chef role attributes

Page 13: Data Driven Monitoring

@mrtazz

Rainbow Graphs!

Page 14: Data Driven Monitoring

@mrtazz

StatsD

Page 15: Data Driven Monitoring

@mrtazz

Graphite• Application level metrics

• 96G RAM, 20 Cores, 7.3T SSD RAID 10

• 525k metrics/minute

• Mirrored Primary/Primary Setup

• Functionally sharded relays

Page 16: Data Driven Monitoring

@mrtazz

Page 17: Data Driven Monitoring

@mrtazz

Page 18: Data Driven Monitoring

@mrtazz

nagios

Page 19: Data Driven Monitoring

@mrtazz

<3 nagios

Page 20: Data Driven Monitoring

@mrtazz

Page 21: Data Driven Monitoring

@mrtazz

Nagios• 2 instances in each DC/environment

• Fully Chef generated configuration

• Service checks and contacts in git

• Notifications via email->SMS gateway

• ~75% ops on-call

Page 22: Data Driven Monitoring

@mrtazz

github.com/lozzd/nagdash

Page 23: Data Driven Monitoring

@mrtazz

Page 24: Data Driven Monitoring

@mrtazz

Much more…• Syslog-ng

• Logstash

• Logster

• Supergrep

• Eventinator

Page 25: Data Driven Monitoring

Information Overload

Image by http://jasoncasteel.deviantart.com/

Page 26: Data Driven Monitoring

@mrtazz

Alert Fatigue

Page 27: Data Driven Monitoring

We have the data

We can make

it better

Item by PicksFromThePast

Page 28: Data Driven Monitoring
Page 29: Data Driven Monitoring

@mrtazz

nagios-herald

Page 30: Data Driven Monitoring

@mrtazz

nagios-herald

Page 31: Data Driven Monitoring

@mrtazz

nagios-herald

Page 32: Data Driven Monitoring

@mrtazz

Failed Check nagios-herald

Formatter

Helpers

Graphite Ganglia Logstash

Message

Page 33: Data Driven Monitoring
Page 34: Data Driven Monitoring

github.com/etsy/nagios-herald

Page 35: Data Driven Monitoring

@mrtazz

opsweekly

Page 36: Data Driven Monitoring

@mrtazz

Page 37: Data Driven Monitoring

@mrtazz

Opsweekly

Page 38: Data Driven Monitoring

@mrtazz

Alert categorization

Page 39: Data Driven Monitoring

@mrtazz

Wearables!

Item by JennysTrinketShoppe

Page 40: Data Driven Monitoring

@mrtazz

Sleep tracking

Page 41: Data Driven Monitoring

github.com/etsy/opsweekly

Page 42: Data Driven Monitoring

@mrtazz

Summary• Set of trusted tools for monitoring

• Always experiment

• Always learn

• Always improve

• Use the data, Luke

Page 43: Data Driven Monitoring

@mrtazz

Shout out to @lozzd

and @Ryan_Frantz

Page 44: Data Driven Monitoring

@mrtazz

codeascraft.com etsy.com/codeascraft/talks

etsy.github.com etsy.com/careers

Page 45: Data Driven Monitoring

Questions?

Page 46: Data Driven Monitoring

Data Driven Monitoring

Daniel Schauenberg

[email protected]

@mrtazz