Mo' Metrics, Mo' Problems

Post on 25-Jan-2017

81 views 4 download

Transcript of Mo' Metrics, Mo' Problems

MO’ METRICS, MO’ PROBLEMS

Erin WillinghamInfrastructure Engineer at Krux Digital

Twitter : GreenSilexhttps://www.linkedin.com/in/erin-willingham-104082126

Krux

http://www.krux.com

GRAPHITE: THEN & NOWWhat works, what doesn't and why we did what we did

http://www.lowcountryafricana.com/wp-content/uploads/2015/10/Research-Plan-Chalkboard-Slate-1000px.jpg

GRAPHShttp://i.stack.imgur.com/WBsLg.png

<metric path> <metric value> <metric timestamp>

test.bash.stats.count_ps 50 1473048113

test/bash/stats/count_ps.wsp

statsd & collectd

relay

aggregator

graphite whisper

GRAPHITE 1.0 ARCHITECTURE

RULES, MERGING, EFFICIENCY & OPERATIONS

https://s-media-cache-ak0.pinimg.com/236x/21/ba/0f/21ba0fe48349a1d5382c261ac25cb6c6.jpg

Graphite

v1

Relays are aware of aggregation rules

Graphite Whisper merges metrics!

Graphite Aggregators are really efficient.

THREADING, SCALING, RELAY CPU, & STORAGE

http://i.dailymail.co.uk/i/pix/2012/06/30/article-2166781-13BCE32D000005DC-492_634x948.jpg

Graphite

v1

Python - single threaded

Relay is CPU intensive

Graphite Whisper - requires sharding and is very I/O intensive

http://obfuscurity.com/

Slow UI when using distributed remote backends

What are we trying to solve? What is forcing the change?

http://oakdome.com/k5/lesson-plans/photo-editing/wanted-poster/wanted-reward-poster-background.jpg

Storage!

Relay & Aggregator CPU usage high

Faster UI

KEEP COSTS LOW

http://3.bp.blogspot.com/-r9l7rltAjnM/Udq8kGlp65I/AAAAAAAAANo/VyQZN48nfMk/s1600/treasurepile.jpg

GRAPHITE ALTERNATIVES

http://3.bp.blogspot.com/-r9l7rltAjnM/Udq8kGlp65I/AAAAAAAAANo/VyQZN48nfMk/s1600/treasurepile.jpg

Circonus: All the insights you ever wantedHosted Graphite

Zabbix: OSS self hosted monitoring

CARBON-C-RELAY, KAFKA, SOCAT, CARBON-RELAY-NG, KAFKACAT

https://wtfbabe.files.wordpress.com/2015/06/kung-fury-23-wtf-watch-the-film-saint-pauly.jpeg

The Tools

Carbon-c-relay

https://github.com/grobian/carbon-c-relay

GRAPHITE 2.0TOOLS

Carbon-relay-ng

https://github.com/graphite-ng/carbon-relay-ng

GRAPHITE 2.0TOOLS

Kafka Producertcp-stream-kafka-producer

https://github.com/krux/tcp-stream-kafka-producer

GRAPHITE 2.0TOOLS

kafkacat

https://github.com/edenhill/kafkacat

GRAPHITE 2.0TOOLS

GRAPHITE 2.0TOOLS

socat

“exec:/usr/bin/kafkacat

-C

-o end

-b <kafka broker>

-t <kafka topic>”

,pty,ctty,echo=0,

tcp4-connect:localhost:<relay port>

BACKEND - STORAGE

http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg

• Whisper

• Ceres

• InfluxDB

• Cyanite

• Riak

• KairosDB

• OpenTSDB

Graphite - Whisper

InfluxDB

KairosDB

GRAPHITE 2.0 ARCHITECTURE

GRAPHITE ARCHITECTURE - SCALABLE

http://www.dinopit.com/wp-content/uploads/2012/07/dinosaur-cowboy.jpg

Why?

LOAD TESTING THE PARTS AND THE PIPELINE

https://github.com/feangulo/graphite-stresser

All the Metrics!

Metrics / min

WHAT WORKED?

http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg

Pre-aggregatedPost Aggregated

http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg

MIRROR PRODUCTION DATA

https://c2.staticflickr.com/6/5278/5903002116_762783602c_b.jpg

UH OH!THE GRAPHS DON’T MATCH

http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg

Old Cluster

New Cluster

HOW DO WE FIX THIS?

http://www.startres.net/startresWP/wp-content/uploads/2013/06/3702A.jpg

TESTING CARBON-RELAY-NG

http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg

Carbon-relay-ng uses more than 2 CPUs!

FAILURE POINT FOR CARBON-RELAY-NG

http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg

Post Aggregated

Pre-aggregated

Carbon-relay-ng: room for improvement

• scale out aggregators horizontally• monitor for metrics per second and scale out as

needed• pass metrics that don’t need to be aggregated

directly to the backend

https://github.com/edenhill/kafkacat

SOLUTION

http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg

QUESTIONS?