Scaling graphite to handle a zerg rush
-
Upload
daniel-ben-zvi -
Category
Engineering
-
view
96 -
download
6
Transcript of Scaling graphite to handle a zerg rush
![Page 1: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/1.jpg)
Scaling graphite to handle a zerg rush
December 11, 2016 | Daniel Ben-Zvi, VP of R&D, SaaS Platform
![Page 2: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/2.jpg)
Mission
The problem
![Page 3: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/3.jpg)
The problemNo metrics across the board
●Hard to debug issues
●No intuitive way to measure efficiency, usage
●Capacity planning?
●Dashboards
![Page 4: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/4.jpg)
The problemNo metrics across the board
●We knew graphite
●We wanted statsd for applicative metrics
●And we heard that collectd is-nice and we installed it X 500 physical machines
![Page 5: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/5.jpg)
Mission
Graphite
![Page 6: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/6.jpg)
Graphite
Write throughput across our Hadoop fleet
Ingress traffic to our load balancing layer
"Store numeric time series data""Render graphs of this data on demand"
![Page 7: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/7.jpg)
GraphiteArchitecture
Image from: github.com/graphite-project/graphite-web/blob/master/README.md
![Page 8: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/8.jpg)
Mission
So why did it crash?
![Page 9: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/9.jpg)
Max IOPS reached
Single Threaded
Graphite●First setup - 2x 1TB magnetic drives @ RAID 1
●Volume peaked at ˜300 iops
●Carbon-cache maxed the CPU
![Page 10: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/10.jpg)
Graphite●Why so many IOPS?
●Every metric is a separate file on the FS
/var/data/graphite/collectd/{hostname}/cpu/user.wsp
![Page 11: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/11.jpg)
Mission
Solving the problem
![Page 12: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/12.jpg)
Mission
![Page 13: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/13.jpg)
Graphite+ Clustering
https://grey-boundary.io/the-architecture-of-clustering-graphite/
![Page 14: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/14.jpg)
Graphite+ Clustering
https://grey-boundary.io/the-architecture-of-clustering-graphite/
This looks nice but do we really need moar machines?
![Page 15: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/15.jpg)
Graphite+ Remember the bottlenecks we
had
●Carbon-cache reached 100% CPU on a single core (it's probably single threaded)
●Disks reached maximum IOPS capacity
![Page 16: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/16.jpg)
Mission
carbon-cache
![Page 17: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/17.jpg)
Graphite+ Carbon-cache
●Persists metrics to disk and serves hot-cache to graphite
●Python, single threaded
●So we replaced carbon-cache with go-carbon:Golang implementation of Graphite/Carbon server with classic architecture: Agent -> Cache -> Persister
![Page 18: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/18.jpg)
Graphite+ go-carbon
The result of replacing "carbon" to "go-carbon" on a server with a load up to 900 thousand metric per minute:
Reference: https://github.com/lomik/go-carbon
![Page 19: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/19.jpg)
Graphite+ go-carbon
Max IOPS reached
20% cpu
x500
![Page 20: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/20.jpg)
Mission
Solving the IOPS bottleneck
![Page 21: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/21.jpg)
Graphite+ IOPS
RAID 0? Raid controller became the bottleneck and it wasn't enough anyway
SSD? Yes! But one wasn't enough :(
Hadoop inspiration! JBOD (no raid)
Influx? No!
![Page 22: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/22.jpg)
Graphite+ We wanted this:
![Page 23: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/23.jpg)
Mission
Load balancer: carbon-relay
![Page 24: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/24.jpg)
Graphite+ carbon-relay
●"Load balancer" between metric producers and go-carbon instances
●Same metric is routed to the same go-carbon instance via a consistent hashing algorithm
●But… is a single-threaded Python app so your mileage may vary
![Page 25: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/25.jpg)
Graphite+ IOPS
100% CPU :(
![Page 26: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/26.jpg)
Graphite+ carbon-relay
●We replaced with carbon-c-relay:A very fast C implementation of carbon-relay (and much more)
![Page 27: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/27.jpg)
Graphite+ Carbon C relay
![Page 28: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/28.jpg)
Graphite+ (Some) Performance metrics
Go-carbon Update Operations Stack CPU usage
![Page 29: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/29.jpg)
Mission
What about statsd?
![Page 30: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/30.jpg)
+ statsd
Can we scale statsd out?
Graphite
![Page 31: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/31.jpg)
+ statsd
Who wins?
If we shard statsd, we end up with wrong data in graphite.
Graphite
![Page 32: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/32.jpg)
Mission
Introducing statsiteC implementation of
statsd (and much more)
![Page 33: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/33.jpg)
Graphite
●Wire compatible with statsd (drop in replacement)
●Pure C with a tight event loop (very fast)
●Low memory footprint
●Supports quantiles, histograms and much more.
+ Statsite
![Page 34: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/34.jpg)
Mission
Final setup
![Page 35: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/35.jpg)
Graphite+ Final setup
“Graphite box”
![Page 36: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/36.jpg)
Mission
Don’t give up on Graphite!
![Page 37: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/37.jpg)
Mission
Recap
![Page 38: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/38.jpg)
Graphite
●Beast graphite stack, peaked at 1M updates per minute, room for more
●Very efficient: ˜10% user-land CPU usage, leaves more room for IRQs (disk, network)
●We can still scale out the whole stacks with another layer of carbon-c-relay but we never needed to go there.
+ Pros
![Page 39: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/39.jpg)
Graphite
●SSD is still expensive and wears out quickly under heavy random-writes scenarios - less relevant on AWS :-)
●Bugs - Custom components are somewhat less field tested.
●Data is not highly available with JBOD
●Doing metrics right is demanding - go SaaS!
+ Cons
![Page 40: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/40.jpg)
Graphite+ Some tuning tips
● UDP creates correlated loss and has shitty backpressure behaviour (actually, NO backpressure). Use TCP when possible
● High frequency UDP packets (statsite) can generate a shit-load of IRQs - balance your interrupts or enforce affinity
● High Carbon PPU (Points per update) signals I/O latency
● Tune go-carbon cache, especially if you alert on metrics
![Page 41: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/41.jpg)
● https://github.com/lomik/go-carbon● https://github.com/grobian/carbon-c-relay● https://github.com/statsite/statsite
●https://github.com/similarweb/puppet-go_carbon● http://www.aosabook.org/en/graphite.html
Links
![Page 42: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/42.jpg)
We are hiring :-)
![Page 43: Scaling graphite to handle a zerg rush](https://reader036.fdocuments.us/reader036/viewer/2022062401/58f9c34c1a28ab666a8b459d/html5/thumbnails/43.jpg)
Thank You!