Scaling graphite to handle a zerg rush

December 11, 2016 | Daniel Ben-Zvi, VP of R&D, SaaS Platform

danielb@similarweb.com

Mission

The problem

The problemNo metrics across the board

●Hard to debug issues

●No intuitive way to measure efficiency, usage

●Capacity planning?

●Dashboards

The problemNo metrics across the board

●We knew graphite

●We wanted statsd for applicative metrics

●And we heard that collectd is-nice and we installed it X 500 physical machines

Mission

Graphite

Write throughput across our Hadoop fleet

Ingress traffic to our load balancing layer

"Store numeric time series data""Render graphs of this data on demand"

GraphiteArchitecture

Image from: github.com/graphite-project/graphite-web/blob/master/README.md

Mission

So why did it crash?

Max IOPS reached

Single Threaded

Graphite●First setup - 2x 1TB magnetic drives @ RAID 1

●Volume peaked at ˜300 iops

●Carbon-cache maxed the CPU

Graphite●Why so many IOPS?

●Every metric is a separate file on the FS

/var/data/graphite/collectd/{hostname}/cpu/user.wsp

Mission

Solving the problem

Mission

Graphite+ Clustering

https://grey-boundary.io/the-architecture-of-clustering-graphite/

Graphite+ Clustering

https://grey-boundary.io/the-architecture-of-clustering-graphite/

This looks nice but do we really need moar machines?

Graphite+ Remember the bottlenecks we

●Carbon-cache reached 100% CPU on a single core (it's probably single threaded)

●Disks reached maximum IOPS capacity

Mission

carbon-cache

Graphite+ Carbon-cache

●Persists metrics to disk and serves hot-cache to graphite

●Python, single threaded

●So we replaced carbon-cache with go-carbon:Golang implementation of Graphite/Carbon server with classic architecture: Agent -> Cache -> Persister

Graphite+ go-carbon

The result of replacing "carbon" to "go-carbon" on a server with a load up to 900 thousand metric per minute:

Reference: https://github.com/lomik/go-carbon

Graphite+ go-carbon

Max IOPS reached

20% cpu

Mission

Solving the IOPS bottleneck

Graphite+ IOPS

RAID 0? Raid controller became the bottleneck and it wasn't enough anyway

SSD? Yes! But one wasn't enough :(

Hadoop inspiration! JBOD (no raid)

Influx? No!

Graphite+ We wanted this:

Mission

Load balancer: carbon-relay

Graphite+ carbon-relay

●"Load balancer" between metric producers and go-carbon instances

●Same metric is routed to the same go-carbon instance via a consistent hashing algorithm

●But… is a single-threaded Python app so your mileage may vary

Graphite+ IOPS

100% CPU :(

Graphite+ carbon-relay

●We replaced with carbon-c-relay:A very fast C implementation of carbon-relay (and much more)

Graphite+ Carbon C relay

Graphite+ (Some) Performance metrics

Go-carbon Update Operations Stack CPU usage

Mission

What about statsd?

+ statsd

Can we scale statsd out?

Graphite

+ statsd

Who wins?

If we shard statsd, we end up with wrong data in graphite.

Graphite

Mission

Introducing statsiteC implementation of

statsd (and much more)

Graphite

●Wire compatible with statsd (drop in replacement)

●Pure C with a tight event loop (very fast)

●Low memory footprint

●Supports quantiles, histograms and much more.

+ Statsite

Mission

Final setup

Graphite+ Final setup

“Graphite box”

Mission

Don’t give up on Graphite!

Mission

Graphite

●Beast graphite stack, peaked at 1M updates per minute, room for more

●Very efficient: ˜10% user-land CPU usage, leaves more room for IRQs (disk, network)

●We can still scale out the whole stacks with another layer of carbon-c-relay but we never needed to go there.

+ Pros

Graphite

●SSD is still expensive and wears out quickly under heavy random-writes scenarios - less relevant on AWS :-)

●Bugs - Custom components are somewhat less field tested.

●Data is not highly available with JBOD

●Doing metrics right is demanding - go SaaS!

+ Cons

Graphite+ Some tuning tips

● UDP creates correlated loss and has shitty backpressure behaviour (actually, NO backpressure). Use TCP when possible

● High frequency UDP packets (statsite) can generate a shit-load of IRQs - balance your interrupts or enforce affinity

● High Carbon PPU (Points per update) signals I/O latency

● Tune go-carbon cache, especially if you alert on metrics

● https://github.com/lomik/go-carbon● https://github.com/grobian/carbon-c-relay● https://github.com/statsite/statsite

●https://github.com/similarweb/puppet-go_carbon● http://www.aosabook.org/en/graphite.html

We are hiring :-)

Thank You!

Scaling graphite to handle a zerg rush

Engineering

Transcript of Scaling graphite to handle a zerg rush

Protoss versus Zerg: The Guide - Team Liquid - StarCraft 2 and

Fichas Dämmerung Ultimate ~ Black Zerg 2008.pdf

GRAPHITE MACHINED PARTS¼re_EN.pdf · INTELLIGENCE IN GRAPHITE GRAPHITE MACHINED PARTS EDM SPARK EROSION.

Poco Graphite, Inc. Properties and Characteristics of Graphite

LAUNDRY DLEY1701 E / DLGY1702 E€¦ · dley1701Ve electric dryer (graphite steel) 048231 014625 dlgy1702Ve gas dryer (graphite steel) 048231 014632 Wt1701V graphite steel (graphite

Graphite Figure

THE ALBERTA GOLD RUSH NEWS...Page 4 Gold Rush News Gold Rush News Gold Rush News Gold Rush News Gold Rush News What you Need to Know! Nugget Raffle winners: The winner of the 4.4 gram

Expandable graphite: a graphite modification with great ...

Expandable Graphite - Graphex Mining€¦ · 2 Flake Graphite •Expandable graphite is a compound of graphite that expands or exfoliates when heated •Manufactured by treating flake

Bulk graphite: materials and manufacturing process...Anisotropic nuclear graphite 2.2 1.73 Isotropic nuclear graphite 5.3 1 Electrode graphite 0.6 2.33 Electrode graphite connecting

solid Metal - Trebo · 2019. 1. 18. · M4515 Metallic Grass Gazon Métallique M5399 Graphite Spin Graphite Filé M5394 Graphite Encode Graphite Encodé M5397 Graphite Twirl Graphite

Rush Orm edPA Rush edit (002) - Research Bank

Maxwell Rush 2017 - GitHub Pagesmaxwellslg.github.io/rush/Rush_Calendar_2017.pdfupdated throughout rush! (Maxwell House Rush 2017) Title Microsoft Word - Maxwell Rush 2017.docx Created

Diamond & Graphite

Page 1 GOLD RUSH NEWS · 2018-10-05 · Page 4 Gold Rush News Gold Rush News Gold Rush News Gold Rush News Gold Rush News with how many are attending. What you Need to Know! *Nugget

PAPYEX¨ FLEXIBLE GRAPHITE - MERSEN...Flexible graphite is manufactured from purified natural graphite crystallites. The best graphite ores are mainly extracted in China, Canada, India

Rush Chna August 2016 - Rush University Medical Center

producEr of Flake Graphite 2015 New Lump-Vein Graphite ... Graphite map 2015_print.pdf · Flake Graphite 2015 Flake graphite demand and trade flows Andrew Miller, Analyst, Industrial

Star Craft 2 Strategy Guide Protoss Terran Zerg

Peeled graphite