Scaling Pinterest's Monitoring

Scaling Pinterest’s Monitoring

1

Brian Overstreet - Visibility Software Engineer

Monitorama Agenda

• What is Pinterest?

• Starting from Scratch

• Scaling the Monitoring System • Focused on time series metrics • Challenges faced

• The Missing Element

• Lessons Learned

• Summary

Scaling Pinterest’s Monitoring

2

75+ Billion Ideas categorized by people into more than

1 Billion Boards

3

4

Pinterest Unique VisitorsVi

sitor

s (m

illion

s)

0

10

20

30

40

Jan 2011 Apr 2011 Jul 2011 Oct 2011 Jan 2012 Apr 2012 Jul 2012 Jan 2013Source: comscore

Tools

• Ganglia (system metrics) • No application metrics

• Up/Down Checks

Early 2012

5

From Bad to WorseLots of Outages

6

Monitoring* TimelineTime Series Tools

7

Pinterest Launched

Graphite Deployed

Ganglia for system metrics

2010 20122011 2013 2014 2015 2016

*The action of observing and checking the behavior and outputs of a system and its components over time.

First Graphite ArchitectureSingle Box — Early 2012

8

Applicationgraphite-web

carbon-cache

statsd-server

Metrics Box

statsd UDP protocol

First Graphite ArchitectureSingle Box — Early 2012

9

Applicationcarbon-cache

statsd-server

Metrics Box

statsd UDP protocol

graphite-web

Second Graphite ArchitectureClustered — Early 2013

10

Application

haproxystatsd server

carbon-relay

carbon-relay

carbon-cache * 4graphite-web


carbon-cache * 4graphite-web haproxy

graphite-web

Second Graphite ArchitectureClustered — Early 2013

11

Application

haproxystatsd server

carbon-relay

carbon-relay



carbon-cache * 4graphite-web haproxy

graphite-web

Option #1: Put StatsD Everywhere

• Pros • Fixed packet loss • Unique metric names per host

• Cons • Unique metric names per host • Latency only calculated per host

statsd for everyone

12

statsd

application

statsd

application

statsd

application

haproxy

carbon-relay

carbon-relay

Option #2: Sharded Statsd

• Pros • Metric name not needed to be

unique by host • Fixed most packet loss issues for

some time

• Cons • Shard mapping in client • Some statsd servers still would

have packet loss • Shard mapping updating

statsd for different names

13

application

haproxy

carbon-relay

carbon-relay

application

application

statsd

statsd

statsd

metric.a

metric.b

metric.c

Multiple Graphite Clusterseverybody gets a cluster (mid 2013)

14

Application (python)Statsd Servers (python)

Graphite Cluster (Java app)Application (java)Statsd Servers (java)

Graphite Cluster (Python app)

User Quote

• “Graphite isn't powerful enough to handle two globs in a request, so ‘obelix.pin.prod.*.*.metrics.coll.p99’ doesn't return anything most of the time. With just one glob it usually works, but it can be very slow.”

on querying metrics in Graphite

15

Monitoring* TimelineTime Series Tools

16

Pinterest Launched

Graphite Deployed

Ganglia for system metrics

2010 20122011 2013 2014 2015 2016

OpenTSDB Deployed

*The action of observing and checking the behavior and outputs of a system and its components over time.

User Quote

• “… convinced me to try out OpenTSDB, and I am VERY GLAD they did. The interface isn't perfect, but it does let you construct queries quickly, and the data is all there, easy to slice by tag and *fast*. I couldn't be happier, and it has saved me hours of frustration and confusion over the last few days while tracking down latency issues in our search clusters.”

on using OpenTSDB

17

Statsd still brokennever fixed real issue

18

Graphs are Just Wrongtoo many metrics dropped

19

User Quotes

• “At this point I would give just about anything for a time-series database that I could trust. The numbers coming out of graphite from the client and server sides don't match, and neither of them match with the ganglia numbers.”

• “I don't know which to trust; even the shapes are different, so I'm no longer convinced that the relative changes are right. That makes it hard for me to tell if my theories are wrong, or the numbers are wrong, or both.”

on time series metrics

20

Replace Statsd Server

• Local metrics-agent

• Kafka

• Storm

by adding 3 new components

21

Metrics-agent

• Gatekeeper for time series data

• Interface for OpenTSDB and StatsD • different ports

• Sends metrics to Kafka

• Needed to convert o Kafka pipeline with no downtime • Double write to existing StatsD servers and Kafka

everybody gets an agent

22

New Metrics Pipelinelambda architecture (2015)

23

Kafka

Storm

Batch Job

metrics-agent

application

metrics-agent

application

metrics-agent

application

graphite cluster 1

graphite cluster 2

opentsdb cluster 1

opentsdb cluster 2

Fixed Graphsno more packet loss

24

Current Write Throughput

• Graphite •120,000 points/second

• OpenTSDB • 1.5 million points/second

Graphite and OpenTSDB

25

Statsboard

• Integrates Graphite, Ganglia, OpenTSDB metrics

• Adds Graphite like functions to OpenTSDB

• asPercent • diffSeries • integral • sumSeries • etc.

Time Series Dashboards and Alerts

26

Statsboard Config

• Dashboards - "Outbound QPS Metrics": - title: "Outbound QPS (by %)" metrics: - stat: metric_name_1

• Alerts Alert Name: threshold: metric > 100 pagerduty: service_name

Yet Another YAML Config Format

27

The Missing ElementThe users

28

User Quotes on Graphite

• “I'm not saying Graphite isn't evil. It's evil. I'm just saying that if you spend a week staring at it hard enough you can make some sense out of the madness :)”

• “I do not believe graphite is 'evil' since this is how RRD datasets have worked since 1999.”

• “I don't think anyone is complaining about rrdtool, which is as much at fault for Graphite as the Linux OS on which it runs. The problem is that you have to know a lot of things to get correct results from a Graphite plot, and none of those things are easy to find out (as John says, none of them appear on the data plot).”

Graphite is Evil?

29

What about OpenTSDB?I thought users were happy.

30

OpenTSDB Aggregation

• “Something is wrong with OpenTSDB. My lines are often unnaturally straight. Can you fix it?”

What exactly is getting aggregated?

31

Graphite User Education

• What RRDs are and how to normalize across intervals

• Metric summarization into next interval

• Getting requests/second from a timer

• Difference between stats and stats_counts

• Should I use hitcount or integral to calculate totals?

Train Users on System

32

OpenTSDB User Education

• Getting data from continually incrementing counters

• Interpolation of data points

• How aggregation works

• Query Optimization

Train Users on System — OpenTSDB

33

What else have we learned?Besides system architecture and doing user education

34

Protect System from Clients

• Alert on unique metrics

• Block metrics using Zookeeper

Must control incoming metrics

35

metrics-agent

application

opentsdb

zookeeper

counts by common prefix

Alert on Prefix Count

on-call engineer

prefix block list

Trusting the Data

• Cannot control how users use the data • Do not want business decisions off of wrong data

• Measuring data accuracy is hard • Count metrics generated vs. metrics written at every phase. • Lots of places a metric can get lost and not known that it was lost

Need to measure data points lost

36

Lessen Aggregator Overhead

• StatsD performs network call to update a metric

• Manually tune sample rate to lessen overhead (time consuming)

• Java uses Ostrich library for in process aggregation

Ideally In Process

37

metrics-agent

Java Application

Ostrich

metrics-agent

StatsD Client

Lessen Operational Overhead

• More tools, more overhead

• Adding boxes to Graphite is hard

• Adding boxes to OpenTSDB is easy

• More monitoring systems, more monitoring of the monitoring system

• Removing a tool in production is hard • Ganglia, Graphite, and OpenTSDB all still running

• As product gets more 9s so must the monitoring tools.

Fewer Tools?

38

Set User Expectations

• Data has a lifetime • Unless otherwise conveyed, most users expect data to exist indefinitely.

• Not magical data warehouse tools that return data instantly

• Not all metrics will be efficient

I didn’t expect this talk to go on so long

39

Summary

• Match the monitoring system to where the company is at

• User education is key to scale these tools organizationally

• Tools scale with number of engineers not users of site

Thanks for listening

40

Scaling Pinterest's Monitoring

Technology

Transcript of Scaling Pinterest's Monitoring