Graphite at CityGrid - LA DevOps April 2014
-
Upload
wil-heitritter -
Category
Technology
-
view
218 -
download
0
description
Transcript of Graphite at CityGrid - LA DevOps April 2014
Graphite at CityGridif you can’t measure it, you can’t fix it
Wil HeitritterDirector, Tech Ops
Los Angeles DevOps2014/04/28
Magnum esse solem philosophus probabit, quantus sit mathematicus
-Seneca
Objectives
- Introduce Graphite to new users
- Show what we like, what we hate
- Present some interesting use-cases
- Generate discussion
Before Graphite
Ganglia
• Predictable interface
• Text “metrics” to store versions
• Slow
• Couldn’t pick and choose metrics to see
Why ganglia sucked
- Clusters had to be pre-configured
- Multicast vs. Unicast
- Data Retention
- Static Web Interface (can’t pick and choose)
- Static Host List
What did we think wanted?
Ease of adding metrics
Ease of sending metrics
Powerful metric display
Retain ganglia-style cluster dashboards
Long-term configurable metric retention
Graphite!
What is Graphite?
a highly scalable real-time graphing system
which collects numeric time-series data
is managed by carbon
and stored as whisper files
and visualized through web interfaces
or queried via the API
http://graphite.wikidot.com/
Graphite: what we like
Sending metrics is simple
Retrieving metrics is simple
Dashboard creation and sharing… is simple
Many functions()
120MM+ metric values received daily
Backfilling past metrics is simple
Expandable - different frontends
Graphite: what sucks
Dashboard ownership/promotion
No ganglia-like standard dashboard
Data retention… is NOT as simple as we thought
CityGrid’s Graphite
Implementation
Metric NamingBusiness Metrics
- These are metrics that are not specific to a specific server
- Format: business.${hierarchical}.${path}.${here}.$metric
- Example: business.ec2.testaccount.us-east-1a.OnDemand.running.m2.4xlarge
Metric Naming
Server Metrics
- These metrics are specific to a particular server (just like ganglia)
- Format: servers.${class}.${f_q_d_n}.${metric}
- Example: servers.rvw.aws1prdrvw1_subdom_cityg_com.LW_api_reviews_QPS
Sending metrics
Sending directly from metric scripts
- /etc/graphite.conf
- May need to spread out sending if in volume
Collecting from gmond every minute
- Metrics are spread out to prevent spiking
- False data (gmond acts as a cache)
Impact of staggered sending
Sending is simply...
echo $metric $value $timestamp | nc $relay $port
Performance
carbon-cache/carbon-relay
SSD
replication within minutes
Maintenance
Changing retention
- whisper-auto-resize.py
Filling holes
- whisper-fill $source $destination
Backups
- Dashboards
- Metrics
Graphite Use-Cases
Single Metric
Combined Metrics
Key Metrics Dashboard
Examples of Key Metrics
- QPS
- Processing Time (Max/Mean/Distribution)
- Metrics about sub-requests
- Network usage
- CPU/load
Key Metrics Dashboard
Nagios Integration
check_graphite_target!highestMax(servers.mai.@[email protected]_map_return_code_5*_ratio, 1
)!5!10
How about Pie Charts?
What NOT to do
Trying it out for yourself
Quick Setup
Install & Start# pip install https://github.com/graphite-project/ceres/tarball/master
# pip install whisper
# pip install carbon
# pip install graphite-web
start it up...
send it a metric:echo business.test.metric1 1 `date “+%s”` | nc localhost 2003
OK, it’s almost that easy...
Discussion