Metrics-Driven Engineering

Post on 08-Sep-2014

17.904 views 0 download

Tags:

description

Presented at Web 2.0 Expo, Oct. 13 2011

Transcript of Metrics-Driven Engineering

Mike Brittain @ mikebrittain

Director of engineering, Infrastructure

Metrics-Driven Engineering

October 13, 2011

Tools and Process at Etsy

How many new visits?How many listings created?How many registrations?

How do people use Etsy?How many convos sent?

How many purchases?How many new shops?

Search indexing?How fast are pages generating?

Async tasks currently in queue?

What is the application doing?Developer API auth and rate limiting?

Images resized and stored?Error and warning rates?

Replication slave lag?Memcache hits/misses?

Available connections?

Are the servers in good shape ?Database queries per second?

Total outgoing bandwidth?CPU, Memory, I/O?

Business Metrics

Application Metrics

System Metrics

Visibility EVERYWHERE

Constant Change

$314 Million GMS 2010

$180 Million GMS 2009$87 Million GMS 2008

$26 Million GMS 2007

credit: pentarux (flickr)

25 Million Unique Visitors1 Billion page views per month

credit: pentarux (flickr)

Engineering team grew 500% over 18 months

credit: martin_heigan (flickr)

Less talk, more do.

Always Be Shipping

credit: ibailemon (flickr)

Always Be Shipping(even if it’s your first day)

credit: ibailemon (flickr)

90+ Engineers40+ Deploys / day

credit: misswired (flickr)

credit: digidave (flickr)

Code Reviews

Automated Tests

$cfg = array( 'checkout' => array('enabled' => 'on'), 'homepage' => array('enabled' => 'on'), 'profiles' => array('enabled' => 'on'), 'new_search' => array('enabled' => 'off'),);

Config FlagsEnable and disable features quickly

$cfg = array( 'checkout' => array('enabled' => 'on'), 'homepage' => array('enabled' => 'on'), 'profiles' => array('enabled' => 'on'), 'new_search' => array('enabled' => 'off'),);

Config FlagsEnable and disable features quicklyPlus “admin-only,” percentage ramp-up, A/B testing,whitelists, blacklists, etc...

Failure is not an option

Failure is not an optioninevitable!

Failure is not an optioninevitable!

a learning opportunity!

Failure is not an optioninevitable!

a learning opportunity!

DETECTABLE!

Access

Detect problems quickly

CONFIDENCE

Well, the Ops team manages the network, racks the servers, installed the monitoring tools, wears

the pagers, blah, blah, blah...

A:

Engineers build the application

OPS

LoggingGraphingTrendingAlerting

ENG

“Engineers are too busy writing features to build metrics.”

Metrics are part of every feature...and so are config flags

Dead Simple

Simple, open source tools

Cacti (network, SNMP)Ganglia (machines)Graphite (application)Splunk (log analysis, nightly reports)Nagios (alerting)

LoggingLogsterStatsD

Ganglia

Cluster-orientedHuge community contributed recipesCustom metrics (gmetad)

Ganglia

Graphite

Single-instanceCreate new metrics on-the-fly

Customize via URLs and display functions

Graphite

Logging

It’s 2:48 PM.

Do you know where yourlogs are?

Logger::log_error("User login failed. Reason: $msg for $username", “login”);

Logger::log_error("User login failed. Reason: $msg for $username", “login”);

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...

LogFormat "%h %l %u %t \"%r\" %>s %b" common

LogFormat %{True-Client-IP}i %l %t \"%r\" %>s %b \"%{Referer}i\"

\"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V

%{etsy_ab_selections}n %{etsy_request_uuid}n

%{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n

%{php_time_microsec}n %D" combined

apache_note()

LogFormat %{True-Client-IP}i %l %t \"%r\" %>s %b \"%{Referer}i\"

\"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V

%{etsy_ab_selections}n %{etsy_request_uuid}n

%{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n

%{php_time_microsec}n %D" combined

LogFormat %{True-Client-IP}i %l %t \"%r\" %>s %b \"%{Referer}i\"

\"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V

%{etsy_ab_selections}n %{etsy_request_uuid}n

%{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n

%{php_time_microsec}n %D" combined

LogFormat %{True-Client-IP}i %l %t \"%r\" %>s %b \"%{Referer}i\"

\"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V

%{etsy_ab_selections}n %{etsy_request_uuid}n

%{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n

%{php_time_microsec}n %D" combined

grep "/listing/" access.log | \awk '{sum=sum+$(NF-2)} END {print sum/NR}'

web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Help me, Rhonda.web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0201 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0034 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web1101 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0201 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue.web0055 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!web0002 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling.web0089 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0020 [04:28:54 2011] [error] [client 10.101.x.x] Sky is falling.web1101 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0055 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0034 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0087 [04:28:54 2011] [fatal] [client 10.101.x.x] Sky is falling.web0002 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0201 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0077 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0355 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0052 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0003 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue.web0066 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling

Fatals Errors Warnings

Logster

github.com/etsy

Run by cronKeeps a cursor on your log fileAggregate lines anyway you wantOutput to Ganglia or GraphiteSimple parsers

Logster

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...

^.+ \[.+\] \[(?P<log_level>.+)\]

if (fields['log_level'] == “fatal”): self.fatals += 1

elif (fields['log_level'] == “error”): self.errors += 1

elif (fields['log_level'] == “warning”): self.warnings += 1

...

MetricObject("fatals", (self.fatals / self.duration), "per sec")

MetricObject("errors", (self.errors / self.duration), "per sec")

MetricObject("warning", (self.warnings / self.duration), "per sec")

Fatals Errors Warnings

StatsD

github.com/etsy

StatsDNetwork daemon (node.js)

Accepts data over UDPFlushes to Graphite every 10 sec

One-line of code

StatsD::increment("logins.success");

StatsD::increment("logins.success");

logins

StatsD::timing("gearman.time", $msec);

StatsD::timing("gearman.time", $msec);

90th pct

average

lower

Ad hocname value timestamp

echo "events.deploy.site 1 `date +%s`" \| nc graphite.etsycorp.com 2003

Vertical Line Technology!target=drawAsInfinite(events.deploy.site)

We could stare at graphs all day...

http://graphite/render?from=-1hours&width=600&height=200

&target=webs.errorLog.warning&rawData=1

webs.errorLog.warning,1318444930,1318448530,60|5.0,1.0,3.0,1.0,0.0,9.0,0.0,1.0,3.0,2.0,1.0,6.0,2.0,6.0,3.0,6.0,4.0,4.0,2.0,1.0,1.0,8.0,2.0,3.0,6.0,3.0,5.0,3.0,0.0,4.0,6.0,2.0,0.0,2.0,0.0,4.0,0.0,3.0,1.0,3.0,4.0,2.0,10.0,3.0,0.0,6.0,0.0,4.0,2.0,5.0,18.0,1.0,1.0,2.0,1.0,8.0,5.0,1.0,1.0,None

Holt-Winters Confidence Bands

lower

upper

Holt-Winters Aberration

Business metrics+ Confidence bands

_____________ Alertable metrics

40,000+ metrics at EtsySystems, Applications, Business

Dashboards

Dashboards

<a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or+Script+Not+Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render?from=-1hours&width=280&height=220&title=File+or+Script+Not+Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"></a>

Kind of Hard :-/

$g = new Graphite($time);$g->setTitle('File Not Found');$g->addMetric('webs.errorLog.notExist', '#00cc00');echo $g->getDashboardHTML(280, 220);

Super Easy!

Metrics!

Metrics!Metrics + Events

Metrics!Metrics + EventsMetrics + Alerts

Metrics!Metrics + EventsMetrics + Alerts

Metrics + Metrics

High-level, real-time visibility

Detect problems quickly

CONFIDENCE

Make them required features

Make them dead simple

Make them accessible

Make them!

Thank You

Homeworkcodeascraft.etsy.comgithub.com/etsy

We’re always looking for people who are interested in this kind of stuff...

etsy.com/careers

Get in touchmike @ etsy . com

@ mikebrittain