Metrics Driven Engineering
-
Upload
mike-brittain -
Category
Technology
-
view
28.587 -
download
0
description
Transcript of Metrics Driven Engineering
Metrics-Driven EngineeringMike Brittain ENGINEERING DIRECTOR, ETSY@mikebrittain
PROCESS AND TOOLSSupporting a culture of Continuous Deployment
How many new visitors?How many listings created?How many registrations?
How do people use Etsy?How many messages sent?
How many purchases?How many new shops?
Search indexing?How fast are pages generating?
Async tasks currently in queue?
How is the application behaving?Developer API auth and rate limiting?
Images resized and stored?Error and warning rates?
Replication slave lag?Memcache hits/misses?
Available connections?
Are the servers and network OK?Database queries per second?
Total outgoing bandwidth?CPU, Memory, I/O?
Business Metrics
Application Metrics
System Metrics
System Metrics
Visibility EVERYWHERE
Metrics help you identify goals
Metrics help you identify goals... but also tell you when you’ve broken something.
Always Be Shipping
credit: ibailemon (flickr)
1!" #$%Put yourself on the web site.
2&# #$%Complete tax, insurance, and
benefits forms.
credit: ktpupp (flickr)
Dev Sandbox Trunk / master Production
You!
Test
7e9a814 -> 63a2bb3
Deploy to Production
50+ Deploys / day
200+ Committers15 Product teams
8 Infrastructure teams
50+ Deploys / day
credit: misswired (flickr)
credit: digidave (flickr)
Peer ReviewCode reviews, Architecture reviews, Operability reviews
Automated TestsStatic analysis, Unit tests, Integration tests, Functional tests
May 2013
$102.9 Million in good sold1.37 Billion page views
https://www.etsy.com/blog/news/2013/etsy-statistics-may-2013-weather-report/
Failure is not an option
Failure is not an optioninevitable
Failure is not an option
and detectable!
inevitable
Access
Sounds like a lot of work, who’s going to build all of this?
Q:
Well, the Ops team manages the network, racks the servers, installed the monitoring tools, wears
the pagers, blah, blah, blah...
A:
Sounds like a lot of work, who’s going to build all of this?
Q:
Engineers build the application
OPS
LoggingGraphingTrendingAlerting
ENG
Metrics are part of every feature(and so are config flags)
Make it DEAD SIMPLE
Ganglia (application, servers, network)
Logster* (application, servers)
Cacti (network, SNMP)
FITB* (network)
* github.com/etsy
Simple, open-source tools
Graphite (application)
Statsd* (application)
Log formats (application, servers)
Nagios (alerting)
Ganglia
Cluster-orientedHuge community contributed recipesCustom metrics (gmetad)
Ganglia
Graphite
Single-instanceCreate new metrics on-the-fly
Customize via URLs and display functions
http://www.aosabook.org/en/graphite.html
Graphite
Log Formats
Time, remote address, http method, request uri, referrer, user-agent, response size, response code, execution time, memory consumed, plus custom fields...
• Signed-in/out (user_id vs. “-”)• display mode (“desktop” vs. “mobile”)• i10n/i18n (“en-US”)• etc.
Access Logs
LogFormat %l %t \"%r\" %>s %b \"%{Referer}i\"
\"%{User-Agent}i\" %{custom_field}n ...
apache_note(“custom_field”, $whatever);
LogFormat "%{True-Client-IP}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %{display_mode}n %{user_id}n %{php_bytes}n %{php_usec}n %D”
web0060 66.249.71.110 - - [11/May/2011:17:08:53 +0000] "GET /listing/12189259/tropical-etched-pair-of-lampwork-glass HTTP/1.1" 200 11034 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" desktop - 13399576 505780 554876
LogFormat "%{True-Client-IP}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %{display_mode}n %{user_id}n %{php_bytes}n %{php_usec}n %D”
web0060 66.249.71.110 - - [11/May/2011:17:08:53 +0000] "GET /listing/12189259/tropical-etched-pair-of-lampwork-glass HTTP/1.1" 200 11034 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" desktop - 13399576 505780 554876
Logger::error("User login failed. Reason: $msg for $email_addr", “login”);
Method name denotes log “level”—error, fatal, warning, notice, debug.
A “namespace” parameter is providedso we can aggregate log entries withsimilar concerns.
Logger::error("User login failed. Reason: $msg for $email_addr", “login”);
web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password was submitted for [email protected]
Unique request ID
Server nameDate and time Level
Namespace
web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] Invalid charset convertionweb0102 [Fri Mar 04 16:27:48 2011] [warning] [login] [47dd608551] User login failed. Reasonweb0012 [Fri Mar 04 16:27:48 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0081 [Fri Mar 04 16:27:48 2011] [error] [register] [39e08e6692] Duplicate user ID encounteredweb0100 [Fri Mar 04 16:27:49 2011] [fatal] [register] [f9c2b23702] Invalid charset convertionweb0003 [Fri Mar 04 16:27:49 2011] [error] [register] [39e08e6692] Duplicate user ID encounteredweb0050 [Fri Mar 04 16:27:49 2011] [error] [register] [2e468a9bb6] Duplicate user ID encounteredweb0054 [Fri Mar 04 16:27:49 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0200 [Fri Mar 04 16:27:49 2011] [error] [login] [f9c2b23702] User login failed. Reasonweb0064 [Fri Mar 04 16:27:49 2011] [error] [login] [47dd608551] Duplicate user ID encounteredweb0012 [Fri Mar 04 16:27:49 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0041 [Fri Mar 04 16:27:49 2011] [fatal] [login] [mk04gw1p71] Invalid charset convertionweb0012 [Fri Mar 04 16:27:49 2011] [error] [login] [2f297b40a5] User login failed. Reasonweb0025 [Fri Mar 04 16:27:49 2011] [warning] [register] [32976da59c] User login failed. Reasonweb0088 [Fri Mar 04 16:27:49 2011] [warning] [register] [2e468a9bb6] User login failed. Reasonweb0050 [Fri Mar 04 16:27:50 2011] [warning] [register] [39e08e6692] User login failed. Reasonweb0035 [Fri Mar 04 16:27:50 2011] [warning] [login] [2f297b40a5] User login failed. Reasonweb0072 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [2f297b40a5] User login failed. Reasonweb0050 [Fri Mar 04 16:27:50 2011] [error] [login] [2e468a9bb6] User login failed. Reasonweb0054 [Fri Mar 04 16:27:50 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0200 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [f9c2b23702] User login failed. Reasonweb0064 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [47dd608551] Invalid charset convertionweb0012 [Fri Mar 04 16:27:50 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0041 [Fri Mar 04 16:27:50 2011] [fatal] [login] [mk04gw1p71] Invalid charset convertionweb0012 [Fri Mar 04 16:27:50 2011] [error] [register] [2f297b40a5] Duplicate user ID encounteredweb0025 [Fri Mar 04 16:27:50 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0088 [Fri Mar 04 16:27:50 2011] [warning] [login] [2e468a9bb6] User login failed. Reasonweb0050 [Fri Mar 04 16:27:51 2011] [warning] [login] [39e08e6692] User login failed. Reasonweb0035 [Fri Mar 04 16:27:51 2011] [warning] [login] [2f297b40a5] User login failed. Reasonweb0072 [Fri Mar 04 16:27:51 2011] [error] [login] [2f297b40a5] User login failed. Reason
web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] Invalid charset convertionweb0102 [Fri Mar 04 16:27:48 2011] [warning] [login] [47dd608551] User login failed. Reasonweb0012 [Fri Mar 04 16:27:48 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0081 [Fri Mar 04 16:27:48 2011] [error] [register] [39e08e6692] Duplicate user ID encounteredweb0100 [Fri Mar 04 16:27:49 2011] [fatal] [register] [f9c2b23702] Invalid charset convertionweb0003 [Fri Mar 04 16:27:49 2011] [error] [register] [39e08e6692] Duplicate user ID encounteredweb0050 [Fri Mar 04 16:27:49 2011] [error] [register] [2e468a9bb6] Duplicate user ID encounteredweb0054 [Fri Mar 04 16:27:49 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0200 [Fri Mar 04 16:27:49 2011] [error] [login] [f9c2b23702] User login failed. Reasonweb0064 [Fri Mar 04 16:27:49 2011] [error] [login] [47dd608551] Duplicate user ID encounteredweb0012 [Fri Mar 04 16:27:49 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0041 [Fri Mar 04 16:27:49 2011] [fatal] [login] [mk04gw1p71] Invalid charset convertionweb0012 [Fri Mar 04 16:27:49 2011] [error] [login] [2f297b40a5] User login failed. Reasonweb0025 [Fri Mar 04 16:27:49 2011] [warning] [register] [32976da59c] User login failed. Reasonweb0088 [Fri Mar 04 16:27:49 2011] [warning] [register] [2e468a9bb6] User login failed. Reasonweb0050 [Fri Mar 04 16:27:50 2011] [warning] [register] [39e08e6692] User login failed. Reasonweb0035 [Fri Mar 04 16:27:50 2011] [warning] [login] [2f297b40a5] User login failed. Reasonweb0072 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [2f297b40a5] User login failed. Reasonweb0050 [Fri Mar 04 16:27:50 2011] [error] [login] [2e468a9bb6] User login failed. Reasonweb0054 [Fri Mar 04 16:27:50 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0200 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [f9c2b23702] User login failed. Reasonweb0064 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [47dd608551] Invalid charset convertionweb0012 [Fri Mar 04 16:27:50 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0041 [Fri Mar 04 16:27:50 2011] [fatal] [login] [mk04gw1p71] Invalid charset convertionweb0012 [Fri Mar 04 16:27:50 2011] [error] [register] [2f297b40a5] Duplicate user ID encounteredweb0025 [Fri Mar 04 16:27:50 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0088 [Fri Mar 04 16:27:50 2011] [warning] [login] [2e468a9bb6] User login failed. Reasonweb0050 [Fri Mar 04 16:27:51 2011] [warning] [login] [39e08e6692] User login failed. Reasonweb0035 [Fri Mar 04 16:27:51 2011] [warning] [login] [2f297b40a5] User login failed. Reasonweb0072 [Fri Mar 04 16:27:51 2011] [error] [login] [2f297b40a5] User login failed. Reason
web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] Invalid charset convertionweb0102 [Fri Mar 04 16:27:48 2011] [warning] [login] [47dd608551] User login failed. Reasonweb0012 [Fri Mar 04 16:27:48 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0081 [Fri Mar 04 16:27:48 2011] [error] [register] [39e08e6692] Duplicate user ID encounteredweb0100 [Fri Mar 04 16:27:49 2011] [fatal] [register] [f9c2b23702] Invalid charset convertionweb0003 [Fri Mar 04 16:27:49 2011] [error] [register] [39e08e6692] Duplicate user ID encounteredweb0050 [Fri Mar 04 16:27:49 2011] [error] [register] [2e468a9bb6] Duplicate user ID encounteredweb0054 [Fri Mar 04 16:27:49 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0200 [Fri Mar 04 16:27:49 2011] [error] [login] [f9c2b23702] User login failed. Reasonweb0064 [Fri Mar 04 16:27:49 2011] [error] [login] [47dd608551] Duplicate user ID encounteredweb0012 [Fri Mar 04 16:27:49 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0041 [Fri Mar 04 16:27:49 2011] [fatal] [login] [mk04gw1p71] Invalid charset convertionweb0012 [Fri Mar 04 16:27:49 2011] [error] [login] [2f297b40a5] User login failed. Reasonweb0025 [Fri Mar 04 16:27:49 2011] [warning] [register] [32976da59c] User login failed. Reasonweb0088 [Fri Mar 04 16:27:49 2011] [warning] [register] [2e468a9bb6] User login failed. Reasonweb0050 [Fri Mar 04 16:27:50 2011] [warning] [register] [39e08e6692] User login failed. Reasonweb0035 [Fri Mar 04 16:27:50 2011] [warning] [login] [2f297b40a5] User login failed. Reasonweb0072 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [2f297b40a5] User login failed. Reasonweb0050 [Fri Mar 04 16:27:50 2011] [error] [login] [2e468a9bb6] User login failed. Reasonweb0054 [Fri Mar 04 16:27:50 2011] [warning] [login] [mk04gw1p71] User login failed. Reasonweb0200 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [f9c2b23702] User login failed. Reasonweb0064 [Fri Mar 04 16:27:50 2011] [error] [subscribe] [47dd608551] Invalid charset convertionweb0012 [Fri Mar 04 16:27:50 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0041 [Fri Mar 04 16:27:50 2011] [fatal] [login] [mk04gw1p71] Invalid charset convertionweb0012 [Fri Mar 04 16:27:50 2011] [error] [register] [2f297b40a5] Duplicate user ID encounteredweb0025 [Fri Mar 04 16:27:50 2011] [warning] [login] [32976da59c] User login failed. Reasonweb0088 [Fri Mar 04 16:27:50 2011] [warning] [login] [2e468a9bb6] User login failed. Reasonweb0050 [Fri Mar 04 16:27:51 2011] [warning] [login] [39e08e6692] User login failed. Reasonweb0035 [Fri Mar 04 16:27:51 2011] [warning] [login] [2f297b40a5] User login failed. Reasonweb0072 [Fri Mar 04 16:27:51 2011] [error] [login] [2f297b40a5] User login failed. Reason
FATALS ERRORS WARNINGS
Logster
github.com/etsy/logster
Run by cron (e.g. 1m intervals)
Keeps a cursor on your log fileParse and aggregate values however you wantOutput to Ganglia, Graphite, Amazon CloudWatchSimple parsers
Logster
web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password was submitted for [email protected]
^.+ \[.+\] \[(?P<log_level>.+)\]
1. Pattern match on fields of interest
if (fields['log_level'] == “fatal”): self.fatals += 1
elif (fields['log_level'] == “error”): self.errors += 1
elif (fields['log_level'] == “warning”): self.warnings += 1
...
2. Aggregate values (sum, average, percentile, etc.)
MetricObject("fatals", (self.fatals / self.duration), "per sec")
MetricObject("errors", (self.errors / self.duration), "per sec")
MetricObject("warning", (self.warnings / self.duration), "per sec")
3. Send the values as “metric objects” to the collectors
github.com/etsy/logster
FATALS ERRORS WARNINGS
Logster
StatsD
github.com/etsy/statsd
StatsDNetwork daemon (node.js)
Accepts data over UDPFlushes to Graphite every 10 sec
One-line of code
StatsD::increment("logins.success");
StatsD::increment("logins.success");
Logins
StatsD::timing("profile.time", $msec);
StatsD::timing("profile.time", $msec);
90th pct
average
lower
Ad hocname value timestamp
echo "events.deploy.site 1 `date +%s`" \| nc graphite.etsycorp.com 2003
Vertical Line Technology!target=drawAsInfinite(events.deploy.site)
User Logins
PHP Warnings
PHP Fatal Errors
250,000+ metrics at EtsySystems, Applications, Business
github.com/etsy/dashboard
Dashboards
<a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or+Script+Not+Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render?from=-1hours&width=280&height=220&title=File+or+Script+Not+Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"></a>
Kind of Hard :-/
github.com/etsy/dashboard
$g = new Graphite($time);$g->setTitle('File Not Found');$g->addMetric('webs.errorLog.notExist', '#00cc00');echo $g->getDashboardHTML(280, 220);
Super Easy!
github.com/etsy/dashboard
But, you said...
“250,000+ metrics at Etsy”Systems, Applications, Business
http://graphite/render?from=-1hours&width=600&height=200
&target=webs.errorLog.warning&rawData=1
http://graphite/render?from=-1hours&width=600&height=200
&target=webs.errorLog.warning&rawData=1
webs.errorLog.warning,1318444930,1318448530,60|5.0,1.0,3.0,1.0,0.0,9.0,0.0,1.0,3.0,2.0,1.0,6.0,2.0,6.0,3.0,6.0,4.0,4.0,2.0,1.0,1.0,8.0,2.0,3.0,6.0,3.0,5.0,3.0,0.0,4.0,6.0,2.0,0.0,2.0,0.0,4.0,0.0,3.0,1.0,3.0,4.0,2.0,10.0,3.0,0.0,6.0,0.0,4.0,2.0,5.0,18.0,1.0,1.0,2.0,1.0,8.0,5.0,1.0,1.0,None
Holt-Winters Confidence Bands
lower
upper
Holt-Winters Aberration
Business metrics+ Confidence bands
_____________ Alertable metrics
Metrics!Metrics + EventsMetrics + Alerts
Metrics + Metrics
High-level, real-time visibility
Detect problems early,and resolve them quickly.
Make them accessibleMake them required features
Make them dead simple
Merci!These slides will be available atmikebrittain.com/talks
codeascraft.etsy.comgithub.com/etsy
Say “Hello!”[email protected]
@mikebrittain
Metrics-Driven Engineering