Monitoring to the Nth (tier)...or, State of Distributed Tracing 2016
Dan KuebrichCTO AppNeta@dkuebric
Outline
● What is distributed tracing?
● Who’s doing it, and how?
● Challenges, and future directions?
● Frontend web app: PHP
● Text search: lucene-based, via thrift
● Pricing service: erlang, via thrift
● Content provider search: ruby, via thrift
● Spelling corrector: python bindings around xapian, via thrift
● ...
●
Thrift Shop
cache(memcached)
search (lucene)
cache(memcached)
app1
ApachePHP
app1
ApachePHP
fw1
perlbal
cache(memcached)
fw2
perlbal
...
search (lucene)
db2
Mysql
search (lucene)
app server
ApachePHP
search (lucene)
search (lucene)
API search (ruby)
pricing (erlang)
spelling (python)
APIs
APIs
db1
Mysql
Q: Why do you remember this so well?
Q: Why do you remember this so well?
A: ops
“Close enough” architectural diagram
https://www.flickr.com/photos/clonedmilkmen/3604999084
Things we had
● Ganglia
● Nagios
● Thrift
○ Per-service status page
○ Service status page
● Logs
1. Hit refresh N times -- how many times were problematic?
2. Are any services outright down?
3. Systematically tail the logs of every service on every machine
4. Check mysql running processes
5. SSH in and poke around
6. Deploy debug logging
7. Pray
Sample debug workflow
X-Trace
Instrumentation points and request flow
Web server
Application
Web server
Application
Web server
Application
Database
Service
Load balancer
Cache
3rd party API
Spans
Spans
Great minds…Distributed tracing based on ID propagation
● Google Dapper (200x? Published paper 2010)● Twitter Zipkin (Open-sourced 2012)● Etsy (2014ish)● Others
Commercial APM -- some distributed tracing
● New Relic● AppDynamics● DynaTrace
Instrumentation points and request flow
Web server
Application
Web server
Application
Web server
Application
Database
Service
Load balancer
Cache
3rd party API
Challenges: Instrumentation Points
def interesting_method():
log_entry(...)
_do_stuff()
log_exit(...)
OpenTracing● Problematic to tie instrumentation to tracing system
● There is no one system that’s perfect for everyone
● So instrumentation that ties you to a system is bad● Either have it be automatically injected (industry)● … or obey a common interface so it’s pluggable
● OpenTracing v1 goal: provide the interface for portable instrumentation
Challenges: Trace ID Propagation
def http_rpc_call():
log_entry(...)
_do_get(modified_headers, ...)
log_exit(...)
def interesting_method(trace_id):
log_entry(trace_id, ...)
_do_stuff()
log_exit(trace_id, ...)
Challenges: Trace ID Propagation
Challenges: Extracting Value
Distributed tracing “only”
● Follow request flow through application● Understand end-to-end latency● Associate backend load with frontend
requests● Provide errors with distributed context
But... as long as you’re in there...
● Latency of queries, RPC calls, in each tier● Slow code● Cache hit/miss ratio● Errors and exceptions● Custom tagging/categorization of data● ...
Rich data set
Context propagation: beyond performance
● Baggage
● Deadlines
● Auth/load attribution
● Flow control?
OFFICE HOURS
3pm
MORE INFO
Booth #713 & back of the room
@dkuebric
Thanks!
Top Related