Download - Monitoring to the Nth tier: The state of distributed tracing in 2016

Monitoring to the Nth (tier)...or, State of Distributed Tracing 2016

Dan KuebrichCTO AppNeta@dkuebric

Outline

● What is distributed tracing?

● Who’s doing it, and how?

● Challenges, and future directions?

● Frontend web app: PHP

● Text search: lucene-based, via thrift

● Pricing service: erlang, via thrift

● Content provider search: ruby, via thrift

● Spelling corrector: python bindings around xapian, via thrift

● ...

●

Thrift Shop

cache(memcached)

search (lucene)

cache(memcached)

app1

ApachePHP

app1

ApachePHP

fw1

perlbal

cache(memcached)

fw2

perlbal

...

search (lucene)

db2

Mysql

search (lucene)

app server

ApachePHP

search (lucene)

search (lucene)

API search (ruby)

pricing (erlang)

spelling (python)

APIs

APIs

db1

Mysql

Q: Why do you remember this so well?

Q: Why do you remember this so well?

A: ops

“Close enough” architectural diagram

https://www.flickr.com/photos/clonedmilkmen/3604999084

Things we had

● Ganglia

● Nagios

● Thrift

○ Per-service status page

○ Service status page

● Logs

1. Hit refresh N times -- how many times were problematic?

2. Are any services outright down?

3. Systematically tail the logs of every service on every machine

4. Check mysql running processes

5. SSH in and poke around

6. Deploy debug logging

7. Pray

Sample debug workflow

X-Trace

Instrumentation points and request flow

Web server

Application

Web server

Application

Web server

Application

Database

Service

Load balancer

Cache

3rd party API

Great minds…Distributed tracing based on ID propagation

● Google Dapper (200x? Published paper 2010)● Twitter Zipkin (Open-sourced 2012)● Etsy (2014ish)● Others

Commercial APM -- some distributed tracing

● New Relic● AppDynamics● DynaTrace

Instrumentation points and request flow

Web server

Application

Web server

Application

Web server

Application

Database

Service

Load balancer

Cache

3rd party API

Challenges: Instrumentation Points

def interesting_method():

log_entry(...)

_do_stuff()

log_exit(...)

OpenTracing● Problematic to tie instrumentation to tracing system

● There is no one system that’s perfect for everyone

● So instrumentation that ties you to a system is bad● Either have it be automatically injected (industry)● … or obey a common interface so it’s pluggable

● OpenTracing v1 goal: provide the interface for portable instrumentation

Challenges: Trace ID Propagation

def http_rpc_call():

log_entry(...)

_do_get(modified_headers, ...)

log_exit(...)

def interesting_method(trace_id):

log_entry(trace_id, ...)

_do_stuff()

log_exit(trace_id, ...)

Challenges: Trace ID Propagation

Challenges: Extracting Value

Distributed tracing “only”

● Follow request flow through application● Understand end-to-end latency● Associate backend load with frontend

requests● Provide errors with distributed context

But... as long as you’re in there...

● Latency of queries, RPC calls, in each tier● Slow code● Cache hit/miss ratio● Errors and exceptions● Custom tagging/categorization of data● ...

Rich data set

Context propagation: beyond performance

● Baggage

● Deadlines

● Auth/load attribution

● Flow control?

OFFICE HOURS

3pm

MORE INFO

Booth #713 & back of the room

@dkuebric

Thanks!