(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

DVO205

The AdRoll Monitoring Evolution:From Flying Blind to Flying by Instrument

Brian Troutwine, AdRoll

Ilan Rabinovitch, Datadog

October 2015

Today’s speakers

Ilan Rabinovitch

Dir. Technical Community

Datadog

Brian Troutwine

Sr. Software Engineer

AdRoll

Quick Overview of

Datadog

• Monitoring for modern applications

• Dynamic Infrastructure

• Microservices

• Time series storage of metrics and events

• 100s of built in integrations

• Eg. EC2, ELB, ECS and more.

CAMS

Culture

Automation

Metrics

SharingDamon Edwards and John Willis

CAMS

Culture

Automation

METRICS

SHARING

You’re in the cloud and it's everything you dreamed of!

You’re in the cloud and it's everything you dreamed of!

AutoscalingContainer

orchestrationInfinite storage

In cloud we trust.

But how do we verify health?

If it moves, monitor it.

How does your current monitoring fit in?

• Host-centric

How does our current monitoring fit in?

• Host-centric

• Static configurations tracking dynamic infrastructure


• Host-centric

• Static configurations tracking

dynamic infrastructure

• Focused on resources, rather than

work


• Host-centric

• Static configurations tracking dynamic infrastructure

• Focused on resources, rather than work

• Difficult to pull together and compare data from

multiple sources


So what to monitor?

More at: http://goo.gl/t1Rgcg

http://goo.gl/t1Rgcg

How to use that data?

More at: http://goo.gl/t1Rgcg

http://goo.gl/t1Rgcg

Recurse until you find root cause

Query-based monitoring• Aggregates matter because the underlying infrastructure is dynamic

• Express our monitors or alerts as queries on predicates:

• “avg response time for requests to hosts running nginx > 500

ms”

• “min # of hosts running nginx < 3”

• Mash up data sources for a 360-degree view of a problem

Query-based monitoring“Show me iowait across nginx hosts, grouped by

availability zone”

Real-time

bidding

The problem domain

• Low latency (< 100 ms per transaction)

• Firm real-time system

• Highly concurrent ( ~2 million transactions

per second, peak)

• Global, 24/7 operation

In the early days of the

AdRoll real-time bidding

(RTB) project, we could

use our intuition.

• The system was simple.• The number of total

requests was small.• The impact of mistakes

was minor.

We could be reasonably

confident that our mental

model of the system’s

behavior was accurate.

The trouble with a

complex system is that its

behavior in practice gets

away from you pretty fast.

Our first approach was

to batch process logs

generated by

individual bidders.

Batch processing

Pros:

• We were already doing this.

• It’s simple to implement.

• It’s straightforward to

conceive.

Batch processing

Cons:

• High update latency

• Catastrophic errors lose logs

• Denies impulse

experimentation

Batch processing

Our second approach

was to generate coarse

real-time metrics and

analyze those.

Coarse real-time metrics

Pros:

• Iterative step up from

batch processing

• Proves out the concept

• Simple to implement


Cons:

• Still relied on intuition

• Bidder implementation was

sub-optimal

• Dashboards were one-size-

fits-all approach


By this point, the complexity of the

system and our ambitions were

growing.

• Two engineers were added to the

team.

• Tens more in the department.

• RTB became a central project.

We were making

decisions in a

knowledge void.

At this point, we have

AWS CloudWatch.

CloudWatch

reports the basic

health of your

system.

CloudWatch provides the total view of the AWS services you’re using.

What we don’t have at

this point is a detailed

view of our system.

What we don’t have is

the ability to explore the

information we have,

especially in high-

stress situations.

Exometer solves our

Erlang-side problem.

Detailed application-

level instrumentation is

cheap and easy.

Datadog solves

our aggregation,

visualization, and

alerting problems.

Datadog integrates with

CloudWatch. Our system-

specific metrics can be

correlated with the basic

health of the system.

This can be done in real time.

Correlation of system information

This can be done in

high-stress situations.


This can be done by

other departments of

the business.


A bid “times out”

when we don’t reply

back to the exchange

in 100 ms.

Timeout spikes

Timeout spikes

We didn’t realize this

was happening. It’s an

early win of our

sophisticated monitoring.

Timeout spikes

Timeout spikes

System load is normal.

There’s not a periodic

spike in bid request traffic.

Timeout spikes

Timeout spikes

There is a correlated

jump in network

traffic, however.

Timeout spikes

Timeout spikes

There were also

correlated spikes in

the Erlang VM’s

process run queue.

Timeout spikes

• VM scheduler threads are locked to CPU

• CPU-intensive background process kicks

on every 20 minutes

• No CPU shield on the server

• VM scheduler thread gets kicked from its

assigned CPU, processes back up

Timeout spikes

Failure of bid-

request traffic is an

all-hands problem.

Traffic crash

Without traffic, the

bidders can do nothing.

Traffic crash

Traffic crash

That’s a healthy couple

of days' worth of traffic.

It dips in the night, and

climbs in the day.

Traffic crash

This is a weekend’s

worth of traffic lost.

Traffic crash

• Confirmed with CloudWatch that

networking to the machines was fine

• No changes had been made to the

production system (it was a looser

time)

• All detail metrics from the Erlang VM

are acceptable

Traffic crash

Traffic crash

The exchange

confirmed a drop in

traffic from their system.

Traffic crash

Turns out, we hit an

implicit exchange

limitation.

Traffic crash

We also became more

conscientious about

alerting effectively.

Traffic crash

At high scale, it’s very

easy to have to be

over-provisioned for

the system’s load.

Sophisticated autoscaling

Worse, it’s very easy to

be under provisioned

for system load.


All CloudWatch alarms

on EC2 instances can

be pressed into service

for autoscaling.


Our first autoscaling

approach used

remaining idle CPU.



As traffic drops off at

the end of the day,

we need less CPU

time to process it.

This was great! We

immediately saved

loads of money.


Problem was, it’s an

indirect measurement.

There’s always some

nuance you’ll miss.


Co-resident subsystems

eat into the CPU time,

giving an inaccurate

impression.


CPU consumption

carries no information

about aberrant

system issues.


What can be done?


Distill the performance

capability of your

system into a single

signal.



The "metadata index"

tracks the load on the

bidders. It’s a weighted

sum of key metrics.


As traffic drops, the

metadata index

drops. Indirectly, idle

CPU increases.


We emit this

metadata index into

CloudWatch as a

custom metric.


As soon as it hits

CloudWatch, you

can autoscale on it.


This is twice as efficient

as the CPU idle scaling

signal. One-half the

number of machines.

There’s a lot of fraud in

the online advertising

industry.

Anti-fraud CookieBouncer

A certain kind of “hot

cookie” fraud

caused a tolerable

fault in the bidders.


CookieBouncer

blocks bidding on

fraudulent, “hot,”

cookies in real time.


Our concern was

blocking too much traffic,

in turn blocking

legitimate bids through

over-aggressive tuning.


We built a new

CookieBouncer

dashboard and introduced

the ability to tune it in real

time on every bidder.


We rolled CookieBouncer

out with conservative

settings and started

adjusting, keeping tabs

on the key indicators.


We adjusted and were

very surprised at the total

number of blocked

cookies and the

percentage of total traffic.


The instrumentation

speaks for itself.


Learn more at….DVO204 - Monitoring Strategies: Finding Signal in the Noise

Thursday, Oct 8, 11:00 AM - 12:00 PM

OR

http://bit.ly/1Qo4Zmy

http://bit.ly/1Qo4Zmy

Thank you!

Remember to complete

your evaluations!

(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Technology

Transcript of (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument