(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
-
Upload
amazon-web-services -
Category
Technology
-
view
881 -
download
0
Transcript of (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DVO205
The AdRoll Monitoring Evolution:From Flying Blind to Flying by Instrument
Brian Troutwine, AdRoll
Ilan Rabinovitch, Datadog
October 2015
Today’s speakers
Ilan Rabinovitch
Dir. Technical Community
Datadog
Brian Troutwine
Sr. Software Engineer
AdRoll
Quick Overview of
Datadog
• Monitoring for modern applications
• Dynamic Infrastructure
• Microservices
• Time series storage of metrics and events
• 100s of built in integrations
• Eg. EC2, ELB, ECS and more.
CAMS
Culture
Automation
Metrics
SharingDamon Edwards and John Willis
CAMS
Culture
Automation
METRICS
SHARING
You’re in the cloud and it's everything you dreamed of!
You’re in the cloud and it's everything you dreamed of!
AutoscalingContainer
orchestrationInfinite storage
In cloud we trust.
But how do we verify health?
If it moves, monitor it.
How does your current monitoring fit in?
• Host-centric
How does our current monitoring fit in?
• Host-centric
• Static configurations tracking dynamic infrastructure
How does our current monitoring fit in?
• Host-centric
• Static configurations tracking
dynamic infrastructure
• Focused on resources, rather than
work
How does our current monitoring fit in?
• Host-centric
• Static configurations tracking dynamic infrastructure
• Focused on resources, rather than work
• Difficult to pull together and compare data from
multiple sources
How does our current monitoring fit in?
Recurse until you find root cause
Query-based monitoring• Aggregates matter because the underlying infrastructure is dynamic
• Express our monitors or alerts as queries on predicates:
• “avg response time for requests to hosts running nginx > 500
ms”
• “min # of hosts running nginx < 3”
• Mash up data sources for a 360-degree view of a problem
Query-based monitoring“Show me iowait across nginx hosts, grouped by
availability zone”
Real-time
bidding
The problem domain
• Low latency (< 100 ms per transaction)
• Firm real-time system
• Highly concurrent ( ~2 million transactions
per second, peak)
• Global, 24/7 operation
In the early days of the
AdRoll real-time bidding
(RTB) project, we could
use our intuition.
• The system was simple.• The number of total
requests was small.• The impact of mistakes
was minor.
We could be reasonably
confident that our mental
model of the system’s
behavior was accurate.
The trouble with a
complex system is that its
behavior in practice gets
away from you pretty fast.
Our first approach was
to batch process logs
generated by
individual bidders.
Batch processing
Pros:
• We were already doing this.
• It’s simple to implement.
• It’s straightforward to
conceive.
Batch processing
Cons:
• High update latency
• Catastrophic errors lose logs
• Denies impulse
experimentation
Batch processing
Our second approach
was to generate coarse
real-time metrics and
analyze those.
Coarse real-time metrics
Pros:
• Iterative step up from
batch processing
• Proves out the concept
• Simple to implement
Coarse real-time metrics
Cons:
• Still relied on intuition
• Bidder implementation was
sub-optimal
• Dashboards were one-size-
fits-all approach
Coarse real-time metrics
By this point, the complexity of the
system and our ambitions were
growing.
• Two engineers were added to the
team.
• Tens more in the department.
• RTB became a central project.
We were making
decisions in a
knowledge void.
At this point, we have
AWS CloudWatch.
CloudWatch
reports the basic
health of your
system.
CloudWatch provides the total view of the AWS services you’re using.
What we don’t have at
this point is a detailed
view of our system.
What we don’t have is
the ability to explore the
information we have,
especially in high-
stress situations.
Exometer solves our
Erlang-side problem.
Detailed application-
level instrumentation is
cheap and easy.
Datadog solves
our aggregation,
visualization, and
alerting problems.
Datadog integrates with
CloudWatch. Our system-
specific metrics can be
correlated with the basic
health of the system.
This can be done in real time.
Correlation of system information
This can be done in
high-stress situations.
Correlation of system information
This can be done by
other departments of
the business.
Correlation of system information
A bid “times out”
when we don’t reply
back to the exchange
in 100 ms.
Timeout spikes
Timeout spikes
We didn’t realize this
was happening. It’s an
early win of our
sophisticated monitoring.
Timeout spikes
Timeout spikes
System load is normal.
There’s not a periodic
spike in bid request traffic.
Timeout spikes
Timeout spikes
There is a correlated
jump in network
traffic, however.
Timeout spikes
Timeout spikes
There were also
correlated spikes in
the Erlang VM’s
process run queue.
Timeout spikes
• VM scheduler threads are locked to CPU
• CPU-intensive background process kicks
on every 20 minutes
• No CPU shield on the server
• VM scheduler thread gets kicked from its
assigned CPU, processes back up
Timeout spikes
Failure of bid-
request traffic is an
all-hands problem.
Traffic crash
Without traffic, the
bidders can do nothing.
Traffic crash
Traffic crash
That’s a healthy couple
of days' worth of traffic.
It dips in the night, and
climbs in the day.
Traffic crash
This is a weekend’s
worth of traffic lost.
Traffic crash
• Confirmed with CloudWatch that
networking to the machines was fine
• No changes had been made to the
production system (it was a looser
time)
• All detail metrics from the Erlang VM
are acceptable
Traffic crash
Traffic crash
The exchange
confirmed a drop in
traffic from their system.
Traffic crash
Turns out, we hit an
implicit exchange
limitation.
Traffic crash
We also became more
conscientious about
alerting effectively.
Traffic crash
At high scale, it’s very
easy to have to be
over-provisioned for
the system’s load.
Sophisticated autoscaling
Worse, it’s very easy to
be under provisioned
for system load.
Sophisticated autoscaling
All CloudWatch alarms
on EC2 instances can
be pressed into service
for autoscaling.
Sophisticated autoscaling
Our first autoscaling
approach used
remaining idle CPU.
Sophisticated autoscaling
Sophisticated autoscaling
Sophisticated autoscaling
As traffic drops off at
the end of the day,
we need less CPU
time to process it.
Sophisticated autoscaling
This was great! We
immediately saved
loads of money.
Sophisticated autoscaling
Problem was, it’s an
indirect measurement.
There’s always some
nuance you’ll miss.
Sophisticated autoscaling
Co-resident subsystems
eat into the CPU time,
giving an inaccurate
impression.
Sophisticated autoscaling
CPU consumption
carries no information
about aberrant
system issues.
Sophisticated autoscaling
What can be done?
Sophisticated autoscaling
Distill the performance
capability of your
system into a single
signal.
Sophisticated autoscaling
Sophisticated autoscaling
Sophisticated autoscaling
The "metadata index"
tracks the load on the
bidders. It’s a weighted
sum of key metrics.
Sophisticated autoscaling
As traffic drops, the
metadata index
drops. Indirectly, idle
CPU increases.
Sophisticated autoscaling
We emit this
metadata index into
CloudWatch as a
custom metric.
Sophisticated autoscaling
As soon as it hits
CloudWatch, you
can autoscale on it.
Sophisticated autoscaling
Sophisticated autoscaling
Sophisticated autoscaling
Sophisticated autoscaling
This is twice as efficient
as the CPU idle scaling
signal. One-half the
number of machines.
Sophisticated autoscaling
There’s a lot of fraud in
the online advertising
industry.
Anti-fraud CookieBouncer
A certain kind of “hot
cookie” fraud
caused a tolerable
fault in the bidders.
Anti-fraud CookieBouncer
Anti-fraud CookieBouncer
CookieBouncer
blocks bidding on
fraudulent, “hot,”
cookies in real time.
Anti-fraud CookieBouncer
Our concern was
blocking too much traffic,
in turn blocking
legitimate bids through
over-aggressive tuning.
Anti-fraud CookieBouncer
We built a new
CookieBouncer
dashboard and introduced
the ability to tune it in real
time on every bidder.
Anti-fraud CookieBouncer
We rolled CookieBouncer
out with conservative
settings and started
adjusting, keeping tabs
on the key indicators.
Anti-fraud CookieBouncer
Anti-fraud CookieBouncer
We adjusted and were
very surprised at the total
number of blocked
cookies and the
percentage of total traffic.
Anti-fraud CookieBouncer
Anti-fraud CookieBouncer
Anti-fraud CookieBouncer
The instrumentation
speaks for itself.
Anti-fraud CookieBouncer
Anti-fraud CookieBouncer
Learn more at….DVO204 - Monitoring Strategies: Finding Signal in the Noise
Thursday, Oct 8, 11:00 AM - 12:00 PM
OR
http://bit.ly/1Qo4Zmy
Thank you!
Remember to complete
your evaluations!