Events and metrics the Lifeblood of Webops

72
Events & Metrics The Lifeblood Of Webops Alexis Lê-Quôc (Product Guy) at Datadog NYCBUG July 6th, 2011

description

What I’m going to talk about ‣Briefly we do and for whom ‣Where we started ‣The kind of data we deal with ‣How it fits altogether ‣A few things we learned along the way ‣Q+A

Transcript of Events and metrics the Lifeblood of Webops

Page 1: Events and metrics the Lifeblood of Webops

Events & MetricsThe Lifeblood Of Webops

Alexis Lê-Quôc (Product Guy) at DatadogNYCBUG

July 6th, 2011

Page 2: Events and metrics the Lifeblood of Webops

I <3 BSD‣OpenBSD user since 2.8 (pf)‣Love the documentation‣m0n0wall/pfSense‣ZFS-envy

Page 3: Events and metrics the Lifeblood of Webops

What I’m going to talk about‣Briefly we do and for whom‣Where we started‣The kind of data we deal with‣How it fits altogether‣A few things we learned along the way‣Q+A

Page 4: Events and metrics the Lifeblood of Webops
Page 5: Events and metrics the Lifeblood of Webops

What we do?

SaaS Platform for Dev & Ops‣Aggregation‣Correlation‣Collaboration

Page 6: Events and metrics the Lifeblood of Webops

Where We Started

Page 7: Events and metrics the Lifeblood of Webops

The Mess

Servers and Devices

Monitoring

IAAS / PAAS

Cap. Planning

CDNsAsset Mgmt

♫ ♫♫

♫ ♫♫

Dev team

Ops team

SDLC support

Applications

metrics

metricsm

etrics events

choice

s

met

rics

?!?

!?

choi

ces

advice

met

rics

billi

ng

metrics

changes

metrics

choiceschanges

metrics

insigh

t

events + feedback

changes

choi

ces

Usage Analytics

Issue Resolution

Hosting

Too many data streams, too many silos

Too many choices to make, too often

Only getting worse as SaaS Silos multiply

Separate Dev and Ops teams, looking at separate

data streams

Page 8: Events and metrics the Lifeblood of Webops

WHERE WE STARTEDDiscourages exploration

Page 9: Events and metrics the Lifeblood of Webops

Very Specific View

Page 10: Events and metrics the Lifeblood of Webops

Different ViewSame Reality

Page 11: Events and metrics the Lifeblood of Webops

Dev InterdictionPart 1

Page 12: Events and metrics the Lifeblood of Webops

Where We Are

Page 13: Events and metrics the Lifeblood of Webops

In Actionhttps://app.datad0g.com/dash/host/8#/date_range/1309383780732-1309988580732

Page 14: Events and metrics the Lifeblood of Webops

Data Exploration‣d3, protovis

Context Matters‣Ganglia Event API

Welcome developers‣Graphite‣statsd

Large Datasets‣OpenTSDB

TRENDSVisible through Datadog and others

Page 15: Events and metrics the Lifeblood of Webops

Sides Of A Coin

Page 16: Events and metrics the Lifeblood of Webops

MetricsUnique visitorsLoadTransaction durationetc.

EventsUser commentsAlertBuildBatch job

Page 17: Events and metrics the Lifeblood of Webops

Aggregationetc.

Page 18: Events and metrics the Lifeblood of Webops

Taxonomy

Page 19: Events and metrics the Lifeblood of Webops

CLASSICShttp://en.wikipedia.org/wiki/Eventual_consistency

AtomicityConcistencyIsolationDurability

e.g. SQL DBs

Page 20: Events and metrics the Lifeblood of Webops

CLASSICShttp://en.wikipedia.org/wiki/Eventual_consistency

AtomicityConcistencyIsolationDurability

e.g. SQL DBs

BasicallyAvailableSoft-stateEventualconsistency

e.g. DNS

Page 21: Events and metrics the Lifeblood of Webops

Brian Cantrill: http://dtrace.org/resources/bmc/DIRT.pdf

DataIntensiveRealTime

e.g. real-time web

NEW COMER

Page 22: Events and metrics the Lifeblood of Webops

CollaborationReal-time updatesOn-the-fly data analysis

CorrelationOn-demand visualizationBackground data analysis

AggregationConstant data influxLarge data sets

Page 23: Events and metrics the Lifeblood of Webops

CollaborationReal-time updatesOn-the-fly data analysis

CorrelationOn-demand visualizationBackground data analysis

AggregationConstant data influxLarge data setsBA

SE

Page 24: Events and metrics the Lifeblood of Webops

CollaborationReal-time updatesOn-the-fly data analysis

CorrelationOn-demand visualizationBackground data analysis

AggregationConstant data influxLarge data setsBA

SEDIRT

Page 25: Events and metrics the Lifeblood of Webops

CollaborationReal-time updatesOn-the-fly data analysis

CorrelationOn-demand visualizationBackground data analysis

AggregationConstant data influxLarge data sets

BASE

BASE

DIRT

Page 26: Events and metrics the Lifeblood of Webops

CollaborationReal-time updatesOn-the-fly data analysis

CorrelationOn-demand visualizationBackground data analysis

AggregationConstant data influxLarge data sets

BASE

DIRT

BASE

DIRT

Page 27: Events and metrics the Lifeblood of Webops

Datadog = DIRT + BASE + a tiny bit of ACID

CollaborationReal-time updatesOn-the-fly data analysis

CorrelationOn-demand visualizationBackground data analysis

AggregationConstant data influxLarge data sets

BASE

DIRT

BASE

DIRT

Page 28: Events and metrics the Lifeblood of Webops

How It All Fits Togetherhttp://www.flickr.com/photos/tom-margie/1253798184/

Page 29: Events and metrics the Lifeblood of Webops

ArchitectureSimplified

Page 30: Events and metrics the Lifeblood of Webops

ArchitectureSimplified

BASE

Page 31: Events and metrics the Lifeblood of Webops

ArchitectureSimplified

BASE

DIRT

Page 32: Events and metrics the Lifeblood of Webops

ArchitectureSimplified

BASE

DIRT AC

ID

Page 33: Events and metrics the Lifeblood of Webops

The Environment

Page 34: Events and metrics the Lifeblood of Webops

4 DimensionsComputeStorageNetworkManagement

Page 35: Events and metrics the Lifeblood of Webops

ON-PREMISE TRAITShttp://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/

Page 36: Events and metrics the Lifeblood of Webops

ON-PREMISE TRAITS

ComputeFastInelastic

http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/

Page 37: Events and metrics the Lifeblood of Webops

ON-PREMISE TRAITS

ComputeFastInelastic

StorageFastCentralizedRedundant

http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/

Page 38: Events and metrics the Lifeblood of Webops

ON-PREMISE TRAITS

ComputeFastInelastic

StorageFastCentralizedRedundant

NetworkFastLocalized

http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/

Page 39: Events and metrics the Lifeblood of Webops

ON-PREMISE TRAITS

ComputeFastInelastic

StorageFastCentralizedRedundant

ManagementPeople-basedFull access

NetworkFastLocalized

http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/

Page 40: Events and metrics the Lifeblood of Webops

CLOUD TRAITS

Page 41: Events and metrics the Lifeblood of Webops

CLOUD TRAITS

ComputeSlowElastic

Page 42: Events and metrics the Lifeblood of Webops

CLOUD TRAITS

ComputeSlowElastic

StorageSlowJitteryMaybe durableLow memory

Page 43: Events and metrics the Lifeblood of Webops

CLOUD TRAITS

ComputeSlowElastic

Network“Fast”Geo-distributed

StorageSlowJitteryMaybe durableLow memory

Page 44: Events and metrics the Lifeblood of Webops

CLOUD TRAITS

ComputeSlowElastic

Network“Fast”Geo-distributed

StorageSlowJitteryMaybe durableLow memory

ManagementNo bare-metal“Magic” API

Page 45: Events and metrics the Lifeblood of Webops

What We Have Found

Page 46: Events and metrics the Lifeblood of Webops

Network

Page 47: Events and metrics the Lifeblood of Webops

NetworkLayer 2: Virtual DomainLayer 3: Crude Edge FilteringLayer 7: Crude Load BalancingDNSCDN

Page 48: Events and metrics the Lifeblood of Webops

NetworkLayer 2: Virtual DomainLayer 3: Crude Edge FilteringLayer 7: Crude Load BalancingDNSCDN

It Work

s!

Page 49: Events and metrics the Lifeblood of Webops

Storage

Page 50: Events and metrics the Lifeblood of Webops

BASEAmazon S3

Storage

DIRTRedis Capacity

Latency

ACIDPostgreSQL

BASEApache Cassandra

Page 51: Events and metrics the Lifeblood of Webops

BASEAmazon S3

Storage

DIRTRedis Capacity

Latency

ACIDPostgreSQL

BASEApache Cassandra

Low mem

ory

Latenc

y

Jitter

y

Limited

throu

ghput

Page 52: Events and metrics the Lifeblood of Webops

Low Memory

http://aws.amazon.com/ec2/#instance

Page 53: Events and metrics the Lifeblood of Webops

Jittery, Limited Throughput

https://app.datad0g.com/dash/dash/1032#/date_range/1308608717016-1309213517016

Network Block Storage (EBS)

Page 54: Events and metrics the Lifeblood of Webops

Limited Throughput In NumbersRAID 0 EBS Volumes, m1.large instances

DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util03:35:02 PM dev8-80 375.95 23614.08 5.70 62.83 47.21 125.58 1.26 47.3403:35:02 PM dev8-96 373.63 23749.65 5.64 63.58 45.55 121.91 1.22 45.7203:35:02 PM dev8-112 375.28 23693.47 5.52 63.15 45.52 121.22 1.23 46.3103:35:02 PM dev8-128 375.31 23721.57 7.19 63.22 56.00 148.96 1.34 50.35

Average service time in ms

Average wait in ms

Read throughput in sector/sTotal: 368Mb/s

Page 55: Events and metrics the Lifeblood of Webops

Some Tricks

Page 56: Events and metrics the Lifeblood of Webops

Some Tricks

Software RAIDRAID 0Offsite backups

Page 57: Events and metrics the Lifeblood of Webops

Some Tricks

Software RAIDRAID 0Offsite backups

Limited by slowest volume

Page 58: Events and metrics the Lifeblood of Webops

Some Tricks

Software RAIDRAID 0Offsite backups

Streaming replicationS3 backups

Limited by slowest volume

Page 59: Events and metrics the Lifeblood of Webops

Some Tricks

Software RAIDRAID 0Offsite backups

Streaming replicationS3 backups

Ephemeral volumesAnd Offsite backups

Limited by slowest volume

Page 60: Events and metrics the Lifeblood of Webops

Some Tricks

Software RAIDRAID 0Offsite backups

Streaming replicationS3 backups

Ephemeral volumesAnd Offsite backups

Limited by slowest volume

ComplexityRecovery Time ObjectiveRecovery Point Objective

Page 61: Events and metrics the Lifeblood of Webops

Some Tricks

Software RAIDRAID 0Offsite backups

Streaming replicationS3 backups

Ephemeral volumesAnd Offsite backups

Limited by slowest volume

ComplexityRecovery Time ObjectiveRecovery Point Objective

Database ServiceMySQL/Oracle RDS

Page 62: Events and metrics the Lifeblood of Webops

Some Tricks

Software RAIDRAID 0Offsite backups

Streaming replicationS3 backups

Ephemeral volumesAnd Offsite backups

Limited by slowest volume

ComplexityRecovery Time ObjectiveRecovery Point Objective

Database ServiceMySQL/Oracle RDS

TrustRDS Outage 2 months ago

Page 63: Events and metrics the Lifeblood of Webops

Network Block Storage Is The Dark Side

Page 64: Events and metrics the Lifeblood of Webops

Network Block Storage Is The Dark Side

Bait For Enterprise Customers

Page 65: Events and metrics the Lifeblood of Webops

Network Block Storage Is The Dark Side

Hard Problem For Cloud Providers

Bait For Enterprise Customers

Page 66: Events and metrics the Lifeblood of Webops

Don’t rely on networked block storageSmall data sets only if you have to

Do use S3 if you canObject semantics a limitationSlow but durable

Don’t trust data-at-restCopy, replicate, back up

Some Do’s And Don’t

Page 67: Events and metrics the Lifeblood of Webops

Compute

Page 68: Events and metrics the Lifeblood of Webops

Compute

DIRTNodes

Number

“Performance”

ACIDNodes

BASENodes

Add more

Scale up Shard

Page 69: Events and metrics the Lifeblood of Webops

Don’t rely on scale-upsLow memory a hard limit for DBsNoisy neighborsIndividual performance poor and jittery

Scale outFirst scale upThen ShardParallelize across machinesVector-processing via GPUs

Some Do’s And Don’t

Page 70: Events and metrics the Lifeblood of Webops

Management

Page 71: Events and metrics the Lifeblood of Webops

An API for everythingComputeStorageNetworkManagement

Page 72: Events and metrics the Lifeblood of Webops

Questions!

http://datadoghq.comtwitter : @alq