Events and metrics the Lifeblood of Webops

Post on 20-Jun-2015

80 views 2 download

Tags:

description

What I’m going to talk about ‣Briefly we do and for whom ‣Where we started ‣The kind of data we deal with ‣How it fits altogether ‣A few things we learned along the way ‣Q+A

Transcript of Events and metrics the Lifeblood of Webops

Events & MetricsThe Lifeblood Of Webops

Alexis Lê-Quôc (Product Guy) at DatadogNYCBUG

July 6th, 2011

I <3 BSD‣OpenBSD user since 2.8 (pf)‣Love the documentation‣m0n0wall/pfSense‣ZFS-envy

What I’m going to talk about‣Briefly we do and for whom‣Where we started‣The kind of data we deal with‣How it fits altogether‣A few things we learned along the way‣Q+A

What we do?

SaaS Platform for Dev & Ops‣Aggregation‣Correlation‣Collaboration

Where We Started

The Mess

Servers and Devices

Monitoring

IAAS / PAAS

Cap. Planning

CDNsAsset Mgmt

♫ ♫♫

♫ ♫♫

Dev team

Ops team

SDLC support

Applications

metrics

metricsm

etrics events

choice

s

met

rics

?!?

!?

choi

ces

advice

met

rics

billi

ng

metrics

changes

metrics

choiceschanges

metrics

insigh

t

events + feedback

changes

choi

ces

Usage Analytics

Issue Resolution

Hosting

Too many data streams, too many silos

Too many choices to make, too often

Only getting worse as SaaS Silos multiply

Separate Dev and Ops teams, looking at separate

data streams

WHERE WE STARTEDDiscourages exploration

Very Specific View

Different ViewSame Reality

Dev InterdictionPart 1

Where We Are

In Actionhttps://app.datad0g.com/dash/host/8#/date_range/1309383780732-1309988580732

Data Exploration‣d3, protovis

Context Matters‣Ganglia Event API

Welcome developers‣Graphite‣statsd

Large Datasets‣OpenTSDB

TRENDSVisible through Datadog and others

Sides Of A Coin

MetricsUnique visitorsLoadTransaction durationetc.

EventsUser commentsAlertBuildBatch job

Aggregationetc.

Taxonomy

CLASSICShttp://en.wikipedia.org/wiki/Eventual_consistency

AtomicityConcistencyIsolationDurability

e.g. SQL DBs

CLASSICShttp://en.wikipedia.org/wiki/Eventual_consistency

AtomicityConcistencyIsolationDurability

e.g. SQL DBs

BasicallyAvailableSoft-stateEventualconsistency

e.g. DNS

Brian Cantrill: http://dtrace.org/resources/bmc/DIRT.pdf

DataIntensiveRealTime

e.g. real-time web

NEW COMER

CollaborationReal-time updatesOn-the-fly data analysis

CorrelationOn-demand visualizationBackground data analysis

AggregationConstant data influxLarge data sets

CollaborationReal-time updatesOn-the-fly data analysis

CorrelationOn-demand visualizationBackground data analysis

AggregationConstant data influxLarge data setsBA

SE

CollaborationReal-time updatesOn-the-fly data analysis

CorrelationOn-demand visualizationBackground data analysis

AggregationConstant data influxLarge data setsBA

SEDIRT

CollaborationReal-time updatesOn-the-fly data analysis

CorrelationOn-demand visualizationBackground data analysis

AggregationConstant data influxLarge data sets

BASE

BASE

DIRT

CollaborationReal-time updatesOn-the-fly data analysis

CorrelationOn-demand visualizationBackground data analysis

AggregationConstant data influxLarge data sets

BASE

DIRT

BASE

DIRT

Datadog = DIRT + BASE + a tiny bit of ACID

CollaborationReal-time updatesOn-the-fly data analysis

CorrelationOn-demand visualizationBackground data analysis

AggregationConstant data influxLarge data sets

BASE

DIRT

BASE

DIRT

How It All Fits Togetherhttp://www.flickr.com/photos/tom-margie/1253798184/

ArchitectureSimplified

ArchitectureSimplified

BASE

ArchitectureSimplified

BASE

DIRT

ArchitectureSimplified

BASE

DIRT AC

ID

The Environment

4 DimensionsComputeStorageNetworkManagement

ON-PREMISE TRAITShttp://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/

ON-PREMISE TRAITS

ComputeFastInelastic

http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/

ON-PREMISE TRAITS

ComputeFastInelastic

StorageFastCentralizedRedundant

http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/

ON-PREMISE TRAITS

ComputeFastInelastic

StorageFastCentralizedRedundant

NetworkFastLocalized

http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/

ON-PREMISE TRAITS

ComputeFastInelastic

StorageFastCentralizedRedundant

ManagementPeople-basedFull access

NetworkFastLocalized

http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/

CLOUD TRAITS

CLOUD TRAITS

ComputeSlowElastic

CLOUD TRAITS

ComputeSlowElastic

StorageSlowJitteryMaybe durableLow memory

CLOUD TRAITS

ComputeSlowElastic

Network“Fast”Geo-distributed

StorageSlowJitteryMaybe durableLow memory

CLOUD TRAITS

ComputeSlowElastic

Network“Fast”Geo-distributed

StorageSlowJitteryMaybe durableLow memory

ManagementNo bare-metal“Magic” API

What We Have Found

Network

NetworkLayer 2: Virtual DomainLayer 3: Crude Edge FilteringLayer 7: Crude Load BalancingDNSCDN

NetworkLayer 2: Virtual DomainLayer 3: Crude Edge FilteringLayer 7: Crude Load BalancingDNSCDN

It Work

s!

Storage

BASEAmazon S3

Storage

DIRTRedis Capacity

Latency

ACIDPostgreSQL

BASEApache Cassandra

BASEAmazon S3

Storage

DIRTRedis Capacity

Latency

ACIDPostgreSQL

BASEApache Cassandra

Low mem

ory

Latenc

y

Jitter

y

Limited

throu

ghput

Low Memory

http://aws.amazon.com/ec2/#instance

Jittery, Limited Throughput

https://app.datad0g.com/dash/dash/1032#/date_range/1308608717016-1309213517016

Network Block Storage (EBS)

Limited Throughput In NumbersRAID 0 EBS Volumes, m1.large instances

DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util03:35:02 PM dev8-80 375.95 23614.08 5.70 62.83 47.21 125.58 1.26 47.3403:35:02 PM dev8-96 373.63 23749.65 5.64 63.58 45.55 121.91 1.22 45.7203:35:02 PM dev8-112 375.28 23693.47 5.52 63.15 45.52 121.22 1.23 46.3103:35:02 PM dev8-128 375.31 23721.57 7.19 63.22 56.00 148.96 1.34 50.35

Average service time in ms

Average wait in ms

Read throughput in sector/sTotal: 368Mb/s

Some Tricks

Some Tricks

Software RAIDRAID 0Offsite backups

Some Tricks

Software RAIDRAID 0Offsite backups

Limited by slowest volume

Some Tricks

Software RAIDRAID 0Offsite backups

Streaming replicationS3 backups

Limited by slowest volume

Some Tricks

Software RAIDRAID 0Offsite backups

Streaming replicationS3 backups

Ephemeral volumesAnd Offsite backups

Limited by slowest volume

Some Tricks

Software RAIDRAID 0Offsite backups

Streaming replicationS3 backups

Ephemeral volumesAnd Offsite backups

Limited by slowest volume

ComplexityRecovery Time ObjectiveRecovery Point Objective

Some Tricks

Software RAIDRAID 0Offsite backups

Streaming replicationS3 backups

Ephemeral volumesAnd Offsite backups

Limited by slowest volume

ComplexityRecovery Time ObjectiveRecovery Point Objective

Database ServiceMySQL/Oracle RDS

Some Tricks

Software RAIDRAID 0Offsite backups

Streaming replicationS3 backups

Ephemeral volumesAnd Offsite backups

Limited by slowest volume

ComplexityRecovery Time ObjectiveRecovery Point Objective

Database ServiceMySQL/Oracle RDS

TrustRDS Outage 2 months ago

Network Block Storage Is The Dark Side

Network Block Storage Is The Dark Side

Bait For Enterprise Customers

Network Block Storage Is The Dark Side

Hard Problem For Cloud Providers

Bait For Enterprise Customers

Don’t rely on networked block storageSmall data sets only if you have to

Do use S3 if you canObject semantics a limitationSlow but durable

Don’t trust data-at-restCopy, replicate, back up

Some Do’s And Don’t

Compute

Compute

DIRTNodes

Number

“Performance”

ACIDNodes

BASENodes

Add more

Scale up Shard

Don’t rely on scale-upsLow memory a hard limit for DBsNoisy neighborsIndividual performance poor and jittery

Scale outFirst scale upThen ShardParallelize across machinesVector-processing via GPUs

Some Do’s And Don’t

Management

An API for everythingComputeStorageNetworkManagement

Questions!

http://datadoghq.comtwitter : @alq