Events and metrics the Lifeblood of Webops
-
Upload
datadogslides -
Category
Technology
-
view
80 -
download
2
description
Transcript of Events and metrics the Lifeblood of Webops
Events & MetricsThe Lifeblood Of Webops
Alexis Lê-Quôc (Product Guy) at DatadogNYCBUG
July 6th, 2011
I <3 BSD‣OpenBSD user since 2.8 (pf)‣Love the documentation‣m0n0wall/pfSense‣ZFS-envy
What I’m going to talk about‣Briefly we do and for whom‣Where we started‣The kind of data we deal with‣How it fits altogether‣A few things we learned along the way‣Q+A
What we do?
SaaS Platform for Dev & Ops‣Aggregation‣Correlation‣Collaboration
Where We Started
The Mess
Servers and Devices
Monitoring
IAAS / PAAS
Cap. Planning
CDNsAsset Mgmt
♫ ♫♫
♫
♫ ♫♫
♫
♫
♫
Dev team
Ops team
SDLC support
Applications
metrics
metricsm
etrics events
choice
s
met
rics
?!?
!?
choi
ces
advice
met
rics
billi
ng
metrics
changes
metrics
choiceschanges
metrics
insigh
t
events + feedback
changes
choi
ces
Usage Analytics
Issue Resolution
Hosting
Too many data streams, too many silos
Too many choices to make, too often
Only getting worse as SaaS Silos multiply
Separate Dev and Ops teams, looking at separate
data streams
WHERE WE STARTEDDiscourages exploration
Very Specific View
Different ViewSame Reality
Dev InterdictionPart 1
Where We Are
In Actionhttps://app.datad0g.com/dash/host/8#/date_range/1309383780732-1309988580732
Data Exploration‣d3, protovis
Context Matters‣Ganglia Event API
Welcome developers‣Graphite‣statsd
Large Datasets‣OpenTSDB
TRENDSVisible through Datadog and others
Sides Of A Coin
MetricsUnique visitorsLoadTransaction durationetc.
EventsUser commentsAlertBuildBatch job
Aggregationetc.
Taxonomy
CLASSICShttp://en.wikipedia.org/wiki/Eventual_consistency
AtomicityConcistencyIsolationDurability
e.g. SQL DBs
CLASSICShttp://en.wikipedia.org/wiki/Eventual_consistency
AtomicityConcistencyIsolationDurability
e.g. SQL DBs
BasicallyAvailableSoft-stateEventualconsistency
e.g. DNS
Brian Cantrill: http://dtrace.org/resources/bmc/DIRT.pdf
DataIntensiveRealTime
e.g. real-time web
NEW COMER
CollaborationReal-time updatesOn-the-fly data analysis
CorrelationOn-demand visualizationBackground data analysis
AggregationConstant data influxLarge data sets
CollaborationReal-time updatesOn-the-fly data analysis
CorrelationOn-demand visualizationBackground data analysis
AggregationConstant data influxLarge data setsBA
SE
CollaborationReal-time updatesOn-the-fly data analysis
CorrelationOn-demand visualizationBackground data analysis
AggregationConstant data influxLarge data setsBA
SEDIRT
CollaborationReal-time updatesOn-the-fly data analysis
CorrelationOn-demand visualizationBackground data analysis
AggregationConstant data influxLarge data sets
BASE
BASE
DIRT
CollaborationReal-time updatesOn-the-fly data analysis
CorrelationOn-demand visualizationBackground data analysis
AggregationConstant data influxLarge data sets
BASE
DIRT
BASE
DIRT
Datadog = DIRT + BASE + a tiny bit of ACID
CollaborationReal-time updatesOn-the-fly data analysis
CorrelationOn-demand visualizationBackground data analysis
AggregationConstant data influxLarge data sets
BASE
DIRT
BASE
DIRT
How It All Fits Togetherhttp://www.flickr.com/photos/tom-margie/1253798184/
ArchitectureSimplified
ArchitectureSimplified
BASE
ArchitectureSimplified
BASE
DIRT
ArchitectureSimplified
BASE
DIRT AC
ID
The Environment
4 DimensionsComputeStorageNetworkManagement
ON-PREMISE TRAITShttp://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
ON-PREMISE TRAITS
ComputeFastInelastic
http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
ON-PREMISE TRAITS
ComputeFastInelastic
StorageFastCentralizedRedundant
http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
ON-PREMISE TRAITS
ComputeFastInelastic
StorageFastCentralizedRedundant
NetworkFastLocalized
http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
ON-PREMISE TRAITS
ComputeFastInelastic
StorageFastCentralizedRedundant
ManagementPeople-basedFull access
NetworkFastLocalized
http://www.flickr.com/photos/theplanetdotcom/4879419788/sizes/l/in/photostream/
CLOUD TRAITS
CLOUD TRAITS
ComputeSlowElastic
CLOUD TRAITS
ComputeSlowElastic
StorageSlowJitteryMaybe durableLow memory
CLOUD TRAITS
ComputeSlowElastic
Network“Fast”Geo-distributed
StorageSlowJitteryMaybe durableLow memory
CLOUD TRAITS
ComputeSlowElastic
Network“Fast”Geo-distributed
StorageSlowJitteryMaybe durableLow memory
ManagementNo bare-metal“Magic” API
What We Have Found
Network
NetworkLayer 2: Virtual DomainLayer 3: Crude Edge FilteringLayer 7: Crude Load BalancingDNSCDN
NetworkLayer 2: Virtual DomainLayer 3: Crude Edge FilteringLayer 7: Crude Load BalancingDNSCDN
It Work
s!
Storage
BASEAmazon S3
Storage
DIRTRedis Capacity
Latency
ACIDPostgreSQL
BASEApache Cassandra
BASEAmazon S3
Storage
DIRTRedis Capacity
Latency
ACIDPostgreSQL
BASEApache Cassandra
Low mem
ory
Latenc
y
Jitter
y
Limited
throu
ghput
Low Memory
http://aws.amazon.com/ec2/#instance
Jittery, Limited Throughput
https://app.datad0g.com/dash/dash/1032#/date_range/1308608717016-1309213517016
Network Block Storage (EBS)
Limited Throughput In NumbersRAID 0 EBS Volumes, m1.large instances
DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util03:35:02 PM dev8-80 375.95 23614.08 5.70 62.83 47.21 125.58 1.26 47.3403:35:02 PM dev8-96 373.63 23749.65 5.64 63.58 45.55 121.91 1.22 45.7203:35:02 PM dev8-112 375.28 23693.47 5.52 63.15 45.52 121.22 1.23 46.3103:35:02 PM dev8-128 375.31 23721.57 7.19 63.22 56.00 148.96 1.34 50.35
Average service time in ms
Average wait in ms
Read throughput in sector/sTotal: 368Mb/s
Some Tricks
Some Tricks
Software RAIDRAID 0Offsite backups
Some Tricks
Software RAIDRAID 0Offsite backups
Limited by slowest volume
Some Tricks
Software RAIDRAID 0Offsite backups
Streaming replicationS3 backups
Limited by slowest volume
Some Tricks
Software RAIDRAID 0Offsite backups
Streaming replicationS3 backups
Ephemeral volumesAnd Offsite backups
Limited by slowest volume
Some Tricks
Software RAIDRAID 0Offsite backups
Streaming replicationS3 backups
Ephemeral volumesAnd Offsite backups
Limited by slowest volume
ComplexityRecovery Time ObjectiveRecovery Point Objective
Some Tricks
Software RAIDRAID 0Offsite backups
Streaming replicationS3 backups
Ephemeral volumesAnd Offsite backups
Limited by slowest volume
ComplexityRecovery Time ObjectiveRecovery Point Objective
Database ServiceMySQL/Oracle RDS
Some Tricks
Software RAIDRAID 0Offsite backups
Streaming replicationS3 backups
Ephemeral volumesAnd Offsite backups
Limited by slowest volume
ComplexityRecovery Time ObjectiveRecovery Point Objective
Database ServiceMySQL/Oracle RDS
TrustRDS Outage 2 months ago
Network Block Storage Is The Dark Side
Network Block Storage Is The Dark Side
Bait For Enterprise Customers
Network Block Storage Is The Dark Side
Hard Problem For Cloud Providers
Bait For Enterprise Customers
Don’t rely on networked block storageSmall data sets only if you have to
Do use S3 if you canObject semantics a limitationSlow but durable
Don’t trust data-at-restCopy, replicate, back up
Some Do’s And Don’t
Compute
Compute
DIRTNodes
Number
“Performance”
ACIDNodes
BASENodes
Add more
Scale up Shard
Don’t rely on scale-upsLow memory a hard limit for DBsNoisy neighborsIndividual performance poor and jittery
Scale outFirst scale upThen ShardParallelize across machinesVector-processing via GPUs
Some Do’s And Don’t
Management
An API for everythingComputeStorageNetworkManagement