Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics...
Transcript of Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics...
![Page 1: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/1.jpg)
Datadog: A Real-Time Metrics Database for Trillions of Points/Day
Ian NOWLAND (https://twitter.com/inowland)VP, Metrics and Monitors
Joel BARCIAUSKAS (https://twitter.com/JoelBarciauskas)Director, Aggregation Metrics
QCon NYC ‘19
![Page 2: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/2.jpg)
Some of Our Customers
2
![Page 3: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/3.jpg)
Some of What We Store
3
![Page 4: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/4.jpg)
Changing Source Lifecycle
4
Months/years Seconds
Datacenter Cloud/VM Containers
![Page 5: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/5.jpg)
Changing Data Volume
5100’s
10,000’s
System
Application
Per User Device
SLIs
![Page 6: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/6.jpg)
Applying Performance Mantras
• Don't do it• Do it, but don't do it again• Do it less• Do it later• Do it when they're not looking• Do it concurrently• Do it cheaper
*From Craig Hanson and Pat Crain, and the performance engineering community - see http://www.brendangregg.com/methodology.html
6
![Page 7: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/7.jpg)
Talk Plan
1. What Are Metrics Databases?2. Our Architecture3. Deep Dive On Our Datastores4. Handling Synchronization5. Introducing Aggregation6. Aggregation For Deeper Insights Using Sketches7. Sketches Enabling Flexible Architecture
7
![Page 8: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/8.jpg)
Talk Plan
1. What Are Metrics Databases?2. Our Architecture3. Deep Dive On Our Datastores4. Handling Synchronization5. Introducing Aggregation6. Aggregation For Deeper Insights Using Sketches7. Sketches Enabling Flexible Architecture
8
![Page 9: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/9.jpg)
Example Metrics Query 1
“What is the system load on instance i-xyz across the last 30 minutes”
9
![Page 10: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/10.jpg)
A Time Series
metric system.load.1
timestamp 1526382440
value 0.92
tags host:i-xyz,env:dev,...
10
![Page 11: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/11.jpg)
Example Metrics Query 2
“Alert when the system load, averaged across our fleet in us-east-1a for a 5 minute interval, goes above 90%”
11
![Page 12: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/12.jpg)
Example Metrics Query 2
“Alert when the system load, averaged across my fleet in us-east-1a for a 5 minute interval, goes above 90%”
12
Aggregate DimensionTake Action
![Page 13: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/13.jpg)
Metrics Name and Tags
Name: single string defining what you are measuring, e.g.
system.cpu.user
aws.elb.latency
dd.frontend.internal.ajax.queue.length.total
Tags: list of k:v strings, used to qualify metric and add dimensions to filter/aggregate over, e.g.
['host:server-1', 'availability-zone:us-east-1a', 'kernel_version:4.4.0']
['host:server-2', 'availability-zone:us-east-1a', 'kernel_version:2.6.32']
['host:server-3', 'availability-zone:us-east-1b', 'kernel_version:2.6.32']
13
![Page 14: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/14.jpg)
Tags for all the dimensions
Host / container: system metrics by host
Application: internal cache hit rates, timers by module
Service: hits, latencies or errors/s by path and/or response code
Business: # of orders processed, $'s per second by customer ID
14
![Page 15: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/15.jpg)
Talk Plan
1. What Are Metrics Databases?2. Our Architecture3. Deep Dive On Our Datastores4. Handling Synchronization5. Introducing Aggregation6. Aggregation For Deeper Insights Using Sketches7. Sketches Enabling Flexible Architecture
15
![Page 16: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/16.jpg)
Pipeline Architecture
16
Customer Browser
IntakeMetrics sources
Query System
Web frontend & APIs
Customer
Monitors and Alerts
Slack/Email/PagerDuty etc
Data StoresData Stores
Data Stores
![Page 17: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/17.jpg)
Performance mantras
• Don't do it• Do it, but don't do it again• Do it less• Do it later• Do it when they're not looking• Do it concurrently• Do it cheaper
17
![Page 18: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/18.jpg)
Performance mantras
• Don't do it• Do it, but don't do it again - query caching• Do it less• Do it later• Do it when they're not looking• Do it concurrently• Do it cheaper
18
![Page 19: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/19.jpg)
Pipeline Architecture
19
Customer Browser
IntakeMetrics sources
Query System
Web frontend & APIs
Customer
Monitors and Alerts
Slack/Email/PagerDuty etc
Data StoresData Stores
Data Stores
Query Cache
![Page 20: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/20.jpg)
Pipeline Architecture
20
Customer Browser
IntakeMetrics sources
Query System
Web frontend & APIs
Customer
Monitors and Alerts
Slack/Email/PagerDuty etc
Data StoresData Stores
Data Stores
Query Cache
![Page 21: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/21.jpg)
Metrics Store Characteristics
• Most metrics report with a tag set for quite some time
=> Therefore separate tag stores from time series stores
21
![Page 22: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/22.jpg)
Pipeline Architecture
22
Customer Browser
IntakeMetrics sources
Query System
Web frontend & APIs
Customer
Monitors and Alerts
Slack/Email/PagerDuty etc
Data StoresData Stores
Data Stores
Query Cache
![Page 23: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/23.jpg)
Kafka for Independent Storage Systems
IntakeIncomingData
Kafka Points
Store 1
Store 2
Kafka Tag Sets
Tag Index
Tag Describer S3
S3 Writer QuerySystem
Outgoing Data
![Page 24: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/24.jpg)
Performance mantras
• Don't do it• Do it, but don't do it again - query caching• Do it less• Do it later - minimize processing on path to persistence• Do it when they're not looking• Do it concurrently• Do it cheaper
24
![Page 25: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/25.jpg)
Kafka for Independent Storage Systems
IntakeIncomingData
Kafka Points
Store 1
Store 2
Kafka Tag Sets
Tag Index
Tag Describer S3
S3 Writer QuerySystem
Outgoing Data
![Page 26: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/26.jpg)
Scaling through Kafka Data is separated by partition to distribute it
Partitions are customers, or a mod hash of their metric name
This also gives us isolation.
IntakeKafka partition:1Incoming
Data Kafka partition:2
Kafka partition:0
Store 1
Kafka partition:3
Store 2
Store 2
Store 2
Store 1
![Page 27: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/27.jpg)
Performance mantras
• Don't do it• Do it, but don't do it again - query caching• Do it less• Do it later - minimize processing on path to persistence• Do it when they're not looking• Do it concurrently - use independent horizontally scalable data
stores• Do it cheaper
27
![Page 28: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/28.jpg)
Talk Plan
1. What Are Metrics Databases?2. Our Architecture3. Deep Dive On Our Datastores4. Handling Synchronization5. Introducing Aggregation6. Aggregation For Deeper Insights Using Sketches7. Sketches Enabling Flexible Architecture
28
![Page 29: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/29.jpg)
Per Customer Volume Ballparking
29
104 Number of apps; 1,000’s hosts times 10’s containers
103 Number of metrics emitted from each app/container
100 1 point a second per metric
105 Seconds in a day (actually 86,400)
101 Bytes/point (8 byte float, amortized tags)
= 1013 10 Terabytes a Day For One Average Customer
![Page 30: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/30.jpg)
Volume Math
• $210 to store 10 TB in S3 for a month• $60,000 for a month rolling queryable (300 TB)• But S3 is not for real time, high throughput queries
30
![Page 31: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/31.jpg)
Cloud Storage Characteristics
31
Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility
DRAM1 4 TB 80 GB/s 0.08 us $1,000 Instance Reboot
SSD2 60 TB 12 GB/s 1 us $60 Instance Failures
EBS io1 432 TB 12 GB/s 40 us $400 Data Center Failures
S3 Infinite 12 GB/s3 100+ ms $214 11 nines durability
Glacier Infinite 12 GB/s3 hours $44 11 nines durability
1. X1e.32xlarge, 3 year non convertible, no upfront reserved instance2. i3en.24xlarge, 3 year non convertible, no upfront reserved instance3. Assumes can highly parallelize to load network card of 100Gbps instance type. Likely does not scale out.4. Storage Cost only
![Page 32: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/32.jpg)
Volume Math
• 80 x1e.32xlarge DRAM for a month• $300,000 to store for a month• This is with no indexes or overhead • And people want to query much more than a month.
32
![Page 33: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/33.jpg)
Performance mantras
• Don't do it• Do it, but don't do it again - query caching• Do it less - only index what you need• Do it later - minimize processing on path to persistence• Do it when they're not looking• Do it concurrently - use independent horizontally scalable data stores• Do it cheaper
33
![Page 34: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/34.jpg)
Returning to an Example Query
“Alert when the system load, averaged across our fleet in us-east-1a for a 5 minute interval, goes above 90%”
34
![Page 35: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/35.jpg)
Queries We Need to Support
35
DESCRIBE TAGS What tags are queryable for this metric?
TAG INDEX Given a time series id, what tags were used?
TAG INVERTED INDEX
Given some tags and a time range, what were the time series ingested?
POINT STORE What are the values of a time series between two times?
![Page 36: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/36.jpg)
Performance mantras
• Don't do it• Do it, but don't do it again - query caching• Do it less - only index what you need• Do it later - minimize processing on path to persistence• Do it when they're not looking• Do it concurrently - use independent horizontally scalable data stores• Do it cheaper
36
![Page 37: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/37.jpg)
Performance mantras
• Don't do it• Do it, but don't do it again - query caching• Do it less - only index what you need• Do it later - minimize processing on path to persistence• Do it when they're not looking• Do it concurrently - use independent horizontally scalable data stores• Do it cheaper - use hybrid data storage types and technologies
37
![Page 38: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/38.jpg)
Cloud Storage Characteristics
38
Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility
DRAM1 4 TB 80 GB/s 0.08 us $1,000 Instance Reboot
SSD2 60 TB 12 GB/s 1 us $60 Instance Failures
EBS io1 432 TB 12 GB/s 40 us $400 Data Center Failures
S3 Infinite 12 GB/s3 100+ ms $214 11 nines durability
Glacier Infinite 12 GB/s3 hours $44 11 nines durability
1. X1e.32xlarge, 3 year non convertible, no upfront reserved instance2. i3en.24xlarge, 3 year non convertible, no upfront reserved instance3. Assumes can highly parallelize to load network card of 100Gbps instance type. Likely does not scale out.4. Storage Cost only
![Page 39: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/39.jpg)
Cloud Storage Characteristics
39
Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility
DRAM1 4 TB 80 GB/s 0.08 us $1,000 Instance Reboot
SSD2 60 TB 12 GB/s 1 us $60 Instance Failures
EBS io1 432 TB 12 GB/s 40 us $400 Data Center Failures
S3 Infinite 12 GB/s3 100+ ms $214 11 nines durability
Glacier Infinite 12 GB/s3 hours $44 11 nines durability
1. X1e.32xlarge, 3 year non convertible, no upfront reserved instance2. i3en.24xlarge, 3 year non convertible, no upfront reserved instance3. Assumes can highly parallelize to load network card of 100Gbps instance type. Likely does not scale out.4. Storage Cost only
![Page 40: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/40.jpg)
Hybrid Data Storage Types
40
System
DESCRIBE TAGS
TAG INDEX
TAG INVERTED INDEX
POINT STORE
QUERY RESULTS
![Page 41: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/41.jpg)
Hybrid Data Storage Types
41
System Type Persistence
DESCRIBE TAGS Local SSD Years
TAG INDEX DRAM Cache (Hours)
Local SSD Years
TAG INVERTED INDEX DRAM Hours
On SSD Days
S3 Years
POINT STORE DRAM Hours
Local SSD Days
S3 Years
QUERY RESULTS DRAM Cache (Days)
![Page 42: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/42.jpg)
Hybrid Data Storage Technologies
42
System Type Persistence Technology Why?
DESCRIBE TAGS Local SSD Years LevelDB High performing single node k,v
TAG INDEX DRAM Cache (Hours) Redis Very high performance, in memory k,v
Local SSD Years Cassandra Horizontal scaling, persistent k,v
TAG INVERTED INDEX DRAM Hours In house Very customized index data structures
On SSD Days RocksDB + SQLite Rich and flexible queries
S3 Years Parquet Flexible Schema over time
POINT STORE DRAM Hours In house Very customized index data structures
Local SSD Days In house Very customized index data structures
S3 Years Parquet Flexible Schema over time
QUERY RESULTS DRAM Cache (Days) Redis Very high performance, in memory k,v
![Page 43: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/43.jpg)
Talk Plan
1. What Are Metrics Databases?2. Our Architecture3. Deep Dive On Our Datastores4. Handling Synchronization5. Introducing Aggregation6. Aggregation For Deeper Insights Using Sketches7. Sketches Enabling Flexible Architecture
43
![Page 44: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/44.jpg)
Alerts/Monitors Synchronization
• Level sensitive• False positives is almost as important as false negative
• Small delay preferable to evaluating incomplete data• Synchronization need is to be sure evaluation bucket is filled
before processing
44
![Page 45: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/45.jpg)
Pipeline Architecture
45
Customer Browser
IntakeMetrics sources
Query System
Web frontend & APIs
Customer
Monitors and Alerts
Slack/Email/PagerDuty etc
Data StoresData Stores
Data Stores
Query Cache
Inject heartbeat here
![Page 46: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/46.jpg)
Pipeline Architecture
46
Customer Browser
IntakeMetrics sources
Query System
Web frontend & APIs
Customer
Monitors and Alerts
Slack/Email/PagerDuty etc
Data StoresData Stores
Data Stores
Query Cache
Inject heartbeat here
And test it gets to here
![Page 47: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/47.jpg)
Heartbeats for Synchronization
Semantics:
- 1 second tick time for metrics- Last write wins to handle agent concurrency
- Inject fake data as heartbeat through pipeline
Then:- Monitor evaluator ensure heartbeat gets through before evaluating next period
Challenges:- With sharding and multiple stores, lots of independent paths to make sure
heartbeats go through
47
![Page 48: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/48.jpg)
Performance mantras
• Don't do it - build the minimal synchronization needed• Do it, but don't do it again - query caching• Do it less - only index what you need• Do it later - minimize processing on path to persistence• Do it when they're not looking• Do it concurrently - use independent horizontally scalable data stores • Do it cheaper - use hybrid data storage types and technologies
48
![Page 49: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/49.jpg)
Talk Plan
1. What Are Metrics Databases?2. Our Architecture3. Deep Dive On Our Datastores4. Handling Synchronization5. Introducing Aggregation6. Aggregation For Deeper Insights Using Sketches7. Sketches Enabling Flexible Architecture
49
![Page 50: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/50.jpg)
Types of metrics
50
Counter, aggregate by sum Gauges, aggregate by last or avg
Ex: Requests, errors/s, total time spent (stopwatch)
Ex: CPU/network/disk use, queue length
![Page 51: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/51.jpg)
Aggregation
51
{0, 1, 0, 1, 0, 1, 0, 1, 0, 1}
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
{5, 5, 5, 5, 5, 5, 5, 5, 5, 5}
{0, 2, 4, 8, 16, 32, 64, 128, 256, 512}
Time
Space
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
Query output
Counters: {5, 40, 50, 1023}
Gauges (average): {0.5, 4, 5, 102.3}
Gauges (last): {1, 9, 5, 512}
![Page 52: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/52.jpg)
Query characteristics
52
User:
• Bursty and unpredictable• Latency Sensitive - ideal end user response is 100ms, 1s at most.• Skews to recent data, but want same latency on old data
![Page 53: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/53.jpg)
Query characteristics
53
Dashboards:
• Predictable• Important enough to save• Looking for step-function changes, e.g. performance regressions,
changes in usage
![Page 54: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/54.jpg)
Focus on outputs
54
These graphs are both aggregating 70k series
Not a lot, but still output 10x to 2000x less than input!
![Page 55: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/55.jpg)
Performance mantras
• Don't do it - build the minimal synchronization needed• Do it, but don't do it again - query caching• Do it less - only index what you need• Do it later - minimize processing on path to persistence• Do it when they're not looking?• Do it concurrently - use independent horizontally scalable data stores• Do it cheaper - use hybrid data storage types and technologies
55
![Page 56: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/56.jpg)
Pipeline Architecture
56
Customer Browser
IntakeMetrics sources
Query System
Web frontend & APIs
Customer
Monitors and Alerts
Slack/Email/PagerDuty etc
Data StoresData Stores
Data Stores
Query Cache
Aggregation Points
![Page 57: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/57.jpg)
Pipeline Architecture
57
Customer Browser
IntakeMetrics sources
Query System
Web frontend & APIs
Customer
Monitors and Alerts
Slack/Email/PagerDuty etc
Data StoresData Stores
Data Stores
Query Cache
Aggregation Points
Streaming Aggregator
![Page 58: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/58.jpg)
Pipeline Architecture
58
Customer Browser
IntakeMetrics sources
Query System
Web frontend & APIs
Customer
Monitors and Alerts
Slack/Email/PagerDuty etc
Data StoresData Stores
Data Stores
Query Cache
Aggregation Points
No one's looking here!
Streaming Aggregator
![Page 59: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/59.jpg)
Performance mantras
• Don't do it - build the minimal synchronization needed• Do it, but don't do it again - query caching• Do it less - only index what you need• Do it later - minimize processing on path to persistence• Do it when they're not looking - pre-aggregate• Do it concurrently - use independent horizontally scalable data stores• Do it cheaper - use hybrid data storage types and technologies
59
![Page 60: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/60.jpg)
Talk Plan
1. What Are Metrics Databases?2. Our Architecture3. Deep Dive On Our Datastores4. Handling Synchronization5. Introducing Aggregation6. Aggregation For Deeper Insights Using Sketches7. Sketches Enabling Flexible Architecture
60
![Page 61: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/61.jpg)
Distributions
61
Aggregate by percentile or SLO (count of values above or below a threshold)
Ex: Latency, request size
![Page 62: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/62.jpg)
Calculating distributions
62
{0, 1, 0, 1, 0, 1, 0, 1, 0, 1}
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
{5, 5, 5, 5, 5, 5, 5, 5, 5, 5}
{0, 2, 4, 8, 16, 32, 64, 128, 256, 512}
Time
Space
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
{0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 3, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 7, 8, 8, 9, 16, 32, 64, 128, 256, 512}
p90
p50
![Page 63: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/63.jpg)
Performance mantras
• Don't do it - build the minimal synchronization needed• Do it, but don't do it again - query caching• Do it less - only index what you need• Do it later - minimize processing on path to persistence• Do it when they're not looking - pre-aggregate• Do it concurrently - use independent horizontally scalable data stores• Do it cheaper again?
63
![Page 64: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/64.jpg)
What are "sketches"?
64
Data structures designed for operating on streams of data
• Examine each item a limited number of times (ideally once)• Limited memory usage (logarithmic to the size of the stream,
or fixed)
Max size
![Page 65: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/65.jpg)
Examples of sketches
HyperLogLog
• Cardinality / unique count estimation• Used in Redis PFADD, PFCOUNT, PFMERGE
Others: Bloom filters (also for set membership), frequency sketches (top-N lists)
65
![Page 66: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/66.jpg)
Tradeoffs
Understand the tradeoffs - speed, accuracy, space
What other characteristics do you need?
• Well-defined or arbitrary range of inputs?• What kinds of queries are you answering?
66
![Page 67: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/67.jpg)
Approximation for distribution metrics
What's important for approximating distribution metrics?
• Bounded error• Performance - size, speed of inserts• Aggregation (aka "merging")
67
![Page 68: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/68.jpg)
How do you compress a distribution
68
![Page 69: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/69.jpg)
Histograms
Basic example from OpenMetrics / Prometheus
69
![Page 70: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/70.jpg)
Histograms
Basic example from OpenMetrics / Prometheus
70
Time spent Count
<= 0.05 (50ms) 24054
<= 0.1 (100ms) 33444
<= 0.2 (200ms) 100392
<= 0.5 (500ms) 129389
<= 1s 133988
> 1s 144320
median = ~158ms (using linear interpolation)
72160
158ms
p99 = ?!
![Page 71: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/71.jpg)
Rank and relative error
71
![Page 72: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/72.jpg)
Rank and relative error
72
![Page 73: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/73.jpg)
Relative error
In metrics, specifically latency metrics, we care about about both the distribution of data as well as specific values
E.g., for an SLO, I want to know, is my p99 500ms or less?
Relative error bounds mean we can answer this: Yes, within 99% of requests are <= 500ms +/- 1%
Otherwise stated: 99% of requests are guaranteed <= 505ms
73
![Page 74: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/74.jpg)
Fast insertion
Each insertion is just two operations - find the bucket, increase the count (sometimes there's an allocation)
74
![Page 75: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/75.jpg)
Fixed Size - how?
With certain distributions, we may reach the maximum number of buckets (in our case, 4000)
• Roll up lower buckets - lower percentiles are generally not as interesting!*
*Note that we've yet to find a data set that actually needs this in practice75
![Page 76: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/76.jpg)
Aggregation and merging
76
"a binary operation is commutative if changing the order of the operands does not change the result"
Why is this important?
![Page 77: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/77.jpg)
Talk Plan
1. What Are Metrics Databases?2. Our Architecture3. Deep Dive On Our Datastores4. Handling Synchronization5. Introducing Aggregation6. Aggregation For Deeper Insights Using Sketches7. Sketches Enabling Flexible Architecture
77
![Page 78: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/78.jpg)
Before, during, save for later
If we have two-way mergeable sketches, we can re-aggregate the aggregations
• Agent• Streaming during ingestion• At query time• In the data store (saving partial results)
78
![Page 79: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/79.jpg)
Pipeline Architecture
79
Customer Browser
IntakeMetrics sources
Query System
Web frontend & APIs
Customer
Monitors and Alerts
Slack/Email/PagerDuty etc
Data StoresData Stores
Data Stores
Query Cache
Aggregation Points
Streaming Aggregator
![Page 80: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/80.jpg)
DDSketch
DDSketch (Distributed Distribution Sketch) is open source (part of the agent today)
• Presenting at VLDB2019 in August• Open-sourcing standalone versions in several languages
80
![Page 81: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/81.jpg)
Performance mantras
• Don't do it - build the minimal synchronization needed• Do it, but don't do it again - query caching• Do it less - only index what you need• Do it later - minimize processing on path to persistence• Do it when they're not looking - pre-aggregate• Do it concurrently - use independent horizontally scalable data stores• Do it cheaper - use hybrid data storage types and technologies
81
![Page 82: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/82.jpg)
Performance mantras
• Don't do it - build the minimal synchronization needed• Do it, but don't do it again - query caching• Do it less - only index what you need• Do it later - minimize processing on path to persistence• Do it when they're not looking - pre-aggregate• Do it concurrently - use independent horizontally scalable data stores• Do it cheaper - use hybrid data storage types and technologies, and
use compression techniques based on what customers really need
82
![Page 83: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/83.jpg)
Summary
• Don't do it - build the bare minimal synchronization needed• Do it, but don't do it again - use query caching• Do it less - only index what you need• Do it later - minimize processing on path to persistence• Do it when they're not looking - pre-aggregate where is cost effective• Do it concurrently - use independent horizontally scaleable data stores • Do it cheaper - use hybrid data storage types and technologies, and
use compression techniques based on what customers really need
83
![Page 84: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/84.jpg)
Thank You
![Page 85: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/85.jpg)
Challenges and opportunities of aggregation
• Challenges:• Accuracy• Latency
• Opportunity:• Orders of magnitude performance improvement on common and
highly visible queries
85
![Page 86: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/86.jpg)
Human factors and dashboards
86
• Human-latency sensitive - high visibility
Late-arriving data makes people nervous
• Human granularity - how many lines can you reason about on a dashboard?
Oh no...
![Page 87: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/87.jpg)
Where aggregation happens
87
At the metric source (agent/lambda/etc)
• Counts by sum• Gauges by last
At query time
• Arbitrary user selection (avg/sum/min/max)• Impacts user experience
![Page 88: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/88.jpg)
Adding a new metric type
Counters, gauges, distributions!
Used gauges for latency, etc, but aggregate by last is not what you want
Need to update the agent, libraries, integrations
We're learning and building on what we have today
88
![Page 89: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/89.jpg)
Building blocks
We have a way to move data around (Kafka)
We have ways to index that data (tagsets)
We know how to separate recent and historical data
Plan for the future
[Lego / puzzle with gaps]
89
![Page 90: Database for Trillions of Datadog: A Real-Time Metrics Points/Day · Cloud Storage Characteristics 31 Type Max Capacity Bandwidth Latency Cost/TB for 1 month Volatility DRAM1 4 TB](https://reader030.fdocuments.us/reader030/viewer/2022040613/5f05ef7c7e708231d415760a/html5/thumbnails/90.jpg)
Connect the dots
90