DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
-
Upload
hakka-labs -
Category
Technology
-
view
820 -
download
8
Transcript of DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
![Page 1: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/1.jpg)
Parquet at DatadogHow we use Parquet for tons of metrics data
Doug Daniels, Director of Engineering
![Page 2: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/2.jpg)
Outline• Monitor everything• Our data / why we chose Parquet• A bit about Parquet• Our pipeline• What we see in production
![Page 3: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/3.jpg)
Datadog is a monitoring service for large scale cloud applications
![Page 4: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/4.jpg)
Collect EverythingIntegrations for 100+ components
![Page 5: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/5.jpg)
Monitor Everything
![Page 6: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/6.jpg)
Alert on Critical Issues Collaborate to Fix them Together
Monitor Everything
![Page 7: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/7.jpg)
We collect a lot of data
![Page 8: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/8.jpg)
We collect a lot of data…
the biggest and most important of which is
![Page 9: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/9.jpg)
Metric timeseries data
timestamp 1447020511
metric system.cpu.idle
value 98.16687
![Page 10: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/10.jpg)
We collect hundreds of billionsof these per day…and growing every week
![Page 11: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/11.jpg)
And we do massive computation on them
![Page 12: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/12.jpg)
• Statistical analysis• Machine learning• Ad-hoc queries• Reporting and aggregation• Metering and billing
![Page 13: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/13.jpg)
One size does not fit all.
![Page 14: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/14.jpg)
ETL and aggregation Pig / Hive
ML and iterative algorithms Spark
Interactive SQL Presto
We want the best frameworkfor each job
![Page 15: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/15.jpg)
How do we do that?Duplicating data storageWriting redundant glue codeCopying data definitions and schema
![Page 16: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/16.jpg)
1. Separate Compute and Storage• Amazon S3 as data system-of-record• Ephemeral, job-specific clusters• Write storage once, read everywhere
![Page 17: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/17.jpg)
2. Standard Data Format• Supported by major frameworks• Schema-aware• Fast to read• Strong community
![Page 18: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/18.jpg)
![Page 19: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/19.jpg)
Parquet is a column-oriented data storage format
![Page 20: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/20.jpg)
What we love about Parquet• Interoperable!• Stores our data super efficiently• Proven at scale on S3• Strong community
![Page 21: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/21.jpg)
Quick Parquet primer
Column A
Row Group 0
Page 0
Page 1
Page 2
Column B
Page 0
Page 1
File Meta Data
Footer
Row Group 0 Metadata
Column B Metadata
…Column A Metadata
![Page 22: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/22.jpg)
Efficient storage and fast reads• Space efficiencies (per page)• Type-specific encodings: run-length, delta, …• Compression• Query efficiencies (support varies by framework)• Projection pushdown (skip columns)• Predicate pushdown (skip row groups)• Vectorized read (many rows at a time)
![Page 23: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/23.jpg)
Broad ecosystem support
![Page 24: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/24.jpg)
Our Parquet pipeline
Kafka
- Buffer- Sort- Dedupe- Upload
Go
Hadoop Spark Presto
PrestoS3FileSystemEMRFS
- Partition- Write Parquet- Update Metastore
Luigi/Pig
Metadata
Hive Metastore
csv-gzAmazon S3 Parquet
![Page 25: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/25.jpg)
What we see in production
![Page 26: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/26.jpg)
Excellent storage efficiency• For just 5 columns:• 3.5X less storage than gz-compressed CSV• 2.5X less than internal query-optimized columnar format
![Page 27: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/27.jpg)
…a little too efficient• One 80MB parquet file with 160M rows / row group• Creates long-running map tasks• Added PARQUET-344 to limit rows per row group• Want to switch this to limit by uncompressed size
![Page 28: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/28.jpg)
Slower read performance with AvroParquet
Runtime for our test job (mins)
0 min
10 min
20 min
30 min
40 min
CSV + gz
AvroParq
uet +
gz
AvroParq
uet +
snap
py
Parque
t + gz
• Tried reading schema w/ AvroReader• Saw 3x slower reads with
AvroParquet (YMMV) on jobs• Using HCatalog reader + hive
metastore for schema in production
![Page 29: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/29.jpg)
Our Parquet configuration• Parquet block size (and dfs block size): 128 MB• Page size: 1 MB• Compression: gzip• Schema Metadata: pig (we actually use hive metastore)
![Page 30: DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data](https://reader033.fdocuments.us/reader033/viewer/2022050806/58f304151a28ab4b7b8b45a5/html5/thumbnails/30.jpg)
Thanks!Want to work with us on Spark, Hadoop, Kafka, Parquet, Presto, and more?
DM me @ddaniels888 or [email protected]