DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Parquet at DatadogHow we use Parquet for tons of metrics data

Doug Daniels, Director of Engineering

Outline• Monitor everything• Our data / why we chose Parquet• A bit about Parquet• Our pipeline• What we see in production

Datadog is a monitoring service for large scale cloud applications

Collect EverythingIntegrations for 100+ components

Monitor Everything

Alert on Critical Issues Collaborate to Fix them Together

Monitor Everything

We collect a lot of data

We collect a lot of data…

the biggest and most important of which is

Metric timeseries data

timestamp 1447020511

metric system.cpu.idle

value 98.16687

We collect hundreds of billionsof these per day…and growing every week

And we do massive computation on them

• Statistical analysis• Machine learning• Ad-hoc queries• Reporting and aggregation• Metering and billing

One size does not fit all.

ETL and aggregation Pig / Hive

ML and iterative algorithms Spark

Interactive SQL Presto

We want the best frameworkfor each job

How do we do that?Duplicating data storageWriting redundant glue codeCopying data definitions and schema

1. Separate Compute and Storage• Amazon S3 as data system-of-record• Ephemeral, job-specific clusters• Write storage once, read everywhere

2. Standard Data Format• Supported by major frameworks• Schema-aware• Fast to read• Strong community

Parquet is a column-oriented data storage format

What we love about Parquet• Interoperable!• Stores our data super efficiently• Proven at scale on S3• Strong community

Quick Parquet primer

Column A

Row Group 0

Page 0

Page 1

Page 2

Column B

Page 0

Page 1

File Meta Data

Footer

Row Group 0 Metadata

Column B Metadata

…Column A Metadata

Efficient storage and fast reads• Space efficiencies (per page)• Type-specific encodings: run-length, delta, …• Compression• Query efficiencies (support varies by framework)• Projection pushdown (skip columns)• Predicate pushdown (skip row groups)• Vectorized read (many rows at a time)

Broad ecosystem support

Our Parquet pipeline

Kafka

- Buffer- Sort- Dedupe- Upload

Go

Hadoop Spark Presto

PrestoS3FileSystemEMRFS

- Partition- Write Parquet- Update Metastore

Luigi/Pig

Metadata

Hive Metastore

csv-gzAmazon S3 Parquet

What we see in production

Excellent storage efficiency• For just 5 columns:• 3.5X less storage than gz-compressed CSV• 2.5X less than internal query-optimized columnar format

…a little too efficient• One 80MB parquet file with 160M rows / row group• Creates long-running map tasks• Added PARQUET-344 to limit rows per row group• Want to switch this to limit by uncompressed size

Slower read performance with AvroParquet

Runtime for our test job (mins)

0 min

10 min

20 min

30 min

40 min

CSV + gz

AvroParq

uet +

gz

AvroParq

uet +

snap

py

Parque

t + gz

• Tried reading schema w/ AvroReader• Saw 3x slower reads with

AvroParquet (YMMV) on jobs• Using HCatalog reader + hive

metastore for schema in production

Our Parquet configuration• Parquet block size (and dfs block size): 128 MB• Page size: 1 MB• Compression: gzip• Schema Metadata: pig (we actually use hive metastore)

Thanks!Want to work with us on Spark, Hadoop, Kafka, Parquet, Presto, and more?

DM me @ddaniels888 or [email protected]

mailto:[email protected]?subject=

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Technology

Transcript of DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data