AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Eva Tse, Director, Big Data Services, Netflix

Kurt Brown, Director, Data Platform, Netflix

November 29, 2016

NetflixUsing Amazon S3 as the Fabric

of Our Big Data Ecosystem

BDM306

What to Expect from the Session

How we use Amazon S3 as our centralized data hub

Our big data ecosystems on AWS

- Big data processing engines

- Architecture

- Tools and services

Why Amazon S3?

‘The only valuable thing is intuition.’

– Albert Einstein

Why is it Intuitive?

It is a cloud native service! (free engineering)

‘Practically infinitely’ scalable

99.999999999% durable

99.99% available

Decouple compute and storage

Why is it Counter Intuitive?

Eventual consistency?

Performance?

Our Data Hub Scale

60+ PB

1.5 billions+ Objects

Data hub

60+ PB

100+ TB daily

Ingest

Expiration

400+ TB daily

ETL Processing

Read ~3.5 PB daily

Write 500+ TB daily

Data Velocity

Ingest

Event Data Pipeline

Business events

~500 billion events/day

5 min SLA from source to data hub

Kafka AWS

UrsulaData

AWS SQS

Event Data

Region 1

Region 2

Region 3

Dimension Data Pipeline

Stateful data in Cassandra clusters

Extract from tens of Cassandra clusters

Daily or more granular extracts

Dimension Data

Cassandra

clusters

Aegisthus Data

Cassandra

clusters

Cassandra

clusters

SSTables

on AWS S3

Region 1

Region 2

Region 3

SSTables

on AWS S3

SSTables

on AWS S3

Transform

Data Processing

Our Data Processing Engines

Data Hub

Hadoop Yarn

Clusters

~250 - 400 r3.4xl~3500 d2.4xl

Look at it from Scalability Angle

1 d2.4xl has 24 TB

60 PB/24 TB = 2,560 machines

To achieve 3 way replications for redundancy in one zone,

we need 7,680 machines!

The data size we have is beyond what we could fit into our

clusters!

Tradeoffs

What are the tradeoffs?

Eventual consistency

Performance

Eventual Consistency

Updates (overwrite puts)

- Always put new files with new keys when updating data;

then delete old files

- We need to know we missed something

- Keep prefix manifest in S3mper (or EMRFS)

Parquet

Majority of our data is in Parquet file format

Supported across Hive, Pig, Presto, Spark

Performance benefits in read

- Column projections

- Predicate pushdown

- Vectorized read

Performance Impact

- Penalty: Throughput and latency

- Impact depends on amount of data read

- Improvement: I/O manager in Parquet

- Penalty: Writing to local disk before upload to S3

- Improvement: Direct write via multi-part uploads

Performance Impact

- Penalty: List thousands of partitions for split calculation

- Each partition is a S3 prefix

- Improvement: Track files instead of prefixes

Performance Impact – some good news

ETL jobs:

- Mostly CPU bound, not network bound

- Performance converges w/ volume and complexity

Interactive queries:

- % impact is higher … but they run fast

Benefits still outweigh the cost!

Job and Cluster Mgmt Service

. . . .

For Users

Should I run my job on my laptop?

Where can I find the right version of

the tools?

Which cluster should I run

my high priority ETL job?

Where can I see all my jobs run yesterday?

. . . .

For Admins

How do I manage different versions

of tools in different clusters?

How can I upgrade/swap the

clusters with no downtime to users?

Genie – Job and Cluster Mgmt Service

Users:

- Discovery: find the right cluster to run the jobs

- Gateway: to run different kinds of jobs

- Orchestration: and, the one place to find all jobs!

Admins:

- Config mgmt: multiple versions of multiple tools

- Deployment: cluster swap/updates with no downtime

. . . .

Archived

Jobs output

. . . .

Job scripts & jars

Tools & clusters

configs

Genie on Netflix OSS

Data Mgmt Services

Metastore

Metacat

Federated metadata service. A proxy across data sources.

MetastoreAmazon

Amazon

Redshift

Metacat

Common APIs for our applications and tools. Thrift APIs for

interoperability.

Metadata discovery across data sources

Additional business context

- Lifecycle policy (TTL) per table

- Table owner, description, tags

- User-defined custom metrics

Data Lifecycle Management

Janitor tools

- Delete ‘dangling’ data after 60 days

- Delete data obsoleted by ‘data updates’ after 3 days

- Delete partitions based on table TTL

Scaling Deletes from S3

Deletion Service

Centralized service to handle errors, retries, and backoffs

of S3 deletes

Cool-down period to delete after a few days

Store history and statistics

Allow easy recovery based on time and tags

Backup Strategy

Core S3

Versioned buckets

20 days

Simplicity

Above and beyond

Other data stores

Heterogeneous cloud platform

Data Accessibility

Data Tracking

Approach

Tell us who you are

User agent

S3 access logs

Metrics pipeline

Charlotte

Data Cost

Approach

The calculation

Tableau reports

Data Doctor

Approach

The calculation

Tableau reports

Data Doctor

Future: Tie to job cost and leverage SIA / Amazon Glacier?

Best Supporting

Amazon Redshift

Faster, interactive subset of data

Some use, some don’t

Auto-synch (tag-based)

Fast loading!

Backups, restore, & expansion

Interactive at scale

S3 (source of truth)

S3 for Druid deep storage

Tableau

S3 (source of truth)

Mostly extracts (vs. Direct Connect)

Backups (multi-region)

Big Data Portal

Big Data API (aka Kragle)

import kragle as kg

trans_info = kg.transport.Transporter() \

.source('metacat://prodhive/default/my_table') \

.target('metacat://redshift/test/demo_table') \

.execute()

Next Steps

Add caching?

Storage efficiency

Partner with the S3 team to

improve S3 for big data

Take-aways

Amazon S3 = Data hub =

Extend and improve as you go

It takes an ecosystem

Thank you!

Remember to complete

your evaluations!

AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Technology

Transcript of AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013

AWS re:Invent 2016: Another Day in the Life of a Netflix Engineer (DEV209)

Understanding AWS Storage Options (STG101) | AWS re:Invent 2013

AWS re:Invent 2017 Recap

Recap of AWS re:invent 2015

AWS re:Invent 2016: How Netflix Achieves Email Delivery at Global Scale with Amazon SES (MBL204)

(SOV209) Introducing AWS Directory Service | AWS re:Invent 2014

(ENT302) Cost Optimization on AWS | AWS re:Invent 2014

AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration with AWS (CON313)

AWS Re:Invent 2012 - Chaos Monkey & The Netflix Simian Army

AWS Re:Invent - Securing HIPAA Compliant Apps in AWS

(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Invent 2014

AWS re:Invent - Accelerating Research

AWS re:Invent 2016 Photo Report

Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

(WEB305) Migrating Your Website to AWS | AWS re:Invent 2014

AWS re:Invent Hackathon

(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014

What an Enterprise Can Learn from Netflix, a Cloud-native Company (ENT203) | AWS re:Invent 2013

Devops at Netflix (re:Invent)