AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Eva Tse, Director, Big Data Services, Netflix

Kurt Brown, Director, Data Platform, Netflix

November 29, 2016

NetflixUsing Amazon S3 as the Fabric

of Our Big Data Ecosystem

BDM306

What to Expect from the Session

How we use Amazon S3 as our centralized data hub

Our big data ecosystems on AWS

- Big data processing engines

- Architecture

- Tools and services

Why Amazon S3?

‘The only valuable thing is intuition.’

– Albert Einstein

Why is it Intuitive?

It is a cloud native service! (free engineering)

‘Practically infinitely’ scalable

99.999999999% durable

99.99% available

Decouple compute and storage

Why is it Counter Intuitive?

Eventual consistency?

Performance?

Our Data Hub Scale

60+ PB

1.5 billions+ Objects

Data hub

60+ PB

100+ TB daily

Ingest

Expiration

400+ TB daily

ETL Processing

Read ~3.5 PB daily

Write 500+ TB daily

Data Velocity

Ingest

Event Data Pipeline

Business events

~500 billion events/day

5 min SLA from source to data hub

Cloud

apps

Kafka AWS

S3

Cloud

apps

Kafka

Cloud

apps

Kafka

UrsulaData

Hub

AWS SQS

Event Data

Region 1

Region 2

Region 3

AWS

S3

AWS

S3

Dimension Data Pipeline

Stateful data in Cassandra clusters

Extract from tens of Cassandra clusters

Daily or more granular extracts

Dimension Data

Cassandra

clusters

Aegisthus Data

Hub

Cassandra

clusters

Cassandra

clusters

SSTables

on AWS S3

Region 1

Region 2

Region 3

SSTables

on AWS S3

SSTables

on AWS S3

Transform

Data Processing

Our Data Processing Engines

Data Hub

Hadoop Yarn

Clusters

~250 - 400 r3.4xl~3500 d2.4xl

Look at it from Scalability Angle

1 d2.4xl has 24 TB

60 PB/24 TB = 2,560 machines

To achieve 3 way replications for redundancy in one zone,

we need 7,680 machines!

The data size we have is beyond what we could fit into our

clusters!

Tradeoffs

What are the tradeoffs?

Eventual consistency

Performance

Eventual Consistency

Updates (overwrite puts)

- Always put new files with new keys when updating data;

then delete old files

List

- We need to know we missed something

- Keep prefix manifest in S3mper (or EMRFS)

http://techblog.netflix.com/2014/01/s3mper-consistency-in-cloud.html

Parquet

Majority of our data is in Parquet file format

Supported across Hive, Pig, Presto, Spark

Performance benefits in read

- Column projections

- Predicate pushdown

- Vectorized read

Performance Impact

Read

- Penalty: Throughput and latency

- Impact depends on amount of data read

- Improvement: I/O manager in Parquet

Write

- Penalty: Writing to local disk before upload to S3

- Improvement: Direct write via multi-part uploads

Performance Impact

List

- Penalty: List thousands of partitions for split calculation

- Each partition is a S3 prefix

- Improvement: Track files instead of prefixes

Performance Impact – some good news

ETL jobs:

- Mostly CPU bound, not network bound

- Performance converges w/ volume and complexity

Interactive queries:

- % impact is higher … but they run fast

Benefits still outweigh the cost!

Job and Cluster Mgmt Service

. . . .

. .

For Users

Should I run my job on my laptop?

Where can I find the right version of

the tools?

Which cluster should I run

my high priority ETL job?

Where can I see all my jobs run yesterday?

. . . .

. .

For Admins

How do I manage different versions

of tools in different clusters?

How can I upgrade/swap the

clusters with no downtime to users?

Genie – Job and Cluster Mgmt Service

Users:

- Discovery: find the right cluster to run the jobs

- Gateway: to run different kinds of jobs

- Orchestration: and, the one place to find all jobs!

Admins:

- Config mgmt: multiple versions of multiple tools

- Deployment: cluster swap/updates with no downtime

. . . .

. .

Archived

Jobs output

. . . .

. .

Job scripts & jars

Tools & clusters

configs

Genie on Netflix OSS

Data Mgmt Services

Metastore

Metacat

Data

Hub

Metacat

Metacat

Federated metadata service. A proxy across data sources.

MetastoreAmazon

RDS

Amazon

Redshift

Metacat

Common APIs for our applications and tools. Thrift APIs for

interoperability.

Metadata discovery across data sources

Additional business context

- Lifecycle policy (TTL) per table

- Table owner, description, tags

- User-defined custom metrics

Data Lifecycle Management

Janitor tools

- Delete ‘dangling’ data after 60 days

- Delete data obsoleted by ‘data updates’ after 3 days

- Delete partitions based on table TTL

Scaling Deletes from S3

Deletion Service

Centralized service to handle errors, retries, and backoffs

of S3 deletes

Cool-down period to delete after a few days

Store history and statistics

Allow easy recovery based on time and tags

Backup Strategy

Core S3

Versioned buckets

20 days

Scale

Simplicity

Above and beyond

Other data stores

Heterogeneous cloud platform

CRR

Data Accessibility

Data Tracking

Approach

Tell us who you are

User agent

S3 access logs

Metrics pipeline

Charlotte

Data Cost

Approach

The calculation

Tableau reports

Data Doctor

Approach

The calculation

Tableau reports

Data Doctor

TTLs

Future: Tie to job cost and leverage SIA / Amazon Glacier?

Best Supporting

Actor

Amazon Redshift

Faster, interactive subset of data

Some use, some don’t

Auto-synch (tag-based)

Fast loading!

Backups, restore, & expansion

Druid

Interactive at scale

S3 (source of truth)

S3 for Druid deep storage

Tableau

S3 (source of truth)

Mostly extracts (vs. Direct Connect)

Backups (multi-region)

Big Data Portal

Big Data API (aka Kragle)

import kragle as kg

trans_info = kg.transport.Transporter() \

.source('metacat://prodhive/default/my_table') \

.target('metacat://redshift/test/demo_table') \

.execute()

Next Steps

Add caching?

Storage efficiency

Partner with the S3 team to

improve S3 for big data

Take-aways

Amazon S3 = Data hub =

Extend and improve as you go

It takes an ecosystem

S3

Thank you!

Remember to complete

your evaluations!

AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Technology

Transcript of AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)