AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

69
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Eva Tse, Director, Big Data Services, Netflix Kurt Brown, Director, Data Platform, Netflix November 29, 2016 Netflix Using Amazon S3 as the Fabric of Our Big Data Ecosystem BDM306

Transcript of AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Page 1: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Eva Tse, Director, Big Data Services, Netflix

Kurt Brown, Director, Data Platform, Netflix

November 29, 2016

NetflixUsing Amazon S3 as the Fabric

of Our Big Data Ecosystem

BDM306

Page 2: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

What to Expect from the Session

How we use Amazon S3 as our centralized data hub

Our big data ecosystems on AWS

- Big data processing engines

- Architecture

- Tools and services

Page 3: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Why Amazon S3?

Page 4: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

S3

Page 5: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

‘The only valuable thing is intuition.’

– Albert Einstein

Page 6: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Why is it Intuitive?

It is a cloud native service! (free engineering)

‘Practically infinitely’ scalable

99.999999999% durable

99.99% available

Decouple compute and storage

Page 7: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Why is it Counter Intuitive?

Eventual consistency?

Performance?

Page 8: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Our Data Hub Scale

60+ PB

1.5 billions+ Objects

Page 9: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Data hub

60+ PB

100+ TB daily

Ingest

Expiration

400+ TB daily

ETL Processing

Read ~3.5 PB daily

Write 500+ TB daily

Data Velocity

Page 10: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Ingest

Page 11: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Event Data Pipeline

Business events

~500 billion events/day

5 min SLA from source to data hub

Page 12: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Cloud

apps

Kafka AWS

S3

Cloud

apps

Kafka

Cloud

apps

Kafka

UrsulaData

Hub

AWS SQS

Event Data

Region 1

Region 2

Region 3

AWS

S3

AWS

S3

Page 13: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Dimension Data Pipeline

Stateful data in Cassandra clusters

Extract from tens of Cassandra clusters

Daily or more granular extracts

Page 14: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Dimension Data

Cassandra

clusters

Aegisthus Data

Hub

Cassandra

clusters

Cassandra

clusters

SSTables

on AWS S3

Region 1

Region 2

Region 3

SSTables

on AWS S3

SSTables

on AWS S3

Page 15: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Transform

Page 16: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Data

Page 17: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Data Processing

Page 18: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Our Data Processing Engines

Data Hub

Hadoop Yarn

Clusters

~250 - 400 r3.4xl~3500 d2.4xl

Page 19: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Look at it from Scalability Angle

1 d2.4xl has 24 TB

60 PB/24 TB = 2,560 machines

To achieve 3 way replications for redundancy in one zone,

we need 7,680 machines!

The data size we have is beyond what we could fit into our

clusters!

Page 20: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Tradeoffs

Page 21: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

What are the tradeoffs?

Eventual consistency

Performance

Page 22: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Eventual Consistency

Updates (overwrite puts)

- Always put new files with new keys when updating data;

then delete old files

List

- We need to know we missed something

- Keep prefix manifest in S3mper (or EMRFS)

Page 23: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Parquet

Majority of our data is in Parquet file format

Supported across Hive, Pig, Presto, Spark

Performance benefits in read

- Column projections

- Predicate pushdown

- Vectorized read

Page 24: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Performance Impact

Read

- Penalty: Throughput and latency

- Impact depends on amount of data read

- Improvement: I/O manager in Parquet

Write

- Penalty: Writing to local disk before upload to S3

- Improvement: Direct write via multi-part uploads

Page 25: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Performance Impact

List

- Penalty: List thousands of partitions for split calculation

- Each partition is a S3 prefix

- Improvement: Track files instead of prefixes

Page 26: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Performance Impact – some good news

ETL jobs:

- Mostly CPU bound, not network bound

- Performance converges w/ volume and complexity

Interactive queries:

- % impact is higher … but they run fast

Benefits still outweigh the cost!

Page 27: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Job and Cluster Mgmt Service

Page 28: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

. . . .

. .

For Users

Should I run my job on my laptop?

Where can I find the right version of

the tools?

Which cluster should I run

my high priority ETL job?

Where can I see all my jobs run yesterday?

Page 29: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

. . . .

. .

For Admins

How do I manage different versions

of tools in different clusters?

How can I upgrade/swap the

clusters with no downtime to users?

Page 30: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Genie – Job and Cluster Mgmt Service

Users:

- Discovery: find the right cluster to run the jobs

- Gateway: to run different kinds of jobs

- Orchestration: and, the one place to find all jobs!

Admins:

- Config mgmt: multiple versions of multiple tools

- Deployment: cluster swap/updates with no downtime

Page 31: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

. . . .

. .

Archived

Jobs output

Page 32: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

. . . .

. .

Job scripts & jars

Tools & clusters

configs

Page 33: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Genie on Netflix OSS

Page 34: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Data Mgmt Services

Page 35: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
Page 36: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Metastore

Metacat

Data

Hub

Page 37: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Metacat

Metacat

Federated metadata service. A proxy across data sources.

MetastoreAmazon

RDS

Amazon

Redshift

Page 38: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Metacat

Common APIs for our applications and tools. Thrift APIs for

interoperability.

Metadata discovery across data sources

Additional business context

- Lifecycle policy (TTL) per table

- Table owner, description, tags

- User-defined custom metrics

Page 39: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Data Lifecycle Management

Janitor tools

- Delete ‘dangling’ data after 60 days

- Delete data obsoleted by ‘data updates’ after 3 days

- Delete partitions based on table TTL

Page 40: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Scaling Deletes from S3

Page 41: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Deletion Service

Centralized service to handle errors, retries, and backoffs

of S3 deletes

Cool-down period to delete after a few days

Store history and statistics

Allow easy recovery based on time and tags

Page 42: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
Page 43: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Backup Strategy

Page 44: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Core S3

Versioned buckets

20 days

Scale

Simplicity

Page 45: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Above and beyond

Other data stores

Heterogeneous cloud platform

CRR

Page 46: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Data Accessibility

Page 47: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
Page 48: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
Page 49: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Data Tracking

Page 50: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Approach

Tell us who you are

User agent

S3 access logs

Metrics pipeline

Charlotte

Page 51: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
Page 52: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Data Cost

Page 53: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Approach

The calculation

Tableau reports

Data Doctor

Page 54: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
Page 55: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Approach

The calculation

Tableau reports

Data Doctor

TTLs

Future: Tie to job cost and leverage SIA / Amazon Glacier?

Page 56: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Best Supporting

Actor

Page 57: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Amazon Redshift

Faster, interactive subset of data

Some use, some don’t

Auto-synch (tag-based)

Fast loading!

Backups, restore, & expansion

Page 58: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Druid

Interactive at scale

S3 (source of truth)

S3 for Druid deep storage

Page 59: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Tableau

S3 (source of truth)

Mostly extracts (vs. Direct Connect)

Backups (multi-region)

Page 60: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Big Data Portal

Page 61: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
Page 62: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
Page 63: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
Page 64: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Big Data API (aka Kragle)

import kragle as kg

trans_info = kg.transport.Transporter() \

.source('metacat://prodhive/default/my_table') \

.target('metacat://redshift/test/demo_table') \

.execute()

Page 65: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

S3

Page 66: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Next Steps

Add caching?

Storage efficiency

Partner with the S3 team to

improve S3 for big data

Page 67: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Take-aways

Amazon S3 = Data hub =

Extend and improve as you go

It takes an ecosystem

S3

Page 68: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Thank you!

Page 69: AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Remember to complete

your evaluations!