(SOV209) Introducing AWS Directory Service | AWS re:Invent 2014
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
-
Upload
amazon-web-services -
Category
Technology
-
view
1.610 -
download
2
Transcript of AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Eva Tse, Director, Big Data Services, Netflix
Kurt Brown, Director, Data Platform, Netflix
November 29, 2016
NetflixUsing Amazon S3 as the Fabric
of Our Big Data Ecosystem
BDM306
What to Expect from the Session
How we use Amazon S3 as our centralized data hub
Our big data ecosystems on AWS
- Big data processing engines
- Architecture
- Tools and services
Why Amazon S3?
S3
‘The only valuable thing is intuition.’
– Albert Einstein
Why is it Intuitive?
It is a cloud native service! (free engineering)
‘Practically infinitely’ scalable
99.999999999% durable
99.99% available
Decouple compute and storage
Why is it Counter Intuitive?
Eventual consistency?
Performance?
Our Data Hub Scale
60+ PB
1.5 billions+ Objects
Data hub
60+ PB
100+ TB daily
Ingest
Expiration
400+ TB daily
ETL Processing
Read ~3.5 PB daily
Write 500+ TB daily
Data Velocity
Ingest
Event Data Pipeline
Business events
~500 billion events/day
5 min SLA from source to data hub
Cloud
apps
Kafka AWS
S3
Cloud
apps
Kafka
Cloud
apps
Kafka
UrsulaData
Hub
AWS SQS
Event Data
Region 1
Region 2
Region 3
AWS
S3
AWS
S3
Dimension Data Pipeline
Stateful data in Cassandra clusters
Extract from tens of Cassandra clusters
Daily or more granular extracts
Dimension Data
Cassandra
clusters
Aegisthus Data
Hub
Cassandra
clusters
Cassandra
clusters
SSTables
on AWS S3
Region 1
Region 2
Region 3
SSTables
on AWS S3
SSTables
on AWS S3
Transform
Data
Data Processing
Our Data Processing Engines
Data Hub
Hadoop Yarn
Clusters
~250 - 400 r3.4xl~3500 d2.4xl
Look at it from Scalability Angle
1 d2.4xl has 24 TB
60 PB/24 TB = 2,560 machines
To achieve 3 way replications for redundancy in one zone,
we need 7,680 machines!
The data size we have is beyond what we could fit into our
clusters!
Tradeoffs
What are the tradeoffs?
Eventual consistency
Performance
Eventual Consistency
Updates (overwrite puts)
- Always put new files with new keys when updating data;
then delete old files
List
- We need to know we missed something
- Keep prefix manifest in S3mper (or EMRFS)
Parquet
Majority of our data is in Parquet file format
Supported across Hive, Pig, Presto, Spark
Performance benefits in read
- Column projections
- Predicate pushdown
- Vectorized read
Performance Impact
Read
- Penalty: Throughput and latency
- Impact depends on amount of data read
- Improvement: I/O manager in Parquet
Write
- Penalty: Writing to local disk before upload to S3
- Improvement: Direct write via multi-part uploads
Performance Impact
List
- Penalty: List thousands of partitions for split calculation
- Each partition is a S3 prefix
- Improvement: Track files instead of prefixes
Performance Impact – some good news
ETL jobs:
- Mostly CPU bound, not network bound
- Performance converges w/ volume and complexity
Interactive queries:
- % impact is higher … but they run fast
Benefits still outweigh the cost!
Job and Cluster Mgmt Service
. . . .
. .
For Users
Should I run my job on my laptop?
Where can I find the right version of
the tools?
Which cluster should I run
my high priority ETL job?
Where can I see all my jobs run yesterday?
. . . .
. .
For Admins
How do I manage different versions
of tools in different clusters?
How can I upgrade/swap the
clusters with no downtime to users?
Genie – Job and Cluster Mgmt Service
Users:
- Discovery: find the right cluster to run the jobs
- Gateway: to run different kinds of jobs
- Orchestration: and, the one place to find all jobs!
Admins:
- Config mgmt: multiple versions of multiple tools
- Deployment: cluster swap/updates with no downtime
. . . .
. .
Archived
Jobs output
. . . .
. .
Job scripts & jars
Tools & clusters
configs
Genie on Netflix OSS
Data Mgmt Services
Metastore
Metacat
Data
Hub
Metacat
Metacat
Federated metadata service. A proxy across data sources.
MetastoreAmazon
RDS
Amazon
Redshift
Metacat
Common APIs for our applications and tools. Thrift APIs for
interoperability.
Metadata discovery across data sources
Additional business context
- Lifecycle policy (TTL) per table
- Table owner, description, tags
- User-defined custom metrics
Data Lifecycle Management
Janitor tools
- Delete ‘dangling’ data after 60 days
- Delete data obsoleted by ‘data updates’ after 3 days
- Delete partitions based on table TTL
Scaling Deletes from S3
Deletion Service
Centralized service to handle errors, retries, and backoffs
of S3 deletes
Cool-down period to delete after a few days
Store history and statistics
Allow easy recovery based on time and tags
Backup Strategy
Core S3
Versioned buckets
20 days
Scale
Simplicity
Above and beyond
Other data stores
Heterogeneous cloud platform
CRR
Data Accessibility
Data Tracking
Approach
Tell us who you are
User agent
S3 access logs
Metrics pipeline
Charlotte
Data Cost
Approach
The calculation
Tableau reports
Data Doctor
Approach
The calculation
Tableau reports
Data Doctor
TTLs
Future: Tie to job cost and leverage SIA / Amazon Glacier?
Best Supporting
Actor
Amazon Redshift
Faster, interactive subset of data
Some use, some don’t
Auto-synch (tag-based)
Fast loading!
Backups, restore, & expansion
Druid
Interactive at scale
S3 (source of truth)
S3 for Druid deep storage
Tableau
S3 (source of truth)
Mostly extracts (vs. Direct Connect)
Backups (multi-region)
Big Data Portal
Big Data API (aka Kragle)
import kragle as kg
trans_info = kg.transport.Transporter() \
.source('metacat://prodhive/default/my_table') \
.target('metacat://redshift/test/demo_table') \
.execute()
S3
Next Steps
Add caching?
Storage efficiency
Partner with the S3 team to
improve S3 for big data
Take-aways
Amazon S3 = Data hub =
Extend and improve as you go
It takes an ecosystem
S3
Thank you!
Remember to complete
your evaluations!