2012 re:Invent Netflix: embracing the cloud final
-
Upload
yuryizrailevsky -
Category
Documents
-
view
595 -
download
6
description
Transcript of 2012 re:Invent Netflix: embracing the cloud final
Netflix: Embracing the Cloud
Neil Hunt, CPO / Yury Izrailevsky, VP Engineering
Embracing the Cloud:Confronting the Challenge
Neil Hunt
Motivation
Netflix – Service Unavailable – Database Crashed
Rest assured that the right peopleare losing sleep to fix this problem!
We expect to resume service in approximately 72h
12 Aug 2008 03:12am
A Business in Transition
OLD – DVD delivery
• Value from DVDs at home• Website load small and
predictable
• Traditional DC technology:• Linux, Apache, Oracle, Java
NEW – Streaming
• Value via Internet delivery• Website and APIs high load
and rapidly growing
• Need more robustness• Cloud as opportunity for
fresh start
Mission: Cloud – High Level Goals
Availability
Scale Performance
4 x nines
Unconstrainedhorizontal scaling
Unlimitedcompute
Forklift, or Rewrite?
OLD NEW
MonolithicApp
Oracle NoSQL
Service
Assembly
Old Style – A large 18 wheeler
• Big• Reliable• Efficient (when full)
• Expensive• Inflexible capacity• Many single points of failure
New Style – A fleet of leased pickups with drivers
• Scalable to small or large loads• Reliability through redundancy• Requires rethinking the whole problem
SQL or NoSQL?
MySQL/RDB:
• Developer familiarity
• Developers imagine transactional consistency requirements in every scenario
NoSQL
• Availability & Scale
• Avoid overhead and riskof managing SQL
• Experimented with both• Ended up with NoSQL for almost everything important
Service Oriented Architecture
• Optimizes for small independent teams with well-defined interfaces
• Better independence from subsystem failures
• Scaling applied to each tier separately NoSQL
How to Manage the Migration?Rebuilding a complex system while in operation
NoSQL
MonolithicApp
Oracle
Transitional Infrastructure: “Roman Riding”
Transitional Infrastructure: Create a read-only copy
NoSQL
Source of Truth
Display onlyExample: Membership records
MonolithicApp
Oracle
Transitional Infrastructure: Move the master copy
NoSQL
Source of Truth
Display only
Example: AB Test Data (account tags controlling test experience)
MonolithicApp
Oracle
Transitional Infrastructure: Full Multi-Master duplicate
NoSQL
Multi-master
Example: Queue
MonolithicApp
Oracle
Organizational Challenges
IT Ops• Initial extensive role
managing legacy DC• Raised visibility during
transition• New DC vulnerabilities
and dependencies to manage
DevOps:• Components at a higher
level abstraction• More opportunities for
automation• Automated build-push tools• Autoscaling• Monitoring and automatic
cutouts and failover
A gradually diminishing role A rapidly expanding role
The Journey
Phase Components Data & PrerequisitesTrial (2009) Streaming Player Content keys (RO)
Membership status (RO)
Development(2010-11)
Member product pages and APIs
Content catalog (RW)Personalization data (RW) & recs algorithmsAB Test data (RW)
Followthrough(2011-12)
Account and membership
Membership data (RW)
Final (2013) Payments PCI and SOX data
Lessons Learned…
• Embrace the whole concept:Take the opportunity to build a modern architecturerather than forklifting SQL and monolithic apps
• Plan to discard your first experimentsYou’ll learn so much that you’ll be glad to redo it right
• Invest in transitional infrastructure:Migration will take a while,and it’s worth the effort to make it easy
• Expect your team to learn new ways …… but some won’t make the transition
Embracing the Cloud:Delivering the Cloud Solution
Yury Izrailevsky
Mission: Cloud – High Level Goals
Availability4 x nines
ScaleUnconstrained
horizontal scaling
PerformanceUnlimitedcompute
PerformanceScalability Availability
PerformanceScalability Availability
23
1/4/
2009
2/5/
2009
3/9/
2009
4/10
/200
9
5/12
/200
9
6/13
/200
9
7/15
/200
9
8/16
/200
9
9/17
/200
9
10/1
9/20
09
11/2
0/20
09
12/2
2/20
09
1/23
/201
0
2/24
/201
0
3/28
/201
0
4/29
/201
0
5/31
/201
0
7/2/
2010
8/3/
2010
9/4/
2010
10/6
/201
0
11/7
/201
0
12/9
/201
0
1/10
/201
1
2/11
/201
1
3/15
/201
1
4/16
/201
1
5/18
/201
1
6/19
/201
1
7/21
/201
1
8/22
/201
1
9/23
/201
1
10/2
5/20
11
11/2
6/20
11
12/2
8/20
11
1/29
/201
2
3/1/
2012
4/2/
2012
5/4/
2012
6/5/
2012
7/7/
2012
8/8/
2012
Scaling Netflix Streaming Service: Weekly Streaming Starts
Netflix Cross-Regional Cloud Architecture
Goal: Regional Failover
Building Global Netflix Streaming Product
PerformanceScalability Availability
Weekly Cloud Cost Per Streaming Start (last 12 months)
28
Simian Army: Cloud Efficiency Automation
Janitor Monkey
Regularly scrape unused capacity
Clean up instances, ASGs, ELBs, SGs, etc.
Efficiency Monkey
AI-based resource under-usage detection (CPU, memory, etc.)
Automated Deletion of Old Data
TTL for S3 (using ObjectExpiration)
29
Cyclical Streaming Usage Pattern
30
Load-Based Auto Scaling
3131
50%+ Cost SavingScale up/down
by 70%+
Move to Load-Based Scaling
PerformanceScalability Availability
A Truly Great Service…
33
Availability Goal: 99.99%(30 secs/week at peak traffic)
Has To Just Work!
7/17
/201
1
7/31
/201
1
8/14
/201
1
8/28
/201
1
9/11
/201
1
9/25
/201
1
10/9
/201
1
10/2
3/20
11
11/6
/201
1
11/2
0/20
11
12/4
/201
1
12/1
8/20
11
1/1/
2012
1/15
/201
2
1/29
/201
2
2/12
/201
2
2/26
/201
2
3/11
/201
2
3/25
/201
2
4/8/
2012
4/22
/201
2
5/6/
2012
5/20
/201
2
6/3/
2012
6/17
/201
2
7/1/
2012
7/15
/201
2
7/29
/201
2
8/12
/201
2
8/26
/201
2
9/9/
2012
9/23
/201
2
10/7
/201
2
10/2
1/20
12
11/4
/201
2
June 29th, 2012 AWS / Netflix Outage
Other AWS Outages
Historical Streaming Availability (13wkMA)
Using Redundancy in AWS Infrastructure to Survive Failures
Cascading Failures
35
API
InstantQueue
SimpleDB
Netflix Cloud Architecture
36
Cascading Failures
37
99% Availability
X …
99% 300 = 4.90%
99% Availability 99% Availability
Strategies to Improve Availability
38
Graceful Degradation Redundancy
Graceful Degradation
39
Redundancy
40
Zone A
Zone B
Zone C
Redundancy Across Availability Zones
Storage Redundancy Across Regions,
Vendors
S3 Backup
Secure Cloud Backup
A B C
Cassandra
Testing Fault Tolerance: Simian Army
41
Chaos Monkey Latency Monkey Chaos Gorilla
Open Source Portal at http://netflix.github.com
Superstorm Sandy
AWS Infrastructure Held Up
>2x Netflix Streaming Usage in East Coast Markets
Boston
New York
Philadelphia
Baltimore
D.C.
Focus on Building a Great Streaming Product
44
Netflix at 2012 re:Invent
Date/Time Presenter Topic
Wed 8:30-10:00 Reed Hastings Keynote with Andy Jassy
Wed 1:00-1:45 Coburn Watson Optimizing Costs with AWS
Wed 2:05-2:55 Kevin McEntee Netflix’s Transcoding Transformation
Wed 3:25-4:15 Neil Hunt / Yury I. Netflix: Embracing the Cloud
Wed 4:30-5:20 Adrian Cockcroft High Availability Architecture at Netflix
Thu 10:30-11:20 Jeremy Edberg Rainmakers – Operating Clouds
Thu 11:35-12:25 Kurt Brown Data Science with Elastic Map Reduce (EMR)
Thu 11:35-12:25 Jason Chan Security Panel: Learn from CISOs working with AWS
Thu 3:00-3:50 Adrian Cockcroft Compute & Networking Masters Customer Panel
Thu 3:00-3:50 Ruslan M./Gregg U. Optimizing Your Cassandra Database on AWS
Thu 4:05-4:55 Ariel Tseitlin Intro to Chaos Monkey and the Simian Army
We are sincerely eager to hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation form when you have a
chance.
We are sincerely eager to hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation form when you have a
chance.