The Forces That Disrupt NetflixThe Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016 ACROBAT...
Transcript of The Forces That Disrupt NetflixThe Forces That Disrupt Netflix Haley Tucker Nov. 7, 2016 ACROBAT...
# A distributed system is
one in which the failure
of a computer you didn't
even know existed can
render your own computer
unusable.
--Leslie Lamport
Proxy/Routing
DECOMPOSING THE MONOLITHDevices
Netflix
ServiceNetflix
ServiceEdge
Service
Traffic
Netflix
Playback
Service
Netflix
Playback
Service
Edge
Service
Edge
Service
Edge
Service
Playback
Service
Traffic
Notes on Distributed Systems for Young Bloods
# Distributed
systems are
different because
they fail often.
--Jeff Hodges
TABLE OF CONTENTS
CHAPTER 1: THE WEIRD DATA IN THE CATALOG
• Metadata impacts on availability
CHAPTER 2: THE VANISHING OF CRITICAL SERVICES
• Crashing services and cascading failures
CHAPTER 3: THE THROTTLE
• Latency spikes and the impact of fallbacks
FORCES AT WORK
Whoops, something went wrong…Netflix Streaming ErrorWe’re having trouble playing this title right now. Please try again later or select a different title.
45 MINUTES!!
Clock, by heyyobecky4lyfe, Tumblr
VIDEO METADATAARCHITECTURE
Video
Metadata
Service
Amazon S3
Source
System
Source
System
Netflix
ServicesNetflix
ServicesNetflix
ServicesNetflix
ServicesNetflix
Service
Traffic
Amazon S3
Netflix
Playback
Service
{
String msg = “This should never
happen!”;
throw new IllegalStateException(msg);
}
MITIGATIONBLAST RADIUS
Explosion, CC BY 2.0, Andrew Kuznetsov 2008, Flikr
Amazon WS Global Infrastructure
STAGGERED ROLLOUT
Canary, CC BY 2.0, Steve P2008 2014, Flikr
PREVENTIONCANARIES
TRADITIONAL CANARY
Canary
(New Code)
Baseline
(Old Code)
TrafficTraffic
Video
Metadata
Service
Amazon S3
Netflix
ServicesNetflix
ServicesNetflix
ServicesNetflix
ServicesNetflix
Service
Source
System
Source
System
Traffic
DATA CANARY
Netflix
ServicesNetflix
ServicesNetflix
ServicesNetflix
Services
Video
Metadata
Service
Amazon S3
Source
System
Source
System
Netflix
Service
Netflix
Data
Canary
Service
Data
Tester
Netflix
Service
Traffic
Australia with AAT, CC BY-SA 2.0, Ssolbergj 2010, Wikimedia
SEEING RETURNS
# A distributed system is
one in which the failure
of a computer you didn't
even know existed can
render your own computer
unusable.
--Leslie Lamport
Proxy/Routing
Devices
LOG DATA
Log Data Service
Traffic
Cassandra
Playback ServiceNetflix
Playback
Service
Netflix
Playback
Service
Edge
Service
Edge
Service
Edge
Service
Playback
Service
Traffic
Proxy/Routing
Devices
Proxy
Log Data Service
Traffic
Cassandra
Playback ServiceNetflix
Playback
Service
Netflix
Playback
Service
Edge
Service
Edge
Service
Edge
Service
Playback
Service
Traffic
Proxy/Routing
Devices
CASCADING FAILURE
Log Data Service
Traffic
Cassandra
Playback ServiceNetflix
Playback
Service
Netflix
Playback
Service
Edge
Service
Edge
Service
Edge
Service
Playback
Service
Traffic
PREVENTIONMANAGING RESOURCE CONSTRAINTS
Whatever you ask, CC BY-SA 2.0, Kreg Steppe 2008, Flikr
Astronomical Clock, CC BY 2.0, Andrew Fleming 2011, Flikr
REDUCE SURFACE AREA
2 LIMIT “MAGIC”
Magic, CC BY-ND 2.0, Daniel Lee 2013, Flikr
3
Medusa Kill Switch, CC BY-NC-ND 2.0, Scott Hart 2013, Flikr
ADD KILL SWITCHES
try {
remoteService.call();
} catch( Throwable t ){
//Oops!
System.exit(1);
}
Log Data
Service
Cassandra
Playback
Service
Proxy/Routing
Traffic
It's Electric, CC BY ND 2.0, Alan Hochberg 2008, Flikr
MITIGATIONCIRCUIT BREAKERS
Wrecking Ball in Building, CC BY 2.0, Jason Eppink 2008, Flikr
FAILURE TESTING
Proxy/Routing
Devices
FAILURE TESTING
Log Data Service
Traffic
Cassandra
Playback ServiceAutomating Chaos
Experiments in
Production
by Ali Basiri
Applying Failure
Testing Research
@Netflix
by Kolton Andrus and
Peter Alvaro
Manage resource constraints by
reducing surface area.
Leverage circuit breakers and
rigorously test failures.
Proxy/Routing
Devices
PLAYBACK ARCHITECTURE
Edge Service Edge Service Edge Service
Playback Service
Traffic Traffic
URL Service
NETFLIX CLIENT JARS
Playback Service
URL
Service
URL Client
Circuit-breakers and Fallbacks
Metrics Retries and Timeouts
RPCService Discovery
NETFLIX CLIENT JARS
Playback Service
URL
Service
URL Client
Circuit-breakers
MetricsRetries and
Timeouts
RPCService Discovery
Heavy
Fallback
FALLBACK TESTING
With 100% Fallback,
CPU held at 90%
15 RPSNo fallback, CPU held at 90%
58 RPSSiege: https://github.com/JoeDog/siege
URL
Service
Playback Service
Edge Service
Proxy/Routing
Traffic
}
return Response
.status(503)
.build();
}
REQUEST BUCKETING
NON-CRITICALExperience or
Performance
Impact
CRITICALCustomer
Streaming
Impact
Fire Buckets at Oakworth Statione, CC BY 2.0, Tim Greene 2015, Flikr
APPLICATION SHARDING
Non-Critical
Playback Service
Proxy/RoutingDevices
Edge Service Edge Service Edge Service
Traffic Traffic
URL Service
Critical Playback
Service
Non-Critical URL
Service
CRITICAL
Country Road at Sunrisee, CC BY-SA 2.0, Susanne Nilssone 2014, Flikr
NON-CRITICAL
Traffice, CC BY-NC 2.0, jonbgeme 2008, Flikr
APPLICATION SHARDING
Non-Critical
Playback Service
Proxy/RoutingDevices
Edge Service Edge Service Edge Service
Traffic Traffic
Critical Playback
Service
URL ServiceNon-Critical URL
Service
No heavy fallbacks!!
Fallbacks should be light and fast.
Shard your application based on
operational characteristics.
KEY TAKEAWAYSCHAPTER 1: THE WEIRD DATA IN THE CATALOG
• Verify consistency prior to applying state changes.
• One tool is a data canary.
CHAPTER 2: THE VANISHING OF CRITICAL SERVICES
• Manage resource constraints by reducing surface area.
• Leverage circuit breakers and rigorously test failures.
CHAPTER 3: THE THROTTLE
• No heavy fallbacks!! Fallbacks should be light and fast.
• Shard your application based on operational characteristics.