Netflix Global Cloud Architecture

Globally Distributed Cloud Applica4ons at Ne7lix

October 2012 Adrian Cockcro3 @adrianco #ne6lixcloud

h;p://www.linkedin.com/in/adriancockcro3

Adrian Cockcro3 •  Director, Architecture for Cloud Systems, Ne6lix Inc.

–  Previously Director for PersonalizaMon Pla6orm

•  DisMnguished Availability Engineer, eBay Inc. 2004-‐7 –  Founding member of eBay Research Labs

•  DisMnguished Engineer, Sun Microsystems Inc. 1988-‐2004 –  2003-‐4 Chief Architect High Performance Technical CompuMng –  2001 Author: Capacity Planning for Web Services –  1999 Author: Resource Management –  1995 & 1998 Author: Sun Performance and Tuning –  1996 Japanese EdiMon of Sun Performance and Tuning

•  SPARC & Solarisパフォーマンスチューニング (サンソフトプレスシリーズ)

•  More –  Twi;er @adrianco – Blog h;p://perfcap.blogspot.com –  PresentaMons at h;p://www.slideshare.net/adrianco

The Ne6lix Streaming Service

Now in USA, Canada, LaMn America, UK, Ireland, Sweden, Denmark,

Norway and Finland

US Non-‐Member Web Site AdverMsing and MarkeMng Driven

Member Web Site PersonalizaMon Driven

Streaming Device API

Netflix Ready DevicesFrom: May 2008

To: May 2010

Content Delivery Service Distributed storage nodes controlled by Ne6lix cloud services

Abstract

•  Ne6lix on Cloud – What, Why and When

•  Globally Distributed Architecture

•  Open Source Components

Why Use Cloud?

Things we don’t do

What Ne6lix Did

•  Moved to SaaS –  Corporate IT – OneLogin, Workday, Box, Evernote… –  Tools – Pagerduty, AppDynamics, EMR (Hadoop)

•  Built our own PaaS –  Customized to make our developers producMve –  Large scale, global, highly available, leveraging AWS

•  Moved incremental capacity to IaaS – No new datacenter space since 2008 as we grew – Moved our streaming apps to the cloud

Keeping up with Developer Trends

•  Big Data/Hadoop •  AWS Cloud •  ApplicaMon Performance Management •  Integrated DevOps PracMces •  ConMnuous IntegraMon/Delivery •  NoSQL •  Pla6orm as a Service; Fine grain SOA •  Social coding, open development/github

In producMon at Ne6lix

2009 2009 2010 2010 2010 2010 2010 2011

AWS specific feature dependence….

Portability vs. FuncMonality

•  Portability – the OperaMons focus – Avoid vendor lock-‐in – Support datacenter based use cases – Possible operaMons cost savings

•  FuncMonality – the Developer focus – Less complex test and debug, one mature supplier – Faster Mme to market for your products – Possible developer Mme/cost savings

FuncMonal PaaS

•  IaaS base -‐ all the features of AWS – Very large scale, mature, global, evolving rapidly – ELB, Autoscale, VPC, SQS, EIP, EMR, etc, etc. – E.g. Large files (TB) and mulMpart writes in S3

•  FuncMonal PaaS – Ne6lix added features – ConMnuous build/deploy, SOA, HA pa;erns – Asgard console, Monkeys, Big data tools – Cassandra/Zookeeper data store automaMon

How Ne6lix Works

Customer Device (PC, PS3, TV…)

Web Site or Discovery API

User Data

PersonalizaMon

Streaming API

DRM

QoS Logging

OpenConnect CDN Boxes

CDN Management and

Steering

Content Encoding

Consumer Electronics

AWS Cloud Services

CDN Edge LocaMons

Component Services (Simplified view using AppDynamics)

Web Server Dependencies Flow (Home page business transacMon as seen by AppDynamics)

Start Here

memcached

Cassandra

Web service

S3 bucket

One Request Snapshot (captured because it was unusually slow)

Current Architectural Pa;erns for Availability

•  Isolated Services – Resilient Business logic

•  Three Balanced Availability Zones – Resilient to Infrastructure outage

•  Triple Replicated Persistence – Durable distributed Storage

•  Isolated Regions – US and EU don’t take each other down

Isolated Services Test With Chaos Monkey, Latency Monkey

Three Balanced Availability Zones Test with Chaos Gorilla

Cassandra and Evcache Replicas

Zone A


Zone B


Zone C

Load Balancers

Triple Replicated Persistence Cassandra maintenance affects individual replicas


Zone A


Zone B


Zone C

Load Balancers

Isolated Regions

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-‐East Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

EU-‐West Load Balancers

Failure Mode Probability Mi4ga4on Plan

ApplicaMon Failure High AutomaMc degraded response

AWS Region Failure Low Wait for region to recover

AWS Zone Failure Medium ConMnue to run on 2 out of 3 zones

Datacenter Failure Medium Migrate more funcMons to cloud

Data store failure Low Restore from S3 backups

S3 failure Low Restore from remote archive

Failure Modes and Effects

Ne6lix Deployed on AWS

Content

Content Management

EC2 Encoding

S3 Petabytes

Logs

S3 Terabytes

EMR

Hive & Pig

Business Intelligence

Play

DRM

CDN rouMng

Bookmarks

Logging

WWW

Sign-‐Up

Search Solr

Movie Choosing

RaMngs

API

Metadata

Device Config

TV Movie Choosing

Social Facebook

CS

InternaMonal CS lookup

DiagnosMcs & AcMons

Customer Call Log

CS AnalyMcs

2009 2009 2010 2010 2010 2011

CDNs ISPs

Terabits Customers

Cloud Architecture Pa;erns

Where do we start?

Datacenter to Cloud TransiMon Goals

•  Faster –  Lower latency than the equivalent datacenter web pages and API calls –  Measured as mean and 99th percenMle –  For both first hit (e.g. home page) and in-‐session hits for the same user

•  Scalable –  Avoid needing any more datacenter capacity as subscriber count increases –  No central verMcally scaled databases –  Leverage AWS elasMc capacity effecMvely

•  Available –  SubstanMally higher robustness and availability than datacenter services –  Leverage mulMple AWS availability zones –  No scheduled down Mme, no central database schema to change

•  ProducMve –  OpMmize agility of a large development team with automaMon and tools –  Leave behind complex tangled datacenter code base (~8 year old architecture) –  Enforce clean layered interfaces and re-‐usable components

Ne6lix Datacenter vs. Cloud Arch

Central SQL Database Distributed Key/Value NoSQL

SMcky In-‐Memory Session Shared Memcached Session

Cha;y Protocols Latency Tolerant Protocols

Tangled Service Interfaces Layered Service Interfaces

Instrumented Code Instrumented Service Pa;erns

Fat Complex Objects Lightweight Serializable Objects

Components as Jar Files Components as Services

Cassandra on AWS

A highly available and durable deployment pa;ern

Cassandra Service Pa;ern Cassandra Cluster Managed by Priam Between 6 and 72 nodes

Data Access REST Service Astyanax Cassandra Client

Datacenter Update Flow

Service REST Clients

Appdynamics Service Flow VisualizaMon

ProducMon Deployment Totally Denormalized Data Model

Over 50 Cassandra Clusters Over 500 nodes Over 30TB of daily backups Biggest cluster 72 nodes 1 cluster over 250Kwrites/s

Astyanax -‐ Cassandra Write Data Flows Single Region, MulMple Availability Zone, Token Aware

Token Aware Clients

Cassandra • Disks • Zone A

Cassandra • Disks • Zone B

Cassandra • Disks • Zone C

Cassandra • Disks • Zone A

Cassandra • Disks • Zone B

Cassandra • Disks • Zone C

1.  Client Writes to local coordinator

2.  Coodinator writes to other zones

3.  Nodes return ack 4.  Data wri;en to

internal commit log disks (no more than 10 seconds later)

If a node goes offline, hinted handoff completes the write when the node comes back up. Requests can choose to wait for one node, a quorum, or all nodes to ack the write SSTable disk writes and compacMons occur asynchronously

14

4

42

3

3 3

2

Data Flows for MulM-‐Region Writes Token Aware, Consistency Level = Local Quorum

1.  Client writes to local replicas 2.  Local write acks returned to

Client which conMnues when 2 of 3 local nodes are commi;ed

3.  Local coordinator writes to remote coordinator.

4.  When data arrives, remote coordinator node acks and copies to other remote zones

5.  Remote nodes ack to local coordinator

6.  Data flushed to internal commit log disks (no more than 10 seconds later)

If a node or region goes offline, hinted handoff completes the write when the node comes back up. Nightly global compare and repair jobs ensure everything stays consistent.

US Clients

Cassandra •  Disks •  Zone A

Cassandra •  Disks •  Zone B

Cassandra •  Disks •  Zone C




EU Clients







6

5

5

6 6 4

4 4

1 6

6

6 2

2

2 3

100+ms latency

ETL for Cassandra

•  Data is de-‐normalized over many clusters! •  Too many to restore from backups for ETL •  SoluMon – read backup files using Hadoop •  Aegisthus

–  h;p://techblog.ne6lix.com/2012/02/aegisthus-‐bulk-‐data-‐pipeline-‐out-‐of.html

– High throughput raw SSTable processing – Re-‐normalizes many clusters to a consistent view – Extract, Transform, then Load into Teradata

Benchmarks and Scalability

Cloud Deployment Scalability New Autoscaled AMI – zero to 500 instances from 21:38:52 -‐ 21:46:32, 7m40s

Scaled up and down over a few days, total 2176 instance launches, m2.2xlarge (4 core 34GB)

Min. 1st Qu. Median Mean 3rd Qu. Max. !41.0 104.2 149.0 171.8 215.8 562.0!

Scalability from 48 to 288 nodes on AWS h;p://techblog.ne6lix.com/2011/11/benchmarking-‐cassandra-‐scalability-‐on.html

174373

366828

537172

1099837

0

200000

400000

600000

800000

1000000

1200000

0 50 100 150 200 250 300 350

Client Writes/s by node count – Replica4on Factor = 3

Used 288 of m1.xlarge 4 CPU, 15 GB RAM, 8 ECU Cassandra 0.86 Benchmark config only existed for about 1hr

Cassandra on AWS

The Past •  Instance: m2.4xlarge •  Storage: 2 drives, 1.7TB •  CPU: 8 Cores, 26 ECU •  RAM: 68GB •  Network: 1Gbit •  IOPS: ~500 •  Throughput: ~100Mbyte/s •  Cost: $1.80/hr

The Future •  Instance: hi1.4xlarge •  Storage: 2 SSD volumes, 2TB •  CPU: 8 HT cores, 35 ECU •  RAM: 64GB •  Network: 10Gbit •  IOPS: ~100,000 •  Throughput: ~1Gbyte/s •  Cost: $3.10/hr

Cassandra Disk vs. SSD Benchmark Same Throughput, Lower Latency, Half Cost

Availability and Resilience

Chaos Monkey h;p://techblog.ne6lix.com/2012/07/chaos-‐monkey-‐released-‐into-‐wild.html •  Computers (Datacenter or AWS) randomly die

– Fact of life, but too infrequent to test resiliency •  Test to make sure systems are resilient

– Allow any instance to fail without customer impact

•  Chaos Monkey hours – Monday-‐Friday 9am-‐3pm random instance kill

•  ApplicaMon configuraMon opMon – Apps now have to opt-‐out from Chaos Monkey

Responsibility and Experience

•  Make developers responsible for failures – Then they learn and write code that doesn’t fail

•  Use Incident Reviews to find gaps to fix – Make sure its not about finding “who to blame”

•  Keep Mmeouts short, fail fast – Don’t let cascading Mmeouts stack up

•  Make configuraMon opMons dynamic – You don’t want to push code to tweak an opMon

Resilient Design – Circuit Breakers h;p://techblog.ne6lix.com/2012/02/fault-‐tolerance-‐in-‐high-‐volume.html

Distributed OperaMonal Model

•  Developers – Provision and run their own code in producMon – Take turns to be on call if it breaks (pagerduty) – Configure autoscalers to handle capacity needs

•  DevOps and PaaS (aka NoOps) – DevOps is used to build and run the PaaS – PaaS constrains Dev to use automaMon instead – PaaS puts more responsibility on Dev, with tools

Culture

UnconvenMonal Culture See culture deck at h;p://jobs.ne6lix.com

•  Brave/Aggressive from the top down •  Focus on talent density above everything •  Reduce process, remove complexity •  Freedom and Responsibility •  One product focus for the whole company •  (almost) full informaMon sharing across co. •  Simplified managers role

Managers Role

•  Hiring, Architecture, Project Management •  No vacaMon policy to track •  (Almost) no remote employees or contractors •  No bonuses to allocate •  No expenses to approve •  Pay mark to market handled at VP level

Ne6lix OrganizaMon DevOps Org ReporMng into Product Group, not ITops

CEO – Reed HasMngs

CPO – Chief Product Officer – Neil Hunt

VP -‐ Cloud and Pla6orm Engineering -‐ Yury

Architecture

Future planning Security Arch Efficiency

AWS VPC Hyperguard

Powerpoint J

Pla6orm and Persistence Engineering

Base Pla6orm Zookeeper

Cassandra Ops

AWS Instances

Cloud SoluMons

Monitoring Monkeys Build Tools

AWS Instances AWS API

Cloud Ops Reliability Engineering

Alert RouMng Incident Lifecycle

PagerDuty

PersonalizaMon Pla6orm and

Performance Eng

Metadata Benchmarking Memcached

AWS Instances

Membership and Billing

Data sources Vault processing

Cassandra

Data Science Pla6orm

Business Intelligence

Hadoop on EMR

Build Your Own PaaS

Components

•  ConMnuous build framework turns code into AMIs •  AWS accounts for test, producMon, etc. •  Cloud access gateway •  Service registry •  ConfiguraMon properMes service •  Persistence services •  Monitoring, alert forwarding •  Backups, archives

Ne6lix Open Source Strategy

•  Release PaaS Components git-‐by-‐git –  Source at github.com/ne6lix – we build from it… –  Intros and techniques at techblog.ne6lix.com –  Blog post or new code every few weeks

•  MoMvaMons – Give back to Apache licensed OSS community – MoMvate, retain, hire top engineers –  “Peer pressure” code cleanup, external contribuMons

Instance creaMon

ASG / Instance started Instance Running

Asgard

Autoscaling scripts Odin

Bakery & Build tools

Base AMI

ApplicaMon Code

Instance

Image baked

ApplicaMon Launch

Registering, configuraMon

Eureka

Entrypoints Archaius

Governator (Guice)

Async logging

Servo

ApplicaMon iniMalizing

RunMme

Managing service

Resiliency aids

Priam

Exhibitor

Explorers

NIWS LB

Astyanax

Curator

Dependency Command

REST client

Chaos Monkey Latency Monkey Janitor Monkey Cass JMeter

Calling other services

Open Source Projects Github / Techblog

Apache ContribuMons

Techblog Post

Coming Soon

Priam Cassandra as a Service

Astyanax Cassandra client for Java

CassJMeter Cassandra test suite

Cassandra MulM-‐region EC2 datastore support

Aegisthus Hadoop ETL for Cassandra

Explorers

Governator -‐ Library lifecycle and dependency injecMon

Odin Workflow orchestraMon

Async logging

Exhibitor Zookeeper as a Service

Curator Zookeeper Pa;erns

EVCache Memcached as a Service

Eureka / Discovery Service Directory

Archaius Dynamics ProperMes Service

EntryPoints

Server-‐side latency/error injecMon

REST Client + mid-‐Mer LB

ConfiguraMon REST endpoints

Servo and Autoscaling Scripts

Honu Log4j streaming to Hadoop

Circuit Breaker Robust service pa;ern

Asgard -‐ AutoScaleGroup based AWS console

Chaos Monkey Robustness verificaMon

Latency Monkey

Janitor Monkey

Bakeries and AMI

Build dynaslaves

Legend

Roadmap for 2012

•  More resiliency and improved availability •  More automaMon, orchestraMon •  “Hardening” the pla6orm, code clean-‐up •  Lower latency for web services and devices •  IPv6 – now running in prod, rollout in process •  More open sourced components •  See you at AWS Re:Invent in November…

Takeaway

Ne?lix has built and deployed a scalable global Pla?orm as a Service.

Key components of the Ne?lix PaaS are being released as Open Source projects so you can build your own custom PaaS.

h;p://github.com/Ne6lix h;p://techblog.ne6lix.com h;p://slideshare.net/Ne6lix

h;p://www.linkedin.com/in/adriancockcro3

@adrianco #ne6lixcloud

Amazon Cloud Terminology Reference See http://aws.amazon.com/ This is not a full list of Amazon Web Service features

•  AWS – Amazon Web Services (common name for Amazon cloud) •  AMI – Amazon Machine Image (archived boot disk, Linux, Windows etc. plus applicaMon code) •  EC2 – ElasMc Compute Cloud

–  Range of virtual machine types m1, m2, c1, cc, cg. Varying memory, CPU and disk configuraMons. –  Instance – a running computer system. Ephemeral, when it is de-‐allocated nothing is kept. –  Reserved Instances – pre-‐paid to reduce cost for long term usage –  Availability Zone – datacenter with own power and cooling hosMng cloud instances –  Region – group of Avail Zones – US-‐East, US-‐West, EU-‐Eire, Asia-‐Singapore, Asia-‐Japan, SA-‐Brazil, US-‐Gov

•  ASG – Auto Scaling Group (instances booMng from the same AMI) •  S3 – Simple Storage Service (h;p access) •  EBS – ElasMc Block Storage (network disk filesystem can be mounted on an instance) •  RDS – RelaMonal Database Service (managed MySQL master and slaves) •  DynamoDB/SDB – Simple Data Base (hosted h;p based NoSQL datastore, DynamoDB replaces SDB) •  SQS – Simple Queue Service (h;p based message queue) •  SNS – Simple NoMficaMon Service (h;p and email based topics and messages) •  EMR – ElasMc Map Reduce (automaMcally managed Hadoop cluster) •  ELB – ElasMc Load Balancer •  EIP – ElasMc IP (stable IP address mapping assigned to instance or ELB) •  VPC – Virtual Private Cloud (single tenant, more flexible network and security constructs) •  DirectConnect – secure pipe from AWS VPC to external datacenter •  IAM – IdenMty and Access Management (fine grain role based security keys)

Netflix Global Cloud Architecture

Technology

Transcript of Netflix Global Cloud Architecture