QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache...

106
Tweet @jedberg with feedback! QConSP 2013

Transcript of QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache...

Page 1: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

QConSP 2013

Page 2: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Do you have...

• A release Engineer?

• A QA department?

• Chef or Puppet to manage your systems?

Page 3: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Do you have...

• Upwards of 100 releases a day?

Page 4: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Jeremy Edberg

Page 5: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Page 6: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Netflix is the world’s leading Internet television network with nearly 38 million members in 40 countries enjoying more than one billion hours of TV shows and movies per

month, including original series. For one low monthly price, Netflix members can watch as much as they want, anytime, anywhere, on nearly any Internet-

connected screen.Source: http://ir.netflix.com

What is Netflix?

Page 7: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

The Netflix way

• Everything is “built for three”

• Fully automated build tools to test and make packages

• Fully automated machine image bakery

• Fully automated image deployment

• Independent teams responsible for both Dev and Ops

Page 8: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Philosophy

Page 9: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Freedom and Responsibility

• We hire responsible adults and keep rules and policies to a minimum

• Developers can change any code in production at any time

• And things don’t break (usually)

• Not eXtreme Go Horse

Page 10: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Automate all the things!

Page 11: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Automate all the things!

• Application startup

• Configuration

• Code deployment

• System deployment

Page 12: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Automation

• Standard base image

• Tools to manage all the systems

• Automated code deployment

Page 13: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Shared state should be stored in a shared service

Data on an instance should be replicated to other

instances

Page 14: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

“Build for three”We hold a boot camp for new engineers to teach them how

to build for a highly distributed environment.

Page 15: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Page 16: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

7%$(0/,4.H,IJ0/#B/C./% F(%$#8/G0 >0?.%#

>%),+,),>0?.%#D,J/C(

<.=.4,$#>0?.%(

678

D%?.%E( 6@"#A%()#B/C./%

!"#$%&'%()(#*%$#+,-#

./)0#)1%#2%34.5#678

9!"#0'):0'/+#$%&'%()(#*%$#+,-#)0#678#

+%*%/+%/;.%(

Page 17: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

!"#$%&'()*'+,-')./!0)/120)3456)

7'8)1,$')%()*,#-%+'(9):/;)

<#'()*=$=)

/'(#%>=?,@=A%>)

1$('=&,>B):/;)

*CD)

E%1)F%BB,>B)

GH'>!%>>'-$)!*I)J%K'#)

!*I)D=>=B'&'>$)=>L)

1$''(,>B)

!%>$'>$)M>-%L,>B)

!%>#"&'()M?'-$(%>,-#)

:71)!?%"L)1'(+,-'#)

!*I)MLB')F%-=A%>#)

J(%N#')

/?=9)

7=$-O)

Page 18: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Highly aligned, loosely coupled

• Services are built by different teams who work together to figure out what each service will provide.

• The service owner publishes an API that anyone can use.

Page 19: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Advantages to a Service Oriented Architecture• Easier auto-scaling

• Easier capacity planning

• Identify problematic code-paths more easily

• Narrow in the effects of a change

• More efficient local caching

Page 20: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Freedom and Responsibility

• Developers deploy when they want

• They also manage their own capacity and autoscaling

• And fix anything that breaks at 4am!

Page 21: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Decision making

Risk to my serviceRisk to Netflix

Time of Day/Week

Page 22: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

All systems choices assume some part will fail at some

point.

Page 23: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Reliability and $$

Page 24: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

The Monkey Theory

• Simulate things that go wrong

• Find things that are different

Page 25: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Execution

Photo from I, Robot, copyright 20th Century Fox

Page 26: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Netflix built a global PaaS

• Service Oriented Architecture

• HTTP/Rest interfaces between services

Page 27: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Netflix PaaS features• Supports all regions and zones

• Multiple accounts

• Cross region/account replication

• Internationalized, localized and GeoIP routed

• Advanced key management

• Autoscaling with 1000s of instances

• Monitoring and alerting on millions of metrics

Page 28: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

What AWS Provides

• Instances

• Machine Images

• Elastic IPs

• Load Balancers

• Security groups / Autoscaling groups

• Availability zones and regions

Page 29: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Linux Base AMI (CentOS or Ubuntu)

Java (JDK 6 or 7)

Tomcat

Optional Apache

Monitoring

Log Rotation to S3

Appdynamics Machine Agent

Appdynamics App Agent

monitoring

Application war file, base servlet, platform, interface

jars for dependent services

GC and thread dump logging

Healthcheck, status servelets, JMX interface,

Servo autoscale

Page 30: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

The Netflix PlatformDiscovery (Eureka)Entrypoints (Edda)

Configuration (Archaius)Zookeeper (Exhibitor)logging (Blitz4j & Honu)

NIWS (Ribbon)GeoBase

Circuit Breakers (Hystrix)Cassandra (Priam &

Astyanax & CassJMeter) Cryptex AKMS

EvCacheZuuli18nL10n

Open Source

Page 31: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Page 32: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Page 33: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Page 34: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Finding things

• Discovery (Eureka)

• Application to instance mapping

• Heartbeat to keep track of health

• Entrypoints (Edda)

• Local database of AWS resources

• NIWS (Ribbon)

• On instance software load balancer

• Handles retry logic

• Geo (Geolocation library)

• Provides IP to Lat/Lon mapping for any service that needs it.

Page 35: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Entrypoints (Edda)

• REST API

• GET /REST/v1/instance/$id

• Keeps track of all resources

• Autoscaling groups, EIPs, Instances, Applications, Clusters, History

Page 36: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Entrypoints Exploration

Find all active instances all()

Find all instances in a group

%(cloudmonkey)

How many instances are not in an autoscale

group?count(all(),-info(eval(INSTANCES;asg())))

Which ELB contains a particular instance?

filter(TYPE;asg;*(i-4a12d3b9))

Page 37: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Keeping it all straight

• Configuration (Archaius)• Global variables (Fast properties)

• Base• Base system. Prod vs. Test, etc

• Zookeeper (Curator)• Locks, other similar coordination

• logging (Blitz4j and Honu)• Keep track of what happened and store it for

post analysis.

Page 38: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Keeping it secure

• Cryptex

• Service for key management

• High, medium and low value keys

• AKMS (Amazon Key Management System)

• Hands out keys to instances (and dev boxes) so they don’t have to store the key on the instance

Page 39: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Key Management

• Cryptex service provides keys

• Low value: Cookie encryption keys

• Med value: Device activation keys

• High value: Credit card encryption

Page 40: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Cryptex

• Pass in encrypted string, get decrypted string out

• Decryption is in a different place depending on value of key

• Always try to design for lowest value key

Page 41: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Translating it

• i18n (Internationalization)

• Make it easy to translate things from one language to another

• L10n (Localization)

• The library that actually does the translations

Page 42: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Storing it• Cassandra (Priam, astyanax)

• Configure and access Cassandra

• Provide OO abstractions handle connection pooling, discovery of hosts

• EVCache (Eccentric Volatile Cache)

• Wrapper for memcached to handle zone awareness and replication

• Proxies

• Get data out of the datacenter and into the cloud.

Page 43: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

DataWhat do we do with it all?

Page 44: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

We store it!

• Cache (memcached)

• Cassandra

• RDS (MySql)

Page 45: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Cassandra

Page 46: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Why Cassandra?

• Availability over consistency

• Writes over reads

• We know Java

• Open source + support

Page 47: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Cassandra Benefits

• Fast writes

• Fast negative lookups

• Easy incremental scalability

• Distributed -- No SPoF

Page 48: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Things we store in Cassandra

• Video Quality

• Network issues

• Usage History

• Playback Errors

• A/B Tests

Page 49: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

A/B Testing

Page 50: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

A/B Testing

Online Data Offline Data

Test Cell allocationTest MetadataStart/End dateUI Directives

Test trackingRetention

Fraction ViewedPages Viewed

Page 51: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Using Cassandra at Netflix

• Priam

• Zero touch auto-config

• State management

• Token assignment

• Node replacement

• Backup/restore to/from S3

• Astyanax

• OO abstraction to Cassandra

• Multi-region support

Page 52: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Page 53: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Page 54: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Cassandra Architecture

Page 55: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Cassandra Architecture

For more info, see DAT202: Optimizing your Cassandra Database on AWS

Page 56: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Tools

• Asgard

• AWS usage

• Atlas

• Chronos

• Build system

• Explorers (Cassandra and SimpleDB)

Page 57: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Page 58: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Deploying Code; Step 1

Page 59: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Page 60: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Page 61: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Auto ScalingGroup

LaunchConfiguration

SecurityGroup

Amazon MachineImage

Instances

Configuration

Elastic LoadBalancer

Page 62: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

api-usprod-v007

api-frontend

api-usprod-v008

Page 63: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

api-usprod-v007

api-frontend

api-usprod-v008

Page 64: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Page 65: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Page 66: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Page 67: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Page 68: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Page 69: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Netflix has moved the granularity from the

instance to the cluster

Page 70: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Why Bake?

Generic AMI

Instance

Traditional:•launch OS•install packages•install app

Netflix:•launch OS+app

App AMI Instance

Page 71: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Getting Baked

Perforce / Git

libraries

source

Ant targets

Ivy

Groovy all over

snapshot / release libraries / apps

app bundlesapp bundles

Jenkins

sync

resolve

buildcompile report

publishtest

Perforce / Git

sourcesourcesource

sync

Perforce / Git Ant targets

sourcesource

sync compile

Perforce / Git

sourcesource

sync

libraries

resolve

Artifactory

Ivylibraries snapshot / release

libraries / apps

Groovy all over

build

Page 72: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Base ImageBaking

Yum / Apt

Linux: CentOS, Fedora, Ubuntu

AWSRPMs: Apache, Java...

ec2 slave instances

Linux: CentOS, Fedora, Ubuntu

ec2 slave instances

S3 / EBS

foundation AMI

base AMI

Bakery

mount

installinstall

ec2 slave instances

Bakeryinstall

foundation AMI

base

Ready forappbake

snapshot

Page 73: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

App ImageBaking

Jenkins / Yum / Artifactory

Linux, Apache, Java, Tomcat

AWSapp bundle

ec2 slave instances

Linux, Apache, Java, Tomcat

ec2 slave instances

S3 / EBS

base AMI

app AMI

Bakery

mount

installinstall

ec2 slave instances

Bakeryinstall

base AMI

Ready to launch!

snapshot

Page 74: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Linux Base AMI (CentOS or Ubuntu)

Java (JDK 6 or 7)

Tomcat

Optional Apache

Monitoring

Log Rotation to S3

Appdynamics Machine Agent

Appdynamics App Agent

monitoring

Application war file, base servlet, platform, interface

jars for dependent services

GC and thread dump logging

Healthcheck, status servelets, JMX interface,

Servo autoscale

Page 75: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Linux Base AMI (CentOS or Ubuntu)

Java (JDK 6 or 7)

JBoss

Optional Apache

Monitoring

Log Rotation to S3

Appdynamics Machine Agent

Appdynamics App Agent

monitoring

Application war file, base servlet, platform, interface

jars for dependent services

GC and thread dump logging

Healthcheck, status servelets, JMX interface,

Servo autoscale

Page 76: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Linux Base AMI (CentOS or Ubuntu)

Python

Django

Optional Apache

Monitoring

Log Rotation to S3

Appdynamics Machine Agent

monitoring

Application file, base server, platform, interface

libs for dependent serviceslogging

Page 77: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

The Monkey Theory

• Simulate things that go wrong

• Find things that are different

Page 78: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Page 79: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

The simian army• Chaos -- Kills random instances

• Chaos Gorilla -- Kills zones

• Chaos Kong -- Kills regions

• Latency -- Degrades network and injects faults

• Conformity -- Looks for outliers

• Circus -- Kills and launches instances to maintain zone balance

• Doctor -- Fixes unhealthy resources

• Janitor -- Cleans up unused resources

• Howler -- Yells about bad things like Amazon limit violations

• Security -- Finds security issues and expiring certificates

Page 80: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

What’s going on?!

Page 81: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Atlas

Page 82: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

!""#$%&'()*'#+",""""#)-.$/011*)10(2*#3""""#)-.$/011*)10(2*45)6#""73""#0%)*('#+",""""88"92&"$0:"&')";060'$*.-("'(9%)"$2<<):('".:"(=)"$2:>.1""""!""""""#<)(*.$?0<)#+"#@-.$A%&1.:/?&<B*2--)5#3""""""#0--%9C2#+"#$%&'()*#3""""""#$2:5.(.2:#+"!""""""""#(9-)#+"#D(0(.$C=*)'=2%5#3""""""""#<0E#+"FGF""""""H3""""""#')6)*.(9#+"#<0;2*#3""""""#5)'$*.-(.2:#+"#-%&1.:".'"5*2--.:1"<)(*.$'#""""H3""""!""""""#<)(*.$?0<)#+"#@-.$A%&1.:/?&<B*2--)5/I:'(0:$)#3""""""#0--%9C2#+"#.:'(0:$)#3""""""#$2:5.(.2:#+"!""""""""#(9-)#+"#?&<J$$&**):$)'#3""""""""#:&<#+"K3""""""""#$2:5.(.2:#+"!""""""""""#(9-)#+"#D(0(.$C=*)'=2%5#3""""""""""#<0E#+"FGF""""""""H""""""H3""""""#26)**.5)'#+"!""""""""#')*6.$)/L)9/26)**.5)#+"#MNOKP#3""""""""#*)Q&.*)/.:'(0:$)/'(0(&'/:2(/.:+",#BJR?#3"#JSC/JT/D@UVIW@#73""""""""#)<0.%/26)**.5)#+"#5)6:&%%X:)(>%.EG$2<#""""""H3""""""#')6)*.(9#+"#<.:2*#""""H3

!""""""#<)(*.$?0<)#+"#@-.$A%&1.:/Y)(*.$W2&:(#3""""""#0--%9C2#+"#.:'(0:$)#3""""""#5)'$*.-(.2:#+"#Z!.:'(0:$)I5H".'"*)-2*(.:1"(22"<0:9"<)(*.$'#3""""""#$2:5.(.2:#+"!""""""""#(9-)#+"#?&<J$$&**):$)'#3""""""""#:&<#+"K3""""""""#$2:5.(.2:#+"!""""""""""#(9-)#+"#D(0(.$C=*)'=2%5#3""""""""""#<0E#+"FGF""""""""H""""""H3""""""#055.(.2:0%B)(0.%'#+"!""""""""#'(0(&'S*%#+"#=((-+88Z!-&[%.$B:'?0<)H+\FFM8D(0(&'#3""""""""#:0$W%&'()*S*%#+"#:0$Z!):6H8Z!*)1.2:H8$%&'()*8'=2]8Z!$%&'()*H#""""""H""""""#26)**.5)'#+"!""""""""#'&[;)$(#+"#Z!.:'(0:$)I5H".'"*)-2*(.:1"(22"<0:9"<)(*.$'#3""""""""#.:$.5):(/L)9#+"#Z!<)(*.$?0<)H+Z!.:'(0:$)I5H#3""""""""#')*6.$)/L)9/26)**.5)#+"#MNOKP#3""""""""#)<0.%/26)**.5)#+"#5)6:&%%X:)(>%.EG$2<#""""""H3""""""#')6)*.(9#+"#<.:2*#""""H""7H

Example Alert Config

Page 83: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Alert Tuning

Page 84: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Alert Systems

alerting

api

api

COREEvent

Gateway

Paging Service

AmazonSES

CORE Agent

Other Team’s Agent

CORE Agent

Atlas

Appdynamics

Page 85: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Page 86: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Chronos

Page 87: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Best Practices

Page 88: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Incident Reviews

• What went wrong?

• How could we have detected it sooner?

• How could we have prevented it?

• How can we prevent this class of problem in the future?

• How can we improve our behavior for next time?

Ask the key questions:

Page 89: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Best Practices for Data

• Have multiple copies of all data

• Keep those copies in multiple AZs

• Avoid keeping state on a single instance

• Take frequent snapshots of EBS disks

• No secret keys on the instance

Page 90: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Circuit Breakers (Hystrix)Be liberal in what you accept, strict in what you send

Page 91: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Netflix autoscaling

Traffic Peak

Text1

2Deployment

Page 92: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

AWS UsageDollar amounts have been carefully removed

Page 93: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Going multi-zone

Page 94: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Benefits of Amazon’s Zones

• Loosely connected

• Low latency between zones

• 99.95% uptime guarantee per region

Page 95: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Going Multi-region

Page 96: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Leveraging Mutli-region

• 100% uptime is theoretically possible.

• You have to replicate your data

• This will cost money

Page 97: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Multi-Region Challenges

• Data replication

• Cache invalidation

• Misdirected users

• Sudden load increase during failover

• When do you fail over?

Page 98: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Data Replication

Page 99: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Cache Replication

• Three strategies available to users:

• No replication

• Invalidation only

• Full copy

Page 100: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Traffic Routing and Failover

• Need to scale up and not get overwhelmed

• Don’t want to suddenly give a bad experience to people

• Make sure that misrouted users are sent “home”

• Can’t failover at first sign of trouble, need to strike a balance

Page 101: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Coming soon...

• We’re in the testing phases now

• Expect to see more info and a tech blog post in the future

Page 102: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Just a quick reminder...

(Some of) Netflix is open source:

https://github.com/netflix

Page 103: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Netflix is hiring

http://jobs.netflix.com/jobs.html

Page 104: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Please don’t forget to vote!

Voting is how we know what to present to you next time. :)

Page 105: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Questions?

Page 106: QConSP 2013• Code deployment ... Netflix built a global PaaS ... Django Optional Apache Monitoring Log Rotation to S3 Appdynamics Machine Agent monitoring Application file, base

Tweet @jedberg with feedback!

Getting in touch

Email: jedberg@{gmail,netflix}.com

Twitter: @jedberg

Web: www.jedberg.net

Facebook: facebook.com/jedberg

Linkedin: www.linkedin.com/in/jedberg