Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

Coburn WatsonManager, Cloud Performance, NetflixSurge ‘13

2

Netflix, Inc.

• World's leading internet television network• ~ 38 Million subscribers in 40+ countries• Over a billion hours streamed per month• Approximately 33% of all US Internet traffic

at night• Recent Notables• Increased Originals catalog• Large open source contribution• OpenConnect (homegrown CDN)

3

About Me

• Manage Cloud Performance Engineering Team• Sub-team of Cloud Solutions Organization

• Focus on performance since 2000• Large-scale billing applications, eCommerce,

datacenter mgmt., etc.• Genentech, McKesson, Amdocs, Mercury Int., HP, etc.

• Passion for tackling performance at cloud-scale• Looking for great performance engineers• [email protected]

4

Freedom and Responsibility

• Culture deck..a great read• Good performers: 2x, Top performers: 10x• What engineers dislike• cumbersome processes• deployment inefficiency• restricted access• restricted technical freedom• lack of trust

• If removed…maximize:• Engineering velocity• Engineer satisfaction

http://www.slideshare.net/reed2001/culture-1798664?ref=http://www.slideshare.net/Netflix/slideshelf

5

Maximizing: Engineering Velocity

6

How

• Implementation freedom• SCM, libraries, language

• that said..platform benefits exist

• Deployment freedom• Service team owns• push schedule, functionality, performance

• operational activities (being paged)• On-demand cloud capacity

• Thousands of instances at the push of a button

7

Rapid Deployment?

Impossible..

3-6 Months?

8

Rapid (Cloud) Deployment

3-5 Minutes

9

BaseAMI• Supply the foundation• Monitoring, java, apache, tomcat, etc.

• Open source project: Aminator

10

Pushing Code: Red-Black

• Gracefully roll code in, or out, of production• Asgard is our AWS configuration mgmt.

tool

11

Compounded risks with increased velocity

Risks: Decreased Reliability, Performance, and Scalability

Not all Roses

12

Goal: CI (Continuous Improvement)

13

Maximizing: Reliability

14

Fear (Revere) the Monkeys

• Simulate• Latency• Errors

• Initiate• Instance Termination• Availability Zone Failure

• Identify• Configuration Drift

… in Test and Production

15

Tracking Change: Chronos

• Aggregate Significant Events *• Current Sources:• Pushes (Asgard)• Production Change Requests (JIRA)• AWS Notifications• Dynamic Property Changes• ASG Scaling Events

• Implementation• Simple REST-service; customized adapters

* - “can disrupt production service”

16

Chronos, cont.

17

Automated Canary Analysis•Identify regression between new and existing code•Point ACA to baseline (prod) and canary ASG

• Typically analyze an hours worth of time series data• Compare ratio of averages between canary and baseline• Evaluate range and noise; determine quality of signal

• Bucket: Hot, Cold, Noisy, or OK• Multiple classifiers available• Multiple metric collections (e.g. hand-picked by service, general)

• Rollup• Constrained: along metric dimensions• Final: Score the canary

•Implementation: R-based analysis

18

HOT OK NOISYCOLDOK

NOISY

constrained rollup (dashed)final rollup

ACA: in Action

19

Hystrix: Defend Your App

● Protection from downstream service failures● Functional (unavailable) or performance in nature

20

Maximizing: Scalability and Performance

21

Dynamic Scaling

EC2 footprint autoscales 2500-3500 instances per day• order of tens of thousands of EC2 instances• Larger ASG spans 200-900 m2.4xlarge daily

Why:• Improved scalability during unexpected workloads• Absorb variance in service performance profile• Reactive chain of dependencies• Creates "reserved instance troughs" for batch

activity

22

Dynamic Scaling, cont.

Example covers 3 services• 2 edge (A,B), 1 mid-tier (C)• C has more upstream services

than simply A and B

Multiple Autoscaling Policies• (A) System Load Average• (B,C) Request-Rate based

23


24


• Response time variability greatest during scaling events• Average response time primary between 75-150 msec

25


• Instance counts 3x, Aggregate requests 4.5x (not shown)• Average CPU utilization per instance: ~25-55%

26

Study performed: • 24 node C* SSD-based cluster (hi1.4xlarge)• mid-tier service load application• Targeting 2x production rates

• Increase read ops from 30k to to 70k in ~ 3 minutes

• Increase write ops 750 to 1500 in ~ 3 minutes

Results: • 95th pctl response time increase: ~ 17 msec to 45

msec• 99th pctl response time increase: ~ 35 msec to 80

msec

Cassandra Performance

27

Response times consistent during 4x increase in load *

* Due to upstream code change

EVcache (memcached) Scalability

28

Cloud-scale Load Testing

• Ad-Hoc or CI-based load test model• (CI) Run-over-run comparison; email on rule

violation

1. Jenkins initiates job2. JMeter instances apply load3. Results written to s3 4. Instance metrics published to Atlas5. Raw data fetched and processed

29

Conclusions

• Continually accelerate engineering velocity• Evolve architecture and processes to mitigate

risks

• Stateless micro-service architectures win!

• Remove barriers for engineers• Last option should be to reduce rate of change

• Exercise failure and “thundering herd” scenarios

• Cloud native scaling and resiliency are key factors• Leverage pre-existing OSS PaaS when

possible

30

Netflix Open Source

Our Open Source Software simplifies mgmt at scale

Great projects, stunning colleagues: jobs.netflix.com

http://netflix.github.com/

http://netflix.github.com/

http://jobs.netflix.com/

http://jobs.netflix.com/

31

Q&A

• [email protected]

• Netflix Tech Blog: http://techblog.netflix.com

mailto:[email protected]

http://techblog.netflix.com/

Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

Technology

Transcript of Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud