Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

31
Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud Coburn Watson Manager, Cloud Performance, Netflix Surge ‘13

description

Surge 2013 presentation which covers how Netflix maximizes engineering velocity while keeping risks to scalability, reliability, and performance in check.

Transcript of Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

Page 1: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

Coburn WatsonManager, Cloud Performance, NetflixSurge ‘13

Page 2: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

2

Netflix, Inc.

• World's leading internet television network• ~ 38 Million subscribers in 40+ countries• Over a billion hours streamed per month• Approximately 33% of all US Internet traffic

at night• Recent Notables• Increased Originals catalog• Large open source contribution• OpenConnect (homegrown CDN)

Page 3: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

3

About Me

• Manage Cloud Performance Engineering Team• Sub-team of Cloud Solutions Organization

• Focus on performance since 2000• Large-scale billing applications, eCommerce,

datacenter mgmt., etc.• Genentech, McKesson, Amdocs, Mercury Int., HP, etc.

• Passion for tackling performance at cloud-scale• Looking for great performance engineers• [email protected]

Page 4: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

4

Freedom and Responsibility

• Culture deck..a great read• Good performers: 2x, Top performers: 10x• What engineers dislike• cumbersome processes• deployment inefficiency• restricted access• restricted technical freedom• lack of trust

• If removed…maximize:• Engineering velocity• Engineer satisfaction

Page 5: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

5

Maximizing: Engineering Velocity

Page 6: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

6

How

• Implementation freedom• SCM, libraries, language

• that said..platform benefits exist

• Deployment freedom• Service team owns• push schedule, functionality, performance

• operational activities (being paged)• On-demand cloud capacity

• Thousands of instances at the push of a button

Page 7: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

7

Rapid Deployment?

Impossible..

3-6 Months?

Page 8: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

8

Rapid (Cloud) Deployment

3-5 Minutes

Page 9: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

9

BaseAMI• Supply the foundation• Monitoring, java, apache, tomcat, etc.

• Open source project: Aminator

Page 10: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

10

Pushing Code: Red-Black

• Gracefully roll code in, or out, of production• Asgard is our AWS configuration mgmt.

tool

Page 11: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

11

Compounded risks with increased velocity

Risks: Decreased Reliability, Performance, and Scalability

Not all Roses

Page 12: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

12

Goal: CI (Continuous Improvement)

Page 13: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

13

Maximizing: Reliability

Page 14: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

14

Fear (Revere) the Monkeys

• Simulate• Latency• Errors

• Initiate• Instance Termination• Availability Zone Failure

• Identify• Configuration Drift

… in Test and Production

Page 15: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

15

Tracking Change: Chronos

• Aggregate Significant Events *• Current Sources:• Pushes (Asgard)• Production Change Requests (JIRA)• AWS Notifications• Dynamic Property Changes• ASG Scaling Events

• Implementation• Simple REST-service; customized adapters

* - “can disrupt production service”

Page 16: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

16

Chronos, cont.

Page 17: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

17

Automated Canary Analysis•Identify regression between new and existing code•Point ACA to baseline (prod) and canary ASG

• Typically analyze an hours worth of time series data• Compare ratio of averages between canary and baseline• Evaluate range and noise; determine quality of signal

• Bucket: Hot, Cold, Noisy, or OK• Multiple classifiers available• Multiple metric collections (e.g. hand-picked by service, general)

• Rollup• Constrained: along metric dimensions• Final: Score the canary

•Implementation: R-based analysis

Page 18: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

18

HOT OK NOISYCOLDOK

NOISY

constrained rollup (dashed)final rollup

ACA: in Action

Page 19: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

19

Hystrix: Defend Your App

● Protection from downstream service failures● Functional (unavailable) or performance in nature

Page 20: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

20

Maximizing: Scalability and Performance

Page 21: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

21

Dynamic Scaling

EC2 footprint autoscales 2500-3500 instances per day• order of tens of thousands of EC2 instances• Larger ASG spans 200-900 m2.4xlarge daily

Why:• Improved scalability during unexpected workloads• Absorb variance in service performance profile• Reactive chain of dependencies• Creates "reserved instance troughs" for batch

activity

Page 22: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

22

Dynamic Scaling, cont.

Example covers 3 services• 2 edge (A,B), 1 mid-tier (C)• C has more upstream services

than simply A and B

Multiple Autoscaling Policies• (A) System Load Average• (B,C) Request-Rate based

Page 23: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

23

Dynamic Scaling, cont.

Page 24: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

24

Dynamic Scaling, cont.

• Response time variability greatest during scaling events• Average response time primary between 75-150 msec

Page 25: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

25

Dynamic Scaling, cont.

• Instance counts 3x, Aggregate requests 4.5x (not shown)• Average CPU utilization per instance: ~25-55%

Page 26: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

26

Study performed: • 24 node C* SSD-based cluster (hi1.4xlarge)• mid-tier service load application• Targeting 2x production rates

• Increase read ops from 30k to to 70k in ~ 3 minutes

• Increase write ops 750 to 1500 in ~ 3 minutes

Results: • 95th pctl response time increase: ~ 17 msec to 45

msec• 99th pctl response time increase: ~ 35 msec to 80

msec

Cassandra Performance

Page 27: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

27

Response times consistent during 4x increase in load *

* Due to upstream code change

EVcache (memcached) Scalability

Page 28: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

28

Cloud-scale Load Testing

• Ad-Hoc or CI-based load test model• (CI) Run-over-run comparison; email on rule

violation

1. Jenkins initiates job2. JMeter instances apply load3. Results written to s3 4. Instance metrics published to Atlas5. Raw data fetched and processed

Page 29: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

29

Conclusions

• Continually accelerate engineering velocity• Evolve architecture and processes to mitigate

risks

• Stateless micro-service architectures win!

• Remove barriers for engineers• Last option should be to reduce rate of change

• Exercise failure and “thundering herd” scenarios

• Cloud native scaling and resiliency are key factors• Leverage pre-existing OSS PaaS when

possible

Page 30: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

30

Netflix Open Source

Our Open Source Software simplifies mgmt at scale

Great projects, stunning colleagues: jobs.netflix.com

Page 31: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

31

Q&A

[email protected]

• Netflix Tech Blog: http://techblog.netflix.com