Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly...

Post on 13-Apr-2018

237 views 1 download

Transcript of Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly...

Agenda

•Intuit and QuickBase

•Building and Running Highly Available Cloud Services

–People & Process

–Technology

2

The single most important thing to keep in mind when designing for High Availability is to anticipate failure.

20% of GDP & Pay 1 in 12

Improving

Lives 60M

Apps for >50% of Fortune 500

Facilitate $40B Tax Refunds

#1 Financial Management Software

#1 for Innovation

in Computer Software Industry 3

4

What is QuickBase?

One platform solves jobs across the enterprise. Project Management, IT helpdesk, CRM, Field service, Human resources, etc.

An Enterprise platform to

empower your team to build applications

Easily customized to meet unique business needs

Requirements, processes and teams evolving constantly

Excel to QuickBase

in less than 5 minutes

500,000+

current users Brand NEW modern UI

enables Ease of Use

More than

4,500 companies use QuickBase

QuickBase – Customized applications matching your unique requirements

Open extensible API’s Common Infrastructure Services

Roles Based UI Dashboards & Reports

Business logic & workflow

Secure Access Control

Relational Data Tables

Data Storage & Backup

5

Modern, Easy, Productive, Dynamic, Fast

30 million requests per day

80 K unique visitors per day

100,000 active apps at any time

25 milliseconds median processing time

Supports Dynamic DML, DDL, CRUD

Cloud based Database with a beautiful UX

6

New QuickBase DIY Data Access

8

Data Mapping WSQL Transforms

Virtual tables Cache

Warehouse Scheduler Repository

Liberator Library

Liberators

2. New Data Sharing Service

1. QuickBase UI Extended with new DIY data sharing

A N Y

A P I

3. Connections to Popular Industry Data

Intuit-class infrastructure (security, billing, HADR, hosting)

AVAILABILITY

9

PSTN Systems Availability SLA

10

99.9999 % “six nines” 31.5 secs/yr, 2.59 secs/month, 0.605 secs/week

99.999 % “five nines” 5.26 mins/yr, 25.9 secs/month, 6.05 secs/week

Downtime

Web Services Availability SLA

11

99.95 % 4.38 hrs/yr, 21.56 mins/month, 5.04 mins/week

99.9 % 8.76 hrs/yr, 43.8 mins/month, 10.1 mins/week

Downtime

12 http://www.google.com/apps/intl/en/terms/sla.html

PEOPLE & PROCESSES Operating High Availability Service

13

People & Process: Monitoring Business Metrics

• It’s critical to detect a problem before your customers have to tell you or you have to ask them.

• By monitoring real time business metrics and comparing the actual data to a historical curve you can more quickly detect if there is a problem and avoid sifting through alerting and monitoring white noise that your systems will inevitability produce.

• Five evolutionary questions that monitoring should answer: 1. Is there a problem?

2. Where is the problem?

3. What is the problem?

4. Why is there a problem?

5. Will there be a problem?

• External versus Internal Monitoring

14

http://akfpartners.com/techblog/2009/06/15/monitoring-strategies/

People & Process: Invest in Good Tools

15

95 K Requests in 12 hour window

Peak Request: 4.3 req/sec (1286 request/5 min window)

Processing Time: 61 millisecond per request

A good tool will help you find the needle in a haystack - fast

People & Process: Incident Management Process

• Incident Management Team (IMT)

• Incident Management Response Plan

• Activating the IMT, notifications

• Having the right break-out rooms

• Classification of the incident

• Communication of the incident

• Time keeper

• Management versus Technical Process

• Tracking:

– SLA

– RPO (recovery point objective)

– RTO (recovery time objective)

• Incident closure, recovery

• Evaluation process

16

People & Process: Runbook and messaging

• Runbook

– Detail process for managing the incident

– Contact Information

– Managing data center cutover, recovery steps, testing, managing replication

• Messaging book

– Who is responsible for communication

– Who creates and approves the message

– How you communicate

– At what cadence

– What you tell your customers

• Social Media Strategy

– If you are not transparent, your customers will let you know

– Social Media coordinator – own the channels

17

People & Process: Service Page

18

Provide Customers ability to find out the health of the system and be notified of any service related issues

People & Process: Service Page

19

Transparency is Key. If you let the customers know what you know, they will respect you and may remain loyal to your business.

People & Process: Business Fault Isolation

• What if your data center went down

• And the production server is down because the data center is down

• And your email server was in the same data center

• And your marketing server was in the same data center

• And your service page was on a server in the same date center

• How do you communicate with all your customers?

20

Business Fault Isolation prevents your business from a SPOF (single point of failure).

People & Process: Review Process

• SaaS or Operations Review Process should have a fixed cadence and be led by a company leader

• Review Team should include leaders from:

– Finance

– Compliance & Risk

– CTO

– Operations

– Product

• Dashboard with KPI

• Review Fire drills

• Change Control Process

– Preferably change one thing at a time

21

TECHNOLOGIES Operating High Availability Service

22

The Three Pillars of High Availability

The goal of High Availability and Disaster Recovery (HA/DR) is to provide Business Continuance through:

HA/DR directly enhances a customer’s experience through greater offering availability

Lack of Service Outage = Happy Customers = Greater Business Value

High Availability Architecture Principles

•Design for Failure

–Avoid Single Points of Failure

–Graceful Degradation and Soft Dependencies

–Asynchronous Design

–Keep State Confined to Where it is Needed

•Design for Operability

–Design to be Monitored

–Design for Hot Deployment and Rollback

–Automate Where Possible

•Keep Everything “In Production”

•Scale Out (Not Up)

•Keep it Fresh…and Mature

Architecture Patterns for High Availability

Swimlanes

1) Active/Passive

2) Active/Active 3) Single Write Master 4) Store and Forward

25

Active / Passive

Active Data

Primary Data Center Secondary Data Center

Near Real-time Replication

Passive Back Up

26

Swimlane Principle

A “Swimlane” is:

A set of predefined systems and software infrastructure tuned to support a predefined workload

•Only a portion of an offering’s total users are hosted on any given swimlane

Within a Swimlane:

–Each Swimlane is independent and self-sufficient and shares no compute/storage resources with other swimlanes

–Offering transactions occur within a Swimlane

–Only access to Shared Services go outside the Swimlane

–Standard Fault Detection and Fault Recovery methods are used

27

Intuit Proprietary & Confidential

High Availability with Swimlanes

WS

AS

Storage

Sw

imla

ne 4

WS

AS

Storage

Sw

imla

ne 2

WS

AS

Storage

Sw

imla

ne 3

WS

AS

Storage

Sw

imla

ne 1

WS

AS

Storage

Sw

imla

ne 2

WS

AS

Storage

Sw

imla

ne 3

WS

AS

Storage

Sw

imla

ne 4

WS

AS

Storage

Sw

imla

ne 1

F5 GTM DNS F5 LTM

DC 1

DC 2

F5 GTM F5 LTM

Internet

GTM

WS: web server; AS: app server

Fault Domain 1 Fault Domain 2

Application Partitioning

via Swimlanes

28

Swimlanes Support Application Needs

• Scalability • Replicated swimlanes add capacity with linear scalability

• Fault Isolation • Complete failure only impacts a subset of users due to application partitioning and data sharding

• High Availability • Individual tiers can be made highly available through intra-VM application recovery, intra-swimlane application failover or intra-swimlane VM restart

• Disaster Recovery • Disaster recovery is achieved through swimlane failover, either in the same or a remote data center

• Automation • The identical nature of a swimlane allows for a high degree of operational automation

29

Active / Active – Swim Lanes

DB1 active

-----------------

DB3 passive

Data Center 1 Data Center 2

25% customers

25% customers

25% customers

25% customers

Replication

Replication

Global Load

Balancer

DB2 active

-----------------

DB4 passive

DB3 active

-----------------

DB1 passive

DB4 active

-----------------

DB2 passive

30

Active / Active – Single Write Master

Read Cache

DC1 DC2 DC3 DC4

Read Cache

Read Cache

Read Cache

Updates

Writes

Cache Updates

31

Design for Failure: Resiliency Patterns

Throttling versus Circuit Breaker

32

Circuit Breaker Pattern

http://techblog.netflix.com/2012_02_01_archive.html

Closed

On call/ pass through

Call succeeds / reset count

Call fail/count failure

Threshold reached/trip breaker

Open

On Call / Fail

On timeout / attempt reset

Half Open

On call / pass through

On succeed/reset

On fail /trip breaker

Trip breaker

Trip breaker

Attempt

Reset

Attempt

Reset

C D

Caller Dependency

Circuit Breaker State Diagram

33

34

Cir

cu

it B

reaker P

att

ern

:

Exam

ple

htt

p:/

/techblo

g.n

etf

lix.c

om

/2012_02_01_arc

hiv

e.h

tml

35

Cir

cu

it B

reaker P

att

ern

: Exam

ple

Example of how threads, network timeouts and retries combine

htt

p:/

/techblo

g.n

etf

lix.c

om

/2012_02_01_arc

hiv

e.h

tml

Examples of Tools for Building HA Systems

• Highly Available DNS– Akamai, Dyn, AWS Route53

• Load Balancing – F5 LTM, F5 GTM, AWS ELB

• Data Replication – Golden Gate

• Monitoring – eHealth, Spectrum, Wily, Splunk, Cacti

• Application Performance – DynaTrace, NewRelic

• Deployment – Perforce, Maven, Nexus, Hudson, Puppet

• Distributed Databases – NuoDB, VoltDB, several NoSQL types

• Distributed Storage – GlusterFS, Atmos, OpenStack

• HA Devices – Veritas Cluster Server

• OS Virtualization – AWS, Mware, Xen, Parallels

• Network Virtualization – AWS, Mware NSX, PLUMgrid

• Caching– Memcached, Akamai, CloudFront

• Caching– Netflix Chaos Monkey

• DDos Protection– Arbor, Riverbed

36

Trust Not the Execution Environment

“Everything Fails, All the Time.” – Werner Vogels, CTO of Amazon.com

37

Summary: Operating HA Service

Monitoring Business Metrics

Incident Management Process

Runbooks

Social Media & Messaging

Service Page

Business Fault Isolation

SLA, RPO, RTO

Failover Drills

Review Process

Change one thing at a time

Principles:

– Design for Failure

– Design for Operability

– Keep Everything “In Production”

– Scale Out (stateless)

– Keep it Fresh

Patterns:

– Active/Active

– Swimlanes

– Active/Passive

– Store-Forward

Design:

– Throttling

– Circuit Breaker

– Caching

– Rollback

– Healthchecks

Tools

38

39

Thank You!