Operating a Highly Available Cloud Service

Depankar Neogi Chief Architect QuickBase, Intuit Inc. November 14, 2013

http://www.meetup.com/Boston-cloud-services/events/141118632/

Presented at Boston Cloud Services Meetup

Agenda

•Intuit and QuickBase

•Building and Running Highly Available Cloud Services

–People & Process

–Technology

The single most important thing to keep in mind when designing for High Availability is to anticipate failure.

20% of GDP & Pay 1 in 12

Improving

Lives 60M

Apps for >50% of Fortune 500

Facilitate $40B Tax Refunds

#1 Financial Management Software

#1 for Innovation

in Computer Software Industry 3

What is QuickBase?

One platform solves jobs across the enterprise. Project Management, IT helpdesk, CRM, Field service, Human resources, etc.

An Enterprise platform to

empower your team to build applications

Easily customized to meet unique business needs

Requirements, processes and teams evolving constantly

Excel to QuickBase

in less than 5 minutes

500,000+

current users Brand NEW modern UI

enables Ease of Use

More than

4,500 companies use QuickBase

QuickBase – Customized applications matching your unique requirements

Open extensible API’s Common Infrastructure Services

Roles Based UI Dashboards & Reports

Business logic & workflow

Secure Access Control

Relational Data Tables

Data Storage & Backup

Modern, Easy, Productive, Dynamic, Fast

30 million requests per day

80 K unique visitors per day

100,000 active apps at any time

25 milliseconds median processing time

Supports Dynamic DML, DDL, CRUD

Cloud based Database with a beautiful UX

New QuickBase DIY Data Access

Data Mapping WSQL Transforms

Virtual tables Cache

Warehouse Scheduler Repository

Liberator Library

Liberators

2. New Data Sharing Service

1. QuickBase UI Extended with new DIY data sharing

3. Connections to Popular Industry Data

Intuit-class infrastructure (security, billing, HADR, hosting)

AVAILABILITY

PSTN Systems Availability SLA

99.9999 % “six nines” 31.5 secs/yr, 2.59 secs/month, 0.605 secs/week

99.999 % “five nines” 5.26 mins/yr, 25.9 secs/month, 6.05 secs/week

Downtime

Web Services Availability SLA

99.95 % 4.38 hrs/yr, 21.56 mins/month, 5.04 mins/week

99.9 % 8.76 hrs/yr, 43.8 mins/month, 10.1 mins/week

Downtime

12 http://www.google.com/apps/intl/en/terms/sla.html

PEOPLE & PROCESSES Operating High Availability Service

People & Process: Monitoring Business Metrics

• It’s critical to detect a problem before your customers have to tell you or you have to ask them.

• By monitoring real time business metrics and comparing the actual data to a historical curve you can more quickly detect if there is a problem and avoid sifting through alerting and monitoring white noise that your systems will inevitability produce.

• Five evolutionary questions that monitoring should answer: 1. Is there a problem?

2. Where is the problem?

3. What is the problem?

4. Why is there a problem?

5. Will there be a problem?

• External versus Internal Monitoring

http://akfpartners.com/techblog/2009/06/15/monitoring-strategies/

People & Process: Invest in Good Tools

95 K Requests in 12 hour window

Peak Request: 4.3 req/sec (1286 request/5 min window)

Processing Time: 61 millisecond per request

A good tool will help you find the needle in a haystack - fast

People & Process: Incident Management Process

• Incident Management Team (IMT)

• Incident Management Response Plan

• Activating the IMT, notifications

• Having the right break-out rooms

• Classification of the incident

• Communication of the incident

• Time keeper

• Management versus Technical Process

• Tracking:

– SLA

– RPO (recovery point objective)

– RTO (recovery time objective)

• Incident closure, recovery

• Evaluation process

People & Process: Runbook and messaging

• Runbook

– Detail process for managing the incident

– Contact Information

– Managing data center cutover, recovery steps, testing, managing replication

• Messaging book

– Who is responsible for communication

– Who creates and approves the message

– How you communicate

– At what cadence

– What you tell your customers

• Social Media Strategy

– If you are not transparent, your customers will let you know

– Social Media coordinator – own the channels

People & Process: Service Page

Provide Customers ability to find out the health of the system and be notified of any service related issues

People & Process: Service Page

Transparency is Key. If you let the customers know what you know, they will respect you and may remain loyal to your business.

People & Process: Business Fault Isolation

• What if your data center went down

• And the production server is down because the data center is down

• And your email server was in the same data center

• And your marketing server was in the same data center

• And your service page was on a server in the same date center

• How do you communicate with all your customers?

Business Fault Isolation prevents your business from a SPOF (single point of failure).

People & Process: Review Process

• SaaS or Operations Review Process should have a fixed cadence and be led by a company leader

• Review Team should include leaders from:

– Finance

– Compliance & Risk

– CTO

– Operations

– Product

• Dashboard with KPI

• Review Fire drills

• Change Control Process

– Preferably change one thing at a time

TECHNOLOGIES Operating High Availability Service

The Three Pillars of High Availability

The goal of High Availability and Disaster Recovery (HA/DR) is to provide Business Continuance through:

HA/DR directly enhances a customer’s experience through greater offering availability

Lack of Service Outage = Happy Customers = Greater Business Value

High Availability Architecture Principles

•Design for Failure

–Avoid Single Points of Failure

–Graceful Degradation and Soft Dependencies

–Asynchronous Design

–Keep State Confined to Where it is Needed

•Design for Operability

–Design to be Monitored

–Design for Hot Deployment and Rollback

–Automate Where Possible

•Keep Everything “In Production”

•Scale Out (Not Up)

•Keep it Fresh…and Mature

Architecture Patterns for High Availability

Swimlanes

1) Active/Passive

2) Active/Active 3) Single Write Master 4) Store and Forward

Active / Passive

Active Data

Primary Data Center Secondary Data Center

Near Real-time Replication

Passive Back Up

Swimlane Principle

A “Swimlane” is:

A set of predefined systems and software infrastructure tuned to support a predefined workload

•Only a portion of an offering’s total users are hosted on any given swimlane

Within a Swimlane:

–Each Swimlane is independent and self-sufficient and shares no compute/storage resources with other swimlanes

–Offering transactions occur within a Swimlane

–Only access to Shared Services go outside the Swimlane

–Standard Fault Detection and Fault Recovery methods are used

Intuit Proprietary & Confidential

High Availability with Swimlanes

Storage

F5 GTM DNS F5 LTM

F5 GTM F5 LTM

Internet

WS: web server; AS: app server

Fault Domain 1 Fault Domain 2

Application Partitioning

via Swimlanes

Swimlanes Support Application Needs

• Scalability • Replicated swimlanes add capacity with linear scalability

• Fault Isolation • Complete failure only impacts a subset of users due to application partitioning and data sharding

• High Availability • Individual tiers can be made highly available through intra-VM application recovery, intra-swimlane application failover or intra-swimlane VM restart

• Disaster Recovery • Disaster recovery is achieved through swimlane failover, either in the same or a remote data center

• Automation • The identical nature of a swimlane allows for a high degree of operational automation

Active / Active – Swim Lanes

DB1 active

-----------------

DB3 passive

Data Center 1 Data Center 2

25% customers

Replication

Global Load

Balancer

DB2 active

-----------------

DB4 passive

DB3 active

-----------------

DB1 passive

DB4 active

-----------------

DB2 passive

Active / Active – Single Write Master

Read Cache

DC1 DC2 DC3 DC4

Read Cache

Updates

Writes

Cache Updates

Design for Failure: Resiliency Patterns

Throttling versus Circuit Breaker

Circuit Breaker Pattern

http://techblog.netflix.com/2012_02_01_archive.html

Closed

On call/ pass through

Call succeeds / reset count

Call fail/count failure

Threshold reached/trip breaker

On Call / Fail

On timeout / attempt reset

Half Open

On call / pass through

On succeed/reset

On fail /trip breaker

Trip breaker

Attempt

Caller Dependency

Circuit Breaker State Diagram

reaker P

/techblo

/2012_02_01_arc

reaker P

: Exam

Example of how threads, network timeouts and retries combine

/techblo

/2012_02_01_arc

Examples of Tools for Building HA Systems

• Highly Available DNS– Akamai, Dyn, AWS Route53

• Load Balancing – F5 LTM, F5 GTM, AWS ELB

• Data Replication – Golden Gate

• Monitoring – eHealth, Spectrum, Wily, Splunk, Cacti

• Application Performance – DynaTrace, NewRelic

• Deployment – Perforce, Maven, Nexus, Hudson, Puppet

• Distributed Databases – NuoDB, VoltDB, several NoSQL types

• Distributed Storage – GlusterFS, Atmos, OpenStack

• HA Devices – Veritas Cluster Server

• OS Virtualization – AWS, Mware, Xen, Parallels

• Network Virtualization – AWS, Mware NSX, PLUMgrid

• Caching– Memcached, Akamai, CloudFront

• Caching– Netflix Chaos Monkey

• DDos Protection– Arbor, Riverbed

Trust Not the Execution Environment

“Everything Fails, All the Time.” – Werner Vogels, CTO of Amazon.com

Summary: Operating HA Service

Monitoring Business Metrics

Incident Management Process

Runbooks

Social Media & Messaging

Service Page

Business Fault Isolation

SLA, RPO, RTO

Failover Drills

Review Process

Change one thing at a time

Principles:

– Design for Failure

– Design for Operability

– Keep Everything “In Production”

– Scale Out (stateless)

– Keep it Fresh

Patterns:

– Active/Active

– Swimlanes

– Active/Passive

– Store-Forward

Design:

– Throttling

– Circuit Breaker

– Caching

– Rollback

– Healthchecks

Thank You!

Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly...

Transcript of Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly...

Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly...

Documents

Transcript of Operating a Highly Available Cloud Service - …files.meetup.com/1460349/Operating a Highly...

Cloud Operating Model Design

HIGHLY PATHOGENIC AVIAN INFLUENZA STANDARD OPERATING ... · HIGHLY PATHOGENIC AVIAN INFLUENZA STANDARD OPERATING PROCEDURES: 1. OVERVIEW OF ETIOLOGY AND ECOLOGY DRAFT SEPTEMBER 2015.

Cloud Ops: Operating OpenStack Clouds - Delli.dell.com/.../Documents/cloud-ops-for-openstack.pdf · Cloud Ops: Operating OpenStack Clouds 6 Before we talk process, we need to set

CAP2165-Operating Cloud Foundry_Final_US.pdf

7 habits of highly effective private cloud architects

Go4Hosting Offers Highly Customisable Private Cloud Hosting Services

Deploying highly available and secure cloud solutionsdownload.microsoft.com/download/8/4/5/845ACDAF-0552-4590...Deploying highly available and secure cloud solutions 5 interface so

Windows Azure Storage: A Highly Available Cloud Storage Service

Unlocking the Cloud Operating Model: Infrastructure · HITEPPE UNLOCKING THE CLOUD OPERATING MODEL 3 Transitioning to a Multi-Cloud Datacenter The transition to cloud, and multi-cloud,

Deploying Hyper-V on Oracle Cloud Infrastructure...Oracle Cloud Infrastructure provides a robust, highly configurable way of deploying individual guest instances that are highly flexible,

Deploying Highly Available and Secure Cloud Solutions

Issue 07 – June 2019 PORTALS IN THE CLOUD€¦ · and yard processes using our highly accurate OCR technology, robust kiosk systems and advanced Gate Operating System. Camco’s

Tonido Cloud Private, Highly Scalable, Self-Hosted Cloud Storage/Sync Solution.

Operating a distributed IaaS Cloud

100% serverless: Operating highly-scalable microservices ... · 100% serverless: Operating highly-scalable microservices with AWS Lambda ... • Amazon SNS • Amazon Cognito ...

Unlocking the Cloud Operating Model - datocms-assets.com · SICORP ITEPPER UNLOCKING THE CLOUD OPERATING MODEL 10 Step 2: Multi-cloud Security Dynamic cloud infrastructure means a

Operating in the Cloud - Oracle Transportation Managementotmsig.communities.oaug.org/multisites/otm/images/Presentations/... · Operating in the Cloud Best Practices and Lessons Learned

Build a low-touch, highly scalable cloud with IBM ...public.dhe.ibm.com/software/dw/cloud/techtalks/... · •VMWare ESX, KVM, Xen •Highly cost effective solution •Requires no

Paving the Way to the Cloud: Cloud Services Brokerage for Highly Secure, Demanding IT Enterprisesv1

Operating the Hyperscale Cloud