Post on 13-Apr-2018
Operating a Highly Available Cloud Service
Depankar Neogi Chief Architect QuickBase, Intuit Inc. November 14, 2013
http://www.meetup.com/Boston-cloud-services/events/141118632/
Presented at Boston Cloud Services Meetup
Agenda
•Intuit and QuickBase
•Building and Running Highly Available Cloud Services
–People & Process
–Technology
2
The single most important thing to keep in mind when designing for High Availability is to anticipate failure.
20% of GDP & Pay 1 in 12
Improving
Lives 60M
Apps for >50% of Fortune 500
Facilitate $40B Tax Refunds
#1 Financial Management Software
#1 for Innovation
in Computer Software Industry 3
4
What is QuickBase?
One platform solves jobs across the enterprise. Project Management, IT helpdesk, CRM, Field service, Human resources, etc.
An Enterprise platform to
empower your team to build applications
Easily customized to meet unique business needs
Requirements, processes and teams evolving constantly
Excel to QuickBase
in less than 5 minutes
500,000+
current users Brand NEW modern UI
enables Ease of Use
More than
4,500 companies use QuickBase
QuickBase – Customized applications matching your unique requirements
Open extensible API’s Common Infrastructure Services
Roles Based UI Dashboards & Reports
Business logic & workflow
Secure Access Control
Relational Data Tables
Data Storage & Backup
5
Modern, Easy, Productive, Dynamic, Fast
30 million requests per day
80 K unique visitors per day
100,000 active apps at any time
25 milliseconds median processing time
Supports Dynamic DML, DDL, CRUD
Cloud based Database with a beautiful UX
6
New QuickBase DIY Data Access
8
Data Mapping WSQL Transforms
Virtual tables Cache
Warehouse Scheduler Repository
Liberator Library
Liberators
2. New Data Sharing Service
1. QuickBase UI Extended with new DIY data sharing
A N Y
A P I
3. Connections to Popular Industry Data
Intuit-class infrastructure (security, billing, HADR, hosting)
AVAILABILITY
9
PSTN Systems Availability SLA
10
99.9999 % “six nines” 31.5 secs/yr, 2.59 secs/month, 0.605 secs/week
99.999 % “five nines” 5.26 mins/yr, 25.9 secs/month, 6.05 secs/week
Downtime
Web Services Availability SLA
11
99.95 % 4.38 hrs/yr, 21.56 mins/month, 5.04 mins/week
99.9 % 8.76 hrs/yr, 43.8 mins/month, 10.1 mins/week
Downtime
12 http://www.google.com/apps/intl/en/terms/sla.html
PEOPLE & PROCESSES Operating High Availability Service
13
People & Process: Monitoring Business Metrics
• It’s critical to detect a problem before your customers have to tell you or you have to ask them.
• By monitoring real time business metrics and comparing the actual data to a historical curve you can more quickly detect if there is a problem and avoid sifting through alerting and monitoring white noise that your systems will inevitability produce.
• Five evolutionary questions that monitoring should answer: 1. Is there a problem?
2. Where is the problem?
3. What is the problem?
4. Why is there a problem?
5. Will there be a problem?
• External versus Internal Monitoring
14
http://akfpartners.com/techblog/2009/06/15/monitoring-strategies/
People & Process: Invest in Good Tools
15
95 K Requests in 12 hour window
Peak Request: 4.3 req/sec (1286 request/5 min window)
Processing Time: 61 millisecond per request
A good tool will help you find the needle in a haystack - fast
People & Process: Incident Management Process
• Incident Management Team (IMT)
• Incident Management Response Plan
• Activating the IMT, notifications
• Having the right break-out rooms
• Classification of the incident
• Communication of the incident
• Time keeper
• Management versus Technical Process
• Tracking:
– SLA
– RPO (recovery point objective)
– RTO (recovery time objective)
• Incident closure, recovery
• Evaluation process
16
People & Process: Runbook and messaging
• Runbook
– Detail process for managing the incident
– Contact Information
– Managing data center cutover, recovery steps, testing, managing replication
• Messaging book
– Who is responsible for communication
– Who creates and approves the message
– How you communicate
– At what cadence
– What you tell your customers
• Social Media Strategy
– If you are not transparent, your customers will let you know
– Social Media coordinator – own the channels
17
People & Process: Service Page
18
Provide Customers ability to find out the health of the system and be notified of any service related issues
People & Process: Service Page
19
Transparency is Key. If you let the customers know what you know, they will respect you and may remain loyal to your business.
People & Process: Business Fault Isolation
• What if your data center went down
• And the production server is down because the data center is down
• And your email server was in the same data center
• And your marketing server was in the same data center
• And your service page was on a server in the same date center
• How do you communicate with all your customers?
20
Business Fault Isolation prevents your business from a SPOF (single point of failure).
People & Process: Review Process
• SaaS or Operations Review Process should have a fixed cadence and be led by a company leader
• Review Team should include leaders from:
– Finance
– Compliance & Risk
– CTO
– Operations
– Product
• Dashboard with KPI
• Review Fire drills
• Change Control Process
– Preferably change one thing at a time
21
TECHNOLOGIES Operating High Availability Service
22
The Three Pillars of High Availability
The goal of High Availability and Disaster Recovery (HA/DR) is to provide Business Continuance through:
HA/DR directly enhances a customer’s experience through greater offering availability
Lack of Service Outage = Happy Customers = Greater Business Value
High Availability Architecture Principles
•Design for Failure
–Avoid Single Points of Failure
–Graceful Degradation and Soft Dependencies
–Asynchronous Design
–Keep State Confined to Where it is Needed
•Design for Operability
–Design to be Monitored
–Design for Hot Deployment and Rollback
–Automate Where Possible
•Keep Everything “In Production”
•Scale Out (Not Up)
•Keep it Fresh…and Mature
Architecture Patterns for High Availability
Swimlanes
1) Active/Passive
2) Active/Active 3) Single Write Master 4) Store and Forward
25
Active / Passive
Active Data
Primary Data Center Secondary Data Center
Near Real-time Replication
Passive Back Up
26
Swimlane Principle
A “Swimlane” is:
A set of predefined systems and software infrastructure tuned to support a predefined workload
•Only a portion of an offering’s total users are hosted on any given swimlane
Within a Swimlane:
–Each Swimlane is independent and self-sufficient and shares no compute/storage resources with other swimlanes
–Offering transactions occur within a Swimlane
–Only access to Shared Services go outside the Swimlane
–Standard Fault Detection and Fault Recovery methods are used
27
Intuit Proprietary & Confidential
High Availability with Swimlanes
WS
AS
Storage
Sw
imla
ne 4
’
WS
AS
Storage
Sw
imla
ne 2
’
WS
AS
Storage
Sw
imla
ne 3
WS
AS
Storage
Sw
imla
ne 1
WS
AS
Storage
Sw
imla
ne 2
WS
AS
Storage
Sw
imla
ne 3
’
WS
AS
Storage
Sw
imla
ne 4
WS
AS
Storage
Sw
imla
ne 1
’
F5 GTM DNS F5 LTM
DC 1
DC 2
F5 GTM F5 LTM
Internet
GTM
WS: web server; AS: app server
Fault Domain 1 Fault Domain 2
Application Partitioning
via Swimlanes
28
Swimlanes Support Application Needs
• Scalability • Replicated swimlanes add capacity with linear scalability
• Fault Isolation • Complete failure only impacts a subset of users due to application partitioning and data sharding
• High Availability • Individual tiers can be made highly available through intra-VM application recovery, intra-swimlane application failover or intra-swimlane VM restart
• Disaster Recovery • Disaster recovery is achieved through swimlane failover, either in the same or a remote data center
• Automation • The identical nature of a swimlane allows for a high degree of operational automation
29
Active / Active – Swim Lanes
DB1 active
-----------------
DB3 passive
Data Center 1 Data Center 2
25% customers
25% customers
25% customers
25% customers
Replication
Replication
Global Load
Balancer
DB2 active
-----------------
DB4 passive
DB3 active
-----------------
DB1 passive
DB4 active
-----------------
DB2 passive
30
Active / Active – Single Write Master
Read Cache
DC1 DC2 DC3 DC4
Read Cache
Read Cache
Read Cache
Updates
Writes
Cache Updates
31
Design for Failure: Resiliency Patterns
Throttling versus Circuit Breaker
32
Circuit Breaker Pattern
http://techblog.netflix.com/2012_02_01_archive.html
Closed
On call/ pass through
Call succeeds / reset count
Call fail/count failure
Threshold reached/trip breaker
Open
On Call / Fail
On timeout / attempt reset
Half Open
On call / pass through
On succeed/reset
On fail /trip breaker
Trip breaker
Trip breaker
Attempt
Reset
Attempt
Reset
C D
Caller Dependency
Circuit Breaker State Diagram
33
34
Cir
cu
it B
reaker P
att
ern
:
Exam
ple
htt
p:/
/techblo
g.n
etf
lix.c
om
/2012_02_01_arc
hiv
e.h
tml
35
Cir
cu
it B
reaker P
att
ern
: Exam
ple
Example of how threads, network timeouts and retries combine
htt
p:/
/techblo
g.n
etf
lix.c
om
/2012_02_01_arc
hiv
e.h
tml
Examples of Tools for Building HA Systems
• Highly Available DNS– Akamai, Dyn, AWS Route53
• Load Balancing – F5 LTM, F5 GTM, AWS ELB
• Data Replication – Golden Gate
• Monitoring – eHealth, Spectrum, Wily, Splunk, Cacti
• Application Performance – DynaTrace, NewRelic
• Deployment – Perforce, Maven, Nexus, Hudson, Puppet
• Distributed Databases – NuoDB, VoltDB, several NoSQL types
• Distributed Storage – GlusterFS, Atmos, OpenStack
• HA Devices – Veritas Cluster Server
• OS Virtualization – AWS, Mware, Xen, Parallels
• Network Virtualization – AWS, Mware NSX, PLUMgrid
• Caching– Memcached, Akamai, CloudFront
• Caching– Netflix Chaos Monkey
• DDos Protection– Arbor, Riverbed
36
Trust Not the Execution Environment
“Everything Fails, All the Time.” – Werner Vogels, CTO of Amazon.com
37
Summary: Operating HA Service
Monitoring Business Metrics
Incident Management Process
Runbooks
Social Media & Messaging
Service Page
Business Fault Isolation
SLA, RPO, RTO
Failover Drills
Review Process
Change one thing at a time
Principles:
– Design for Failure
– Design for Operability
– Keep Everything “In Production”
– Scale Out (stateless)
– Keep it Fresh
Patterns:
– Active/Active
– Swimlanes
– Active/Passive
– Store-Forward
Design:
– Throttling
– Circuit Breaker
– Caching
– Rollback
– Healthchecks
Tools
38
39
Thank You!