Design (Cloud systems) for Failures
-
Upload
rodolfo-kohn -
Category
Software
-
view
87 -
download
3
Transcript of Design (Cloud systems) for Failures
Design for Failures(and for Availability)
IV Jornadas de Cloud Computing & Big Data
Rodolfo KohnCloud Architect
Intel [email protected]
Original Agenda
Remembering “Distributed System Design” and availability
Introduction to Design for Failures
• Failure modes
• Redundancy (process and data)
• Failure detection
• Failure recovery
• Cascade failures and recovery
Redundancy and high availability in AWS
Eventual consistency problems
Performance and scalability problems
Operations monitoring
• Techniques to avoid false positives
Logs and counters
Design software for failures
Testing availability
Measuring availability
Education
7/10/20162
Agenda
Remembering “Distributed System Design” and availability
Introduction to Design for Failures
• Redundancy – Process
– Data: Replication (multi-master, master-slave)
– Flat groups and hierarchical groups
• Synchronization Model
• Stateful vs Stateless
• Eventual consistency
• CAP Theorem
• Failure detection
• Failure recovery
• Cascade failures and recovery
7/10/20163
(Cloud or Distributed) Applications are Complex
7/10/20164
DNSServer
.com Root
GLB
Auth
Datacenter-1
GLB
Auth
Datacenter-2
Service
Cache
Cache
Cache
Cache
DNS
Disk
Network
SMTP
CDN
NoSQL
SQL
Monitoring Logs Configuration Management
Multiple Opportunities for Unexpected FailuresBrittle Systems shall not Survive
Load bursts &Response time deterioration
Micro-services dependencies
In distributed systems, and cloud systems, there are complex dependencies between systems such that failure of one component can bring down the whole system
7/10/20165
What is Availability?
Distributed Systems: Principles and Paradigms (2nd Edition), Andrew Tanenbaum, Maarten Van Steen
“Availability is defined as the property that a system is ready to be used immediately. In general, it refers to the probabilitythat the system is operating correctly at any given moment and is available to perform its functions on behalf of its users. In other words, a highly available system is one that will most likely be working at a given instant in time.”
3/4/5 9’s of Availability: see Wikipedia :)
7/10/20166
The system is always running correctlyWhen users access it, they have it
Systems fail …
7/10/20167
http://techcrunch.com/2012/10/22/aws-ec2-issues-in-north-virginia-affect-heroku-reddit-and-others-heroku-still-down/
What started as a small issue affecting some instances of Amazon’s Elastic Cloud Compute (EC2) in North Virginia became a full-blown outage of AWS in North Virginia. Major services, such as Reddit, Foursquare, Minecraft and Heroku, are down. GitHub, imgur, Pocket, HipChat, Coursera and others are affected …
And DOWNTIME COMES …
Consequences of Unavailability
7/10/20168
http://blog.smartbear.com/news/motorolas-site-collapses-under-cyber-monday-traffic/
Talk about failures
7/10/20169
We don’t avoid failures, we live with them
Design for Failures is about focusingon the Error Path
7/10/201610
PAINFUL AND TIME CONSUMIG
Failures affecting Availability
Different types of failures• Infrastructure failures• Software failures• Operations failures• Deployment failures
System updates or upgrades may affect availability if they require downtime
Bad response time affects availability• Unacceptable response time = system unavailable• Bad scalability eventually affects response time– Vulnerability to load peaks
Manual Path to Production affects availability
Neglected business/process situations affect availability
7/10/201611
Valid for all business
As core business moves to the Internet, downtime means money
More possibilities of failure:
• (Cloud) systems are becoming increasingly complex
• Software undergoes stringent conditions
• There is a demand for excellent user experience
• In the cloud applications run in commodity hardware
7/10/201612
It’s about the whole big machinery
7/10/201613
Product/Service Requirements
DevelopmentDeployment
and Operation
Path to production
PDM and CXD must think about alternative paths on error conditions
Architects design for Availability(Software and Infrastructure)
Agile teamsDistributed Systems SkillsAvailability, Scalability, Performance mindset
Fast, automated, error free
DevOps, Monitoring,Operations Automation
From Architecture to Development
Architecture:
redundancy model and management, dependency
management, state model, synchronization model, failure
detection, recovery, scalability model,
administration/configuration management
Design: logging design, monitoring design, dependency handling, state management design (stateful and stateless),
consistency, fallback actions on failures per operation…
Development: consistency handling, retries, error analysis,
logging, error path (if ... else …), …
Topics
Redundancy (process and data)
Flat (P2P) Groups vs Hierarchical groups
State: stateless vs stateful
Replication
Synchronization: asynchonous vs. synchronous
Eventual Consistency
CAP
Failure detection
Recovery actions
Cascade Failures
Client recovery in client/server
7/10/201615
Redundancy
It is about provisioning in excess, replicating hardware or software components or data
It allows masking failures as a mechanism of fault tolerance
Additional hardware equipment or software processes are provided
When a component fails another one in the group takes over its work
Data replication, associated with a component replication, keeps data safe in face of a component failure
7/10/201616
Redundancy and groups
Process redundancy implies the creation of groups of replicated processes
The group is seen by other processes as a single process
• Replication is abstracted to be seen as one entity
• The same happens with hardware
7/10/201617
Two types of groups
7/10/201618
Flat group or peer-to-peer Hierarchical group
Coordinator
Worker
Design Considerations
Group creation and destroy
• Group bootstrapping
Group membership
• Processes can join and leave a group
Decision making
• Task distribution, synchronization, consistency, etc.
7/10/201619
Different challenges
Hierarchical group
• The coordinator, primary, or master knows and controls all workers
• Simpler control and management
• If coordinator fails a group crashes
Flat group or peer-to-peer
• There is need of agreement or consensus algorithms– For Coordinator election
– For consistency
– Synchronization
– For faulty process detection
– Membership change detection
• Data distribution
• If any member crashes the group continuous working, just shrinks
7/10/201620
Hierarchical group:Pool of servers controlled by a Load Balancer
7/10/201621
Load balancer detects unresponsive server and removes it
A new server is added to the pool.Manually or automatically.
All other processes/applications/systems sending requests to this group see it as just one process
The LB distributes work and controls workers
Faulty process and server detection
Load balancer sends health checks to servers in the pool detecting failing servers
• It can monitor at different stack layers
– In the case of AWS ELB: TCP, SSL, HTTP, HTTPS
– F5 can also test at different stack layers
• Failing servers can be automatically de-registered
• New healthy servers can be added to the pool
7/10/201622
Flat group: Cassandra
A cluster of Cassandra nodes
• Information is transmitted with a gossip protocol
• If a node detects a new node or a faulty node It transmits information through a gossip protocol
• Heartbeats with other nodes to detect faulty nodes with Phi Accrual Failure Detectors
7/10/201623
Flat group: Cassandra
A cluster of Cassandra nodes
7/10/201624
Flat group: OSPF
I would say OSPF routers form a flat group
• Routers use link-state routing protocol to transmit connectivity information
• Routers can detect neighbor failures through Hello protocol and transmit the data as links states
7/10/201625
Data Redundancy
Data stores may be replicated for high availability
• Database replication
• Disk replication
Data redundancy is also found at other levels
• RAID disks
• In communications: CDMA uses Hamming code to recover from errors
We focus on higher level failures that affect operations: a database, SAN, whole platform, datacenter
7/10/201626
Data Redundancy
SQL and NoSQL Databases allow different replication models
• Master-Master
– All replicas can be read and written
• Master-Slave
– All replicas read, only master can be written
– In case of master failure, a slave must take over
7/10/201627
Database Replication (1)
Replication: Data is replicated in all instances
Partitioning: Data is partitioned across different instances
• This is not replication
Data Data Data
Data Data
Clients from America
Clients from Europe
Database Replication (2)
Replication Master-Slave: Write in one instance, Read from all instances
DataData
Data
WRITE
READ
REPLICATION
Database Replication (3)
Replication Master-Master or Multi-master or peer-to-peer: Write in all instances, Read from all instancesPossibility of conflicts in asynchronous mode:
• Same row updated in different replicas
• Two inserts in different replicas
• Delete and insert/update
DataData
Data
WRITE READ
REPLICATION
Synchronous vs. Asynchronous Replication
Synchronous replication assures a write will occur in all instances at the same time
• Either multi-master or master-slave
In asynchronous replication write is sent to one node and then replicated to other nodes
• Either multi-master or master-slave
• There is a lag in write replication
• At a point in time data might not be the same in all nodes (eventual consistency)
Synchronous Replication
Synchronous replication assure a write will occur in all instances at the same time
• All servers (both masters and slaves) have up-to-date data (A and C in ACID)
• Provides ACID capabilities
• High availability
• Simpler for developers
• Implementation through Two-phase commit or distributed lock which may turn system slow
• No write scalability
• Performance might be affected
• Possibility of deadlocks
Galera cluster for MySQL
http://galeracluster.com/
Galera Replication is a synchronous multi-master replication plug-in for InnoDB
Asynchronous Replication
Write occurs in one node and then it replicates to other nodes
• Less complex (no two-phase commit or distributed locks)
• High availability across datacenters
• Better write scalability
• Eventual consistency
• Write conflicts among masters
• Loss of synchronization is a problem to solve
• More difficult for developers (eventual consistency, write conflicts)
This type of replication is the basic one offered by MySQL, PostgresSQL and MariaDB (and SQL Server???)
Multi-master with Cassandra
Source: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
Asynchronous replication
Tunable consistency
P2P Database Solutions
• Dynamo DB
– http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
• Cassandra
– https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf
• Netflix’s Dynomite (Redis and Memcached)
– http://techblog.netflix.com/2014/11/introducing-dynomite.html
– https://github.com/Netflix/dynomite
7/10/201635
Consistency and Design for Failures
When working with asynchronous replication you need to deal with eventual consistency
• With asynchronous processes in general it is possible that when a process goes to read something that should be there it is not there yet
It could take milliseconds or many seconds
Under heavy load it turns worse
Write conflicts are another issue you need to deal with
• Need to have alarm and repair scripts if an automated solution is not possible
Asynchronous, Fire and forget, Future, Let it be …
7/10/201636
Eventual consistency
Applications
Data
Applications Applications
Data
Load Balancer
Applications
Replicationafter some time
1-WRITE
•Eventually both DBinstances have the same data
2
3
4
Eventual consistency problem
Applications
Data
Applications Applications
Data
Load Balancer
Applications
Replicationafter some time
1-WRITE4-READ
•Read-after-write problem
•Specific solution for each case
•Cannot trust replication will occur after some time
2
3
5
6
7
From Architecture to Development
Designers and developers must understand the consequences of each architecture
Typical questions/comments that predict issues in distributed systems (100% certainty)
• By comparing operations’ time I can determine order
• How long does it take to replicate data?
• We tested it and it is replicating very fast, no problems
• It’s fast. It’s just fire and forget (asynchronous): check if there is a subsequent read associated
7/10/201639
Asynchronous Replication in Active-Active
Network partitioning
7/10/201640
DNSServer
.com Root
GLB
Auth
Datacenter-1
GLB
Auth
Datacenter-2
Service
Cache
Cache
Cache
Cache
DNS
Disk
Disk
Why the hassle of P2P/flat
Best solution for high availability
Self-managed system
Best horizontal and dynamic scalability
Usually, can still write after network partition
7/10/201641
Brewer’s Conjecture and CAP Theorem
• Consistency, Availability, and Partition Tolerance are all desired features of database systems.
• However it is not possible to have all of them: pick only two.
42
A
C P
Availability:Each client can always read and write
Consistency:All clients always have the same view of the data
Partition Tolerance:System works well despite physical network partitions
CA: RDBMS
AP: Dynamo, Cassandra
CP: MongoDB, BigTable
MongoDB
7/10/201643
Source: https://docs.mongodb.com/manual/core/replica-set-elections/
MongoDB
7/10/201644
Source: https://docs.mongodb.com/manual/core/replica-set-elections/