Mamoolaat-e-Subha-Shaam by Molana Fazl Ur Raheem Ashrafi Sahib
MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett.
-
Upload
merryl-atkins -
Category
Documents
-
view
221 -
download
2
Transcript of MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett.
MS CLOUD DB - AZURE SQL DBFault Tolerance
bySubha VasudevanChristina Burnett
Windows AZURE Cloud Services
AZURE Storage Services
● Blob● Table● Queue● File Storage
Azure SQL Database
Database as a Service● Predictable performance● Scalability● Business continuity● Data protection● Zero administration
Azure DB
Fault Tolerance and Failure
Why is it so important?● Supports
concurrency control● Provides
transactional guarantee
● ACID
Why does it fail?● Inevitable
software/hardware failure
● Human errors
Fault Tolerant SQL Database
● Redundant computers rather than redundant components.
● Fault tolerance at the highest level of the stack - Fault tolerant DB rather than fault tolerant DB servers.
● Database replication across fault zones.
● Failure Detection and Failover.
Fault Zones/Domains
Each fault zone is a fully independent physical sub-system with its own server racks and network routers.
Assigning Storage to a Fault Domain
Proximity vs. Isolation● Proximity of replicas affects network latency● Isolation helps ensure availability of replicas in
the event of a failure
Selection of replica location ● MDS codes● (N, K) coding
(Banerjee, Das, Mazumder, Derakhshandeh, & Sen, 2014)
Database Replication
There are 3 copies of each DB, a primary and two secondary replicas.The primary database performs the transactions, and sends the updates and DDL to the replicas.
Database Replication
Each replica is stored in a different fault zone.
Quorum-Based Commit
● At least two copies required.
● Data must be written to the primary and at least one secondary before it is considered committed.
PRIMARY FAILSWhen the server containing the primary database fails, one of the secondary replicas is promoted to primary.
Dynamic Quorum
SECONDARY FAILSWhen a server fails that contains secondary replicas, new replicas are created.
Dynamic Quorum
Transactional Consistency
● Updates are persisted in log
● Primary DB streams updates to secondaries
● Secondaries are asked to commit first
● Secondaries return acknowledgement
● Primary commits after quorum
Recovering Transactions
If secondary fails, on restart it checks with primary for transactions it may have missed.
Failure Detection● The database is paired with
the SQL Engine to detect failures in the neighborhood.
● Distributed failure detection - every node monitored by several neighbors.
● Efficient, localized and fast.● Prevents ping storms and
avoids delayed failure detection
Failover● If primary node fails unexpectedly,
standby backup node automatically assumes role of primary.
● Managed by GPM(Global Partition Manager).
● Distributed fabric maintains a global map
● GPM maintains the health, state and location of every DB.
● Fabric informs GPM of any node failure.● GPM reconfigures assignment of
primary and secondary DBs in failed node.
Gateway Processes
Client
psss
ssps
sssp
Fault Tolerance in Application Design
Data Failure● application specific● catastrophic consequences● not addressed by Azure
Computational Failure● addressed by Azure
● controlled by application
Monitoring and Logging● diagnosis
● debugging(Jie Li et al., 2010)
ReferencesFault-tolerance in Windows Azure SQL Database. [Online]. Available: http://azure.microsoft.com/blog/2012/07/30/fault-tolerance-in-windows-azure-sql-database/
Banerjee, S., Das, A., Mazumder, A., Derakhshandeh, Z., & Sen, A. (2014). On the impact of coding parameters on storage requirement of region-based fault tolerant distributed file system design. Paper presented at the Computing, Networking and Communications (ICNC), 2014 International Conference On, 78-82. doi:10.1109/ICCNC.2014.6785309
Jie Li, Humphrey, M., You-Wei Cheah, Youngryel Ryu, Agarwal, D., Jackson, K., & van Ingen, C. (2010). Fault tolerance and scaling in e-science cloud applications: Observations from the continuing development of MODIS Azure. Paper presented at the E-Science (E-Science), 2010 IEEE Sixth International Conference On, 246-253. doi:10.1109/eScience.2010.47
Rajan, D., Canino, A., Izaguirre, J. A., & Thain, D. (2011). Converting a high performance application to an elastic cloud application. Paper presented at the Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference On, 383-390. doi:10.1109/CloudCom.2011.58
QUESTIONS?