Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford,...
-
Upload
sade-rasbury -
Category
Documents
-
view
214 -
download
0
Transcript of Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim 1 Daniel Ford,...
Availability in Globally Distributed Storage Systems
Presented By Ala`a Ibrahim
1
Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong,Luiz Barroso, Carrie Grimes, and Sean Quinlan
OUTLINE• Introduction
• Disks failures• Correlated Failures• Fault Tolerance MechanismsMarkov Model of Stripe Availability
•Markov Model Findings•Conclusions
3
Data Center
4
Data Center Components
Server Components
Racks
Interconnects
Cluster of Racks
5
Data Center Components
Server Components
Racks
Interconnects
Cluster of Racks
ALL THESE COMPONENTS CAN FAIL
6
Cell, Stripe and Chunk
Stripe 1 Stripe 2
Stripe 1 Stripe 2
CELL 1 CELL 2
ChunksChunks ChunksChunks
GFS Instance 1 GFS Instance 2
7
Failure Sources• Failure Sources
• Hardware – Disks, Memory etc.• Software – chunk server process• Network Interconnect• Power Distribution Unit
• Availability• Reasons of unavailable
•Overloaded•Crash or restart•Hardware error•Automated repair processes
Disks failures•Node restarts• Planned machine reboots•Unplanned machine reboots•Unknown
10
Fault Tolerance Mechanisms• Replication (R = n)
• ‘n’ identical chunks (replication factor) are placed across storage nodes in different rack/cell/DC
• Erasure Coding ( RS (n, m))• ‘n’ distinct data blocks and ‘m’ code blocks• Can recover utmost ‘m’ blocks from the remaining
‘n-m’ blocks
11
Replication
1 Chunk
5 replicas
Fast Encoding / Decoding
Very Space Inefficient
12
Erasure Coding
‘n’ data blocks
Encode
‘n + m’ blocks
‘m’ code blocks
13
Erasure Coding
‘n’ data blocks
Encode
‘n + m’ blocks
‘m’ code blocks
14
Erasure Coding
Highly Space Efficient Slow Encoding / Decoding
‘n’ data blocks
Decode
Encode
‘n + m’ blocks
‘m’ code blocks
‘n’ data blocks
15
Correlated Failures• Failure Domain
• Set of machines that simultaneously fails from a common source of failure
• Failure Burst• Sequence of node failures each occurring within a
time window ‘w’ of the next• Window 120 s
Correlated Failures…• Failure Burst (Window Size)
17
Markov Model• Chunk placement policy• Cell Simulation
• trace-based simulation• Priority queue
18
Markov Chain
19
Conclusion
• The findings provides a feedback for improving• Replication and encoding schemes• Recovery rate