Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable...
Transcript of Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable...
Michael SevillaMantle, Symposium ‘15
Mantle: A Programmable Metadata Load Balancer for the
Ceph File SystemMichael A. Sevilla, Noah Watkins, Carlos Maltzahn, Ike Nassi, Scott A. Brandt, Sage A. Weil*, Greg Farnum*, Sam Fineberg^
UC Santa Cruz, *Red Hat, ^HP StoragePublished at Supercomputing 2015
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Separating Metadata & Data IO
File System
2
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
metadata service
Separating Metadata & Data IO
DistributedFile System
object store
3
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
History: A Simple Solution
• 1 MDS is insufficient[McKusick et al., login; '10], [Beaver et al., OSDI '10], [Thusoo et al., SIGMOD '10]
• How do we distribute metadata?
4
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
History: Scalable Solutions
1. Hash file identifier 2. Subtree partitioning
5
Michael SevillaMantle, Symposium ‘15
Outline
1. File System Metadata Management2. CephFS Background3. Complexity of Dynamic Subtree Partitioning4. Mantle5. Evaluation
6
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
CephFS Background
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Example File System Workload
• Linux kernel compile locality
• Shade of Red: locality
Time
Fewer InodeRead/Writes
Many InodeRead/Writes
8
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Example File System Workload
• Linux kernel compile locality
• Shade of Red: locality
Time
Fewer InodeRead/Writes
Many InodeRead/Writes
9
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
CephFS Hotspot Detection!
Migration!
10
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Does CephFS work?what we want
bad
bad
bad
11
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Complexity of Dynamic Subtree Partitioning
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
MDS Cluster
rebalancemigrate?
partitionclusterpartition
namespace
migratefragment
recv HB
Why not?
Migration Policies• How to calculate load?• When to move load?• Where to move load?• How much to move?
RADOSrebalance
Hierarchical Namespace
13
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
CephFS’s Policies
“weighted ∑𝒐𝒐𝒐𝒐’’
“weighted ∑𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎’’
“greater than average’’
“underload MDS’’
“equal load across cluster’’
14
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Different Balancers for Different Workloads
• Which heuristics should we use?[Weil et al., SuperComputing ‘04] [Patil et al., FAST ‘11] [Pai et al., ASPLOS ‘98]
Good for mixed workloads
Good for create-heavy workloads
Simple implementation
15
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Mantle
http://synapostasy.blogspot.com/2007/10/cephalopod-awareness-day.html
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Different Balancers for Different Workloads
• Which heuristics should we use?[Weil et al., SuperComputing ‘04] [Patil et al., FAST ‘11] [Pai et al., ASPLOS ‘98]
MDS Cluster
Mantle API
17
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Different Balancers for Different Workloads
• Which heuristics should we use?[Weil et al., SuperComputing ‘04] [Patil et al., FAST ‘11] [Pai et al., ASPLOS ‘98]
MDS Cluster
18
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Different Balancers for Different Workloads
• Which heuristics should we use?[Weil et al., SuperComputing ‘04] [Patil et al., FAST ‘11] [Pai et al., ASPLOS ‘98]
MDS Cluster
19
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Implementation: API + EnvironmentMDS Cluster
rebalance
20
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Balancers
• Greedy Spill Balancer
• Fill & Spill Balancer
• Adaptable Balancer
21
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Evaluation
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Evaluation: Creates Workload
• % of total load:
25 25 2525
25 0 075
25 13 1350
23
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Workload: Creates in Same Directory
best
sp
eedu
p
distribution not worthwhile st
able
Ove
rload
ed
MD
S
bett
er th
an1
MDS
wor
se th
an
1 M
DS
Strategy
24
Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation
Michael SevillaMantle, Symposium ‘15
Workload: Compiling Code
system notsaturated
best speedupmost stable
bett
er th
an1
MDS
wor
se th
an
1 M
DS
Adaptable Balancer
too
aggr
essi
ve=
bad
perf.
25
Michael SevillaMantle, Symposium ‘15
Conclusion: Separate Policy and Mechanism
• Benefits of understanding server capacity• less resource utilization• better performance/stability
• Distribution can hurt performance/stability
• Being too aggressive thrashes workload
26
Michael SevillaMantle, Symposium ‘15
Thanks! Questions?Acknowledgements:
Co-authors: Noah Watkins, Carlos Maltzahn, Ike Nassi, Scott A. Brandt, Sage A. Weil*, Greg Farnum*, Sam Fineberg^
Collaboraters: Ivo Jimenez, Adam CrumeFunding: HP Enterprise; storage division
27
Michael SevillaMantle, Symposium ‘15
Extra Slides
28/24
Michael SevillaMantle, Symposium ‘15
Why is Locality Important?
29
Michael SevillaMantle, Symposium ‘15
More Recent History: Distributed Metadata
Mechanisms for migrating load
Heuristics for migrating resources
30
Michael SevillaMantle, Symposium ‘15
Evaluation: Compile Workload
31
Michael SevillaMantle, Symposium ‘15
Background CephFS
• Why layering a file system over RADOS is effective• Random access• Significant engineering effort• Specialized subsystem for handling the namespace
32