DHT2 - O Brother, Where Art Thou with Shyam Ranganathan
-
Upload
glusterorg -
Category
Technology
-
view
208 -
download
1
Transcript of DHT2 - O Brother, Where Art Thou with Shyam Ranganathan
DHT2 - O Brother, Where Art Thou?Shyamsundar RanganathanDeveloper
Session aims to explore... "The hypothetical treasure at the end of the journey"
Why DHT2 "The plan..." DHT2 design "Known adventures along the way!"Challenges in DHT2 "The strange characters"Challenges because of DHT2 "Trouble escaping the chain gang!"Where are we with DHT2Loosely inspired by the movie: https://en.wikipedia.org/wiki/O_Brother,_Where_Art_Thou%3F
Why DHT2DHT pitfalls
Directories on all subvolumesLayout per directoryRebalance IO path handling and nonoptimal data movementThis impacts scale and correctness!
Why DHT2DHT pitfalls
Directories on all subvolumesLayout per directoryRebalance IO path handling and nonoptimal data movementThis impacts scale and correctness!
Correctness can be addressed in DHT,Broader locking semantics for dentry operationsPossibly single layout adoptionBut, increases complexity and could cost performance!
With DHT2 the goal is to fix all of the above, retaining or improving performance
DHT2 Design: The file system objectsView the file system as a collection of related objects
”wait a second... isn't that what inodes and data pointers are?”Yes, but they are not distributed!
Directory objects denote hierarchystoring <name,inode#> tables
File object maintains inode related metadataActual file data is maintained in data object(s)
The file system objects (example)
Client View. ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1
Dir Object File Object
Data
Data
root
File2
Dir2Dir1
File1
The file system objects (example)
inodes/dinode File data
1
A
CB
D
A
D
A Data Object
Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1
The different objects, segregated by type
Dir Object File Object
Data
Data
root
File2
Dir2Dir1
File1
The file system objects (example)
inodes/dinode File data
1
A
CB
D
A
D
A Data Object
Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1
Namespace hierarchy representation
Dir Object File Object
Data
Data
root
File2
Dir2Dir1
File1
The file system objects (example)
inodes/dinode File data
1
A
CB
D
A
D
A Data Object
Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1
Data association
DHT2 Design: Distribution detailsDistribute inodes using GFID
in the metadata ringNo hierarchy, a directory object lives only on one subvolume
Use GFID as the data object#in the data ring
Distribution is hence not name dependent, and we just use a single layout per ring
Dir Object File Object
BAC5
00EF
0001
BAC5
7525BA11
Distribution details (example)
Metadata Ring(few bricks)
Data Ring(many bricks)
1
A
CB
D
A
D
Data Object
00EF
<File1, 00EF><Dir1, BA11>
<File2, BAC5><Dir2, 7525>
Switch names to GFID, add name to dinodes
Dir Object File Object
BAC5
00EF
0001
BAC5
7525BA11
Distribution details (example)
Metadata Ring(few bricks)
Data Ring(many bricks)
1
A
CB
D
A
D
Data Object
00EF
<File1, 00EF><Dir1, BA11>
<File2, BAC5><Dir2, 7525>
Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1
DHT2 Design: Distribution details (contd.)Layout is based on bucket to subvolume assignment
Where, buckets >> subvolumesBucket ID is encoded into first n bytes of the GFID
Trivial GFID based operations
Collocates file object with parent objectFile object# statically inherits parent directory# bucket IDOptimized readirp and lookup operations (no hopping unless
non-trivially renamed, or a link file)IOW, optimized (pGFID, basename) based operations
00EF
Dir Object File Object
BAC5
00EF
0001
BAC5
7525BA11
Distribution details (example)
Metadata Ring(few bricks)
Data Ring(many bricks)
1
A
CB
D
A
D
Data Object
<File1, 00EF><Dir1, BA11>
<File2, BAC5><Dir2, 7525>
Bricks/Subvols
Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1
Add bricks/subvolumes
00EF
Dir Object File Object
BAC5
00EF
0001
BAC5
7525BA11
Distribution details (example)
Metadata Ring(few bricks)
Data Ring(many bricks)
1
A
CB
D
A
D
Data Object
<File1, 00EF><Dir1, BA11>
<File2, BAC5><Dir2, 7525>
Bricks/Subvols
00
75
BA
00
BA
Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1
Buckets
Assign buckets to bricks
00EF
Dir Object File Object
BAC5
00EF
0001
BAC5
7525BA11
Distribution details (example)
Metadata Ring(few bricks)
Data Ring(many bricks)
1
A
CB
D
A
D
Data Object
<File1, 00EF><Dir1, BA11>
<File2, BAC5><Dir2, 7525>
Bricks/Subvols
00
75
BA
00
BA
Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1
Buckets
Place directories based on bucket encoded in the GFID
00EF
Dir Object File Object
BAC5
00EF
0001
BAC5
7525BA11
Distribution details (example)
Metadata Ring(few bricks)
Data Ring(many bricks)
1
A
CB
D
A
D
Data Object
<File1, 00EF><Dir1, BA11>
<File2, BAC5><Dir2, 7525>
Bricks/Subvols
00
75
BA
00
BA
Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1
Buckets
Colocate the files under a directory with the same bucket ID
DHT2 Design: RebalanceReassign buckets to/from newer/removed subvolumes
fix-layout is instantaneousFiles travel with directories (same bucket colocation)
Expand the cluster, but perform no rebalanceaka just add-brick and let min-free-disk+link-to do its job This is the tough one, use layout versions/histories to pull this
off?
Split DHT2 into client-server piecesHandle IO traffic, locking during rebalanceBetter consistency model for transactions
Ability to have different expansions strategies for the 2 rings
Challenges in DHT2Rename ELOOP checking requires hierarchy
Object backpointers
Time and size information should be in sync between data and metadata objectsDirty inode, tracked via open fd
Orphan GFID cleanupEnter transactions/journals!
Directories as files/in a DBReduce local FS inode proliferation
Challenges because of DHT2IO path cannot depend on hierarchy (Ex: quota)Quick-read cannot fetch data in lookupsAnon-fd based operations cannot track dirty inodesOthers
Will changelog play well!EC has to bother with only data?Tier may need a rethinkSharding may accrue cost of missing anon-fd and data/meta-
data split of shards
Unknowns!
Where are we with DHT2Introduced DHT Version 2 in Barcelona summit, 2015
Followed up with 2 discussions upstream on core concepts [1] [2]
Followed up with a POC and some slides/documents to demonstrate the concepts [3]
In a limbo since then,But, not out of the picture yet!
Targeting an experimental release with 4.0
Questions?
"The treasure you seek shall not be the treasure you find."
References[1] DHT2 Design Discussion
https://goo.gl/tLpqJO[2] DHT2 Design Discussion, Round 2https://goo.gl/dCAO36[3] POC trail…http://www.gluster.org/pipermail/gluster-devel/2015-August/046369.html
Other threads of interest:
- http://www.gluster.org/pipermail/gluster-devel/2016-March/048874.html
- http://www.gluster.org/pipermail/gluster-devel/2015-November/047098.html
- http://www.gluster.org/pipermail/gluster-devel/2015-September/046630.html