BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers...
Transcript of BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers...
![Page 1: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/1.jpg)
SNAPI 2010 · Jan Stender
BabuDB: Fast and Efficient File System Metadata Storage
Jan Stender, Björn Kolbeck, Mikael Högqvist
Zuse Institute Berlin
Felix Hupfeld
Google GmbH Zurich
![Page 2: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/2.jpg)
SNAPI 2010 · Jan Stender
Motivation
– Modern parallel / distributed file systems:– Huge numbers of files and directories
– Many storage servers but few metadata servers
– Examples:
– Lustre, Panasas Active Scale, Google File System
– Metadata access critical wrt. system performance
– ~75% of all file system calls are metadata accesses
– Metadata servers are bottlenecks
![Page 3: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/3.jpg)
SNAPI 2010 · Jan Stender
Motivation
– B-tree-like data structures used for metadata storage– ZFS, btrfs, Lustre, PVFS2
– Downsides:
– Hard to implement and test,high code complexity
– Multi-version B-trees even more complex
– On-disk re-balancing expensive
![Page 4: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/4.jpg)
SNAPI 2010 · Jan Stender
BabuDB
– Key-value store
– FS metadata: key-value pairs stored in DB indices
![Page 5: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/5.jpg)
SNAPI 2010 · Jan Stender
BabuDB: Index
![Page 6: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/6.jpg)
SNAPI 2010 · Jan Stender
Example
![Page 7: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/7.jpg)
SNAPI 2010 · Jan Stender
Example: Insertions
![Page 8: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/8.jpg)
SNAPI 2010 · Jan Stender
Example: Insertions
![Page 9: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/9.jpg)
SNAPI 2010 · Jan Stender
Example: Lookups
![Page 10: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/10.jpg)
SNAPI 2010 · Jan Stender
Example: Lookups
![Page 11: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/11.jpg)
SNAPI 2010 · Jan Stender
Example: Lookups
![Page 12: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/12.jpg)
SNAPI 2010 · Jan Stender
Example: Lookups
![Page 13: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/13.jpg)
SNAPI 2010 · Jan Stender
Example: Deletions
![Page 14: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/14.jpg)
SNAPI 2010 · Jan Stender
Example: Deletions
![Page 15: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/15.jpg)
SNAPI 2010 · Jan Stender
Example: Deletions
![Page 16: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/16.jpg)
SNAPI 2010 · Jan Stender
Example: Deletions
![Page 17: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/17.jpg)
SNAPI 2010 · Jan Stender
Example: Range Lookups
![Page 18: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/18.jpg)
SNAPI 2010 · Jan Stender
Example: Range Lookups
![Page 19: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/19.jpg)
SNAPI 2010 · Jan Stender
Example: Range Lookups
![Page 20: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/20.jpg)
SNAPI 2010 · Jan Stender
Example: Range Lookups
![Page 21: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/21.jpg)
SNAPI 2010 · Jan Stender
Example: Checkpoints
![Page 22: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/22.jpg)
SNAPI 2010 · Jan Stender
Example: Checkpoints
![Page 23: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/23.jpg)
SNAPI 2010 · Jan Stender
Example: Checkpoints
![Page 24: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/24.jpg)
SNAPI 2010 · Jan Stender
Example: Checkpoints
![Page 25: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/25.jpg)
SNAPI 2010 · Jan Stender
On-disk Index
– Sorted by Keys
– Block index in RAM, blocks mmap'ed
![Page 26: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/26.jpg)
SNAPI 2010 · Jan Stender
BabuDB: Related Work
– Inspired by log-structured merge trees (LSM-trees)
– Only one on-disk index
– No „rolling merge“
– Made popular by Google Bigtable– Insert/lookup/merge similar as in Bigtable's Tablets
![Page 27: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/27.jpg)
SNAPI 2010 · Jan Stender
BabuDB: Metadata Mapping
– Mapping a hierarchical directory tree to a flat database index:
![Page 28: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/28.jpg)
SNAPI 2010 · Jan Stender
BabuDB: Advantages
– Why BabuDB for File System Metadata?
– Short-lived files
▪ 50% of all files deleted within 5 minutes
– Atomic file system operations w/o locking or transactions
▪ e.g. rename
– Directory content in contiguous disk regions
▪ Efficient readdir + stat
– Snapshots
▪ No need for multi-version data structures
![Page 29: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/29.jpg)
SNAPI 2010 · Jan Stender
BabuDB: Evaluation
– Linux kernel build
– ~10M calls: 44% stat, 40% open, 15% readlink, 1% others
– Dovecot mail server + imaptest
– ~2M calls: 51% stat, 48% open, 1% others
seco
nd
sDovecot test
0
50
100
150
200
250
300
350
400
BabuDBext4
Kernel build
0200400600800
100012001400160018002000
BabuDBext4
seco
nd
s
![Page 30: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/30.jpg)
SNAPI 2010 · Jan Stender
BabuDB: Evaluation
– Listing directory content
![Page 31: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/31.jpg)
SNAPI 2010 · Jan Stender
Summary
– BabuDB is ...
– an efficient key-value store
– optimized for file system metadata but also suitable for other purposes
– suitable for large-scale databases
– available for Java and C++ under BSD license
– used in the XtreemFS metadata server
http://babudb.googlecode.com
http://www.xtreemfs.org
![Page 32: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/32.jpg)
SNAPI 2010 · Jan Stender
Thank you for your attention!
![Page 33: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/33.jpg)
SNAPI 2010 · Jan Stender
Background: XtreemFS
– XtreemFS: a distributed replicated Internet file system
– part of the XtreemOS research project
– developed since 2006 by partners fromGermany, Spain and Italy
www.xtreemfs.org
– Object-based architecture:
– MRC stores metadata
– OSDs store pure file content as objects
– Clients provide POSIX file system interface
![Page 34: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/34.jpg)
SNAPI 2010 · Jan Stender
The XtreemOS Project
– Research project funded by the European Commission
– 19 partners from Europe and China
– XtreemFS is the data management component– developed by ZIB, NEC HPC Europe,
Barcelona Supercomputing Center and ICAR-CNR Italy
– ~ 3 years of development
– first public release in August 2008
![Page 35: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/35.jpg)
SNAPI 2010 · Jan Stender
XtreemFS: Overview
– What is XtreemFS?
– a distributed and replicatedPOSIX compliant file system
– off-the-shelve Servers – no expensive hardware
– servers in Java, runs onLinux / OS X / Solaris
– client in C, runs onLinux / OS X / Windows
– secure (X.509 and SSL)
– easy to install and maintain
– open source (GPL)
![Page 36: BabuDB: Fast and Efficient File System Metadata Storage · 2010. 5. 4. · – Many storage servers but few metadata servers ... – Dovecot mail server + imaptest – ~2M calls:](https://reader035.fdocuments.us/reader035/viewer/2022071418/6115a245387fe5509223cea8/html5/thumbnails/36.jpg)
SNAPI 2010 · Jan Stender
File System Landscape
ext3, ZFS,NTFS
NFS, SMBAFS/Coda
Lustre, Panasas,GPFS, CEPH...
Internet
Cluster FS/Data Center
Network FS/Centralized
PC
GDM"gridftp"
Grid File SystemGFarm