Boxwood: Abstractions as the Foundation for Storage Infrastructure
Lidong Zhou, Microsoft Research Silicon Valley
Joint work with Chandu Thekkath, John MacCormick, Nick Murphy, and Marc Najork
12/06/2004 Boxwood 2
Distributed Storage Applications are Hard to Build Distributed storage: low hardware cost, but
high development/deployment cost Application logic on low-level storage interface Hardware parallelism and concurrency control Fault tolerance a necessity Incremental expansion and dynamic
reconfiguration vs. system consistency
Our goal: Distributed storage applications made easyto design, build, and deploy
12/06/2004 Boxwood 3
Target Application and Setting
CPU + Memory
CPU + Memory
CPU + Memory
Local Area Network
Locally Attached Disks
Locally Attached Disks
Locally Attached Disks
Enterprise storage applications and back-end storage for data-intensive Internet services
12/06/2004 Boxwood 4
Roadmap
Boxwood Vision Boxwood Architecture Building Applications on Boxwood Performance Related Work and Conclusion
12/06/2004 Boxwood 5
Boxwood Vision
Incorporate rich virtualized abstractions into low levels of the storage
An evolution path for distributed storage:
Storage Applications
12/06/2004 Boxwood 6
Boxwood Vision
Incorporate rich virtualized abstractions into low levels of the storage
An evolution path for distributed storage:
Virtual Disk
Storage Applications
12/06/2004 Boxwood 7
Boxwood Vision
Incorporate rich virtualized abstractions into low levels of the storage
An evolution path for distributed storage:
Storage Applications
… …
Tree Table List
12/06/2004 Boxwood 8
Why High-Level Abstractions Reduce the complexity of distributed
storage applications Natural continuum of storage virtualization “High-level programming language” for building
distributed storage applications Potential built-in performance optimization
by exploiting structural information Caching Prefetching
12/06/2004 Boxwood 9
Roadmap
Boxwood Vision Boxwood Architecture Building Applications on Boxwood Performance Related Work and Conclusion
12/06/2004 Boxwood 10
Chunk Store
Reliable“Media”
Services
Locking
Logging
Consensus
Storage Application
High-levelStorage
Abstractions
Boxwood Architecture
Replicated Logical Device
Magnetic Media
B-Tree
12/06/2004 Boxwood 11
Chunk Store Persistent storage with
“malloc”-like interface
Virtualization layer that hides the distributed nature
Manage address space or free space for higher layers
Reliable storage through replicated logical device
Chunk Store
AllocateDe-allocate
ReadWrite
ReplicatedLogical Device
12/06/2004 Boxwood 12
B-Tree Abstraction B-Tree: A proven useful
data structure for storage applications
Distributed/reliable B-Link trees in Boxwood B-Link trees: high
concurrency with simple locking
Distributed reliable storage from chunk store
Caching for performance Distributed lock service
for consistency Logging for recovery
B-Link Tree
InsertDelete
LookupEnumerate
Create
Chunk Store
LockingLogging
12/06/2004 Boxwood 13
Boxwood Services Distributed lock service for coordinating
concurrent access to shared data Logging and recovery service for atomicity
in face of transient failures Consensus service for system consistency
Clean design of these services is crucial for scalability and for managing complexity
12/06/2004 Boxwood 14
Roadmap
Boxwood Vision Boxwood Architecture Building Applications on Boxwood Performance Related Work and Conclusion
12/06/2004 Boxwood 15
Distributed Storage Applications on Boxwood: A Recipe
1. Design applications for local storage Map application logic to storage abstractions
2. Adapt the design for a distributed storage infrastructure Boxwood abstractions are virtualized
Boxwood offers facilitating distributed services
Separating algorithmic design from distributed system concerns is attractive.
12/06/2004 Boxwood 16
Local Disks
From B-Link Tree Algorithm to Distributed Reliable B-Link Trees
Local Disks
B-Link trees on a single machine
B-Link Tree Algorithm
LocalLocks
Logging
12/06/2004 Boxwood 17
From B-Link Tree Algorithm to Distributed Reliable B-Link Trees
B-Link Tree Algorithm
GlobalLock
Service
ReliableLogging
Chunk Store
Distributed and reliable B-Link trees
ReplicatedLogical Device
12/06/2004 Boxwood 18
B-Link Tree
Chunk Store
Services
BoxFS
BoxFS:Multi-Node File Server on Boxwood
Exported via NFS v2 Directory/File B-Tree
Directory: maps names to NFS file handle with embedded B-tree handle
File: maps block number to chunk handle
File blocks chunks Locking/caching at file
system level ~2500 lines of C# code
12/06/2004 Boxwood 19
Roadmap
Boxwood Vision Boxwood Architecture Building Applications on Boxwood Performance Related Work and Conclusion
12/06/2004 Boxwood 20
Prototype Deployment and Performance Evaluation System setup
Eight Dell PowerEdge 2650 servers with a single 2.4 GHz Xeon processor, 1GB of RAM
Gigabit Ethernet switch Adaptec AIC-7899 dual SCSI adapter, and 5 SCSI
drives Performance evaluation
Single-machine non-replicated performance (BoxFS vs. NFS)
B-tree operation scalability BoxFS operation scalability
12/06/2004 Boxwood 21
BoxFS vs. NFS over NTFS:Connectathon Benchmarks
0
2
4
6
8
10
12
crea
te
rem
ove
getw
d+st
at
chm
od+sta
twrit
ere
ad
read
dir
rena
me
sym
link
stat
fs
BoxFS
NFS
12/06/2004 Boxwood 22
B-Tree Scaling (Private Tree)
Throughput (Ops/sec)
0
100
200
300
400
500
600
700
800
900
2 4 6 8
Number of Servers
Ins
ert
& D
ele
te
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Lo
ok
up
Inserts Deletes Lookups
12/06/2004 Boxwood 23
BoxFS Scaling (Read)
Throughput (MB/sec)
0
0.5
1
1.5
2
2.5
3
2 3 4 5 6 7 8
Number of BoxFS servers
Re
ad
12/06/2004 Boxwood 24
B-Tree Scaling (Shared Tree)
Throughput (Ops/sec)
0
100
200
300
400
500
600
2 4 6 8
Number of Servers
Ins
ert
& D
ele
te
0
500
1000
1500
2000
2500
3000
3500
4000
Lo
ok
up
Inserts Deletes Lookups
12/06/2004 Boxwood 25
BoxFS Scaling (Write/MkDirEnt)
Write Throughput (MB/sec) and MkDirEnt Latency (sec)
0
0.5
1
2 3 4 5 6 7 8
Number of BoxFS servers
Wri
te
1
1.5
2
2.5
3
3.5
4
MkD
irE
nt
WriteFile MkDirEnt
12/06/2004 Boxwood 26
Roadmap
Boxwood Vision Boxwood Architecture Building Applications on Boxwood Performance Related Work and Conclusion
12/06/2004 Boxwood 27
Related Work Distributed Storage/Operating Systems
Virtual/Logical disks File systems Database systems
Scalable Distributed Data Structures Linear Hash Table (LH) and its variants
(Litwin, 1980--present) Scalable distributed hash table
(Gribble et al., 2000)
Highly concurrent B-trees (Lehman and Yao, 1981; Sagiv, 1986)
12/06/2004 Boxwood 28
Conclusion and Future Directions
A storage infrastructure offering virtualized high-level abstractions is promising
Future Work: Explore more abstractions and applications;
expose flexible interfaces (e.g., through hints) Leverage high-level abstractions for better load
balancing, prefetching, and caching Graceful degradation during massive failures
Top Related