The BW-Tree: A B-tree for New Hardware Platforms
-
Upload
jacqueline-bowers -
Category
Documents
-
view
25 -
download
2
description
Transcript of The BW-Tree: A B-tree for New Hardware Platforms
![Page 1: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/1.jpg)
The BW-Tree: A B-tree for New Hardware
Platforms
Justin Levandoski
David Lomet
Sudipta Sengupta
![Page 2: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/2.jpg)
An Alternate Title
“The BW-Tree: A Latch-free, Log-structured B-tree for Multi-core
Machines with Large Main Memories and Flash Storage”
BW = “Buzz Word”
![Page 3: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/3.jpg)
The Buzz Words (1)
On Disk
InMemory
…data datadata data datadata
• B-tree• Key-ordered access to records• Efficient point and range lookups• Self balancing• B-link variant (side pointers at each level)
![Page 4: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/4.jpg)
The Buzz Words (2)• Multi-core + large main memories
• Latch (lock) free• Worker threads do not set latches for any reason• Threads never block• No data partitioning
• “Delta” updates• No updates in place• Reduces cache invalidation
• Flash storage• Good at random reads and sequential reads/writes• Bad at random writes• Use flash as append log• Implement novel log-structured storage layer over
flash
![Page 5: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/5.jpg)
Outline
• Overview• Deployment scenarios
• Latch-free main-memory index (Hekaton)• Deuteronomy
• Bw-tree architecture• In-memory latch-free pages• Latch-free structure modifications• Cache management• Performance highlights• Conclusion
![Page 6: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/6.jpg)
Multiple Deployment Scenarios
• Standalone (highly concurrent) atomic record store
• Fast in-memory, latch-free B-tree
• Data component (DC) in a decoupled “Deuteronomy” style transactional system
![Page 7: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/7.jpg)
Microsoft SQL Server Hekaton
• Main-memory optimized OLTP engine• Engine is completely latch-free• Multi-versioned, optimistic concurrency
control (VLDB 2012)• Bw-tree is the ordered index in Hekaton
http://research.microsoft.com/main-memory_dbs/
![Page 8: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/8.jpg)
Deuteronomy
Transaction Component (TC)1. Guarantee ACID
Properties2. No knowledge of
physical data storage
Logical locking and logging
1. Physical data storage2. Atomic record modifications3. Data could be anywhere
(cloud/local)
Storage
Data Component (DC)
RecordOperations
Control Operations
Client Request Interaction
Contract1. Reliable messaging
“At least once execution”multiple sends
2. Idempotence“At most once execution”
LSNs
3. Causality “If DC remembers message, TC must also”
Write ahead log (WAL) protocol
4. Contract termination“Mechanism to release contract” Checkpointing
http://research.microsoft.com/deuteronomy/
![Page 9: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/9.jpg)
Outline
• Overview• Deployment scenarios• Bw-tree architecture• In-memory latch-free pages• Latch-free structure modifications• Cache management• Performance highlights• Conclusion
![Page 10: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/10.jpg)
Bw-Tree Architecture
B-TreeLayer
CacheLayer
FlashLayer
• API• B-tree search/update
logic• In-memory pages only• Logical page abstraction
forB-tree layer
• Brings pages from flash to RAM as necessary
• Sequential writes to log-structured storage
• Flash garbage collection
Focus of this talk
![Page 11: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/11.jpg)
Outline
• Overview• Deployment scenarios• Bw-tree architecture• In-memory latch-free pages• Latch-free structure modifications• Cache management• Performance highlights• Conclusion
![Page 12: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/12.jpg)
Mapping Table and Logical Pages
PID Physical Address
Mapping Table
Index page
Data page
Index page
On Flash
In-Memory
Data page
1 bit 63 bits
flash/memflag address
• Pages are logical, identified by mapping table index
• Mapping table• Translates logical page ID to physical
address• Important for latch-free behavior and log-
structuring• Isolates update to a single page
![Page 13: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/13.jpg)
Compare and Swap• Atomic instruction that compares contents of a
memory location M to a given value V• If values are equal, installs new given value
V’ in M• Otherwise operation fails
M
CompareAndSwap(&M, 20, 30)
Address
Compare Value
New Value
CompareAndSwap(&M, 20, 40)
2030
X
![Page 14: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/14.jpg)
Delta Updates
Page P
PID Physical
Address
P
Mapping Table
Δ: Insert record 50
• Each update to a page produces a new address (the delta)
• Delta physically points to existing “root” of the page
• Install delta address in physical address slot of mapping table using compare and swap
Δ: Delete record 48
![Page 15: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/15.jpg)
Update Contention
Page P
PID Physical
Address
P
Mapping Table
Δ: Insert record 50
• Worker threads may try to install updates to same state of the page
• Winner succeeds, any losers must retry• Retry protocol is operation-specific (details in
paper)
Δ: Delete record 48
Δ: Update record 35
Δ: Insert Record 60
![Page 16: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/16.jpg)
Delta Types
• Delta method used to describe all updates to a page
• Record update deltas• Insert/Delete/Update of record on a page
• Structure modification deltas• Split/Merge information
• Flush deltas• Describes what part of the page is on log-
structured storage on flash
![Page 17: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/17.jpg)
In-Memory Page Consolidation
17
Page P
PID Physical
Address
P
Mapping Table
Δ: Insert record 50
• Delta chain eventually degrades search performance
• We eventually consolidate updates by creating/installing new search-optimized page with deltas applied
• Consolidation piggybacked onto regular operations
• Old page state becomes garbage
Δ: Delete record 48
Δ: Update record 35
“Consolidated” Page P
![Page 18: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/18.jpg)
Garbage Collection Using Epochs
• Thread joins an epoch prior to each operation (e.g., insert)
• Always posts “garbage” to list for current epoch (not necessarily the one it joined)
• Garbage for an epoch reclaimed only when all threads have exited the epoch (i.e., the epoch drains)
Epoch 1 Epoch 2
Current Epoch
Members
Garbage Collection List
Thread 1
Members
Garbage Collection List
Δ
Thread 2
Δ
Δ
Thread 3
Δ
ΔΔ
Δ
![Page 19: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/19.jpg)
Outline
• Overview• Deployment scenarios• Bw-tree architecture• In-memory latch-free pages• Latch-free structure modifications• Cache management• Performance highlights• Conclusion
![Page 20: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/20.jpg)
Latch-Free Node Splits
• Page sizes are elastic• No hard physical threshold for splitting• Can split when convenient
• B-link structure allows us to “half-split” without latching
1. Install split at child level by creating new page2. Install new separator key and pointer at parent level
PID Physical Address
2
Mapping Table
Split Δ
Page 1 Page 2 Page 3
Page 4
4
Index Entry Δ
Logical pointerPhysical pointer
![Page 21: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/21.jpg)
Outline
• Overview• Deployment scenarios• Bw-tree architecture• In-memory latch-free pages• Latch-free structure modifications• Cache management• Performance highlights• Conclusion
![Page 22: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/22.jpg)
Cache Management
• Write sequentially to log-structured store using large write buffers
• Page marshalling• Transforms in-memory page state to
representation written to flush buffer• Ensures correct ordering
• Incremental flushing• Usually only flush deltas since last flush• Increased writing efficiency
![Page 23: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/23.jpg)
Representation on LSS
Base page
RAM
Flash Memory
.
.
.
.
.
.
Mapping table
Sequential log
Write ordering in log
Base page
Base page
-record
-record
-record
![Page 24: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/24.jpg)
Outline
• Overview• Deployment scenarios• Bw-tree architecture• In-memory latch-free pages• Latch-free structure modifications• Cache management• Performance highlights• Conclusion
![Page 25: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/25.jpg)
Performance Highlights - Setup• Experimented against
• BerkeleyDB standalone B-tree (no transactions)• Latch-free skiplist implementation
• Workloads• Xbox
• 27M get/set operations from Xbox Live Primetime• 94 byte keys, 1200 byte payloads, read-write ratio of 7:1
• Storage Deduplication• 27M deduplication chunks from real enterprise trace• 20-byte keys (SHA-1 hash), 44-byte payload, read-write ratio of
2.2:1
• Synthetic• 42M operations with keys randomly generated• 8-byte keys, 8-byte payloads, read-write ratio of 5:1
![Page 26: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/26.jpg)
Vs. BerkeleyDB
Xbox Synthetic Deduplication0.0
2000000.0
4000000.0
6000000.0
8000000.0
10000000.0
12000000.0
10402244.61
3829679.23
2837646.88
555480.18 660189.09330204.29
BW-Tree BerkeleyDB
Op
era
tio
ns/S
ec (
M)
![Page 27: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/27.jpg)
Vs. Skiplists
Bw-tree Skiplist0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
L1 hits L2 hits L3 hits RAM
Bw-Tree SkiplistSynthetic workload
3.83M Ops/Sec
1.02 M Ops/Sec
![Page 28: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/28.jpg)
Outline
• Overview• Deployment scenarios• Bw-tree architecture• In-memory latch-free pages• Latch-free structure modifications• Cache management• Performance highlights• Conclusion
![Page 29: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/29.jpg)
Conclusion
• Introduced a high performance B-tree
• Latch free• Install delta updates with CAS (do not update
in place)• Mapping table isolates address change to
single page
• Log structured (details in another paper)• Uses flash as append log• Updates batched and written sequentially• Flush queue maintains ordering
• Very good performance
![Page 30: The BW-Tree: A B-tree for New Hardware Platforms](https://reader035.fdocuments.us/reader035/viewer/2022081519/56812c3b550346895d90c41d/html5/thumbnails/30.jpg)
Questions?