A Practical Scalable Distributed B-Treekmsalem/courses/CS848W10/presentations/... · A Practical...

32
A Practical Scalable Distributed B-Tree CS 848 Paper Presentation Marcos K. Aguilera, Wojciech Golab, Mehul A. Shah PVLDB ’08 March 8, 2010 Presenter: Evguenia (Elmi) Eov

Transcript of A Practical Scalable Distributed B-Treekmsalem/courses/CS848W10/presentations/... · A Practical...

A Practical Scalable Distributed B-TreeCS 848 Paper Presentation

Marcos K. Aguilera, Wojciech Golab,Mehul A. Shah

PVLDB ’08

March 8, 2010

Presenter: Evguenia (Elmi) Eflov

Presentation Outline

1 BackgroundProblemDistributed B+treeSinfonia

2 Distributed B-tree ImplementationAssumptionsDesign of the B-treeB-tree OperationsTransactionsExtensions

3 Experimental ResultsWorkloadResults

4 DiscussionQuestions

1 BackgroundProblemDistributed B+treeSinfonia

2 Distributed B-tree ImplementationAssumptionsDesign of the B-treeB-tree OperationsTransactionsExtensions

3 Experimental ResultsWorkloadResults

4 DiscussionQuestions

Distributed (key,value) Storage

The paper presents three motivating examples:

The back-end of a multiplayer game. Multiplayergames need to store and manage data for thousands ofplayers while providing low latency access and very highdata consistency

Metadata storage for a cluster file system.Metadata access is often the bottleneck in such systems.Metadata changes, for example, file renaming orrelocation, in cluster file systems need to be atomic

Secondary indexes. A lot of application require morethan one index on a set of data to guarantee fast accessbased on different conditions. As in the two previousexamples, data changes need to be atomic

1 BackgroundProblemDistributed B+treeSinfonia

2 Distributed B-tree ImplementationAssumptionsDesign of the B-treeB-tree OperationsTransactionsExtensions

3 Experimental ResultsWorkloadResults

4 DiscussionQuestions

B+tree

B-tree is a tree data structure that stores values sorted bykey and allows updates and lookups in amortizedlogarithmic time

B+tree is a form of B-tree where inner nodes of the treestore keys and pointers, and leaf nodes store key-valuepairs

Distributed B+tree

1 BackgroundProblemDistributed B+treeSinfonia

2 Distributed B-tree ImplementationAssumptionsDesign of the B-treeB-tree OperationsTransactionsExtensions

3 Experimental ResultsWorkloadResults

4 DiscussionQuestions

Sinfonia

Sinfonia is a distributed data storage service that providesACID properties for the application data

Sinfonia provides a data manipulation primitive, aminitransaction

Minitransaction ...

Consists of 3 (possibly empty) sets of operationsOperations are comparisons, reads, and writesReads and writes are performed only if all of thecomparisons are successfulIs performed as part of a two-phase commitSome varieties can be performed in a single phase

1 BackgroundProblemDistributed B+treeSinfonia

2 Distributed B-tree ImplementationAssumptionsDesign of the B-treeB-tree OperationsTransactionsExtensions

3 Experimental ResultsWorkloadResults

4 DiscussionQuestions

Assumptions

The B-tree operates in a data center environment. Thisguarantees high bandwidth, low latency connectionsbetween client and server machines

Individual machines can fail without causing the systemto stall, but network partitions will stall the system

B-tree is not going to grow or shrink rapidly

1 BackgroundProblemDistributed B+treeSinfonia

2 Distributed B-tree ImplementationAssumptionsDesign of the B-treeB-tree OperationsTransactionsExtensions

3 Experimental ResultsWorkloadResults

4 DiscussionQuestions

Design of the B-tree

Design of the B-tree

Each server in the system stores some number of innerand leaf nodes of the B-tree

Each server in the system stores the version table of allthe inner nodes of the B-tree

Each client caches all inner nodes of the B-tree, and usesthis cache while executing a transaction

During a transaction, the client composes a set of readsand writes required

At commit time, Sinfonia’s minitransaction is used toperform the B-tree operations required on the server data

Comparisons are added by the B-tree client library toguarantee data consistency

B-tree Operations Efficiency

To make the B-tree efficient, the following three techniquesare used:

Clients use optimistic concurrency control, which workswell unless the B-tree is rapidly shrinking or growing

Since version numbers of the inner nodes are stored ateach server, inner node versions can be checked at anyserver in the system, for example, at the server where aleaf node being accessed is stored

Inner B-tree nodes are lazily replicated by clients - nodesthat a particular client does not access may be stale ornot present on the client

1 BackgroundProblemDistributed B+treeSinfonia

2 Distributed B-tree ImplementationAssumptionsDesign of the B-treeB-tree OperationsTransactionsExtensions

3 Experimental ResultsWorkloadResults

4 DiscussionQuestions

Standard B-tree Operations

Migration Operations

The distributed B-tree supports the following additionaloperations:

Migrate(x , s) - migrates node B-tree node x to server s

The following operations for multi-node migration:

1 BackgroundProblemDistributed B+treeSinfonia

2 Distributed B-tree ImplementationAssumptionsDesign of the B-treeB-tree OperationsTransactionsExtensions

3 Experimental ResultsWorkloadResults

4 DiscussionQuestions

Why are transactions required?

In order to guarantee data consistency, each datamanipulation on the B-tree has to be performedatomically, for example, renaming of a file in the clusterfile system or transferring an item and payment for theitem between characters in the computer game

While a minitransaction provided by Sinfonia is sufficientto perform the necessary B-tree node manipulations, it istedious of the user of the B-tree to code in terms of theminitransaction

The B-tree provides transaction interface as a way for theuser to define all the necessary Read and Writeoperations within a transaction, while adding necessarycomparisons to guarantee data consistency

Transaction Interface

1 BackgroundProblemDistributed B+treeSinfonia

2 Distributed B-tree ImplementationAssumptionsDesign of the B-treeB-tree OperationsTransactionsExtensions

3 Experimental ResultsWorkloadResults

4 DiscussionQuestions

Extensions

The following extensions are suggested to enhance the existingimplementation

Enhanced migration tasks - migration tasks that helpthe system adapt to seasonal variations or balance theload aggressively can be implemented

Dealing with hot-spots - migration task to migratepopular keys to different servers can be implemented,including migration of the keys that are currently storedin the same node

(continued)

Extensions

Varying the replication factor of inner nodes -replicating version numbers of the inner nodes on lowerlevels of the tree less aggressively could decrease the costof modifying those nodes

Finer-grained concurrency control to avoid falsesharing - if concurrency control operated on keys (orsmall groups of keys) rather than nodes, the number ofconflicts could be decreased

1 BackgroundProblemDistributed B+treeSinfonia

2 Distributed B-tree ImplementationAssumptionsDesign of the B-treeB-tree OperationsTransactionsExtensions

3 Experimental ResultsWorkloadResults

4 DiscussionQuestions

Experimental Setup

10-byte keys, 8-byte values, 12-byte pointers - 4 bytesspecify server, 8 bytes - offset within the server

4 KB nodes, with leaf nodes storing 220 key-value pairsand inner nodes storing 180 key-pointer pairs

Same number of servers and clients

Each client has 4 parallel threads, each thread issues anew request as soon as the current request is completed

Key space consists of 109 elements, with keys chosenuniformly, at random for each operation

Workloads

The following workloads are used in both scalability andmigration experiments

Insert

Lookup

Update - values for existing keys are updated

Mixed - 60% lookups and 40% updates

“Before the insert workload, the B-tree was pre-populatedwith 40,000 elements rather than starting with an emptyB-tree.”

Were the experiments performed in the order they arepresented in?

1 BackgroundProblemDistributed B+treeSinfonia

2 Distributed B-tree ImplementationAssumptionsDesign of the B-treeB-tree OperationsTransactionsExtensions

3 Experimental ResultsWorkloadResults

4 DiscussionQuestions

Results of the Scalability Experiments

Results of the Migration Experiments

For migration experiments, Move task was performed by amigration client while the rest of the setup for thecorresponding experiment was executed

Migration rate was 55.3 ± 2.7 nodes/s (around 10000key-value pairs/s) on an idle system, and less than 5 nodes/swhen executed with other tasks.

1 BackgroundProblemDistributed B+treeSinfonia

2 Distributed B-tree ImplementationAssumptionsDesign of the B-treeB-tree OperationsTransactionsExtensions

3 Experimental ResultsWorkloadResults

4 DiscussionQuestions

Some Discussion Questions

Are experiments representative of the workload of themotivating examples?

Would larger transactions have different scalability?

Can co-locating of the lower level inner nodes and thecorresponding leaf nodes increase the throughput of thesystem?