HyperDex: A Consistent, Fault-tolerant, Searchable, Transactional NoSQL Store - VNGRS Tech Talks
-
Upload
firat-arig -
Category
Engineering
-
view
287 -
download
0
Transcript of HyperDex: A Consistent, Fault-tolerant, Searchable, Transactional NoSQL Store - VNGRS Tech Talks
HyperDex: A Consistent,Fault-tolerant, Searchable,Transactional NoSQL Store
Emin Gün SirerCornell University / HyperDex, Inc.
VNGRS/YTU6 January, 2014
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 1 / 51
From RDBMS to NoSQL
I RDBMS have difficulty with scalability andperformance
I ... also, they have been expensive to own and operateI NoSQL systems emerged to fill the gap
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 2 / 51
Problems Typical of NoSQL
Lack of ...
I SearchI ConsistencyI Fault-Tolerance
Specifics vary between systems
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 3 / 51
Typical NoSQL Architecture
K
Consistent hashing maps each key to a server
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 4 / 51
The Search Problem
Searching for objects without the key involves manyserversEmin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 4 / 51
The Consistency Problem
Clients may read inconsistent data and writes may be lost
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 5 / 51
The Fault-Tolerance Problem
Many systems’ default settings consider a write completeafter writing to just one nodeEmin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 6 / 51
HyperDex: An OverviewI Hyperspace hashingI Value-dependent chainingI ACID Transactions
⇓I High-Performance: High throughput with low varianceI Consistent: Every GET returns the latest PUTI Available: Tolerates a threshold of failuresI Partition-Tolerant: Operates in the presence of
partitionsI Scalable: Adding resources increases performanceI Rich API: Support for complex datastructures and
searchEmin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 7 / 51
Introduction
Design and ImplementationHyperspace HashingValue-Dependent ChainingLinear Transactions
Evaluation
Perspective
Additional
Conclusion
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 8 / 51
Attributes map to dimensions in a multidimensionalhyperspace
First Name
Phone Number
Last Name
H(“Neil”)
H(“607-555-1024”)
H(“Armstrong”)
Neil ArmstrongLance ArmstrongNeil Diamond
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 9 / 51
Attribute values are hashed independentlyAny hash function may be used
First Name
Phone Number
Last Name
H(“Neil”)
H(“607-555-1024”)
H(“Armstrong”)
Neil ArmstrongLance ArmstrongNeil Diamond
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 9 / 51
Objects reside at the coordinate specified by the hashes
First Name
Phone Number
Last Name
H(“Neil”)
H(“607-555-1024”)
H(“Armstrong”)
Neil Armstrong
Lance ArmstrongNeil Diamond
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 9 / 51
First Name
Phone Number
Last Name
H(“Neil”)
H(“607-555-1024”)
H(“Armstrong”)
Neil Armstrong
Lance ArmstrongNeil Diamond
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 9 / 51
Different objects reside at different coordinates
First Name
Phone Number
Last Name
H(“Neil”)
H(“607-555-1024”)
H(“Armstrong”)
Neil ArmstrongLance ArmstrongNeil Diamond
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 9 / 51
The hyperspace is divided into regions whereeach object resides in exactly one region
First Name
Phone Number
Last Name
H(“Neil”)
H(“607-555-1024”)
H(“Armstrong”)
Neil ArmstrongLance ArmstrongNeil Diamond
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 9 / 51
Each server is responsible for a region of the hyperspace
First Name
Phone Number
Last Name
H(“Neil”)
H(“607-555-1024”)
H(“Armstrong”)
Neil ArmstrongLance ArmstrongNeil Diamond
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 9 / 51
Each search intersects a subset of regions of thehyperspace
First Name
Phone Number
Last Name
H(“Neil”)
H(“607-555-1024”)
H(“Armstrong”)
Neil ArmstrongLance ArmstrongNeil Diamond
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 9 / 51
All people named Neil are mapped to the yellow plane
First Name
Phone Number
Last Name
H(“Neil”)
H(“607-555-1024”)
H(“Armstrong”)
Neil ArmstrongLance ArmstrongNeil Diamond
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 9 / 51
All people named Neil are mapped to the yellow plane
First Name
Phone Number
Last Name
H(“Neil”)
H(“607-555-1024”)
H(“Armstrong”)
Neil ArmstrongLance ArmstrongNeil Diamond
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 9 / 51
All people named Armstrong are mapped to the grayplane
First Name
Phone Number
Last Name
H(“Neil”)
H(“607-555-1024”)
H(“Armstrong”)
Neil ArmstrongLance ArmstrongNeil Diamond
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 9 / 51
All people named Armstrong are mapped to the grayplane
First Name
Phone Number
Last Name
H(“Neil”)
H(“607-555-1024”)
H(“Armstrong”)
Neil ArmstrongLance ArmstrongNeil Diamond
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 9 / 51
A more restrictive search for Neil Armstrong contactsfewer servers
First Name
Phone Number
Last Name
H(“Neil”)
H(“607-555-1024”)
H(“Armstrong”)
Neil ArmstrongLance ArmstrongNeil Diamond
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 9 / 51
Range searches are natively supported
First Name
Phone Number
Last Name
H(“Neil”)
H(“607-555-1024”)
H(“Armstrong”)
Neil ArmstrongLance ArmstrongNeil Diamond
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 9 / 51
Space PartitioningI In a naive implementation, the hyperspace would
grow exponentially in the number of dimensionsI Space partitioning prevents exponential growth in the
number of searchable attributes
k a1 a2 a3 a4 a5 . . . aD-2aD-1 aD
k a1 a2 a3 a4 a5 . . . aD-2aD-1 aD
keysubspace
subspace 0 subspace 1 subspace S
I A search is performed in the most restrictivesubspace
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 10 / 51
Space PartitioningI In a naive implementation, the hyperspace would
grow exponentially in the number of dimensionsI Space partitioning prevents exponential growth in the
number of searchable attributes
k a1 a2 a3 a4 a5 . . . aD-2aD-1 aD
k a1 a2 a3 a4 a5 . . . aD-2aD-1 aD
keysubspace
subspace 0 subspace 1 subspace S
I A search is performed in the most restrictivesubspace
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 10 / 51
Hyperspace Hashing Implications
I searches are efficientI Hyperspace hashing is a mapping, not an index
I No per-object updates to a shared datastructureI No overhead for building and maintaining B-treesI Functionality gained solely through careful placement
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 11 / 51
Introduction
Design and ImplementationHyperspace HashingValue-Dependent ChainingLinear Transactions
Evaluation
Perspective
Additional
Conclusion
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 12 / 51
Replication
I As an object changes, so too must the set of serversholding it
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 13 / 51
Value-Dependent Chaining
k
key subspace
A
Bsubspace 1
C
Dsubspace 2
put(k,A=1,B=1,C=1,D=1)
11 1
1 1 1put(k,A=0,B=0,C=1,D=1)
22
2
2
2
2put(k,A=0,B=1,C=1,D=1)
32 3
3
3
2
3
3
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 14 / 51
Value-Dependent Chaining
k
key subspace
A
Bsubspace 1
C
Dsubspace 2
put(k,A=1,B=1,C=1,D=1)
11 1
1 1 1
put(k,A=0,B=0,C=1,D=1)
22
2
2
2
2put(k,A=0,B=1,C=1,D=1)
32 3
3
3
2
3
3
A put includes one node from each subspace
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 14 / 51
Value-Dependent Chaining
k
key subspace
A
Bsubspace 1
C
Dsubspace 2
put(k,A=1,B=1,C=1,D=1)
11 1
1 1 1
put(k,A=0,B=0,C=1,D=1)
21
2
2
2 1
2
2
put(k,A=0,B=1,C=1,D=1)
32 3
3
3
2
3
3
The value-dependent chain includes the servers whichhold the old and new versions of the objectEmin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 14 / 51
Value-Dependent Chaining
k
key subspace
A
Bsubspace 1
C
Dsubspace 2
put(k,A=1,B=1,C=1,D=1)
11 1
1 1 1
put(k,A=0,B=0,C=1,D=1)
22
2
2
2
2
put(k,A=0,B=1,C=1,D=1)
32 3
3
3
2
3
3
Each put removes all state from the previous put
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 14 / 51
Value-Dependent Chaining
k
key subspace
A
Bsubspace 1
C
Dsubspace 2
put(k,A=1,B=1,C=1,D=1)
11 1
1 1 1put(k,A=0,B=0,C=1,D=1)
22
2
2
2
2
put(k,A=0,B=1,C=1,D=1)
32 3
3
3
2
3
3
Subsequent operations involve solely the most recentnodesEmin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 14 / 51
Value-Dependent Chaining
k
key subspace
A
Bsubspace 1
C
Dsubspace 2
put(k,A=0,B=0,C=1,D=1)
21
2
2
2 2 1 1
2 2
2 2
Servers are replicated in each region for fault tolerance
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 15 / 51
Value-Dependent Chaining
k
key subspace
A
Bsubspace 1
C
Dsubspace 2
put(k,A=0,B=0,C=1,D=1)
21
2
2
2 2 1 1
2 2
2 2
The value-dependent chain includes all replicas
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 15 / 51
Value-Dependent Chaining
k
key subspace
A
Bsubspace 1
C
Dsubspace 2
put(k,A=0,B=0,C=1,D=1)
21
2
2
2 2 1 1
2 2
2 2
Failed nodes are removed from the chain
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 15 / 51
Value-Dependent Chaining Implications
No extra mechanism is necessary to provideI AtomicityI OrderingI ReplicationI Relocation
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 16 / 51
Multikey Transactions
I Hyperspace hashing enables HD to locate dataquickly
I Value-dependent chaining enables HD to replicatedata
I And this is sufficient for many applicationsI But some apps require atomic, consistent updates to
multiple items
Options are:I Spray and prayI Use a heavyweight algorithm (e.g. Paxos) for
orderingI HyperDex Warp
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 17 / 51
Warp Properties
Warp is a novel, optimistic, concurrent, distributedalgorithm for ensuring isolated updates to a data store.
I Atomic – operations on multiple keys are indivisibleI Consistent – application invariants are preservedI Isolated – one copy serializableI Durability – all transactions are propagated to f+1
replicas
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 18 / 51
Existing Systems Exhibit SpuriousCoordination
Spurious coordination is the unnecessary delay orreordering of a transaction in order to enforce an order(w.r.t. other transactions) when the transaction could havebeen applied without delay/reordering without violatingguarantees.
Examples:I Centralized sequencerI Paxos RSMI Batching
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 19 / 51
Linear Transactions
I Arrange servers into a chainI Track dependencies while traversing the chainI Prevent cycles using the dependenciesI Protocol exhibits no spurious coordination
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 20 / 51
Implementation
I Bindings for C, C++, Python, Go, Java, Ruby,Node.JS
I Open sourced under a BSD-like licenseI Active user community with many contributors
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 21 / 51
Inserting/Retrieving an Object
>>> c.put(’phonebook’, ’jsmith’,... {’first’: ’John’,... ’last’: ’Smith’,... ’phone’: 6075551024})True>>> c.get(’phonebook’, ’jsmith’){’first’: ’John’, ’last’: ’Smith’,’phone’: 6075551024}
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 22 / 51
Performing A Search
>>> [x for x in c.search(’phonebook’,... {’first’: ’John’})][{’first’: ’John’,
’last’: ’Smith’,’phone’: 6075551024, ’username’: ’jsmith’}]
>>> [x for x in c.search(’phonebook’,... {’phone’: (6070000000, 6080000000)})][{’first’: ’John’,
’last’: ’Smith’,’phone’: 6075551024, ’username’: ’jsmith’}]
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 23 / 51
Atomic Operations
>>> c.condput(’phonebook’, ’jsmith’,... {’phone’: 6075551024},... {’phone’: 6075552048})True>>> c.condput(’phonebook’, ’jsmith’,... {’phone’: 6075551024},... {’phone’: 6075552048})False
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 24 / 51
Atomic Operations
>>> c.atomic_add(’phonebook’, ’jsmith’,... {’phone’: 1})True>>> c.get(’phonebook’, ’jsmith’){’first’: ’John’,’last’: ’Smith’,’phone’: 6075552049}
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 25 / 51
Asynchronous Operations
>>> d = c.async_put(’phonebook’, ’js’,... {’first’: ’John’,... ’last’: ’Smith’,... ’phone’: 6075551024})>>> d<hyperclient.DeferredInsert object at ...>>>> do_work()>>> d.wait()True
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 26 / 51
A Space with Datastructures
>>> c.add_space(’’’... space socialnetwork... key username... attributes... first, last,... list(string) pending,... set(string) hobbies,... map(string,string) unread_messages,... map(string,int) upvotes... subspace first, last... tolerate 2 failures’’’)True
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 27 / 51
Manipulating Lists
>>> c.list_rpush(’socialnetwork’, ’js’,... {’pending’: ’bjones1’})True>>> c.get(’socialnetwork’, ’js’)[’pending’][’bjones1’]
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 28 / 51
Transactions
>>> t = c.begin_transaction()>>> amt1 = t.get(’account’, ’egs’)[’balance’]>>> amt2 = t.get(’account’, ’rob’)[’balance’]>>> amt1 -= 1000>>> amt2 += 1000>>> t.put(’account’, ’egs’, {’balance’: amt1})>>> t.put(’account’, ’rob’, {’balance’: amt2})>>> t.commit()
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 29 / 51
Introduction
Design and Implementation
Evaluation
Perspective
Additional
Conclusion
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 30 / 51
Experimental Setup
I Use the Yahoo! Cloud Serving BenchmarkI Each operation writes to two replicas of the data
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 31 / 51
YCSB Throughput
0
50
100
150
200
250
MongoDB HyperDex
Thro
ughp
ut(th
ousa
ndop
/s)
Bulk LoadingWorkload AWorkload BWorkload CWorkload F
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 32 / 51
95% get / 5% put Latency
0
20
40
60
80
100
1 10 100 1000
CD
F(%
)
Latency (ms)
YCSB Workload B
MongoDB (R)MongoDB (U)HyperDex (R)HyperDex (U)
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 33 / 51
100% put Latency
0
20
40
60
80
100
1 10 100 1000
CD
F(%
)
Latency (ms)
YCSB Load Dataset
MongoDBHyperDex
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 34 / 51
Chain Length vs. Put Latency
0
1
2
3
4
5
6
7
2 4 6 8 10 12 14 16 18 20 22
Late
ncy
(ms)
Chain Length (nodes)
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 35 / 51
Scalability
0
1
2
3
4
4 8 12 16 20 24 28 32
Thro
ughp
ut(m
illio
nop
s/s)
Nodes
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 36 / 51
TPC-C BenchmarkI TPC-C for key-value storesI Each row is a separate key/value pairI Key-operations onlyI Transactions: New-Order, Order-Status, Payment,
Stock-LevelI No time wait between transactionsI Each transaction touches multiple objects:
Profile R W RMW % MixNew Order 12 3 11 (1) 45Payment 0 1 3 (2) 45Order Status 12 0 0 5Stock Level 201 (1) 0 0 5
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 37 / 51
TPC-C Throughput
0
50
100
150
200
HyperDex 2PCDex Warp
Thro
ughp
ut(th
ousa
ndop
s/s)
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 38 / 51
TPC-C New Order Transactions
0
20
40
60
80
100
1 10 100 1000
CD
F(%
)
Latency (ms)
HyperDex2PCDexWarp
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 39 / 51
TPC-C Order Status Transactions
0
20
40
60
80
100
1 10 100 1000
CD
F(%
)
Latency (ms)
HyperDex2PCDexWarp
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 40 / 51
Performance Summary
I Outperforms other systems by 2–4× for get/putI While offering stronger consistency and fault
toleranceI Outperforms 2PC/Sinfonia by 3.3× for xactions
I Within 96% of non-transactional loadsI Latency for chain-operations is predictableI Scales as resources are added
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 41 / 51
The CAP Theorem
I Consistency, Availability, Partition Tolerance: Pick atmost two
I Problem 1. CAP does not type-check.
I Consistency: property of a systemI Availability: property of a systemI Partitions: property of the failure model
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 42 / 51
The CAP Theorem
I Consistency, Availability, Partition Tolerance: Pick atmost two
I Problem 1. CAP does not type-check.I Consistency: property of a system
I Availability: property of a systemI Partitions: property of the failure model
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 42 / 51
The CAP Theorem
I Consistency, Availability, Partition Tolerance: Pick atmost two
I Problem 1. CAP does not type-check.I Consistency: property of a systemI Availability: property of a system
I Partitions: property of the failure model
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 42 / 51
The CAP Theorem
I Consistency, Availability, Partition Tolerance: Pick atmost two
I Problem 1. CAP does not type-check.I Consistency: property of a systemI Availability: property of a systemI Partitions: property of the failure model
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 42 / 51
The CAP Theorem
I CAP Proof SketchI Imagine two hosts, with a single network link.I If partitioned and forced to provide answers (achieve
A), the hosts cannot be consistent (fail at C)I If partitioned and forced to remain consistent (achieve
C), the hosts cannot respond (fail at A)I Problem 2. CAP defintions are not what you want.
I Availability is defined as "every host must respond toevery request it receives."
I System designers care about A at the systemboundary
I CAP defines A per-server
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 43 / 51
The CAP Theorem
I Problem 3. The failure model is all too powerfulI If failures are unconstrained, anything can happenI Scant little is possible when everything can failI The Not-A Theorem: Availability, however you define
it, is not achievable.I Proof Sketch: Imagine a distributed system. If all the
servers are down due to host or network failures, itcannot be available.
I The Not-A Theorem is a dumb tautology, bereft ofinsight or implications
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 44 / 51
The CAP Theorem
I Problem 4. CAP is correct, but has noconsequences.
I It is simply a contradiction of definitions.I Implies nothing about system design.I If a system gives up on C or A (or P!) because of
CAP, it’s doing something wrong.
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 45 / 51
The CAP Theorem
I What CAP really says:I If you cannot limit the number of faultsI and requests can be directed to any serverI and you insist on serving every requestI then you cannot possibly be consistent
I Most NoSQL systems are proud to preemptively giveup desirable properties like consistency in the nameof CAP — even in the case of no failures
I HyperDex allows for f failures without sacrificingconsistency or availability
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 46 / 51
The CAP Theorem
I What of CAP?I There is a 3-way tradeoff involving C, A, and P!I Consistency, Availability and PerformanceI One may sacrifice C or A to gain P. This tradeoff was
behind the success of broken-by-design systemssuch as MongoDB.
I HyperDex achieves C and A, and the P you actuallywant, in the presence of the original P.
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 47 / 51
Backups
I Everyone else implements inconsistent backupsI Inconsistent backups are only useful if you stop the
world and snapshotI HyperDex takes a consistent snapshotI Approximately 200-300ms for a snapshotI Copying takes place in the background
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 48 / 51
Security and Macaroons
I Typically, all items in a database are under the sameACL
I Macaroons were invented at Google to protect userdata from front-end breaches
I Like cookies, but much more powerfulI Every item can be protected individually
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 49 / 51
Deployment
I HyperDex automatically configures itself.I Supported on Linux and OS X.I Works well within Docker.I Install and deploy an auto-scaling cluster in 30
seconds.
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 50 / 51
Conclusion
I HyperDex is a next generation NoSQL systemI NoSQL advantages can be combined with
consistency and fault tolerance guaranteesI http://hyperdex.org/
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 51 / 51
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 1 / 17
YCSB Benchmark WorkloadsName Workload Key Choice Application
A 50% R Zipf Session Store50% U
B 95% R Zipf Photo Tagging5% U
C 100% R Zipf Profile CacheD 95% R Temporal Status Updates
5% IE 95% S Zipf Threads
5% IF 50% R Zipf User Database
50% R-M-U
R = Read, U = Update, I = Insert, S = Scan/Search
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 2 / 17
Hash Functions and Load Balancing
I Out of the box, HyperDex supports hashing stringsand integers
I What about non-uniform inputs?I Select a better hash functionI Use forwarding pointersI Create multiple dimensions in the hyperspace for a
single attributeI The default hash functions work well for workloads
that we’ve seen in practice
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 3 / 17
Experimental Setup
Lab ClusterI 14 MachinesI Intel Xeon 2.5 GHz E5420 ×
2I 16 GB RAMI 500 GB SATA HDDI Debian 6.0I Linux 2.6.32
VICCI ClusterI 70 MachinesI Intel Xeon 2.66 GHz X5650 × 2I 48 GB RAMI 1 TB SATA HDD × 3I Virtualized Fedora 12I Linux 2.6.32
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 4 / 17
Experimental Setup
I Use the Yahoo! Cloud Serving BenchmarkI Each system makes two replicas of the dataI MongoDB: Writes to the client’s outgoing socket
bufferI Cassandra: Writes to one storage node’s filesystemI HyperDex: Writes to both replicas in three
subspaces
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 5 / 17
YCSB Throughput
0
5
10
15
20
25
30
35
40
CassandraMongoDB HyperDex
Thro
ughp
ut(th
ousa
ndop
/s)
Workload AWorkload BWorkload CWorkload DWorkload FWorkload E
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 6 / 17
95% get / 5% put Latency
0
20
40
60
80
100
1 10 100 1000
CD
F(%
)
Latency (ms)
YCSB Workload B
Cassandra (R)Cassandra (U)MongoDB (R)MongoDB (U)HyperDex (R)HyperDex (U)
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 7 / 17
100% put Latency
0
20
40
60
80
100
1 10 100 1000
CD
F(%
)
Latency (ms)
YCSB Load Dataset
CassandraMongoDBHyperDex
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 8 / 17
search Latency
0
20
40
60
80
100
1 10 100 1000
CD
F(%
)
Latency (ms)
YCSB Workload E
CassandraMongoDBHyperDex
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 9 / 17
Cluster Size
I Netflix: App-specific clusters of 6-48 Cassandrainstances
I Google BigTable:I 66% of clusters < 20 tablet serversI 84% of clusters < 100 tablet serversI 96% of clusters < 500 tablet servers
I Justin Sheehy, Basho Inc.:I Typical cluster is 6-12 Riak nodesI Largest clusters < 100 Riak nodes
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 10 / 17
Related WorkI Multi-dimensional database systems on a single host
I Grid File, KD-Tree, Multi-dimensional BST, Quad-Tree,R-Tree, Universal B-Tree
I Distributed database systems maintain distributed indicesI Distributed B-Tree, P-Tree, Sinfonia
I Peer-to-peer systems are only eventually consistentI Arpeggio, CAN, Chord, Consistent Hashing, Mercury,
MURK, Pastry, SkipIndex, SWAM-V, TapestryI Space-filling curves suffer from the curse of dimensionality
I MAAN, SCRAP, Squid, ZNetI NoSQL systems/key-value stores give up search, consistency or
fault-toleranceI CouchDB, MongoDB, Neo4j, PNUTS, Redis, TXCache,
BigTable, Cassandra, COPS, Distributed Data Structures,Dynamo, Fawn KV, HBase, HyperTable, LazyBase,Masstree, Memcached, RAMCloud, Riak, SILT, Spanner,Spinnaker, TSSL, Voldemort
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 11 / 17
Linear Transactions
G-I
D-F
A-CY-Z
V-X
S-U
P-RM-O
J-L
T1
T2
T3
Servers are arranged into a chain; dependencies aretracked at each point where two chains overlapEmin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 12 / 17
Linear Transactions
G-I
D-F
A-CY-Z
V-X
S-U
P-RM-O
J-L
T1
T2T3
Servers are arranged into a chain; dependencies aretracked at each point where two chains overlapEmin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 12 / 17
Dependency Management PreventsCycles
s2
s1
s0s8
s7
s6
s5s4
s3
T1
T2
T3
s0 s1 s2
deps = [T2,T3]
deps = [T1,T2]
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 13 / 17
Dependency Management PreventsCycles
s2
s1
s0s8
s7
s6
s5s4
s3
T1
T2
T3
s0 s1 s2deps = [T2,T3]
deps = [T1,T2]
s0 abides by T3 T1 because of the dependencies withinT1Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 13 / 17
Dependency Management PreventsCycles
s2
s1
s0s8
s7
s6
s5s4
s3
T1
T2
T3
s0 s1 s2
deps = [T2,T3]
deps = [T1,T2]
s0 abides by T1 T3 because of the dependencies withinT3Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 13 / 17
Dependency Management PreventsCycles
s2
s1
s0s8
s7
s6
s5s4
s3
T1
T2
T3
s0 s1 s2
deps = [T2,T3]
deps = [T1,T2]
s0 is free to commit T1 and T3 in natural arrival orderEmin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 13 / 17
TPC-C Payment Transactions
0
20
40
60
80
100
1 10 100 1000
CD
F(%
)
Latency (ms)
HyperDex2PCDexWarp
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 14 / 17
TPC-C Stock Level Transactions
0
20
40
60
80
100
1 10 100 1000
CD
F(%
)
Latency (ms)
HyperDex2PCDexWarp
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 15 / 17
TPC-C Retries
0
20
40
60
80
100
0 1 2 3 4 5
CD
F(%
)
Retries
2PCDexWarp
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 16 / 17
The CAP Theorem
I Fallacy. The CAP Theorem is not equivalent to theFLP Result.
I The FLP Result is a deep result that provides insightsinto the nature and complexity of consensusalgorithms.
I CAP is a simple tautology, like the Not-A Theorem.I CAP and FLP do not reduce to one another.
Emin Gün Sirer Cornell University / HyperDex, Inc. http://hyperdex.org/ 17 / 17