Post on 25-May-2015
description
The Power of Determinism in Database Systems
Daniel J. Abadi
Yale University
(Joint work with Jose Faleiro, Kun Ren, and Alex Thomson)
Database Systems Are Great
• Protects a dataset from corruption or deletion in the face of media, system, or program crashes
• Allows programs to change state of data in arbitrary ways
• Allows 1000s of such programs to run concurrently– Guarantees atomicity and isolation of such
programs
• Has served as blueprint for many concurrent, highly complex systems
But …
• Design is incredibly complex– Takes $17 million to build a new one
• Components are horribly monolithic• Corner case bugs nearly impossible to
reproduce• Does not scale horizontally• Does not scale horizontally (seriously)
Should the DBMS architecture really be a blueprint for concurrent system design?
Nondeterminism is the problem
• Building on top of:– OSes that enable threads to be scheduled
arbitrarily– Networks that deliver messages with arbitrary
delays (and sometimes in arbitrary orders)– Hardware that can fail arbitrarily
• Only natural to allow the state of the database to be dependent on these nondeterministic events
Nondeterminism is the problem
• OS non-deterministic thread scheduling leads to:– Arbitrary transaction interleaving– Deadlocks– Difficult to reproduce bugs– Tight interactions between lock manager, recovery
manager, access manager, and transaction manager.
• Hardware failures and message delivery delays result in transaction aborts– Need complicated recovery manager to handle
half-completed transactions– Need commit-protocol for distributed transactions
How to eliminate nondeterminism?
• There exist proposals for:– Deterministic operating systems– (Somewhat) deterministic networking layers– Highly redundant and reliable hardware
• Maybe one day those proposals will come with fewer disadvantages
• In the meantime, we have to create determinism from nondeterministic components– Select and choose what we make deterministic
Possible determinism levels• Given an input and initial state of the database
system, to get to one and only one possible final state:– Level 1: System always runs the same sequence
of instructions – Level 2: System always proceeds through the
same sequence of states of the database– Level 3: Database is allowed to proceed through
states in any order as long as the final state of all external and internal data structures is determined by the input
– Level 4: Database is allowed to proceed through states in any order as long as the final state of all external structures is determined by the input
Database Systems Problems
• Design is incredibly complex– Takes $17 million to build a new one
• Components are horribly monolithic• Corner case bugs nearly impossible to
reproduce• Does not scale horizontally• Does not scale horizontally
Database Systems Problems
• Design is incredibly complex– Takes $17 million to build a new one
• Components are horribly monolithic• Corner case bugs nearly impossible to
reproduce• Does not scale horizontally• Does not scale horizontally
LEVEL 4 DETERMINISM HELPS WITH ALL OF
THESE
Recovery
• Brain-dead version:– Log all input to the system– Upon a failure, trash the entire database, reply input log from
the beginning
• Less brain-dead version:– Create checkpoints of database state as of some point in the
input log– Upon a failure, trash the entire database, load checkpoint,
replay input log from point where checkpoint was taken
• Note that logging can happen entirely externally to the DBMS
• Same is true for checkpointing, although may want to perform it inside the DBMS for performance– Even in this case, it needs very little knowledge about other
components
Replication
• Send the same input log to replica DBMS– User-visible state in replicas will not diverge– Can happen entirely externally to the DBMS
Horizontal Scalability
• Active distributed xacts not aborted upon node failure– Greatly reduces (or eliminates) cost of
distributed commit• Don’t have to worry about nodes failing during
commit protocol• Don’t have to worry about affects of transaction
making it to disk before promising to commit transaction
• Just need one message from any node that potentially can deterministically abort the xact
– This message can be sent in the middle of the xact, as soon as it knows it will commit
One Way to Implement Determinism
• Use a preprocessor to handle client communications, and create a log of submitted xacts
• Send log in batches to DBMS• Every xact immediately requests all locks it will need (in
order of log)• If it doesn’t know what it will need
– Run enough of the xact to find out, but do not change the database state
– Reissue xact to the preprocessor with lock requirements included as parameter
– Run enough of the new xact to find out if it locked the correct items (database state might have changed in the meantime)
• If so, then xact can proceed as normal• If not, reissue again to the preprocessor and repeat as necessary
• Trivial to prove this is deterministic and deadlock-free
What’s the Downside?
• Increased latency to log input transactions and send to the DBMS in batches
• No flexibility for the system to abort transactions on a whim
• Can’t reorder transaction execution if one xact stalls mid-transaction
• Need to determine what will be locked in advance
Additional Upside
• Our implementation eliminates deadlocks– Distributed deadlock is a major problem for
distributed DBMSs
• Lock manager totally separate from the rest of DBMS– Increases modularity of the system
Experimental Evaluation
• Experiments conducted on Amazon EC2 using m3.2xlarge(Double Extra Large)
• Cluster of 8 nodes• TPC-C • Microbenchmark:
– 10RMW actions– 10RMW actions + CPU computation
TPC-C
Microbenchmark Experiments (Long xacts)
0%10%
20%30%
40%50%
60%70%
80%90%
100%0
50000
100000
150000
200000
250000
Deterministic, high contentionNondeterministic, high contentionDeterministic, low contentionNondeterministic, low contentionNondeterministic w/o 2PC, low contentionNondeterministic w/o 2PC, high contention
% Distributed Transactions
Tran
sacti
ons p
er se
cond
per
nod
e
Microbenchmark Experiments (Short xacts)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0
100000
200000
300000
400000
500000
600000
Deterministic, high contention
Nondeterministic, high contention
Nondeterministic, low contention
Deterministic, low contention
Deterministic w/ VLL, low contention
Deterministic w/ VLL, high contention
% Distributed Transactions
Tran
sacti
ons p
er se
cond
per
nod
e
Resource Constraints Experiments
Dependent Transactions Experiments
( a) 0% distributed transactions( b) 100% distributed
transactions
Latency CDF
More information
• The Case for Determinism in Database SystemsAlexander Thomson and Daniel J. Abadi. In PVLDB, 3(1), 2010. (pdf)
• Calvin: Fast Distributed Transactions for Partitioned Database SystemsAlexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. In Proceedings of SIGMOD, 2012. (pdf)
• An Evaluation of the Advantages and Disadvantages of Deterministic Database SystemsKun Ren, Alexander Thomson and Daniel J. Abadi. In PVLDB, 7(10), 2014. (pdf)
• Modularity and Scalability in CalvinAlexander Thomson and Daniel J. Abadi. In IEEE Data Eng. Bull., 36(2): 48-55, 2013. (pdf)
• Lightweight Locking for Main Memory Database SystemsKun Ren, Alexander Thomson, and Daniel J. Abadi. In PVLDB 6(2): 145-156, 2012. (pdf)
Conclusions
• Determinism not a good fit for latency-sensitive applications
• Fewer options to deal with node overload (true only for lock-based implementation)
• Much improved throughput for distributed transactions
• Much simpler design. Recover manager, lock manager, totally separate from rest of DBMS
• Replication is trivial