Distributed Data Management - - TU...

76
Distributed Data Management Summer Semester 2015 TU Kaiserslautern Prof. Dr.-Ing. Sebastian Michel Databases and Information Systems Group (AG DBIS) http://dbis.informatik.uni-kl.de/ Distributed Data Management, SoSe 2015, S. Michel 1

Transcript of Distributed Data Management - - TU...

Page 1: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Distributed Data ManagementSummer Semester 2015

TU Kaiserslautern

Prof. Dr.-Ing. Sebastian Michel

Databases and Information Systems Group (AG DBIS)

http://dbis.informatik.uni-kl.de/

Distributed Data Management, SoSe 2015, S. Michel 1

Page 2: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Logical Ordering of Events in Distributed System

• Last lecture we looked at consensus algorithm(s) to agree on execution order of requests for State Machine Replication.

• Now: logical (partial) ordering of events

• A kind of “timestamp” attached to records that indicate conflict or “descendant” relation

Distributed Data Management, SoSe 2015, S. Michel 2

Lamport, L. (1978). Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21 (7): 558–565.

Page 3: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Ordering of Events

• Consider events of a specific process.

• We assume the events form a sequence, so it is know which one appears before another.

• If event a occurs before b in this sequence, then a “happens before” b.

Distributed Data Management, SoSe 2015, S. Michel 3

time

a b c d

• How does this translate to multiple processes?

Page 4: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

The “Happens Before” Relation

• The relation “” is the smallest relation such that:

– (1) If a and b are events in the same process, and a comes before b, then ab

– (2) If a is the sending of a message by one process and b is the receipt of the same message by another process, then ab

– (3) If ab and bc then ac.

• Two distinct events are called concurrent if and

Distributed Data Management, SoSe 2015, S. Michel 4

baab

Page 5: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Illustration

Distributed Data Management, SoSe 2015, S. Michel 5

ab if one can get from a to b by following processes or messagesFor instance, p1r4

Image source: L. Lamport.

Page 6: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Causally Affecting

• Another way of viewing the definition of is to say that ab means that it is possible for event a to affect event b.

• Two events are concurrent if neither can causally affect the other. For instance, p3 and q3

are concurrent (previous Figure).

Distributed Data Management, SoSe 2015, S. Michel 6

P cannot know what process Q didat q3 until it receives the message at p4

Page 7: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Lamport Timestamps (Logical Clocks)

• Each node keeps integer count.

• Incremented at each atomic event.

Distributed Data Management, SoSe 2015, S. Michel 7

time

time

1 2 3 4

1 2 3 4 5 6 7 8

Page 8: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Lamport Timestamps (2)

• At communication: receiver adapts clock to received value + 1 (if not larger than value already)

Distributed Data Management, SoSe 2015, S. Michel 8

time

time

1 2 3 4 9

1 2 3 4 5 6 7 8 9 10 11

10

12

Process 1

Process 2

Page 9: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Lamport Timestamps: Properties

• If event x has happened before y

– either x and y on same process

– or x is sending and y receiving event on diff. process

then C(x) < C(y), i.e., x y => C(x) < C(y)

This is called Clock Condition in Lamport’s paper.

• But not: C(x) < C(y) => x -> y

• Why?

Distributed Data Management, SoSe 2015, S. Michel 9

(we will see later, that vector clocks can say this!)

Page 10: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Total Order

• The relation defines only a partial order.

• How can be break ties of “concurrent” events?

• Assume we have an (arbitrary) total order of the processes.

• Define relation as follows:

– If a is an event in process Pi and b is event in Pj, then a b if and only if either

– (i) Ci(a) < Cj(b) or

– (ii) Ci(a) = Cj(b) and Pi Pj

Distributed Data Management, SoSe 2015, S. Michel 10

Page 11: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Replicated State Machines with Logical Clocks

• Can build a replicated state machine using logical clocks.

• See tutorial by F. Scheider* (Section 3)

• Idea: Use total order described before as order the state machines use

• Needs “stability”: Process next stable request with lowest identifier

• Stable means: No other request with lower timestamp can arrive after the request.

Distributed Data Management, SoSe 2015, S. Michel 11

*Fred B. Schneider: Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial. ACM Comput. Surv. 22(4): 299-319 (1990) http://www.cs.cornell.edu/fbs/publications/smsurvey.pdf

Page 12: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Recap: Concurrency Control in DBMS

• Addresses the problem of dealing with concurrent reads/writes to same data

• Transactions encapsulate batch of commands

• http://www.youtube.com/watch?v=G3xH2SoMOF0

• Traditional RDBMS: ACID– Isolation: transaction (TA) works with the DB as if it is

the only TA

– Atomicity: transaction is executed entirely or no changes are made at all

– …Distributed Data Management, SoSe 2015, S. Michel 12

Page 13: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Recap: Locking: Two Phase Locking

• Recall: Traditional way to ensure multi user concurrency control: Locking

– (Strict) 2PL (Two Phase Locking)

– Transaction claims required locks

– Starts releasing locks at some point or all at once (strict)

• Distributed: Distributed 2PL, 2 Phase Commit (and variants)

• Or: PAXOS commit protocol (for higher reliability)

Distributed Data Management, SoSe 2015, S. Michel 13#l

ock

s

time

Page 14: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Distributed Execution Monitor

Distributed Data Management, SoSe 2015, S. Michel 14

Transaction Manager

(TM)

Scheduler(SC)

Begin_transaction,Read, Write, Commit, Abort Results

Scheduling/DeschedulingRequests

To Data Processors (DPs)

With other SCs

With Data Processors (DPs)

With other TMs

Page 15: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Synchronization based on Timestamp Ordering (T/O)

• Transaction (TA) gets timestamp assigned• Timestamps impose ordering of transactions.• I.e. one specific order of TA is chosen (unlike with

locking!)

• Conflicts are resolved by resetting TAs (which gets a new timestamp). There are no deadlocks.

• Creates conflict serializable schedules.

• Record for each record (item) the largest timestamp of any read or write operation.

Distributed Data Management, SoSe 2015, S. Michel 15

Page 16: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

(Pessimistic) Timestamp Ordering

• With each database item two timestamp (TS) values are associated

• read_TS(X): The read timestamp of item X; i.e., largest TS of all transactions that have successfully read item X

• write_TS(X): The write timestamp of item X; i.e, largest of all TS of transactions that have successfully written item X

Distributed Data Management, SoSe 2015, S. Michel 16

Page 17: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Conflicts in (Pessimistic) T/O

• R/W: If TA T with timestamp TS(T) wants to read X– reject if TS(T) < write_TS(X)

– otherwise: read and set read-timestamp of record x to max(TS, read-timestamp of x)

• R/W: If TA T with timestamp TS(T) wants to write X– reject if TS(T) < read_TS(X)

– otherwise: write and set write-timestamp of x to max(TS, write-timestamp of x)

• W/W: If a TA T with timestamp TS(T) wants to write X– reject if TS < write_TS(X)

– otherwise: write and set write-timestamp to TS

Distributed Data Management, SoSe 2015, S. Michel 17

Page 18: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Thomas’s Write Rule

• Modification of basic T/O algorithm

• Rejects fewer writes. No conflict serializability.

• If read_TS(X) > TS(T) then abort / reject

• If write_TS(X) > TS(T) then do not execute operation, but continue.

• If neither the condition (1) or (2) occurs, execute write_item(X) and set write_TS(X) to TS(T).

• Also called: “Last Write Wins”

Distributed Data Management, SoSe 2015, S. Michel 18

Page 19: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Multiversion Concurrency Control

• No overwriting of old values:

Each write operation creates a new version

• Transactions have timestamp of begin

• Transaction that wants to read can immediately do that (no locking).

Distributed Data Management, SoSe 2015, S. Michel 19

Page 20: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Multiversion Concurrency Control

• Read vs. Write Operations

• Read is never rejected; reads record version that is largest but < TS !

Distributed Data Management, SoSe 2015, S. Michel 20

Alice

Bob t0

v0 v1 v0 v1

t1

read v0

read v0 read v0read v1

write v0 -> v1

vlatest = v0

vlatest = v1

Page 21: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Multiversion Concurrency Control (2)

• Conflict detection

Distributed Data Management, SoSe 2015, S. Michel 21

Alice

Bob t0

v0 v1

t2

read v0

read v0

write v0 -> v1a

write v0 -> v1b

v1

t3t1

vlatest = v0

vlatest = v1a

Conflict!vlatest != v0

Page 22: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Distributed Implementation

• Basic implementation

• 2PL can be handled in one central instance, called centralized 2PL (C2PL), or as distributed 2PL (D2PL) with lock managers on each site

• Requires distributed deadlock management

• Timestamp Ordering rather straightforward to realize

• 2PC etc. for agreeing on commit (before changes are made permanent)

Distributed Data Management, SoSe 2015, S. Michel 22

Page 23: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Discussion• Strong consistency as we know it from

traditional database systems is achievable also in distributed fault-tolerant systems

• But it comes with a price to pay

• Synchronization is costly, thus, increases response times

Distributed Data Management, SoSe 2015, S. Michel 23

Literature for the aforementioned techniques:-standard DB books like the one by Ramakrishnan and Gehrke (Database Management Systems) or the one by Elmasri and Navathe (Fundamentals of Database Systems).-distributed case specifically, book by Özsu and Valduriez (Principles of Distributed Database Systems)

Page 24: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Going Distributed: Some Thoughts• What happens if network gets partitioned, e.g.,

falls into two disconnected sub networks?

Distributed Data Management, SoSe 2015, S. Michel 24

Partition 1 Partition 2

• What about clients that want to read/write data?

• Do we allow that (they pick one partition) or do we have to wait (how long?) for the problem to be fixed?

clients

Page 25: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

High Availability Aim

• Availability/Response Time (Latency)

• 100ms additional latency in Amazon results in 1% drop of sales*

• 500ms additional latency in Google causes 20% decrease in traffic*

Distributed Data Management, SoSe 2015, S. Michel 25

*) source: see references within Bailis et al.: Probabilistically Bounded Staleness for Practical Partial Quorums. PVLDB 5(8): 776-787 (2012)

Page 26: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

CAP Theorem (Brewer's Theorem) (Cont’d)• System cannot provide all 3 properties at the

same time:

– C onsistency

– A vailability

– P artition Tolerance

Distributed Data Management, SoSe 2015, S. Michel 26

C A

P

C+P A+P

• After Eric Brewer, as a conjecture presented at PODC 2000, later formally proved by Seth Gilbert and Nancy Lynch

S. Gilbert, N. Lynch, "Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services", ACM SIGACT News, Volume 33 Issue 2 (2002).

Page 27: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

CAP Theorem (Cont’d)

• Consistency: All nodes/clients see same data at same times.

• Availability: System is responding to requests

• Partition Tolerance: In a distributed system (scale-out) data is put on various nodes, network problems can create partitions of nodes. System has to be able to live with this.

Distributed Data Management, SoSe 2015, S. Michel 27

Page 28: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

In Different (Older) Words ….

• "Partitioning - When communication failures break all connections between two or more active segments of the network ... each isolated segment will continue … processing updates, but there is no way for the separate pieces to coordinate their activities. Hence … the database … will become inconsistent. This divergence is unavoidable if the segments are permitted to continue general updating operations and in many situations it is essential that these updates proceed.“

• [Rothnie & Goodman, VLDB 1977]

Distributed Data Management, SoSe 2015, S. Michel 28

From SIGMOD ’13 tutorial by P. Bernstein and S. Das: http://research.microsoft.com/en-us/projects/rethinkingconsistency/rethinkingconsistencysigmod2013.pdf

Page 29: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

CAP Theorem: Proof Idea

• Consider a system with multiple partitions

• Failure prevents sync between node1 and node2.

• Now?

– Prohibit reads until synced? (system not available)

– Or let clients read (system not consistent)

Distributed Data Management, SoSe 2015, S. Michel 29

node1 node2

D1 D0sync?

Page 30: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Consistent + Available

• No support of multiple partitions.

• Strong (ACID) consistency enforced.

• Example: Single-site database

Distributed Data Management, SoSe 2015, S. Michel 30

C A

Page 31: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Partition Tolerant + Available

• Example: Domain Name Service (DNS)

Distributed Data Management, SoSe 2015, S. Michel 31

A

P

A+P

Page 32: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Consistent + Partition Tolerant

• Example: Distributed Databases with distributed concurrency control

Distributed Data Management, SoSe 2015, S. Michel 32

C

P

C+P

Page 33: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Can’t do without “P”

• Large data => Scale out architecture => Partition Tolerance is a strict requirement

• Leaves: Trading off consistency

and availability

Distributed Data Management, SoSe 2015, S. Michel 33

C A

P

C+P A+P

Page 34: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

But: Is Weaker Consistency Acceptable?• Weaker “isolation” produces errors. Why is this

ok?

• Some guesses:– Anyway, DBs are inconsistent for many other reasons:

• Bad data entry, bugs, duplicate requests, disk errors, ….

– Maybe errors due to weaker isolation levels are infrequent

– When DB consistency matters a lot, there are external controls:

• People look closely at their paychecks

• Financial information is audited

• Retailers take inventory periodically

Distributed Data Management, SoSe 2015, S. Michel 34

http://research.microsoft.com/en-us/projects/rethinkingconsistency/rethinkingconsistencysigmod2013.pdf

Page 35: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Best effort: BASE

• Basically Available

• Soft State

• Eventual Consistency

Distributed Data Management, SoSe 2015, S. Michel 35

- http://www.allthingsdistributed.com/2007/12/eventually_consistent.html

- W. Vogels. Eventually Consistent. ACM Queue vol. 6, no. 6, December 2008.

Page 36: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Base (Cont’d)

"the term "BASE" was first presented in the 1997 SOSP article that you cite. I came up with acronym with my students in their office earlier that year. I agree it is contrived a bit, but so is "ACID" – much more than people realize, so we figured it was good enough. Jim Gray and I discussed these acronyms and he readily admitted that ACID was a stretch too – the A and D have high overlap and the C is ill-defined at best. But the pair connotes the idea of a spectrum, which is one of the points of the PODC lecture as you correctly point out."

Distributed Data Management, SoSe 2015, S. Michel 36

Anecdote: According to http://www.julianbrowne.com/article/viewer/brewers-cap-theorem Prof. Brewer said in personal communication:

• What Basically Available means is pretty clear• Eventual Consistency means data can be inconsistent but will

become eventually consistent (after some time).• Soft State means that data is not necessarily persistently

(durably) stored but needs to be maintained by users. Think: Time to Live (TTL)

Page 37: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Wide Consistency Spectrum

• Extreme points:

– Strong Consistency

– Eventual Consistency

• And various intermediate “levels” of consistency

• We will first have a look at eventual consistency

Distributed Data Management, SoSe 2015, S. Michel 37

Page 38: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Idea

• Trade off consistency and availability

• Read/Write Pattern:

– Guarantee write successful to subset of machines

– Read data from subsets of machines

• Maintain multiple versions per data item

• Resolve conflicts based on version mismatch once it occurs (be optimistic)

Distributed Data Management, SoSe 2015, S. Michel 38

Page 39: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

(Logical) Anatomy of a Write

Distributed Data Management, SoSe 2015, S. Michel 39

node1 node3node2

Client

Coordinator

write request with W=2

write requeststo all

OK, if received 2 ACKs

Page 40: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

(Logical) Anatomy of a Write (Cont’d)• Client sends write request

• Assume there are N copies (replicas) of data record/tuple/item

• Server does not wait until all (of the N) nodes (that keep the replicas) ack’ed the write but returns already after W acks. So, we (as a client) know, there are for sure W successful writes.

• A client’s read request is returned after reading from R out of the N nodes.

• Can lead to multiple conflicting versions of a data item, depending on configuration.

• Needs to detected and resolved

Distributed Data Management, SoSe 2015, S. Michel 40

Page 41: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

W < N

Distributed Data Management, SoSe 2015, S. Michel 41

node1 node2

D1 D1

wri

te

wri

tenode3 node4

D0 D1

wri

te

node1 node2

D1 D1node3 node4

D1 D1

time it takes for write (or sync to arrive at node3)

Page 42: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Eventual Consistency

• After a write, data can be at some nodes/machines inconsistent

• But will eventually(!) become consistent again

• By (background) sync protocol

Distributed Data Management, SoSe 2015, S. Michel 42

Client gets ACKafter write

All nodes havesame values for data

Time range of possibly inconsistent data

http://www.allthingsdistributed.com/2008/12/eventually_consistent.html

Page 43: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Consistency by Write to All

• Write to all: W=N (i.e., wait until all writes are acknowledged)

• Read from one: R=1

• Guarantee to see at least one recent version

Distributed Data Management, SoSe 2015, S. Michel 43

node1 node2

D0 D0

wri

te

wri

te

node3 node4

D0 D0

wri

te

wri

te

read

Page 44: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Consistency by Read From All

• Wait for one write to be ack’ed: W=1

• Read from all: R=N

• Guarantee to see the last version

Distributed Data Management, SoSe 2015, S. Michel 44

node1 node2

D1 D0

wri

te

node3 node4

D0 D0

read

read

read

read

Page 45: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

WARS Model (of Amazon Dynamo)

Distributed Data Management, SoSe 2015, S. Michel 45

Coordinator Replica(s)send to N replicas

WRITE(W)

ACK(A)

READ(R)

RESPONSE(S)

wait for Wresponses

send to N replicas

time

wait for Rresponses

Page 46: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Quorums and Configurations• In general: Methodology gives space of

solutions.– Number of servers/replicas: N

– Simultaneous writes: W

– Simultaneous reads: R

• Problematic if sets of written-to and read-from servers do not overlap: – Partial quorum if W+R ≤ N

• What if W < (N+1)/2 ? Write sets do not overlap. What does this mean?

Distributed Data Management, SoSe 2015, S. Michel 46

Page 47: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Configurations

Distributed Data Management, SoSe 2015, S. Michel 47

R/W Configuration Kind of Consistency

W=N and R=1 Read optimized strong consistency.

W=1 and R=N Write optimized strong consistency.

W+R<=N Eventual consistency. Read might miss recent writes.

W+R>N Strong consistency. Read will see at least one most recent write.

Page 48: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

How Consistent is Eventual Consistency?

• Goal is to quantify eventual consistency

• Probability that we miss reading the last write.

• What does that mean for real systems? (Let’s put in some numbers …..)

Distributed Data Management, SoSe 2015, S. Michel 48

Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, Joseph M. Hellerstein, Ion Stoica: Probabilistically Bounded Staleness for Practical Partial Quorums. PVLDB 5(8): 776-787 (2012)

Page 49: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Putting in Some Numbers

N W R ps3 1 1 0.6676 1 1 0.883

30 5 5 0.40030 10 10 0.00614930650 20 20 6,37503E-07

100 30 30 1,884E-06

Distributed Data Management, SoSe 2015, S. Michel 49

• Does this look promising? Well, for some cases at least.

• Asymptotically is appears to work out well.

Page 50: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

More Probabilities

• How likely is it that we get a version that is not older than k versions (form now)?

Distributed Data Management, SoSe 2015, S. Michel 50

• When N=3, R=W=1, this means that the probability of returning a version within 2 versions is 0.556, within 3 versions, 0.703, 5 versions, 0.868, 10 versions, 0.98.

• Note that this is assuming that the set of updated replicas is not growing over time.

• There are more complex models in the paper to handle other cases.

Page 51: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Example Amazon’s Dynamo

• Key/Value store

• Simple operations: CRUD on a per key basis

• Aiming at high availability: “Never reject a write”– Leading to different versions of data (due to node

failures but also concurrent writes)

• Conflicting versions are reconciled at read/write (using vector clocks)

• Data placement: consistent hashing (with virtual nodes)

Distributed Data Management, SoSe 2015, S. Michel 51

Page 52: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Vector Clocks

• Idea: each node gets separate counter

• By C. Fidge and F. Mattern in 1988 (independently)

• Vector clock: Vector of counters

[c0, c1, …, cn] ci is counter for node i

Initialization: all ci are zero: [c0, c1, …, cn]

Upon event at local event at node i: node increments ci in its vector.

Sending clock: node i increments ci and sends vector

Distributed Data Management, SoSe 2015, S. Michel 52

Page 53: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Vector Clocks: Merging upon Receive

• When node i receives clock of other node

– node i merges its vector clock VC with the received one VCother

– as follows:

increment own counter ci, i.e., VC[i]=VC[i]+1

for each j do

VC [j] = max(VC[j], VCother[j])

end

Distributed Data Management, SoSe 2015, S. Michel 53

Page 54: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Distributed Data Management, SoSe 2015, S. Michel 54

tim

e

b [2,0,0]

a [1,0,0]

d [4,0,3]

l [0,1,0]

m [1,2,0]

c [3,0,0] n [1,3,2]

o [1,4,2]

p [3,5,2]

q [3,6,2]

v [0,0,1]

w [0,0,2]

x [0,0,3]

y [0,0,4]

z [2,4,5]

Process 1 Process 2 Process 3

Page 55: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Comparing Two Vector Clocks

• VC1 = VC2,

iff VC1[i] = VC2[i], for all i = 1, … , n

• VC1 ≤ VC2,

iff VC1[i] ≤ VC2[i], for all i = 1, … , n

• VC1 < VC2,

iff VC1 ≤ VC2 &

j (1 ≤ j ≤ n & VC1[j] < VC2 [j])

• VC1 is concurrent with VC2

iff (not VC1 ≤ VC2 AND not VC2 ≤ VC1)

Distributed Data Management, SoSe 2015, S. Michel 55

Page 56: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Example Story

• “Alice, Ben, Cathy, and Dave are planning to meet next week for dinner. The planning starts with Alice suggesting they meet on Wednesday. Later, Dave also exchanges email with Ben, and they decide on Tuesday. Also, Dave discuss alternatives with Cathy, and they decide on Thursday instead. When Alice pings everyone again to find out whether they still agree with her Wednesday suggestion, she gets mixed messages: Cathy claims to have settled on Thursday with Dave, and Ben claims to have settled on Tuesday with Dave. Dave can’t be reached, and so no one is able to determine the order in which these communications happened, and so none of Alice, Ben, and Cathy know whether Tuesday or Thursday is the correct choice.”

Distributed Data Management, SoSe 2015, S. Michel 56

source: http://basho.com/why-vector-clocks-are-easy/

Page 57: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Distributed Data Management, SoSe 2015, S. Michel 57

date = Wednesdayvclock = Alice:1

http://basho.com/why-vector-clocks-are-hard/

“Alice, Ben, Cathy, and Dave are planning to meet next week for dinner. The planning starts with Alice suggesting they meet on Wednesday. Later, Dave also exchanges email with Ben, and they decide on Tuesday. Also, Dave discuss alternatives with Cathy, and they decide on Thursday instead. When Alice pings everyone again to find out whether they still agree with her Wednesday suggestion, she gets mixed messages: Cathy claims to have settled on Thursday with Dave, and Ben claims to have settled on Tuesday with Dave. Dave can’t be reached, and so no one is able to determine the order in which these communications happened, and so none of Alice, Ben, and Cathy know whether Tuesday or Thursday is the correct choice.”

Start with Alice’s initial message

Page 58: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Distributed Data Management, SoSe 2015, S. Michel 58

date = Tuesdayvclock = Alice:1, Ben:1

http://basho.com/why-vector-clocks-are-hard/

Now Dave and Ben start talking. Ben

suggests Tuesday:

“Alice, Ben, Cathy, and Dave are planning to meet next week for dinner. The planning starts with Alice suggesting they meet on Wednesday. Later, Dave also exchanges email with Ben, and they decide on Tuesday. Also, Dave discuss alternatives with Cathy, and they decide on Thursday instead. When Alice pings everyone again to find out whether they still agree with her Wednesday suggestion, she gets mixed messages: Cathy claims to have settled on Thursday with Dave, and Ben claims to have settled on Tuesday with Dave. Dave can’t be reached, and so no one is able to determine the order in which these communications happened, and so none of Alice, Ben, and Cathy know whether Tuesday or Thursday is the correct choice.”

Page 59: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Distributed Data Management, SoSe 2015, S. Michel 59

date = Tuesdayvclock = Alice:1, Ben:1, Dave:1

http://basho.com/why-vector-clocks-are-hard/

Dave replies, confirming Tuesday

“Alice, Ben, Cathy, and Dave are planning to meet next week for dinner. The planning starts with Alice suggesting they meet on Wednesday. Later, Dave also exchanges email with Ben, and they decide on Tuesday. Also, Dave discuss alternatives with Cathy, and they decide on Thursday instead. When Alice pings everyone again to find out whether they still agree with her Wednesday suggestion, she gets mixed messages: Cathy claims to have settled on Thursday with Dave, and Ben claims to have settled on Tuesday with Dave. Dave can’t be reached, and so no one is able to determine the order in which these communications happened, and so none of Alice, Ben, and Cathy know whether Tuesday or Thursday is the correct choice.”

Page 60: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Distributed Data Management, SoSe 2015, S. Michel 60

date = Thursdayvclock = Alice:1, Cathy:1

Now Cathy is suggesting Thursday

“Alice, Ben, Cathy, and Dave are planning to meet next week for dinner. The planning starts with Alice suggesting they meet on Wednesday. Later, Dave also exchanges email with Ben, and they decide on Tuesday. Also, Dave discuss alternatives with Cathy, and they decide on Thursday instead. When Alice pings everyone again to find out whether they still agree with her Wednesday suggestion, she gets mixed messages: Cathy claims to have settled on Thursday with Dave, and Ben claims to have settled on Tuesday with Dave. Dave can’t be reached, and so no one is able to determine the order in which these communications happened, and so none of Alice, Ben, and Cathy know whether Tuesday or Thursday is the correct choice.”

Page 61: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Comparing Vector Clocks

• Dave has the following clocks:

• Conflict because neither clocks descends from the other.

Distributed Data Management, SoSe 2015, S. Michel 61

date = Tuesdayvclock = Alice:1, Ben:1, Dave:1

date = Thursdayvclock = Alice:1, Cathy:1

Page 62: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Dave Resolves Conflict …

• … by choosing Thursday

• New clock tells that it is successor of the two previous clocks!

Distributed Data Management, SoSe 2015, S. Michel 62

date = Thursdayvclock = Alice:1, Ben:1, Cathy:1, Dave:2

Page 63: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Distributed Data Management, SoSe 2015, S. Michel 63

date = Thursdayvclock = Alice:1, Ben:1, Cathy:1, Dave:2

Page 64: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Distributed Data Management, SoSe 2015, S. Michel 64http://basho.com/why-vector-clocks-are-hard/

Ben informed Alice about Tuesday, Dave

informs Cathy and Alice about Thursday.

That means, Alice gets two different

proposals! A problem?

Page 65: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Again Conflict?

• Alice gets from Ben

• and from Cathy

• Conflict? No. Cathy’s clock is successor of Ben’s

Distributed Data Management, SoSe 2015, S. Michel 65

date = Tuesdayvclock = Alice:1, Ben:1, Dave:1

date = Thursday vclock = Alice:1, Ben:1, Cathy:1, Dave:2

Page 66: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

And we are Done

Distributed Data Management, SoSe 2015, S. Michel 66http://basho.com/why-vector-clocks-are-hard/

Page 67: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Conflicts and Their Resolution

• Assume two or more conflicting versions of the same object/item.

• What can the database do?

– Limited possibilities since application logic is not known. E.g., take most “recent” one

• What can client software do?

– Full-fledged resolution, since app logic is known.

Distributed Data Management, SoSe 2015, S. Michel 67

Page 68: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Conflict Resolution: Example

• Typical use case at Amazon

• Multiple versions of shopping cart – merged by a union of their contents

– what can go wrong? might put back a deleted item (but you wont miss any items=>don’t loose money)

• Let’s see how one would work with multiple versions in a real system ….

Distributed Data Management, SoSe 2015, S. Michel 68

Page 69: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Riak

• Key/Value store

• With namespaces (buckets)

• Queries:

– CRUD (create, retrieve, update, delete)

– MapReduce

– Riak Search (i.e., full text search engine)

– Support of secondary indices

Distributed Data Management, SoSe 2015, S. Michel 69

http://basho.com/riak/

Page 70: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Riak Architecture

• Set of equal nodes (no master)

• Placement of data: consistent hashing (will see later)

• Replication (default: 3 per object)

• Fault tolerant

• Various different setups (choices) for consistency: R, W, number of copies, etc.

Distributed Data Management, SoSe 2015, S. Michel 70

http://docs.basho.com/riak/1.2.1/references/appendices/concepts/Eventual-Consistency/

Page 71: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Riak: Parameters Last Write Wins and Allow Multiple

• multiple versions = false

– Last write wins = false: then use “time” annotations to objects and TAs for conflict resolution

– Last write wins = true: then just consider last write.

• multiple versions = true

– Last write wins = false: then retain even concurrent writes, client (application) has to resolve

– Last write wins = true: Don’t do this, unpredictable behavior…..

Distributed Data Management, SoSe 2015, S. Michel 71

See: http://docs.basho.com/riak/2.0.1/dev/using/conflict-resolution/

Page 72: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

API: GET

• curl -v http://127.0.0.1:8098/riak/test/doc

• Response: HTTP/1.1 200 OK

• Plus: Content of the document

• But could also end up with

• HTTP/1.1 300 Multiple Choices …

• Plus: a number of versions …

Distributed Data Management, SoSe 2015, S. Michel 72

http://docs.basho.com/riak/1.2.1/references/apis/http/HTTP-Fetch-Object/

Page 73: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Riak: Siblings – Different Versions

Siblings:

16vic4eU9ny46o4KPiDz1f 4v5xOg4bVwUYZdMkqf0d6I 6nr5tDTmhxnwuAFJDd2s6G 6zRSZFUJlHXZ15o9CG0BYl

Distributed Data Management, SoSe 2015, S. Michel 73

http://docs.basho.com/riak/1.2.1/references/apis/http/HTTP-Fetch-Object/

Page 74: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Riak: Get Specific Version

• curl -v http://127.0.0.1:8098/riak/test/doc?vtag=16vic4eU9ny46o4KPiDz1f

Distributed Data Management, SoSe 2015, S. Michel 74

Page 75: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Get ALL Versions

--YinLMzyUR9feB17okMytgKsylvh Content-Type: application/json Link: </riak/test>; rel="up" Etag: 6nr5tDTmhxnwuAFJDd2s6G Last-Modified: Wed, 10 Mar 2010 17:58:08 GMT {"bar":"baz"}

--YinLMzyUR9feB17okMytgKsylvh Content-Type: application/json Link: </riak/test>; rel="up" Etag: 6zRSZFUJlHXZ15o9CG0BYl Last-Modified: Wed, 10 Mar 2010 17:55:03 GMT {"foo":"bar"}

.....Distributed Data Management, SoSe 2015, S. Michel 75

curl -v http://127.0.0.1:8098/riak/test/doc -H "Accept: multipart/mixed"

Page 76: Distributed Data Management - - TU Kaiserslauterndbis.informatik.uni-kl.de/files/teaching/ss15/ddm/... ·  · 2015-07-29Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Comments

• Gives good overview of what it takes to work with Riak. Particularly in terms of managing conflicting (multiple) versions.

• https://github.com/basho/riak-java-client

• Next to vector clocks Riak supports also dotted version vectors

Distributed Data Management, SoSe 2015, S. Michel 76

More details on conflict resolution in a more recent version of Riak:http://docs.basho.com/riak/2.0.1/dev/using/conflict-resolution/