9.6 Multimaster replicationMultimaster replication · at master • Transfer of log data, replay at...

HS-201001-TA-Repl- 47

9.6 Multimaster replication9.6 9.6 MultimasterMultimaster replicationreplication

D

D

D

D

D

"All citizens are equal"

• May results in unsolvable conflicts

• Detected when updatesare propagated

• Need auxiliary data whichreflect the update of object x at different sites

Not only an academic study: in some applications datamay be updated according to geo location: "Employees in Berlin / New York " . Updates primarily at home location.

HS-201001-TA-Repl- 48

MultimasterMultimasterMultimaster

Also a multimaster scenario,if the disconnected devicesmay update independently.

Master responsible for shipping updates it learnsfrom replicas to all otherreplicas

Master DB

DB

DB

DBDisconnected copies

More general case: every replica can synchronize at any timewith any other replica (node).

HS-201001-TA-Repl- 49

Versions and ordering Versions and ordering Versions and ordering

x

x'

x''

Independent updates and synchronisation ----->

No problem: versions of x follow each other(happens before, precedence relation)

Repl R

Repl Sx'''

x'v

Update and sync anywhere at any time ⇒ conflictif there was a version which has been overwritten independently by two (replication) nodes.

HS-201001-TA-Repl- 50

System modelSystem modelSystem model

• Transactions read / write at arbitrary replica• No abort (simplifying assumption)• Only the first update of an object x in a TA defines

version id• Objects may be tables, files, rows...• Version id identifies last value of x written by TA • Since x and y updated by R are causally related:

Version of x is update count of R

• e.g [R,7] is version of x @ R, [R,8] is version of yif x and y updated subsequently

HS-201001-TA-Repl- 51

Version idVersion idVersion id

Version id for objects x [R,updateCount] not sufficientfor ordering :

..... x8 y9 Rreplica

Sx7 How do y9 andx10 compare??

• Which order??• How can conflicts be detected?

HS-201001-TA-Repl- 52

Multimaster SyncMultimasterMultimaster SyncSyncTaskFind data structures and sync algorithms which allow to

detect conflicts, i.e. There are transactions T1 at R and T2 at S which have not seen the output of each other, but produced a new version.

Order of versions

xi directly precedes xj if a) there are TA t1, t2 and

t1 reads x and writes xi, t2 reads xi and writes xjb) xi precedes xj if there xi directly precedes xj or there is a

sequence of versions xi,xi+1, .....,xj and xi directly precedes xi+1, ...

(transitive closure)

HS-201001-TA-Repl- 53

Multimaster: data structuresMultimasterMultimaster: data structures: data structures

For each replica Ri a version ID vector:[[R1,c1],....[Rn,cn]] : number of updates at Ri

from every other replica Ri has received. Reflects updates R received from the other replica.

and an update count [R, c] .

Node Ri ordered ⇒ version vector = vector of update countse.g. R,S,T, R version vector: [8, 5,10]S : [4, 5, 9]T : [5, 4, 11]

Each data item x has a version id [ Ri,c] : x has been updated @ Ri with update count c.

R has update count 8,has seen all but one update of T,and all of S etc

HS-201001-TA-Repl- 54

Multimaster SyncMultimasterMultimaster SyncSync

Let R, S be nodes with version vectorsVR =[c1,...,cn], VS = [d1,...dn]

If R wants to synchronize with S, (1) R sends VR to S,(2) S sends VS and all updates of all objects x

which satisfy: let VR[i] = k, version id of x = [Ri,c] and k < c

... because R has not seen the update of x made by Ri

(3) R updates its version vector an objects x received,if no conflict!

HS-201001-TA-Repl- 55

MultimasterMultimasterMultimaster

Partial order on version vectors :VR < VS if for all i VR[i] ≤ VS[i]

VS < VR if for all i VS[i] ≤ VR[i] else incomparable.

Update rules(1) TA t executes at R with update count [R,c].

For each modified x gets version [R,c]; c++(2) Sync: sending x from S to R ...(3) conflict?

Goal of the rules: if version xi overwrites version xj then xj precedes xi

HS-201001-TA-Repl- 56

Multimaster SyncMultimasterMultimaster SyncSync

Update rules (cont)(2) x sent from S to R, let

version id of x @ R [Rk, d]version id of x @ S [Ri, c]

If VR[i] > c then discard the version of x sent(since R has received from Ri already 'higher' update)

If VS[k] > d then replace x with version received from S with version id [Ri,c]

(since S received version of x produced by Rkbefore R)

Update version vector. (3) VR and VS incomparable: conflict

HS-201001-TA-Repl- 57

Multimaster / version vectorsMultimasterMultimaster / version vectors/ version vectors

• Conflicts must be resolved by application• Except for some possible strategies: last update wins,....

• Better solution: Replica retain conflicting updates (versions of x ) and present them to application.

Correctness of replica update?

Easy to see with version vectors for each object (!)More subtle only with version of object and version vector of replica. Show that goal of the rules achieved:

xi overwrites version xj only if xj precedes xi

HS-201001-TA-Repl- 58

Example Example Example

Example by Bernstein / Newcomer

Conflict situation: x has been updated independentlyby R1 and R2

HS-201001-TA-Repl- 59

ExampleExampleExample

R3 receives T2's update and it can tell whether it ranbefore or after R2 received T1's update – if version vectorsare used.

HS-201001-TA-Repl- 60

9.7 Replication in the real world9.7 Replication in the real world9.7 Replication in the real world

Typically simpler solutions oriented towards most important scenarios

Asynchronous mode Terminology of vendors differsTypical global architecture:

[source:Oracle]

Data changesare captures andstaged to targetfor consumption

HS-201001-TA-Repl- 61

Important scenariosImportant scenariosImportant scenarios

High availability

Hot standbyMaster

e.g. msg. queue,or redo log device

• Might be synchronous, but would slow down TA processingat master

• Transfer of log data, replay at standby• Insert into message queue may be part of TA at primary⇒ No TA lost

• Take over within seconds – needed for replay of pending TAs

Supported bymost systems;Oracle: specific multi-masterconfiguration (both[a]synchronous)

HS-201001-TA-Repl- 62

Scenario: scalingScenario: scalingScenario: scaling

Scaling of read workload

TA / command logdevice, Msg queue

Primary copy/master

Read only copies/slaves

MySQL replication:- slaves read command

log @ master- restart of slave: use numbering of commands

Oracle: Read-onlymaterialized view

Low update traffic, unidirectional refresh , failure of slave ⇒ slight read performance decrease

HS-201001-TA-Repl- 63

Scenario: Clients with update rightScenario: Clients with update rightScenario: Clients with update right

Typical situation: Mobile clients

• Low update traffic, bidirectional refresh , • frequently trigger-based update on both sides, acceptableif low update rate, e.g. msg queue based communication,

• conflicts may have to be solved manually

Oracle:- updatable

materialized view

HS-201001-TA-Repl- 64

Replication ManagerReplication ManagerReplication ManagerDedicated server for coordination replication

specific tasks– IBM: "Data Propagator"– Sybase "Replication server"– MS: "SQL Server Synchronization Mgr"– Oracle "Replication Mgr" (Siebel)

Typically hierachically structured

Replication server("staging server")

• Different types of data refreshment policies

• Different kind of technical data exchange, e.g. msg-queues, publish, subscribe etc

HS-201001-TA-Repl- 65

Oracle 8iOracle 8iOracle 8i

Can replicate updates to table fragments or stored proc calls at the master copy

Uses triggers to capture updates in a deferred queue– Updates are row-oriented, identified by primary key– Can optimize by sending keys and updated columns only

Group updates by transaction, which are propagated:– Either serially in commit order or – in parallel with some dependent transaction ordering: each read

reads the “commit number” of the data item; updates are ordered by dependent commit number

Snapshots (= materialized view) are updated in a batch refresh.– Pushed from master to snapshots, using queue scheduler

HS-201001-TA-Repl- 66

Oracle replication – overall pictureOracle replication Oracle replication –– overall pictureoverall picture

slide by G. Alonzo, ETH

Very flexible solution,(nearly) everything allowed!

not shown (and not required!?): replication manager

HS-201001-TA-Repl- 67

Multimaster Peer-to-Peer Replication

– keeps all copies up to date– transactional guarantees

How?Conclusion from

experiments and talks and personal communication: table locks (!)

May be ok in particular situations, but in general?

HS-201001-TA-Repl- 68

Multimaster replication: peer-to-peerMultimasterMultimaster replication: peerreplication: peer--toto--peerpeer

Multi-master replication without a primary:Wingman

Each row of a table has 4 additional columns– globally unique id (GUID)– generation number, to determine which updates from

other replicas have been applied– version number = a count of the number of updates to

this row– array of [replica, version number] pairs, identifying

the largest version number it received for this row from every other replica.

adapted from Phil Bernstein

Used in Microsoft Access 7.0 and Visual Basic 4.0

HS-201001-TA-Repl- 69

Multimaster replication: "MS-Wingman"MultimasterMultimaster replication: "MSreplication: "MS--Wingman"Wingman"

Each replica has a current generation numberA replica updates a row’s generation number whenever it

updates the rowA replica remembers the generation number it had when it

last exchanged updates with R´, for every replica R´.A replica increments its generation number every time it

exchanges updates with another replica.So, when exchanging updates with R′, it should send all rows

with a generation number larger than what it had when last exchanging updates with R′.

adapted from Phil Bernstein

HS-201001-TA-Repl- 70

Wingman update processingWingman update processingWingman update processing

Use Thomas’ Write Rule to process an update from another replica– Compare the update’s and row’s version

numbers– The one with larger version number wins

(use replica id to break ties)– Yields the same result at both replicas, but

maybe not serializable

HS-201001-TA-Repl- 71

Wingman: not serializableWingman: not Wingman: not serializableserializable

Suppose two replicas perform updates to x– Replica A does 2 updates, incrementing version

number from 1 to 3 – Replica B does 1 update, incrementing version number

from 1 to 2– When they exchange updates, replica A has higher

version number and wins, causing replica B’s update to be lost

For this reason, rejected updates are retained in a conflict table for later analysis

HS-201001-TA-Repl- 72

Wingman: rejecting duplicate updateWingman: rejecting duplicate updateWingman: rejecting duplicate update

Some rejected updates are duplicatesTo identify them -

– When applying an update to x, replace x’s array of [replica, version#] pairs by the update’s array.

– To avoid processing the same update via many paths, check version number of arriving update against the array

Consider a rejected update to x at R from R´, where – [R´, V] describes R´ in x’s array, and

V´ is the version number sent by R´.– If V ≥ V´, then R saw R´’s updates– If V < V´, then R didn’t see R´’s updates, so store it in

the conflict table for later reconciliation

HS-201001-TA-Repl- 73

9.8 Replication and Consistency @ Google9.8 Replication and Consistency @ Google9.8 Replication and Consistency @ Google

GFS– Big chunks of data (64 MB) – blocks– heavily replicated– controlled by master – replicated as well– Important status data – e.g. who is primary –

held in master data structure– These data are persistently replicated

ChubbyLock service based on Paxos consensus,

locks them according to reader-writer locking:n readers – one writer, no reader

HS-201001-TA-Repl- 74

ChubbyChubbyChubby

Use casesGFS: Elect a masterBigTable: master election, client discovery, table service

lockingWell-known location to bootstrap larger systemsPartition workloadsLocks should be coarse: held for hours or days – build your

own fast locks on top

HS-201001-TA-Repl- 75

Chubby Chubby Chubby

replica replica

replica replica

Master replica

One Chubby “Cell”

All client traffic

• Master: has all the information about chunks, node failures, locks etc.

• Readers / writer have to lock chunks before read / write• Loss of Master = disaster!

HS-201001-TA-Repl- 76

Chubby Chubby Chubby

• Typical 5 Chubby cells (servers) in different racks• Responsible for a data center• Master election using Paxos• Master Lease: promise not to elect a new master for

some time (see below)• Clients will access master or replicas found in DNS

but all reads / writes forwarded to master • Write requests propagated to replica by consensus

protocol

HS-201001-TA-Repl- 77

Fault tolerant locking serviceFault tolerant locking serviceFault tolerant locking service

• Lock service used by GFS, BigTable etc.• Holds all kinds of metadata• Replicated for fault tolerance, not performance

locking:reader /writer-model:

many reads,at most one write

Chubby protocol RPCClient network

Rplica network

Local file system IO

Paxos

File transfer / snapshot

HS-201001-TA-Repl- 78

Why leases?Why leases?Why leases?

Goal: Make reads cheaper

Read request would need a consensus of sufficient (3) replicaNew master cold have been elected! Value to be read may be different in different replica.

Master lease: a promise not to elect a new master as long as lease is valid

But writes...

HS-201001-TA-Repl- 79

Write requestsWrite requestsWrite requests

write requests performed on masterpropagated to replica using PaxosIn case of agreement (3 replica of 5 living) ack to client

Log entries propagated for the values to be writtenOne instance of Paxos started for each log entry⇒ Multipaxos = agreement on a sequence of values

Many subtle engineering problems, see reader

HS-201001-TA-Repl- 84

9.8 Mobile Databases, a brief overview9.8 Mobile Databases, a brief overview9.8 Mobile Databases, a brief overview

IBM DB2 EveryplaceOracle 9i LiteSybase UltraLiteTamino Mobile

Pointbase Micro eXtremeDB http://www.mcobject.com/milaero.shtml

Differences– Synchronisation with base stations– Application Development (Tools, platforms,…)

see: Mutschler, Specht: Mobile Datenbankysteme, Springer 2004.

HS-201001-TA-Repl- 85

Mobile Clients Middleware(Synchronisations-Server) DB-Server

DrahtloseNetzwerkanbindung Festnetzverbindung

Anwendung DB

Application ArchitecturesApplication ArchitecturesApplication Architectures

Standard Client Server

– Systems

• IBM DB2 EveryPlace• Oracle 9i Lite slides adaptes from

Mutschler / Specht

DBS separated from application

HS-201001-TA-Repl- 86

Application ArchitecturesApplication ArchitecturesApplication Architectures

DrahtloseNetzwerk-anbindung

Optionale Middleware DB-Server

Festnetz-verbindungMobile Anwendung

IntegrierteDatenbank

Systems• Sybase UltraLite• Pointbase Micro • eXtremeDB (Main mempry DB")

• Integrated Mobile DB

DBS and application integratedsaves memory space, only functions needed are binded

HS-201001-TA-Repl- 87

Database EngineDatabase EngineDatabase Engine

Typically Relational Functionally differs considerablyTop end: DB2Everyplace,full fledged DBSSystems configurable(100 - ~500 KB)

HS-201001-TA-Repl- 88

Synchronisation of ReplicaSynchronisationSynchronisation of Replicaof Replica

MobilerClient

Synchronisations-antwort

Mid-Tier-System

Source System

Spiegeldatenbank Quelldatenbank

InputQueue

Mirror Table

Source Table

Change Data Table1

2

3

4

DB2 Everyplace: Mirror-DB

HS-201001-TA-Repl- 89

Synchronisation of ReplicaSynchronisationSynchronisation of Replicaof Replica

Oracle Lite– Snapshot based– Snapshot = materialized view

Full Refresh transmit all tuples of snapshotquery

Fast Refresh use snapshot logsForce Refresh mixed full / fast

HS-201001-TA-Repl- 90

Mobile MiddlewareMobile MiddlewareMobile Middleware

Mobile Server Standalone

Oracle HTTP Server

Oracle 9i AS (WE)

MS Module

MS Module

Apache MS Module

MobileServer

(Middleware)

Oracle LiteDatenbank

Oracle 9iDatenbank

ServerClient Middleware

Should be non-proprietary! How to connect Clientwith server from different vendor? -> Standards

HS-201001-TA-Repl- 91

SyncML PlatformSyncMLSyncML PlatformPlatform

no only c/s syncbut also client / client

HS-201001-TA-Repl- 92

SummarySummarySummary

• Replication is intended for availability rather than for throughput / response time enhancement (more or less)

• Transactional guarantees are costly• Atomicity and prevention of lost updates

may be ok in many application, i.e. Isolation level Read uncommitted

more update performance (e.g. asynchronousupdate propagation possible)

• Replication of tables with high frequency updates does not make much sense (in general),… but backup

• Sophisticated (and confusing!) solutions by vendors• Formidable task for the DB Administrator to

decide on when and what to replicate

9.6 Multimaster replicationMultimaster replication · at master • Transfer of log data, replay at...

Documents

Transcript of 9.6 Multimaster replicationMultimaster replication · at master • Transfer of log data, replay at...