9.6 Multimaster replicationMultimaster replication · at master • Transfer of log data, replay at...
Transcript of 9.6 Multimaster replicationMultimaster replication · at master • Transfer of log data, replay at...
HS-201001-TA-Repl- 47
9.6 Multimaster replication9.6 9.6 MultimasterMultimaster replicationreplication
D
D
D
D
D
"All citizens are equal"
• May results in unsolvable conflicts
• Detected when updatesare propagated
• Need auxiliary data whichreflect the update of object x at different sites
Not only an academic study: in some applications datamay be updated according to geo location: "Employees in Berlin / New York " . Updates primarily at home location.
HS-201001-TA-Repl- 48
MultimasterMultimasterMultimaster
Also a multimaster scenario,if the disconnected devicesmay update independently.
Master responsible for shipping updates it learnsfrom replicas to all otherreplicas
Master DB
DB
DB
DBDisconnected copies
More general case: every replica can synchronize at any timewith any other replica (node).
HS-201001-TA-Repl- 49
Versions and ordering Versions and ordering Versions and ordering
x
x'
x''
Independent updates and synchronisation ----->
No problem: versions of x follow each other(happens before, precedence relation)
Repl R
Repl Sx'''
x'v
Update and sync anywhere at any time ⇒ conflictif there was a version which has been overwritten independently by two (replication) nodes.
HS-201001-TA-Repl- 50
System modelSystem modelSystem model
• Transactions read / write at arbitrary replica• No abort (simplifying assumption)• Only the first update of an object x in a TA defines
version id• Objects may be tables, files, rows...• Version id identifies last value of x written by TA • Since x and y updated by R are causally related:
Version of x is update count of R
• e.g [R,7] is version of x @ R, [R,8] is version of yif x and y updated subsequently
HS-201001-TA-Repl- 51
Version idVersion idVersion id
Version id for objects x [R,updateCount] not sufficientfor ordering :
..... x8 y9 Rreplica
Sx7 How do y9 andx10 compare??
• Which order??• How can conflicts be detected?
HS-201001-TA-Repl- 52
Multimaster SyncMultimasterMultimaster SyncSyncTaskFind data structures and sync algorithms which allow to
detect conflicts, i.e. There are transactions T1 at R and T2 at S which have not seen the output of each other, but produced a new version.
Order of versions
xi directly precedes xj if a) there are TA t1, t2 and
t1 reads x and writes xi, t2 reads xi and writes xjb) xi precedes xj if there xi directly precedes xj or there is a
sequence of versions xi,xi+1, .....,xj and xi directly precedes xi+1, ...
(transitive closure)
HS-201001-TA-Repl- 53
Multimaster: data structuresMultimasterMultimaster: data structures: data structures
For each replica Ri a version ID vector:[[R1,c1],....[Rn,cn]] : number of updates at Ri
from every other replica Ri has received. Reflects updates R received from the other replica.
and an update count [R, c] .
Node Ri ordered ⇒ version vector = vector of update countse.g. R,S,T, R version vector: [8, 5,10]S : [4, 5, 9]T : [5, 4, 11]
Each data item x has a version id [ Ri,c] : x has been updated @ Ri with update count c.
R has update count 8,has seen all but one update of T,and all of S etc
HS-201001-TA-Repl- 54
Multimaster SyncMultimasterMultimaster SyncSync
Let R, S be nodes with version vectorsVR =[c1,...,cn], VS = [d1,...dn]
If R wants to synchronize with S, (1) R sends VR to S,(2) S sends VS and all updates of all objects x
which satisfy: let VR[i] = k, version id of x = [Ri,c] and k < c
... because R has not seen the update of x made by Ri
(3) R updates its version vector an objects x received,if no conflict!
HS-201001-TA-Repl- 55
MultimasterMultimasterMultimaster
Partial order on version vectors :VR < VS if for all i VR[i] ≤ VS[i]
VS < VR if for all i VS[i] ≤ VR[i] else incomparable.
Update rules(1) TA t executes at R with update count [R,c].
For each modified x gets version [R,c]; c++(2) Sync: sending x from S to R ...(3) conflict?
Goal of the rules: if version xi overwrites version xj then xj precedes xi
HS-201001-TA-Repl- 56
Multimaster SyncMultimasterMultimaster SyncSync
Update rules (cont)(2) x sent from S to R, let
version id of x @ R [Rk, d]version id of x @ S [Ri, c]
If VR[i] > c then discard the version of x sent(since R has received from Ri already 'higher' update)
If VS[k] > d then replace x with version received from S with version id [Ri,c]
(since S received version of x produced by Rkbefore R)
Update version vector. (3) VR and VS incomparable: conflict
HS-201001-TA-Repl- 57
Multimaster / version vectorsMultimasterMultimaster / version vectors/ version vectors
• Conflicts must be resolved by application• Except for some possible strategies: last update wins,....
• Better solution: Replica retain conflicting updates (versions of x ) and present them to application.
Correctness of replica update?
Easy to see with version vectors for each object (!)More subtle only with version of object and version vector of replica. Show that goal of the rules achieved:
xi overwrites version xj only if xj precedes xi
HS-201001-TA-Repl- 58
Example Example Example
Example by Bernstein / Newcomer
Conflict situation: x has been updated independentlyby R1 and R2
HS-201001-TA-Repl- 59
ExampleExampleExample
R3 receives T2's update and it can tell whether it ranbefore or after R2 received T1's update – if version vectorsare used.
HS-201001-TA-Repl- 60
9.7 Replication in the real world9.7 Replication in the real world9.7 Replication in the real world
Typically simpler solutions oriented towards most important scenarios
Asynchronous mode Terminology of vendors differsTypical global architecture:
[source:Oracle]
Data changesare captures andstaged to targetfor consumption
HS-201001-TA-Repl- 61
Important scenariosImportant scenariosImportant scenarios
High availability
Hot standbyMaster
e.g. msg. queue,or redo log device
• Might be synchronous, but would slow down TA processingat master
• Transfer of log data, replay at standby• Insert into message queue may be part of TA at primary⇒ No TA lost
• Take over within seconds – needed for replay of pending TAs
Supported bymost systems;Oracle: specific multi-masterconfiguration (both[a]synchronous)
HS-201001-TA-Repl- 62
Scenario: scalingScenario: scalingScenario: scaling
Scaling of read workload
TA / command logdevice, Msg queue
Primary copy/master
Read only copies/slaves
MySQL replication:- slaves read command
log @ master- restart of slave: use numbering of commands
Oracle: Read-onlymaterialized view
Low update traffic, unidirectional refresh , failure of slave ⇒ slight read performance decrease
HS-201001-TA-Repl- 63
Scenario: Clients with update rightScenario: Clients with update rightScenario: Clients with update right
Typical situation: Mobile clients
• Low update traffic, bidirectional refresh , • frequently trigger-based update on both sides, acceptableif low update rate, e.g. msg queue based communication,
• conflicts may have to be solved manually
Oracle:- updatable
materialized view
HS-201001-TA-Repl- 64
Replication ManagerReplication ManagerReplication ManagerDedicated server for coordination replication
specific tasks– IBM: "Data Propagator"– Sybase "Replication server"– MS: "SQL Server Synchronization Mgr"– Oracle "Replication Mgr" (Siebel)
Typically hierachically structured
Replication server("staging server")
• Different types of data refreshment policies
• Different kind of technical data exchange, e.g. msg-queues, publish, subscribe etc
HS-201001-TA-Repl- 65
Oracle 8iOracle 8iOracle 8i
Can replicate updates to table fragments or stored proc calls at the master copy
Uses triggers to capture updates in a deferred queue– Updates are row-oriented, identified by primary key– Can optimize by sending keys and updated columns only
Group updates by transaction, which are propagated:– Either serially in commit order or – in parallel with some dependent transaction ordering: each read
reads the “commit number” of the data item; updates are ordered by dependent commit number
Snapshots (= materialized view) are updated in a batch refresh.– Pushed from master to snapshots, using queue scheduler
HS-201001-TA-Repl- 66
Oracle replication – overall pictureOracle replication Oracle replication –– overall pictureoverall picture
slide by G. Alonzo, ETH
Very flexible solution,(nearly) everything allowed!
not shown (and not required!?): replication manager
HS-201001-TA-Repl- 67
Multimaster Peer-to-Peer Replication
– keeps all copies up to date– transactional guarantees
How?Conclusion from
experiments and talks and personal communication: table locks (!)
May be ok in particular situations, but in general?
HS-201001-TA-Repl- 68
Multimaster replication: peer-to-peerMultimasterMultimaster replication: peerreplication: peer--toto--peerpeer
Multi-master replication without a primary:Wingman
Each row of a table has 4 additional columns– globally unique id (GUID)– generation number, to determine which updates from
other replicas have been applied– version number = a count of the number of updates to
this row– array of [replica, version number] pairs, identifying
the largest version number it received for this row from every other replica.
adapted from Phil Bernstein
Used in Microsoft Access 7.0 and Visual Basic 4.0
HS-201001-TA-Repl- 69
Multimaster replication: "MS-Wingman"MultimasterMultimaster replication: "MSreplication: "MS--Wingman"Wingman"
Each replica has a current generation numberA replica updates a row’s generation number whenever it
updates the rowA replica remembers the generation number it had when it
last exchanged updates with R´, for every replica R´.A replica increments its generation number every time it
exchanges updates with another replica.So, when exchanging updates with R′, it should send all rows
with a generation number larger than what it had when last exchanging updates with R′.
adapted from Phil Bernstein
HS-201001-TA-Repl- 70
Wingman update processingWingman update processingWingman update processing
Use Thomas’ Write Rule to process an update from another replica– Compare the update’s and row’s version
numbers– The one with larger version number wins
(use replica id to break ties)– Yields the same result at both replicas, but
maybe not serializable
HS-201001-TA-Repl- 71
Wingman: not serializableWingman: not Wingman: not serializableserializable
Suppose two replicas perform updates to x– Replica A does 2 updates, incrementing version
number from 1 to 3 – Replica B does 1 update, incrementing version number
from 1 to 2– When they exchange updates, replica A has higher
version number and wins, causing replica B’s update to be lost
For this reason, rejected updates are retained in a conflict table for later analysis
HS-201001-TA-Repl- 72
Wingman: rejecting duplicate updateWingman: rejecting duplicate updateWingman: rejecting duplicate update
Some rejected updates are duplicatesTo identify them -
– When applying an update to x, replace x’s array of [replica, version#] pairs by the update’s array.
– To avoid processing the same update via many paths, check version number of arriving update against the array
Consider a rejected update to x at R from R´, where – [R´, V] describes R´ in x’s array, and
V´ is the version number sent by R´.– If V ≥ V´, then R saw R´’s updates– If V < V´, then R didn’t see R´’s updates, so store it in
the conflict table for later reconciliation
HS-201001-TA-Repl- 73
9.8 Replication and Consistency @ Google9.8 Replication and Consistency @ Google9.8 Replication and Consistency @ Google
GFS– Big chunks of data (64 MB) – blocks– heavily replicated– controlled by master – replicated as well– Important status data – e.g. who is primary –
held in master data structure– These data are persistently replicated
ChubbyLock service based on Paxos consensus,
locks them according to reader-writer locking:n readers – one writer, no reader
HS-201001-TA-Repl- 74
ChubbyChubbyChubby
Use casesGFS: Elect a masterBigTable: master election, client discovery, table service
lockingWell-known location to bootstrap larger systemsPartition workloadsLocks should be coarse: held for hours or days – build your
own fast locks on top
HS-201001-TA-Repl- 75
Chubby Chubby Chubby
replica replica
replica replica
Master replica
One Chubby “Cell”
All client traffic
• Master: has all the information about chunks, node failures, locks etc.
• Readers / writer have to lock chunks before read / write• Loss of Master = disaster!
HS-201001-TA-Repl- 76
Chubby Chubby Chubby
• Typical 5 Chubby cells (servers) in different racks• Responsible for a data center• Master election using Paxos• Master Lease: promise not to elect a new master for
some time (see below)• Clients will access master or replicas found in DNS
but all reads / writes forwarded to master • Write requests propagated to replica by consensus
protocol
HS-201001-TA-Repl- 77
Fault tolerant locking serviceFault tolerant locking serviceFault tolerant locking service
• Lock service used by GFS, BigTable etc.• Holds all kinds of metadata• Replicated for fault tolerance, not performance
locking:reader /writer-model:
many reads,at most one write
Chubby protocol RPCClient network
Rplica network
Local file system IO
Paxos
File transfer / snapshot
HS-201001-TA-Repl- 78
Why leases?Why leases?Why leases?
Goal: Make reads cheaper
Read request would need a consensus of sufficient (3) replicaNew master cold have been elected! Value to be read may be different in different replica.
Master lease: a promise not to elect a new master as long as lease is valid
But writes...
HS-201001-TA-Repl- 79
Write requestsWrite requestsWrite requests
write requests performed on masterpropagated to replica using PaxosIn case of agreement (3 replica of 5 living) ack to client
Log entries propagated for the values to be writtenOne instance of Paxos started for each log entry⇒ Multipaxos = agreement on a sequence of values
Many subtle engineering problems, see reader
HS-201001-TA-Repl- 84
9.8 Mobile Databases, a brief overview9.8 Mobile Databases, a brief overview9.8 Mobile Databases, a brief overview
IBM DB2 EveryplaceOracle 9i LiteSybase UltraLiteTamino Mobile
Pointbase Micro eXtremeDB http://www.mcobject.com/milaero.shtml
Differences– Synchronisation with base stations– Application Development (Tools, platforms,…)
see: Mutschler, Specht: Mobile Datenbankysteme, Springer 2004.
HS-201001-TA-Repl- 85
Mobile Clients Middleware(Synchronisations-Server) DB-Server
DrahtloseNetzwerkanbindung Festnetzverbindung
Anwendung DB
Application ArchitecturesApplication ArchitecturesApplication Architectures
Standard Client Server
– Systems
• IBM DB2 EveryPlace• Oracle 9i Lite slides adaptes from
Mutschler / Specht
DBS separated from application
HS-201001-TA-Repl- 86
Application ArchitecturesApplication ArchitecturesApplication Architectures
DrahtloseNetzwerk-anbindung
Optionale Middleware DB-Server
Festnetz-verbindungMobile Anwendung
IntegrierteDatenbank
Systems• Sybase UltraLite• Pointbase Micro • eXtremeDB (Main mempry DB")
• Integrated Mobile DB
DBS and application integratedsaves memory space, only functions needed are binded
HS-201001-TA-Repl- 87
Database EngineDatabase EngineDatabase Engine
Typically Relational Functionally differs considerablyTop end: DB2Everyplace,full fledged DBSSystems configurable(100 - ~500 KB)
HS-201001-TA-Repl- 88
Synchronisation of ReplicaSynchronisationSynchronisation of Replicaof Replica
MobilerClient
Synchronisations-antwort
Mid-Tier-System
Source System
Spiegeldatenbank Quelldatenbank
InputQueue
Mirror Table
Source Table
Change Data Table1
2
3
4
DB2 Everyplace: Mirror-DB
HS-201001-TA-Repl- 89
Synchronisation of ReplicaSynchronisationSynchronisation of Replicaof Replica
Oracle Lite– Snapshot based– Snapshot = materialized view
Full Refresh transmit all tuples of snapshotquery
Fast Refresh use snapshot logsForce Refresh mixed full / fast
HS-201001-TA-Repl- 90
Mobile MiddlewareMobile MiddlewareMobile Middleware
Mobile Server Standalone
Oracle HTTP Server
Oracle 9i AS (WE)
MS Module
MS Module
Apache MS Module
MobileServer
(Middleware)
Oracle LiteDatenbank
Oracle 9iDatenbank
ServerClient Middleware
Should be non-proprietary! How to connect Clientwith server from different vendor? -> Standards
HS-201001-TA-Repl- 91
SyncML PlatformSyncMLSyncML PlatformPlatform
no only c/s syncbut also client / client
HS-201001-TA-Repl- 92
SummarySummarySummary
• Replication is intended for availability rather than for throughput / response time enhancement (more or less)
• Transactional guarantees are costly• Atomicity and prevention of lost updates
may be ok in many application, i.e. Isolation level Read uncommitted
more update performance (e.g. asynchronousupdate propagation possible)
• Replication of tables with high frequency updates does not make much sense (in general),… but backup
• Sophisticated (and confusing!) solutions by vendors• Formidable task for the DB Administrator to
decide on when and what to replicate