CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)
CS-556: Distributed Systems
description
Transcript of CS-556: Distributed Systems
Fall Semester 2005 CS556: Distributed Systems
The issue of Time in distributed systems
A quantity that we often have to measure accurately
necessary to synchronize a node’s clock with an authoritative external source of time
Eg: timestamps for electronic transactions both at merchant’s & bank’s computers auditing
An important theoretical construct in understanding how distributed executions unfold
Algorithms for several problems depend upon clock synchronization
timestamp-based serialization of transactions for consistent updates of distributed data
Kerberos authentication protocol elimination of duplicate updates
Fall Semester 2005 CS556: Distributed Systems
Clock Synchronization
When each machine has its own clock, an event that occurred after another event may nevertheless be assigned an earlier time.
Fall Semester 2005 CS556: Distributed Systems
Fundamental limits
The notion of physical time is problematic in distributed systems: - limitations in our ability to timestamp events at different nodes sufficiently accurately to know the order in which any pair of events occurred, or whether they occurred simultaneously.
Fall Semester 2005 CS556: Distributed Systems
History of Process pi
e i e’ total ordering of events at process
Assuming that process executes on a single processor
history(pi) = hi = <ei0, ei
1, ei2, ... >
series of events that take place within pi
Hi(t): hardware clock value (by oscillator)
Ci(t): software clock value (generated by OS) Ci(t) = a Hi(t) + b
Eg: # nsec’s elapsed at time t since a reference time clock resolution: period bet. updates of Ci(t)
limit on determining order of events
Fall Semester 2005 CS556: Distributed Systems
Clock skew & driftSkew: instantaneous difference bet. readingsDrift: different rates of counting time physical variations of underlying oscillators variance with temperature Even extremely small differences accumulate
over a large number of oscillations leading to observable difference in the counters
drift rate: difference in reading bet. a clock and a nominal “perfect clock” per unit of time measured by the reference clock 10-6 seconds/sec for quartz crystals 10-7 - 10-8 seconds/sec for high precision quartz
crystals
Fall Semester 2005 CS556: Distributed Systems
UTC: Coordinated Universal Time
Atomic oscillators: drift rate ~ 10-13 seconds/second International Atomic Time (since 1967)
1 standard sec = 9,192,631,770 periods of transition for Cs133
Astronomical Time: years, seconds, ...UTC: 1 leap sec is occasionally inserted, or more rarely deleted, to keep in step with Astronomical Time time signals broadcasted from land-based
radio stations (WWV) and satelites (GPS) accuracy: 0.1-10 millisec (land-based), 1
microsec (GPS)
Fall Semester 2005 CS556: Distributed Systems
Synchronization of physical clocks
D: synchronization boundS: source of UTC time, t IExternal synchronization: |S(t) - Ci(t)| < D Clocks are accurate within the bound D
Internal synchronization: |Ci(t) - Cj(t)| < D Clocks agree within the bound D
external sync internal sync
Fall Semester 2005 CS556: Distributed Systems
Correctness of clocksHardware correctness: (1 - p)(t’ - t) H(t’) - H(t) (1 + p)(t’ - t) There can be no jumps in the value of H/W
clocksMonotonicity: t’ > t C(t’) > C(t) A clock only ever advances Even if a clock is running fast, we only need to
change at which updates are made to the time given to apps can be achieved in software: Ci(t) = a Hi(t) + b
Hybrid: monotonicity + drift rate bounded bet. sync.
points (where clock value can jump ahead)
Fall Semester 2005 CS556: Distributed Systems
Synchronous systemsP1 sends its local clock value t to P2 P2 can set its clock value to (t + Ttransmit) Ttransmit can be variable or unknown
resource competition bet. processes network congestion
u = (max - min) uncertainty in Ttransmit
obtained if P2 sets its clock to (t + min) or (t + max) If P2 sets its clock value to t + (max+min)/2,
then skew <= u/2
Optimal bound for N processes: u (1 - )N
1
In asynchronous systems:Ttransmit = min + x, where x 0Only the distribution of x may be measurable, for a given installation
Fall Semester 2005 CS556: Distributed Systems
Clock Synchronization Algorithms
The relation between clock time and UTC when clocks tick at different rates.
Fall Semester 2005 CS556: Distributed Systems
Cristian's AlgorithmGetting the current time from a time server.
Fall Semester 2005 CS556: Distributed Systems
Time servers: Christian’s algorithm
mr
mtp Time server,S
Tround := total round-trip timet := time value in message mt
estimate := (t + Tround /2)
Receiver of UTC signals
Time by S’s clock when reply msg arrives [t+min, t+Tround-min]
Accuracy: (Tround/2 - min)
Fall Semester 2005 CS556: Distributed Systems
Limitations of Cristian’s algorithm
Variability in estimate of Tround can be reduced by repeated requests to S
& taking the minimum value of Tround
Single point of failure group of synchronized time servers
multicast request & use only 1st reply obtained
Faulty clocks: f: #faulty clocks, N: #servers
N > 3f, for the correct clocks to achieve agreement
Malicious interference Protection by authentication techniques
Fall Semester 2005 CS556: Distributed Systems
The Berkeley algorithm (I)Gusella & Zatti (1989) Co-ordinator (master) periodically polls slaves
estimates each slave’s local clock (based on RTT) averages the values obtained (incl. its own clock value) ignores any occasional readings with RTT higher than max
Slaves are notified of the adjustment required This amount can be positive or negative Sending the updated current time would introduce further
uncertainty, due to message transmit delay Elimination of faulty clocks
averaging over clocks that do not differ from one another more than a specified amount
Election of new master, in case of failure no guarantee for election to complete in bounded time
Fall Semester 2005 CS556: Distributed Systems
The Berkeley Algorithm (II)
a) The time daemon asks all the other machines for their clock valuesb) The machines answerc) The time daemon tells everyone how to adjust their clock
Fall Semester 2005 CS556: Distributed Systems
Averaging algorithmsDivide time into fixed-length re-synchronization intervals: [T0 + iR, T0 + (i+1)R] At the beginning of an interval, each machine
broadcasts the current time according to its clock … and starts a local timer to collect all incoming broadcasts
during a time interval S When the broadcasts have been received, a new
time value is computed Average Average after discarding the m lowest and the m highest
values … tolerate up to m faulty machines
May also correct each value based on estimate of propagation time from the source machine
Fall Semester 2005 CS556: Distributed Systems
NTP: An Internet-scale time protocol
Sync. Accuracy: ~10s of milliseconds over Internet paths~ 1 millisecond on LANs
Statistical filtering of timing data discrimination based on quality of data from
different servers
Re-configurable inter-server connections logical hierarchy
Scalable for both clients & servers Clients can re-sync. frequently to offset drift
Authentication of trusted servers … and also validation of return addresses
Fall Semester 2005 CS556: Distributed Systems
NTP Synchronization Subnets
1
2
3
2
3 3
Primary servers
stratum
High stratum # server more liable to be less accurate
Node root RTT as a quality criterion
3 modes of synchronization:•multicast: acceptable for high-speed LAN•procedure-call: similar to Cristian’s algorithm •symmetric: between a pair of serversAll modes rely on UDP messages.
Fall Semester 2005 CS556: Distributed Systems
Message pairs bet. NTP peers (I)
Ti
Ti-1Ti -2
Ti - 3
Server B
Server A
Time
m m'
Time
Each message contains the local times when the previous message was sent & received, and the local time when thecurrent message was sent. •There can be a non-negligible delay bet. the arrival of onemessage & the dispatch of the next.• Messages may be lost
Offset oi : estimate of the actual offset bet. two clocks, as computed from a pair of messagesDelay di : total transmission time for the message pair
Fall Semester 2005 CS556: Distributed Systems
Message pairs bet. NTP peers (II)
T i-2 = T i - 3 + t + o, where o is the true offset
T i = T i - 1 + t’ - o
di = t + t’ = T i-2 - T i - 3 + Ti - T i - 1
o = oi + (t’ - t)/2
oi = (T i-2 - T i - 3 - Ti + T i - 1 ) / 2
oi - di / 2 o oi + di /2
Delay di is a measure of the accuracy of the estimate of offset
Fall Semester 2005 CS556: Distributed Systems
NTP data filtering & peer selection
Retain 8 most recent <oi, di > pairs compute “filter dispersion” metric
higher values less reliable data The estimate of offset with min. delay is chosen
Examine values from several peers look for relatively unreliable values
May switch the peer used primarily for sync.Peers with low stratum # are more favored “closer” to primary time sources
Also favored are peers with lowest sync. dispersion: sum of filter dispersions bet. peer & root of sync. subnet
May modify local clock update frequency wrt observed drift rate
Fall Semester 2005 CS556: Distributed Systems
Lamport’s notion of logical time
For many purposes, it is sufficient that all machines agree on the same time … Emphasis on internal consistency
If two processes do not interact, lack of synchronization will not be observable … and thus will not cause problems
Ordering of events is needed to avoid ambiguities
Fall Semester 2005 CS556: Distributed Systems
Lamport Timestamps
• 3 processes, each with its own clock. The clocks run at different rates.• Lamport's algorithm corrects the clocks.
Fall Semester 2005 CS556: Distributed Systems
Space-Time diagram representation of a distributed computation
p1
p2
p3
a b
c d
e f
m1
m2
Physicaltime
a b
c d
b c
d f
a f
Fall Semester 2005 CS556: Distributed Systems
The “happened-before” relation
We cannot synchronize clocks perfectly across a distributed system cannot use physical time to find out event order Lamport, 1978: “happened-before” partial
order (potential) causal ordering e i e’, for process Pi e e’ send(m) receive(m), for any message m e e’ and e’ e’’ e e’’ concurrent events: a // b
occur at different processes & chain of messages intervening between them
Fall Semester 2005 CS556: Distributed Systems
Totally-Ordered Multicasting
Updating a replicated database & leaving it in an inconsistent state.
Solution via multicast:
•Each msg is multicast, with timestamp= current (logical) time
•Recipient ACKs each message, via multicast•Each process puts received messages in its local queue, sortedaccording to the timestamp
•A process only delivers a msg when it is at the head and it has been ACKed by all processes
Fall Semester 2005 CS556: Distributed Systems
Lamport’s Logical Clocks (I)Per-process monotonically increasing counters Li := Li + 1, before each event is recorded at Pi
Clock value, t, is piggy-backed with messages Upon receiving <m ,t>, Pj updates its clock:
Lj :=max {Lj , t}, Lj := Lj + 1
Total order by taking into account process ID: (Ti, i) < (Tj, j) iff (Ti < Tj or (Ti = Tj and i < j) )
e e’ L(e) < L(e’)
Fall Semester 2005 CS556: Distributed Systems
Lamport’s Logical Clocks (II)
p1
p2
p3
a b
c d
e f
m1
m2
Physicaltime
L(b) > L(e), but b // e
Fall Semester 2005 CS556: Distributed Systems
FIFO delivery causal delivery
p1
p2
p3
a b
c d
e f
m2
m3
Physical
timem1
Fall Semester 2005 CS556: Distributed Systems
Hidden channelsThe relation captures the flow of data intervening bet. eventsData can flow in ways other than message passing !
p1
p2
p3
a
b
c d
m2
Physicaltimem1
a: pipe rapture, detected by sensor #1b: pressure drop, detected by sensor #2Controller (P3) increases heat (to increase
pressure), then receives notification of rapture.
The pipe acts as comm. channel
Fall Semester 2005 CS556: Distributed Systems
Vector ClocksMattern, 1989 & Fidge, 1991: clock := vector of N numbers (one per process) Vi [i] := Vi [i] + 1, before Pi timestamps an event Clock vector is piggybacked with messages When Pi receives <m ,t> :
Vi [j] := max{ t[j], Vi [j] }, for j=1, …, N
Vi [j], j i: #events that have occurred at Pj and has a (potential) effect on Pi
Vi [i]: #events that Pi has timestamped
e e’ V(e) < V(e’)