Fundamentals

98
CS60002: Distributed Systems

Transcript of Fundamentals

Page 1: Fundamentals

CS60002:Distributed Systems

Page 2: Fundamentals

Textbook etc.

No one textbookWill follow for some time

“Advanced Concepts in Operating Systems”by Mukesh Singhal and Niranjan G. Shivaratri

supplemented by copies of papers

Will give materials from other books, papers etc. from time to time

Page 3: Fundamentals

Introduction

Page 4: Fundamentals

Distributed System

A broad definitionA set of autonomous processes that

communicate among themselves to perform some task

Modes of communicationMessage passingShared memory

Includes single machine with multiple communicating processes also

Page 5: Fundamentals

A more common definitionA network of autonomous computers that

communicate by message passing to perform some task

A practical distributed system may have bothComputers that communicate by messagesProcesses/threads on a computer that communicate by messages or shared memory

Page 6: Fundamentals

Advantages

Resource SharingHigher throughputHandle inherent distribution in problem structureFault ToleranceScalability

Page 7: Fundamentals

Representing Distributed Systems

Graph representationNodes = processesEdges = communication linksLinks can be bidirectional (undirected graph) or unidirectional (directed graph)Links can have weights to represent different things (Ex. delay, length, bandwidth,…)Links in the graph may or may not correspond with physical links

Page 8: Fundamentals

Why are They Harder to Design?

Lack of global shared memoryHard to find the global system state at any point

Lack of global clockEvents cannot be started at the same timeEvents cannot be ordered in time easily

Hard to verify and proveArbitrary interleaving of actions makes the system hard to verify Same problem is there for multi-process programs on a single machineHarder here due to communication delays

Page 9: Fundamentals

Example: Lack of Global Memory

Problem of Distributed SearchA set of elements distributed across multiple machinesA query comes at any one machine A for an element XNeed to search for X in the whole system

Sequential algorithm is very simpleSearch and update done on a single array in a single machineNo. of elements also known in a single variable

Page 10: Fundamentals

A distributed algorithm has more hurdles to solveHow to send the query to all other machines?Do all machines even know all other machines?How to get back the result of the search in each m/c?Handling updates (both add/delete of elements at a machine and add/remove of machines) – adds more complexity

Main problemNo one place (global memory) that a machine can look

up to see the current system state (what machines, what elements, how many elements)

Page 11: Fundamentals

Example: Lack of Global Clock

Problem of Distributed Replication3 machines A, B, C have copies of a data X, say initialized to 1Query/Updates can happen in any m/cNeed to make the copies consistent within short time in case of update at any one machineNaïve algorithm

On an update, a machine sends the updated value to the other replicasA replica, on receiving an update, applies it

Page 12: Fundamentals

3 3

2X=2

X=2

2 3

2X=2

3 3

3

X=33 1

1X=3

2 3

2X=2

Node accepts X=2

2 2

2

Page 13: Fundamentals

1 1

2X=2

X=2

2 1

2X=2

3 3

3 X=2

X=33 1

2X=2

X=3

3 3

3X=2

What should this node do now?Reject X=2, right?

But it has received exactly thesame messages in the same order

But then, consider the following scenario

Could be easily solved if all nodes had a synchronized global clock

Page 14: Fundamentals

Models for Distributed Algorithms

Informally, guarantees that one can assume the underlying system will (or will not!) give

Topology : completely connected, ring, tree, arbitrary,…

Communication : shared memory/message passing (Reliable? Delay? FIFO? Broadcast/multicast?…)Synchronous/asynchronousFailure possible or not

What all can fail?Failure models (crash, omission, Byzantine, timing,…)

Unique IdsOther Knowledge : no. of nodes, diameter

Page 15: Fundamentals

Less assumptions => weaker modelA distributed algorithm needs to specify the model on which it is supposed to workThe model may not match the underlying physical system always

Physical System

Gap between assumption and system available

Model assumed

Need to implementwith h/w-s/w

Page 16: Fundamentals

Complexity Measures

Message complexity : total no. of messages sentCommunication complexity/Bit Complexity : total no. of bits sentTime complexity : For synchronous systems, no. of rounds. For asynchronous systems, different definitions are thereSpace complexity : total no. of bits needed for storage at all the nodes

Page 17: Fundamentals

Example: Distributed Search Again

Assume that all elements are distinctNetwork represented by graph G with n nodes and m edges

Model 1Asynchronous, completely connected topology,

reliable communicationAlgorithm:

Send query to all neighborsWait for reply from all, or till one node says FoundA node, on receiving a query for X, does local search for X and replies Found/Not found.

Worst case message complexity = 2(n – 1) per query

Page 18: Fundamentals

Model 2Asynchronous, completely connected topology,

unreliable communicationAlgorithm:

Send query to all neighborsWait for reply from all, or till one node says FoundA node, on receiving a query for X, does local search for X and replies Found/Not found.If no reply within some time, send query again

Problems!How long to wait for? No bound on message delay!Message can be lost again and again, so this still does not solve the problem.In fact, impossible to solve (may not terminate)!!

Page 19: Fundamentals

Model 3Synchronous, completely connected topology, reliable

communication

Maximum one-way message delay = αMaximum search time at each m/c = βAlgorithm:

Send query to all neighborsWait for reply from all for T = 2α + β, or till one node says FoundA node, on receiving a query for X, does local search for X and replies Found if found, does not reply if not foundIf no reply received within T, return “Not found”Message complexity = n -1 if not found, n if foundMessage complexity reduced, possibly at the cost of more time

Page 20: Fundamentals

Model 4Asynchronous, reliable communication, but not

completely connected

How to send the query to all?Algorithm (first attempt):

Querying node A sends query for X to all its neighborsAny other node, on receiving query for X, first searches for X. If found, send back “Found” to A. If not, send back “Not found” to A, and also forward the query to all its neighbors other than the one it received from (flooding)Eventually all nodes get it and replyMessage complexity – O(nm) (why?)

Page 21: Fundamentals

But are we done?Suppose X is not there. A gets many “Not found”messages. How does it know if all nodes have replied? (Termination Detection)

Lets change (strengthen) the modelSuppose A knows n, the total number of nodes

A can now count the number of messages received. Termination if at least one “Found” message, or n “Not found” messagesMessage complexity – O(nm)

Suppose A knows upper bound on network diameter and synchronous system

Can be done with O(m) messages only

Can you do it without changing the model?

Page 22: Fundamentals

So Which Model to Choose?Ideally, as close to the physical system available as possible

The algorithm can directly run on the systemShould be implementable on the physical system by additional h/w-s/w

Ex., reliable communication (say TCP) over an unreliable physical system

Sometimes, start with a strong model, then weaken it

Easier to design algorithms on a stronger model (more guarantees from the system)Helps in understanding the behavior of the systemCan use this knowledge to then design algorithms on the weaker model

Page 23: Fundamentals

Some Fundamental Problems

Ordering events in the absence of a global clockCapturing the global stateMutual exclusionLeader electionClock synchronizationTermination detectionBuilding structures

Spanning treeShortest path tree…

Page 24: Fundamentals

Ordering of Events and Logical Clocks

Page 25: Fundamentals

Ordering of Events

Lamport’s Happened Before relationship:

For two events a and b, a → b (a happened beforeb) if

a and b are events in the same process and a occurred before ba is a send event of a message m and b is the corresponding receive event at the destination processa → c and c → b for some event c

Page 26: Fundamentals

a → b implies a is a a potential cause of bCausal ordering : potential dependencies“Happened Before” relationship causally orders events• If a → b, then a causally affects b• If a → b and b → a, then a and b are concurrent

( a || b)

Page 27: Fundamentals

Logical Clock

Each process i keeps a clock Ci

Each event a in i is timestamped C(a), the value of Ci when a occurredCi is incremented by 1 for each event in iIn addition, if a is a send of message m from process i to j, then on receive of m,

Cj = max(Cj, C(a)+1)

Page 28: Fundamentals

Points to note:

• Increment amount can be any positive number no necessarily 1

• if a → b, then C(a) < C(b)

• → is an irreflexive partial order

• Total ordering possible by arbitrarily ordering concurrent events by process numbers (assuming process numbers are unique)

Page 29: Fundamentals

Limitation of Lamport’s Clock

a → b implies C(a) < C(b)

BUT

C(a) < C(b) doesn’t imply a → b !!

So not a true clock !!

Page 30: Fundamentals

Solution: Vector Clocks

Ci is a vector of size n (no. of processes)C(a) is similarly a vector of size nUpdate rules:• Ci[i]++ for every event at process i• if a is send of message m from i to j with vector

timestamp tm, on receive of m:Cj[k] = max(Cj[k], tm[k]) for all k

Page 31: Fundamentals

For events a and b with vector timestamps ta and tb,• ta = tb iff for all i, ta[i] = tb[i]

• ta ≠ tb iff for some i, ta[i] ≠ tb[i]

• ta ≤ tb iff for all i, ta[i] ≤ tb[i]

• ta < tb iff (ta ≤ tb and ta ≠ tb)

• ta || tb iff (ta < tb and tb < ta)

Page 32: Fundamentals

a → b iff ta < tb

Events a and b are causally related iff ta < tb or tb < ta, else they are concurrent

Page 33: Fundamentals

Causal ordering of messages: Application of vector clocks

Delivery in Causal Order:If send(m1)→ send(m2), then every recipient of both message m1 and m2 must “deliver” m1 before m2

“deliver” – when the message is actually given to the application for processing

Page 34: Fundamentals

Birman-Schiper-StephensonProtocol

To broadcast m from process i, increment Ci(i), and timestamp m with VTm = Ci[i]When j ≠ i receives m, j delays delivery of m until

Cj[i] = VTm[i] –1 andCj[k] ≥ VTm[k] for all k ≠ iDelayed messaged are queued in j sorted by vector time. Concurrent messages are sorted by receive time.

When m is delivered at j, Cj is updated according to vector clock rule

Page 35: Fundamentals

Problem of Vector Clock

Message size increases since each message needs to be tagged with the vector

Size can be reduced in some cases by only sending values that have changed

Page 36: Fundamentals

Capturing Global State

Page 37: Fundamentals

Global State Collection

Applications: Checking “stable” properties, checkpoint & recovery,…

Issues:Need to collect both node and channel statesSystem cannot be stoppedNo global clock

But what is global state??

Page 38: Fundamentals

Some Notations

LSi : local state of process isend(mij) : send event of message mij from process i to process jrec(mij) : similar, receive instead of sendtime(x) : time at which state x was recordedtime (send(m)) : time at which send(m) occured

Page 39: Fundamentals

send(mij) є LSi ifftime(send(mij)) < time(LSi)

rec(mij) є LSj ifftime(rec(mij)) < time(LSj)

transit(LSi,LSj) = { mij | send(mij) є LSi and rec(mij) єLSj}

inconsistent(LSi, LSj) = {mij | send(mij) є LSi and rec(mij) є LSj}

Page 40: Fundamentals

Global state: collection of local statesGS = {LS1, LS2,…, LSn}

GS is consistent ifffor all i, j, 1 ≤ i, j ≤ n,

inconsistent(LSi, LSj) = Ф

GS is transitless ifffor all i, j, 1 ≤ i, j ≤ n,

transit(LSi, LSj) = Ф

GS is strongly consistent if it is consistent and transitless.Note that channel state may be specified explicitly in a global state, or implicitly in node states using transit()

Page 41: Fundamentals

Chandy-Lamport’s Algorithm

Uses special marker messages

One process acts as initiator, starts the state collection by following the marker sending rule below

Marker sending rule for process P:P records its state; then for each outgoing channel C from P on which a marker has not been sent already, P sends a marker along C before any further message is sent on C

Page 42: Fundamentals

When Q receives a marker along a channel C:

If Q has not recorded its state then Q records the state of C as empty; Q then follows the marker sending rule

If Q has already recorded its state, it records the state of C as the sequence of messages received along C after Q’s state was recorded and before Q received the marker along C

Page 43: Fundamentals

Points to Note

Markers sent on a channel distinguish messages sent on the channel before the sender recorded its states and the messages sent after the sender recorded its stateThe state collected may not be any state that actually happened in reality, rather a state that “could have” happenedRequires FIFO channelsNetwork should be strongly connected (works obviously for connected, undirected also)Message complexity O(|E|), where E = no. of links

Page 44: Fundamentals

Lai and Young’s Algorithm

Similar to Chandy-Lamport’s, but does not require FIFOBoolean value X at each node, False indicates state is not recorded yet, True indicates recordedValue of X piggybacked with every application messageValue of X distinguishes pre-snapshot and post-snapshot messages, similar to the MarkerRequires log of all messages sent before the state is recorded

Page 45: Fundamentals

Mutual Exclusion

Page 46: Fundamentals

Mutual Exclusion

Very well-understood in shared memory systems

Requirements:at most one process in critical section (safety)if more than one requesting process, someone enters (liveness)a requesting process enters within a finite time (no starvation)requests are granted in order (fairness)

Page 47: Fundamentals

Classification of Distributed Mutual Exclusion Algorithms

Non-token based/Permission basedNode takes permission from all/subset of other nodes before entering critical sectionPermission from all processes: e.g. Lamport, Ricart-Agarwala, Raicourol-Carvalho etc.Permission from a subset: ex. Maekawa

Token basedSingle token in the systemNode enters critical section if it has the tokenAlgorithms differ in how the token is circulatedex. Suzuki-Kasami

Page 48: Fundamentals

Some Complexity Measures

No. of messages/critical section entrySynchronization delayResponse timeThroughput

Page 49: Fundamentals

Lamport’s Algorithm

Every node i has a request queue qi, keeps requests sorted by logical timestamps (total ordering enforced by including process id in the timestamps) To request critical section:

send timestamped REQUEST (tsi, i) to all other nodesput (tsi, i) in its own queue

On receiving a request (tsi, i):send timestamped REPLY to the requesting node i put request (tsi, i) in the queue

Page 50: Fundamentals

To enter critical section:i enters critical section if (tsi, i) is at the top if its own queue, and i has received a message (any message) with timestamp larger than (tsi, i) from ALL other nodes.

To release critical section:i removes its request from its own queue and sends a timestamped RELEASE message to all other nodesOn receiving a RELEASE message from i, i’srequest is removed from the local request queue

Page 51: Fundamentals

Some points to note

Purpose of REPLY messages from node i to j is to ensure that j knows of all requests of i prior to sending the REPLY (and therefore, possibly any request of i with timestamp lower than j’s request)Requires FIFO channels. 3(n – 1 ) messages per critical section invocationSynchronization delay = max. message transmission timeRequests are granted in order of increasing timestamps

Page 52: Fundamentals

Ricart-Agarwala AlgorithmImprovement over Lamport’sMain Idea:

node j need not send a REPLY to node i if j has a request with timestamp lower than the request of i (since i cannot enter before j anyway in this case)

Does not require FIFO2(n – 1) messages per critical section invocationSynchronization delay = max. message transmission timerequests granted in order of increasing timestamps

Page 53: Fundamentals

To request critical section:send timestamped REQUEST message (tsi, i)

On receiving request (tsi, i) at j:send REPLY to i if j is neither requesting nor executing critical section or if j is requesting and i’s request timestamp is smaller than j’s request timestamp. Otherwise, defer the request.

To enter critical section:i enters critical section on receiving REPLY from all nodes

To release critical section:send REPLY to all deferred requests

Page 54: Fundamentals

Roucairol-Carvalho Algorithm

Improvement over Ricart-AgarwalaMain idea

once i has received a REPLY from j, it does not need to send a REQUEST to j again unless it sends a REPLY to j (in response to a REQUEST from j)no. of messages required varies between 0 and 2(n – 1) depending on request patternworst case message complexity still the same

Page 55: Fundamentals

Maekawa’s Algorithm

Permission obtained from only a subset of other processes, called the Request Set (or Quorum)Separate Request Set Ri for each process iRequirements:

for all i, j: Ri ∩ Rj ≠ Φfor all i: i Є Rifor all i: |Ri| = K, for some Kany node i is contained in exactly D Request Sets, for some D

K = D = sqrt(N) for Maekawa’s

Page 56: Fundamentals

A simple version

To request critical section:i sends REQUEST message to all process in Ri

On receiving a REQUEST message:send a REPLY message if no REPLY message has been sent since the last RELEASE message is received. Update status to indicate that a REPLY has been sent. Otherwise, queue up the REQUEST

To enter critical section:i enters critical section after receiving REPLY from all nodes in Ri

Page 57: Fundamentals

To release critical section:send RELEASE message to all nodes in Ri

On receiving a RELEASE message, send REPLY to next node in queue and delete the node from the queue. If queue is empty, update status to indicate no REPLY message has been sent.

Page 58: Fundamentals

Message Complexity: 3*sqrt(N)Synchronization delay =

2 *(max message transmission time)

Major problem: DEADLOCK possible

Need three more types of messages (FAILED, INQUIRE, YIELD) to handle deadlock. Message complexity can be 5*sqrt(N)

Building the request sets?

Page 59: Fundamentals

Token based Algorithms

Single token circulates, enter CS when token is presentNo FIFO requiredMutual exclusion obviousAlgorithms differ in how to find and get the tokenUses sequence numbers rather than timestamps to differentiate between old and current requests

Page 60: Fundamentals

Suzuki Kasami Algorithm

Broadcast a request for the tokenProcess with the token sends it to the requestor if it does not need it

Issues:

Current vs. outdated requestsdetermining sites with pending requestsdeciding which site to give the token to

Page 61: Fundamentals

The token:Queue (FIFO) Q of requesting processesLN[1..n] : sequence number of request that j executed most recently

The request message:REQUEST(i, k): request message from node i for its kth critical section execution

Other data structuresRNi[1..n] for each node i, where RNi[j] is the largest sequence number received so far by i in a REQUEST message from j.

Page 62: Fundamentals

To request critical section:If i does not have token, increment RNi[i] and send REQUEST(i, RNi[i]) to all nodesif i has token already, enter critical section if the token is idle (no pending requests), else follow rule to release critical section

On receiving REQUEST(i, sn) fat j:set RNj[i] = max(RNj[i], sn)if j has the token and the token is idle, send it to i if RNj[i] = LN[i] + 1. If token is not idle, follow rule to release critical section

Page 63: Fundamentals

To enter critical section:enter CS if token is present

To release critical section:set LN[i] = RNi[i]For every node j which is not in Q (in token), add node j to Q if RNi[ j ] = LN[ j ] + 1If Q is non empty after the above, delete first node from Q and send the token to that node

Page 64: Fundamentals

Points to note:

No. of messages: 0 if node holds the token already, n otherwise

Synchronization delay: 0 (node has the token) or max. message delay (token is elsewhere)

No starvation

Page 65: Fundamentals

Raymond’s Algorithm

Forms a directed tree (logical) with the token-holder as root

Each node has variable “Holder” that points to its parent on the path to the root. Root’s Holder variable points to itself

Each node i has a FIFO request queue Qi

Page 66: Fundamentals

To request critical section:Send REQUEST to parent on the tree, provided i does not hold the token currently and Qi is empty. Then place request in Qi

When a non-root node j receives a request from iplace request in Qj

send REQUEST to parent if no previous REQUEST sent

Page 67: Fundamentals

When the root r receives a REQUESTplace request in Qrif token is idle, follow rule for releasing critical section (shown later)

When a node receives the tokendelete first entry from the queuesend token to that node (maybe itself)set Holder variable to point to that nodeif queue is non-empty, send a REQUEST message to the parent (node pointed at by Holder variable)

Page 68: Fundamentals

To execute critical sectionenter if token is received and own entry is at the top of the queue; delete the entry from the queue

To release critical sectionif queue is non-empty, delete first entry from the queue, send token to that node and make Holder variable point to that nodeIf queue is still non-empty, send a REQUEST message to the parent (node pointed at by Holder variable)

Page 69: Fundamentals

Points to note:

Avg. message complexity O(log n)

Sync. delay (T log n)/2, where T = max. message delay

Page 70: Fundamentals

Leader Election

Page 71: Fundamentals

Leader Election in Rings

ModelsSynchronous or AsynchronousAnonymous (no unique id) or Non-anonymous (unique ids)Uniform (no knowledge of ‘n’, the number of processes) or non-uniform (knows ‘n’)

Known Impossibility ResultThere is no deterministic, synchronous, non-uniform leader election protocol for anonymous rings

Page 72: Fundamentals

Election in Asynchronous Rings

Lelann-Chang-Robert’s Algorithmsend own id to node on leftif an id received from right, forward id to left node only if received id greater than own id, else ignoreif own id received, declares itself “leader”

Works on unidirectional ringsWorst case message complexity = O(n2)Average case message complexity = O(nlogn)

Page 73: Fundamentals

Hirschberg-Sinclair AlgorithmOperates in phases, requires bidirectional ringIn kth phase, send own id to 2^k processes on both sides of yourself (directly send only to next processes with id and k in it)If id received, forward if received id greater than own id, else ignoreLast process in the chain sends a reply to originator if its id less than received idReplies are always forwardedA process goes to (k+1)th phase only if it receives a reply from both sides in kth phaseProcess receiving its own id – declare itself “leader”

Page 74: Fundamentals

Message Complexity: O(nlgn)Lots of other algorithms exist for ringsLower Bound Result:

Any comparison-based leader election algorithm in a ring requires Ώ(nlgn) messagesWhat if not comparison-based?

Page 75: Fundamentals

Leader Election in Arbitrary Networks

FloodMaxSynchronous, round-basedAt each round, each process sends the max. id seen so far (not necessarily its own) to all its neighborsAfter diameter no. of rounds, if max. id seen = own id, declares itself leaderComplexity = O(d.m), where d = diameter of the network, m = no. of edgesDoes not extend to asynchronous model trivially

Variations of building different types of spanning trees with no pre-specified roots. Chosen root at the end is the leader

Page 76: Fundamentals

Clock Synchronization

Page 77: Fundamentals

Clock Synchronization

Multiple machines with physical clocks. How can we keep them more or less synchronized?Internal vs. External synchronizationPerfect synchronization not possible because of communication delaysEven synchronization within a bound can not be guaranteed with certainty because of unpredictability of communication delays.But still useful !! Ex. – Kerberos, GPS

Page 78: Fundamentals

How clocks work

Computer clocks are crystals that oscillate at a certain frequencyEvery H oscillations, the timer chip interrupts once (clock tick). No. of interrupts per second is typically 18.2, 50, 60, 100; can be higher, settable in some casesThe interrupt handler increments a counter that keeps track of no. of ticks from a reference in the past (epoch)Knowing no. of ticks per second, we can calculate year, month, day, time of day etc.

Page 79: Fundamentals

Clock Drift

Unfortunately, period of crystal oscillation varies slightly If it oscillates faster, more ticks per real second, so clock runs faster; similar for slower clocksFor machine p, when correct reference time is t, let machine clock show time as C = Cp(t)Ideally, Cp(t) = t for all p, tIn practice,

1 – ρ ≤ dC/dt ≤ 1 + ρρ = max. clock drift rate, usually around 10-5 for cheap oscillatorsDrift => Skew between clocks (difference in clock values of two machines)

Page 80: Fundamentals

Resynchronization

Periodic resynchronization needed to offset skew

If two clocks are drifting in opposite directions, max. skew after time t is 2 ρ t

If application requires that clock skew < δ, then resynchronization period

r < δ /(2 ρ)

Usually ρ and δ are known

Page 81: Fundamentals

Cristian’s Algorithm

One m/c acts as the time serverEach m/c sends a message periodically (within resync. period r) asking for current timeTime server replies with its timeSender sets its clock to the replyProblems:

message delaytime server time is less than sender’s current time

Page 82: Fundamentals

Handling message delay: try to estimate the time the message with the timer server’s time took to each the sender

Measure round trip time and halve itMake multiple measurements of round trip time, discard too high values, take average of restUake multiple measurements and take minimumuse knowledge of processing time at server if known

Handling fast clocksDo not set clock backwards; slow it down over a period of time to bring in tune with server’s clock

Page 83: Fundamentals

Berkeley Algorithm

Centralized as in Cristian’s, but the time server is activeTime server asks for time of other m/cs at periodic intervalsTime server averages the times and sends the new time to m/csM/cs sets their time (advances immediately or slows down slowly) to the new timeEstimation of transmission delay as before

Page 84: Fundamentals

External Synchronization

Clocks must be synchronized with real time

Cristian’s algorithm can be used if the time server is synchronized with real time somehow

Berkeley algorithm cannot be used

But what is “real time” anyway?

Page 85: Fundamentals

Measurement of time

Astronomicaltraditionally usedbased on earth’s rotation around its axis and around the sunsolar day : interval between two consecutive transits of the sunsolar second : 1/86,400 of a solar dayperiod of earth’s rotation varies, so solar second is not stablemean solar second : average length of large no of solar days, then divide by 86,400

Page 86: Fundamentals

Atomicbased on the transitions of Cesium 133 atom1 sec. = time for 9,192,631,770 transitionsabout 50+ labs maintain Cesium clockInternational Atomic Time (TAI) : mean no. of ticks of the clocks since Jan 1, 1958highly stableBut slightly off-sync with mean solar day (since solar day is getting longer)A leap second inserted approx. occasionally to bring it in sync. (so far 32, all positive)Resulting clock is called UTC – Universal Coordinated Time

Page 87: Fundamentals

UTC time is broadcast from different sources around the world, ex.

National Institute of Standards & Technology (NIST) – runs radio stations, most famous being WWV, anyone with a proper receiver can tune inUnited States Naval Observatory (USNO) –supplies time to all defense sources, among othersNational Physical Laboratory in UKGPS satellitesMany others

Page 88: Fundamentals

NTP : Network Time ProtocolProtocol for time sync. in the internetHierarchical architecture

Primary time servers (stratum 1) synchronize to national time standards via radio, satelite etc. Secondary servers and clients (stratum 2, 3,…) synchronize to primary servers in a hierrachicalmanner (stratum 2 servers sync. with stratum 1, startum 3 with stratum 2 etc.).

Page 89: Fundamentals

Reliability ensured by redundant serversCommunication by multicast (usually within LAN servers), symmetric (usually within multiple geographically close servers), or client server (to higher stratum servers)Complex algorithms to combine and filter timesSync. possible to within tens of milliseconds for most machinesBut just a best-effort service, no guaranteeshttp://www.ntp.org for more details

Page 90: Fundamentals

Termination Detection

Page 91: Fundamentals

Termination Detection

Modelprocesses can be active or idleonly active processes send messagesidle process can become active on receiving an computation messageactive process can become idle at any timetermination: all processes are idle and no computation message are in transitCan use global snapshot to detect termination also

Page 92: Fundamentals

Huang’s Algorithm

One controlling agent, has weight 1 initiallyAll other processes are idle initially and has weight 0Computation starts when controlling agent sends a computation message to a processAn idle process becomes active on receiving a computation messageB(DW) – computation message with weight DW. Can be sent only by the controlling agent or an active processC(DW) – control message with weight DW, sent by active processes to controlling agent when they are about to become idle

Page 93: Fundamentals

Let current weight at process = W

1. Send of B(DW):• Find W1, W2 such that W1 > 0, W2 > 0, W1 + W2 = W• Set W = W1 and send B(W2)

2. Receive of B(DW):• W += DW; • if idle, become active

3. Send of C(DW):• send C(W) to controlling agent• Become idle

4. Receive of C(DW):• W += DW• if W = 1, declare “termination”

Page 94: Fundamentals

Building Spanning Trees

Page 95: Fundamentals

Building Spanning Trees

Applications:BroadcastConvergecastLeader election

Two variations:from a given root rroot is not given a-priori

Page 96: Fundamentals

Flooding Algorithm

Starts from a given root rr initiates by sending message M to all neighbours, sets its own parent to nilFor all other nodes, on receiving M from i for the first time, set parent to i and send M to all neighbors except i. Ignore any M received after thatTree built is an arbitrary spanning treeMessage complexity

= 2m – (n -1) where m = no of edgesTime complexity ??

Page 97: Fundamentals

Constructing a DFS tree with given root

Plain parallelization of the sequential algorithm by introducing synchronizationEach node i has a set unexplored, initially contains all neighbors of iA node i (initiated by the root) considers nodes in unexplored one by one, sending a neighbor j a message M and then waiting for a response (parentor reject) before considering the next node in unexplored If j has already received M from some other node, j sends a reject to i

Page 98: Fundamentals

Else, j sets i as its parent, and considers nodes in its unexplored set one by onej will send a parent message to i only when it has considered all nodes in its unexplored seti then considers the next node in its unexplored setAlgorithm terminates when root has received parentor reject message from all its neighboursWorst case no. of messages = 4mTime complexity O(m)