DISTRIBUTED COMPUTING

173
DISTRIBUTED COMPUTING Fall 2005

description

DISTRIBUTED COMPUTING. Fall 2005. ROAD MAP: OVERVIEW. Why are distributed systems interesting? Why are they hard?. GOALS OF DISTRIBUTED SYSTEMS. Take advantage of cost/performance difference between microprocessors and shared memory multiprocessors Build systems: - PowerPoint PPT Presentation

Transcript of DISTRIBUTED COMPUTING

Page 1: DISTRIBUTED COMPUTING

DISTRIBUTED COMPUTING

Fall 2005

Page 2: DISTRIBUTED COMPUTING

ROAD MAP: OVERVIEW

• Why are distributed systems interesting?

• Why are they hard?

Page 3: DISTRIBUTED COMPUTING

GOALS OF DISTRIBUTED SYSTEMSTake advantage of cost/performance difference between

microprocessors and shared memory multiprocessors

Build systems:1. with a single system image

2. with higher performance

3. with higher reliability

4. for less money than uniprocessor systems

In wide-area distributed systems, information and work are physically distributed, implying that computing needs should be distributed. Besides improving response time, this contributes to political goals such as local control over data.

Page 4: DISTRIBUTED COMPUTING

WHY SO HARD?

A distributed system is one in which each process has imperfect knowledge of the global state.

Reasons: Asynchrony and failures

We discuss problems that these two features raise and algorithms to address these problems.

Then we discuss implementation issues for real distributed systems.

Page 5: DISTRIBUTED COMPUTING

ANATOMY OF A DISTRIBUTED SYSTEM

A set of asynchronous computing devices connected by a network. Normally, no global clock.

Communication is either through messages or shared memory. Shared memory is usually harder to implement.

Page 6: DISTRIBUTED COMPUTING

ANATOMY OF A DISTRIBUTED SYSTEM (cont.)

EACH PROCESSOR HAS ITS OWN CLOCK + ARBITRARY NETWORK

BROADCAST MEDIUM

Special protocols will be possible for the broadcast medium.

Page 7: DISTRIBUTED COMPUTING

COURSE GOALS

1. To help you understand which system assumptions are important.

2. To present some interesting and useful distributed algorithms and methods of analysis then have you apply them under challenging conditions.

3. To explore the sources for distributed intelligence.

Page 8: DISTRIBUTED COMPUTING

BASIC COMMUNICATION PRIMITIVE: MESSAGE PASSING

Paradigm:– Send message to destination– Receive message from origin

Nice property: can make distribution transparent, since it does not matter whether destination is at a local computer or at a remote one (except for failures).

Clean framework: “Paradigms for Process Interaction in Distributed Programs,” G. R. Andrews, ACM Computing Surveys 23:1 (March 1991) pp. 49-90.

Page 9: DISTRIBUTED COMPUTING

BLOCKING (SYNCHRONOUS) VS. NON-BLOCKING (ASYNCHRONOUS)

COMMUNICATIONFor sender: Should the sender wait for the

receiver to receive a message or not?

For receiver: When arriving at a reception point and there is no message waiting, should the receiver wait or proceed? Blocking receive is normal (i.e., receiver waits).

Page 10: DISTRIBUTED COMPUTING

sender receiver

send

ACK

BLOCKINGsend

NOCOMPUTATION

sender receiver

send

ACK (?)

NON-BLOCKING

Page 11: DISTRIBUTED COMPUTING

REMOTE PROCEDURE CALLClient calls the server using a call server (in parameters;

out parameters). The call can appear anywhere that a normal procedure call can.

Server returns the result to the client.

Client blocks while waiting for response from server.CLIENT server

call

return

Page 12: DISTRIBUTED COMPUTING

RENDEZVOUS FACILITY– One process sends a message to another process and blocks at

least until that process accepts the message.– The receiving process blocks when it is waiting to accept a

request.

Thus, the name: Only when both processes are ready for the data transfer, do they proceed.

We will see examples of rendezvous interactions in CSP and Ada.

sender receiver

send

accepted

accept

Page 13: DISTRIBUTED COMPUTING

Beyond send-receive: Conversations

Needed when a continuous connection is more efficient and/or only some data at a time.

Bob and Alice: Bob initiates, Alice responds, then Bob, then Alice, …

But what if Bob wants Alice to send messages as they arrive without Bob’s doing more than an ack?

Sendonly or receiveonly mode.Others?

Page 14: DISTRIBUTED COMPUTING

SEPARATION OF CONCERNS

Separation of concerns is the software engineering principle that each component should have a single small job to do so it can do it well.

In distributed systems, there are at least three concerns having to do with remote services: what to request, where to do it, how to ask for it.

Page 15: DISTRIBUTED COMPUTING

IDEAL SEPARATION

• What to request: application programmer must figure this out, e.g. access customer database.

• Where to do it: application programmer should not need to know where, because this adds complexity + if location changes, application break.

• How to ask for it: want a uniform interface.

Page 16: DISTRIBUTED COMPUTING

WHERE TO DO IT: ORGANIZATION OF CLIENTS AND SERVERS

A service is a piece of work to do. Will be done by a server.

A client who wants a service sends a message to a service broker for that service. The server gets work from the broker and commonly responds directly to the client. A server is a process.

More basic approach: Each server has a port from which it can receive requests.

Difference: In client-broker-server model, many servers can offer the same service. In direct client-server approach, client must request a service from a particular server.

client…client

Service broker

server…serverserver

client

Page 17: DISTRIBUTED COMPUTING

ALTERNATIVE: NAME SERVERA service is a piece of work to do. Will be done by a

server. Name Server knows where services are done

Example: Client requests address of server from the Name Server and then communicates directly with that server..

Difference: Client-server communication is direct, so may be more efficient.

client…client

Service broker

Client … clientserver

client

Page 18: DISTRIBUTED COMPUTING

HOW TO ASK FOR IT:OBJECT-BASED

• Encapsulation of data behind functional interface.

• Inheritance is optional but interface is the contract.

• So need a technique for both synchronous and asynchronous procedure calls.

Page 19: DISTRIBUTED COMPUTING

REFERENCE EXAMPLE:CORBA OBJECT REQUEST

BROKER• Send operation to ORB with its

parameters.

• ORB routes operation to proper site for execution.

• Arranges for response to be sent to you directly or indirectly.

• Operations can be “events” so can allow interrupts from servers to clients.

Page 20: DISTRIBUTED COMPUTING

SUCCESSORS TO CORBA Microsoft Products

• COM: allow objects to call one another in a centralized setting: classes + objects of those classes. Can create objects and then invoke them.

• DCOM: COM + Object Request Broker.

• ActiveX: DCOM for the Web.

Page 21: DISTRIBUTED COMPUTING

SUCCESSORS TO CORBA Java RMI

• Remote Method invocation (RMI): Define a service interface in Java.

• Register the server in RMI repository, i.e., an object request broker.

• Client may access Server through repository.

• Notion of distributed garbage collection

Page 22: DISTRIBUTED COMPUTING

SUCCESSORS TO CORBA Enterprise Java Beans

• Beans are again objects but can be customized at runtime.

• Support distributed transaction notion (later) as well as backups.

• So transaction notion for persistent storage is another concern it is nice to separate.

Page 23: DISTRIBUTED COMPUTING

REDUCING BUREAUCRACY:automatic registration

• SUN also developed an abstraction known as JINI.

• New device finds a lookup service (like an ORB), uploads its interface, and then everyone can access.

• No need to register.

• Requires a trusted environment.

Page 24: DISTRIBUTED COMPUTING

COOPERATING DISTRIBUTED SYSTEMS: LINDA

• Linda supports a shared data structure called a tuple space.

• Linda tuples, like database system records, consists of strings and integers. We will see that in the matrix example below.

TUPLE SPACE

PROCESSES

Page 25: DISTRIBUTED COMPUTING

LINDA OPERATIONSThe operations are out (add a tuple to the space); in

(read and remove a tuple from the space); and read (read but don’t remove a tuple from the tuple space).

A pattern-matching mechanism is used so that tuples can be extracted selectively by specifying values or data types of some fields.

in (“dennis”, ?x, ?y, ….)

– gets tuple whose first field contains “dennis,” assigns values in second and third fields of the tuple to x and y, respectively.

Page 26: DISTRIBUTED COMPUTING

EXAMPLE: MATRIX MULTIPLICATIONThere are two matrices A and B. We store A’s rows and

B’s columns as tuples.

(“A”, 1, A’s first row), (“A”, 2, A’s second row) ….

(“B”, 1, B’s first column), (“B”, 2, B’s second column) ….(“Next”, 15)

There is a global counter called Next in the range 1 .. number of rows of A x number of columns of B.

A process performs an “in” on Next, records the value, and performs an “out” on Next+1, provided Next is still in its range.

Convert Next into the row number I and column number j such that Next = i x total number of columns + j.

Page 27: DISTRIBUTED COMPUTING

ACTUAL MULTIPLICATIONFirst find i and j.

in (“Next”, ?temp);

out (“Next”, temp +1);

convert (temp, i, j);

Given i and j, a process just reads the values and outputs the result.

read (“A”, i, ?row_values)

read (“B”, j, ?col_values)

out (“result”, i, j, Dotproduct(row, col)).

Page 28: DISTRIBUTED COMPUTING

LINDA IMPLEMENTATION OF SHARED TUPLE SPACE

The implementers assert that the work represented by the tuples is large enough so that there is no need for shared memory hardware.

The question is how to implement out, in, and read (as well as inp and readp).

Page 29: DISTRIBUTED COMPUTING

BROADCAST IMPLEMENTATION 1Implement out by broadcasting the argument of out to all sites.

(Use a negative acknowledgement protocol for the broadcast.)

To implement read, perform the read from the local memory.

To implement in, perform a local read and then attempt to delete the tuple from all other sites.

If several sites perform an in, only one site should succeed.

One approach is to have the site originating the tuple decide which site deletes.

Summary: good for reads and outs, not so good for ins.

out

Page 30: DISTRIBUTED COMPUTING

BROADCAST IMPLEMENTATION 2Implement out by writing locally.

Implement in and read by a global query. (This may have to be repeated if the data is not present.)

Summary: better for out. Worse for read. Same for in.

in, read

Page 31: DISTRIBUTED COMPUTING

COMMUNICATION REVIEW

Basic distributed communication when no shared memory: send/receive.

Location transparency: broker or name server or tuple space.

Synchrony and asynchrony are both useful (e.g. real-time vs. informational sensors).

Other mechanisms are possible

Page 32: DISTRIBUTED COMPUTING

COMMUNICATION BY SHARED MEMORY: beyond locks

Framework: Herlihy, Maurice. “Impossibility and Universality Results for

Wait-Free Synchronization,” ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computed (PODC), 1988.

In a system that uses mutual exclusion, it is possible that one process may stop while holding a critical resources and hang the entire system.

It is of interest to find “wait-free” primitives, in which no process ever waits for another one.

The primitive operations include test-and-set, fetch-and-add, and fetch-and-cons.

Herlihy shows that certain operations are strictly more powerfully wait-free than others.

Page 33: DISTRIBUTED COMPUTING

CAN MAKE ANYTHING WAIT-FREE (at a time price)

Don’t maintain the data structure at all. Instead, just keep a history of the operations.

enq(x)put enq(x) on end of history list (fetch-and-cons)end enq(x)

deqput deq on end of history list (fetch-and-cons)“replay the array” and figure out what to returnend deq

Not extremely practical: the deq takes O(number of deq’s + number of enq’s) time.

Suggestion is to have certain operations reconstruct the state in an efficient manner.

Page 34: DISTRIBUTED COMPUTING

GENERAL METHOD: COMPARE-AND-SWAP

Compare-and-swap takes two values: v and v’. If the register’s current value is v, it is replaced by v’, otherwise it is left unchanged. The register’s old value is returned.

temp := compare-and-swap (register, 0, i) if register = 0 then register := ielse register is unchanged

Use this primitive to perform atomic updates to a data structure.

In the following figure, what should the compare-and-swap do?

Page 35: DISTRIBUTED COMPUTING

PERSISTENT DATA STRUCTURES AND WAIT-FREEDOM

One node added, one node removed. To establish change, change the current pointer. Old tree would still be available.

Important point: If process doing change should abort, then no other process is affected.

current

Original Data Structure

current

Page 36: DISTRIBUTED COMPUTING

LAMPORT Times, Clocks paper• What is the proper notion of time for Distributed

Systems?• Time Is a Partial Order• The Arrow Relation• Logical Clocks• Ordering All Events using a tie-breaking Clock• Achieving Mutual Exclusion Using This Clock• Correctness• Criticisms• Need for Physical Clocks • Conditions for Physical Clocks• Assumptions About Clocks and Messages• How Do We Achieve Physical Clock Goal?

Page 37: DISTRIBUTED COMPUTING

ROAD MAP: TIME ACCORDING TO LAMPORT

How to model time in distributed systems

Languages & Constructs for

Synchronization

Page 38: DISTRIBUTED COMPUTING

TIMEAssuming there are no failures, the most

important difference between distributed systems and centralized ones is that distributed systems have no natural notion of global time.

– Lamport was the first who built a theory around accepting this fact.

– That theory has proven to be surprisingly useful, since the partial order that Lamport proposed is enough for many applications.

Page 39: DISTRIBUTED COMPUTING

WHAT LAMPORT DOES

1. Paper (reference on next slide) describes a message-based criterion for obtaining a time partial order.

2. It converts this time partial order to a total order.

3. It uses the total order to solve the mutual exclusion problem.

4. It describes a stronger notion of physical time and gives an algorithm that sometimes achieves it (depending on quality of local clocks and message delivery).

Page 40: DISTRIBUTED COMPUTING

NOTIONS OF TIME IN DISTRIBUTED SYSTEMS

– Distributed system consists of a collection of distinct processes, which are spatially separated. (Each process has a unique identifier.)

– Communicate by exchanging messages.

– Messages arrive in the order they are sent. (Could be achieved by hand-shaking protocol.)

– Consequence: Time is partial order in distributed systems. Some events may not be ordered.

Lamport, L. “Times, Clocks, and the Ordering of Events in a Distributed System,” Communications of the ACM, vol. 21, no. 7 (July 1978).

Page 41: DISTRIBUTED COMPUTING

THE ARROW (partial order) RELATION

We say A happens before B or A B, if:

1. A and B are in the same process and A happens before B in that process> (Assume processes are sequential.)

2. A is the sending of a message at one process and B is the receiving of that message at

another process, then A B.

3. There is a C such that A C and C B.

In the jargon, is an irreflexive partial ordering.

Page 42: DISTRIBUTED COMPUTING

LOGICAL CLOCKS

Clocks are a way of assigning a number to an event. Each process has its own clock.

For now, clocks will have nothing to do with real time, so they can be implemented by counters with no actual timing mechanism.

Clock condition: For any events A and B, if A B, then C(A) < C(B).

Page 43: DISTRIBUTED COMPUTING

IMPLEMENTING LOGICAL CLOCKS

• Each process increments its local clock between any two successive events.

• Each process puts its local time on each message that it sends.

• Each process changes its clock C to C’ when it receives message m having timestamp T. Require that C’> max(C, T).

Page 44: DISTRIBUTED COMPUTING

IMPLEMENTATION OF LOGICAL CLOCKS

Receiver clock jumps to 14 because of timestamp on message received.

Receiver clock is unaffected by the timestamp associated with sent message, because receiver’s clock is already 18, so greater than message timestamp.

13

148

13

1918

Page 45: DISTRIBUTED COMPUTING

ORDERING ALL EVENTS

We want to define a total order .

Suppose two events occur in the same process, then they are ordered by the first condition.

Suppose A and B occur in different processes, i and j. Use process ids to break ties.

LC(A) = A|i, A concatenated with i.LC(B) = B|j.

The total ordering is called Lamport clock.

Page 46: DISTRIBUTED COMPUTING

ACHIEVING MUTUAL EXCLUSION USING THIS CLOCK

Goals:

1. Only one process can hold the resource at a time.

1. Requests must be granted in the order in which they are made.

Assumption: Messages arrive in the order they are sent. (Remember, this can be achieved by handshaking.)

Page 47: DISTRIBUTED COMPUTING

ALGORITHM FOR MUTUAL EXCLUSION1. To request the resource, Pi sends the message “request resource” to

all other processes along with Pi’s local Lamport timestamp T. It also puts that message on its own request queue.

2. When a process receives such a request, it acknowledges the message. (Unless it has already sent a message to Pi timestamped later than T.)

3. Releasing the resource is analogous to requesting, but doesn’t require an acknowledgement.

Pk Pi Pj

Ack

REQUEST

Ack

Ack

REQUEST

REQUEST

RELEASE

RELEASE

Pi executes

None needed

Page 48: DISTRIBUTED COMPUTING

USING THE RESOURCE

Process Pi starts using the resource when:

i. its own request on its local request queue has the earliest Lamport timestamp T (consistent with ); and

ii. it has received a message (either an acknowledgement or some other message) from every other process with a timestamp larger than T.

Page 49: DISTRIBUTED COMPUTING

CORRECTNESSTheorem: Mutual exclusion and first-requested,

first-served are achieved.

Proof

Suppose Pi and Pj are both using the resource at the same time and have timestamps Ti and Tj.

Suppose Ti < Tj. Then Pj must have received i’s request, since it has received at least one message with a timestamp greater than Tj from Pi and since messages arrive in the order they are sent. But then Pj would not execute its request. Contradiction.

First-requested, first-served. If Pi requests the resource before Pj (in the sense), then Ti < Tj, so Pi will win.

Page 50: DISTRIBUTED COMPUTING

CRITICISMS

• Many messages. If only one process is using the resource, it still must send messages to many other processes.

• If one process stops, then all processes hang (no wait freedom; could we achieve?)

Page 51: DISTRIBUTED COMPUTING

Is there a Wait-Free Variant?

• Modify resource locally and then send to everyone. If nobody objects, then new resource value is good.

• Difficulty: how to make it so that a single atomic wait-free operation can install the update to the resource?

Page 52: DISTRIBUTED COMPUTING

NEED FOR PHYSICAL CLOCKS Time as a partial order is the most frequent assumption in distributed

systems, however it is sometimes important to have a physical notion of time.

Example: Going outside the system. Person X starts a program A, then calls Y on the telephone, who then starts program B. We would like A B.

But that may not be true for Lamport clocks, because they are sensitive only to inter computer messages. Physical clocks try to account for event ordering that are external to the system.

X Y

starts A

Would like starts A starts BBut this may not be true with as so far defined.

calls y

receives call from x

starts B

Page 53: DISTRIBUTED COMPUTING

CONDITIONS FOR PHYSICAL CLOCKS• Suppose u is the smallest time through internal or

external means that one process can be informed of an event occurring at another process. That is, u is the smallest transmission time. (Distance/speed of light?)

• Suppose we have a global time t (all processes are in same frame of reference) that is unknown to any process.

Goal for physical clocks: Ci(t + u) > Cj(t) for any i, j.

This ensures that if A happens before B, then the clock time for B will be after the clock time for A.

Page 54: DISTRIBUTED COMPUTING

ASSUMPTIONS ABOUT CLOCKS AND MESSAGES

1. Clock drift. In one unit of global time, Ci will advance between 1-k and 1+k time units. (k << 1)

2. A message can be sent in some minimum time v with a possible additional delay of at most e.

Page 55: DISTRIBUTED COMPUTING

HOW DO WE ACHIEVE PHYSICAL CLOCK GOAL?

• Can’t always do so, e.g., can’t synchronize quartz watches using the U.S. post office.

• Basic algorithm: Periodically (to be determined), each process sends out timestamped messages.

• Upon receiving a message from Pi timestamped Ti, process Pj sets its own timestamp to max(Ti + v, Tj).

Page 56: DISTRIBUTED COMPUTING

WHAT ALGORITHM ACCOMPLISHESSimplifying to the essence of the idea, suppose

there are two processes i and j and i sends a message that arrives at global time t.

After possibly resetting its timestamp, process j ensures that

Cj(t) ≥ Ci(t) + v – (e+v)x(1+k)

That is, since i sent its message at local time Ti, i’s clock may have advanced (e+v)x(1+k) time units to Ti+(e+v)x(1+k) time. At the least Cj(t) ≥ Ti+v.

How good can synchronization be, given e, v, k?

Page 57: DISTRIBUTED COMPUTING

ROAD MAP: SOME FUNDAMENTAL PROTOCOLS

Languages & Constructs for

Synchronization

Time in distributed systems

Protocols built on asynchronous networks (achieve global state)

Page 58: DISTRIBUTED COMPUTING

PROTOCOLSAsynchrony and distributed centers of processing

give rise to various problems:

1. Find a spanning tree n a network.

1. When does a collection of processes terminate?

1. Find a consistent state of a distributed system, i.e., some analogue to a photographic “snapshot”.

1. Establish a synchronization point. This will allow us to implement parallel algorithms that work in rounds on a distributed asynchronous system.

1. Find the shortest path from a given node s to every other node in the network.

Page 59: DISTRIBUTED COMPUTING

MODEL

Links are bidirectional. A message traverses a single link.

All nodes have distinct ids. Each node knows immediate neighbors

Messages incur arbitrary but finite delay.

FIFO discipline on links, i.e., messages are received in the order they are sent.

Page 60: DISTRIBUTED COMPUTING

PRELIMINARY: ESTABLISH A SPANNING TREE

Some node establishes itself as the leader (e.g. node x establishes a spanning tree for its broadcasts, so x is root).

That node sends out a “request” for children” to its neighborsin the graph.

When a node n receives “request for children” from node m

if m is the first node that sent n this message, then n responds “ACK and you’re my parent”; sends request for children to its other neighbors else n responds “ACK”

Each node except the root has one parent, and every node is in the tree. A leaf is a node that only received ACKs from neighbors to which it sent requests.

Page 61: DISTRIBUTED COMPUTING

TERMINATING THE SPANNING TREE

A node that determines that it is a leaf sends up an “I’m all done” message to its parent.

Each non-root parent sends an “I’m all done” message to its parent once it has received such message from all its children.

When the root receives “I’m all done” from its children, then it is done.

Page 62: DISTRIBUTED COMPUTING

BROADCAST WITH FEEDBACKA given node s would like to pass message X

to all other nodes in the network and be informed that all nodes have received the message.

Algorithm:

Construct the spanning tree and then send X along the tree.

Have each node send an acknowledgement to its parent after it has received an acknowledgement from its children.

Page 63: DISTRIBUTED COMPUTING

PROCESS TERMINATIONDef: Detect the completion of a collection of non-interacting tasks, each of which is

performed on a distinct processor.

When a leaf finishes its computation, it sends an “I am terminated” message to its parent.

When an internal node completes it has received an “I am terminated” message from all of its children, it sends such a message to its parent.

When the root completes and has received an “I am terminated” message from all of its children, all task have been completed.

5

4

3

2

4

1

4

Page 64: DISTRIBUTED COMPUTING

DISTRIBUTED SNAPSHOTS

Intuitively, a snapshot is a freezing of a distributed computation at the “same time.”

Given a snapshot, it is easy to detect stable conditions such as deadlock.

(A deadlock condition doesn’t go away. If a deadlock held in the past and nothing has been done about it, then it still holds. That makes it stable.)

Page 65: DISTRIBUTED COMPUTING

FORMAL NOTION OF SNAPSHOTAssume that each processor has a local clock, which is

incremented after the receipt of and processing of each incoming message, e.g., a Lamport clock. (Processing may include transmitting other messages.)

A collection of local times {tk|kεN}, where N denotes the set of nodes, constitutes a snapshot, if each message received by node j from node i prior to tj has been sent by i prior to ti.

A message sent by i before ti but not received by j before tj is said to be in transit.

The correctness criterion is that no message be received before the snapshot, which was sent after the snapshot. (Such a thing could never happen if the snapshot time at every site were a single global time.)

Page 66: DISTRIBUTED COMPUTING

DISTRIBUTED SNAPSHOTSi j

i j

i j

i j

SITUATION I

SITUATION II

SITUATION III

Time

Time

Time

ti

tj

ti

ti

tj

tj

A MESSAGE FROM i TO J

OK

BAD

IN TRANSIT -- OK

ti is SNAPSHOT TIME FOR PROCESS itj is SNAPSHOT TIME FOR PROCESS j

Page 67: DISTRIBUTED COMPUTING

ALGORITHMNode i enters its snapshot time either spontaneously or upon

receipt of a “flagged” message, whichever comes first. In either case, it sends out a “flagged” token and advances the clock to what becomes its snapshot time ti.

Messages sent later are after the snapshot.

This algorithm allows each node to determine when all messages in transit have been received.

That is, when a node receives a flagged token from all its neighbors, then it has received all messages in transit.

spontaneous

received flagged token

send flagged token; set snapshot

Page 68: DISTRIBUTED COMPUTING

SNAPSHOT PROTOCOL IS CORRECT

Remember that we must prevent a node i from receiving a message before its snapshot time, ti, that was sent by a node j after its snapshot time, tj.

But any message sent after the snapshot will follow the flagged token at the receiving site because of the FIFO discipline on links. So, bad case cannot happen.

Page 69: DISTRIBUTED COMPUTING

SYNCHRONIZERIt is often much easier to design a distributed

protocol when the underlying system is synchronous.

In synchronous systems, computation proceeds in “rounds”. Messages are sent at the beginning of the round and arrive before the end of the round. The beginning of each round is determined by a global clock.

A synchronizer enables a protocol designed for a synchronous system to run an asynchronous one.

Page 70: DISTRIBUTED COMPUTING

PROTOCOL FOR SYNCHRONIZER1. Round manager broadcasts “round n begin”

Each node transmits the messages of the round.

2. Each node then sends is flagged token and records that time as the snapshot time. (Snapshot tokens are numbered to distinguish the different rounds.)

3. Each node receives messages along each link until it receives the flagged token.

4. Nodes perform non-interfering termination back to manager after they have received a token from all neighbors.

Page 71: DISTRIBUTED COMPUTING

MINIMUM-HOP PATHSTask is to obtain the paths with the smallest number of links

from a given node s to each other node in the network.

Suppose the network is synchronous. In the first round s sends to its neighbors. In the second round, the neighbors send to their neighbors. And so it continues.

When node I receives id s, for the first time, it designates the link on which s has arrived as the first link on the shortest path to s.

Use the synchronization protocol to simulate rounds.

A B

DS

FEC

A B

DS

FEC

NETWORK ONE POSSIBLE RESULT

Page 72: DISTRIBUTED COMPUTING

MINIMUM-HOP PATHSThe round by round approach takes a long

time. Can you think of an asynchronous approach that takes less time?

How do you know when you’re done?

A B

DS

FEC

A B

DS

FEC

NETWORK ONE POSSIBLE RESULT

Page 73: DISTRIBUTED COMPUTING

CLUSTERINGK-means clustering:Choose k centroids from the set of points at random.Then assign each point to the cluster of the nearest centroid.Then recompute the centroid of each cluster and start over.

Why does this converge?A Lyapunov function is a function of the state of an algorithm

that decreases whenever the state changes and that is bounded from below.

With sequential k-means the sum of the distances always decreases.

Can't get lower than zero.

Page 74: DISTRIBUTED COMPUTING

WHY DOES DISTANCE DECREASE?Well, when you readjust the mean, it decreases for that set.When you reassign, every distance gets smaller still.So every step readjusts the total distance.

How do we do this for a distributed asynchronous system?

What if you have rounds?

What if you don’t?

Page 75: DISTRIBUTED COMPUTING

SETI-at-Home Style Projects?SETI stands for search for extra-terrestrial intelligence. It

consists of testing radio signal receptions for some regularity.

Ideal distributed system project: master sends out work. Servers do work.

Servers may crash. What to do?

Servers may be dishonest. What to do?

Page 76: DISTRIBUTED COMPUTING

BROADCAST PROTOCOLSOften it is important to send a message to a

group of processes in an all-or-nothing manner. That is, either all non-failing processes should receive the message or none should.

This is called atomic broadcast

Assumptions:

1. fail-stop processes2. messages are received from one process to

another in the order they are sent

Page 77: DISTRIBUTED COMPUTING

ATOMIC (UNORDERED) BROADCAST PROTOCOL

• Application: Update all copies of a replicated data item

• Initiator: Send message m to all destination processes

• Destination process: When receiving m for the first time, send it to all other destinations

Initiator

Initiator

Page 78: DISTRIBUTED COMPUTING

Fault-Tolerant Broadcasts

• Reference: “A Modular Approach to Fault-Tolerant Broadcasts and Related Problems” Vassos Hadzilacos and Sam Toueg.

• Describes reliable broadcast, FIFO broadcast, causal broadcast and ordered broadcast.

Page 79: DISTRIBUTED COMPUTING

Stronger Broadcasts

• FIFO broadcast: Reliable broadcast that guarantees that messages broadcast by the same sender are received in the order they were broadcast.

• A bit more precise: If a process broadcasts a message m before it broadcasts a message m’, then no correct process accepts m’ unless it has previously accepted m. (Might buffer a message before accepting it.)

Page 80: DISTRIBUTED COMPUTING

Problems with FIFO• Network news application, where users

distribute their articles with FIFO Broadcast. User A broadcasts an article.

• User B, at a different site, accepts that article and broadcasts a response that can only be understood by a user who has already seen the original article.

• User C accepts B’s response before accepting the original article from A and so misinterprets the response.

Page 81: DISTRIBUTED COMPUTING

Causal Broadcast

• Causal broadcast: If the broadcast of m causally precedes the broadcast of m’ (in the sense of Lamport ordering), then m must be accepted everywhere before m’

• Does this solve the previous problem?

Page 82: DISTRIBUTED COMPUTING

Problems with Causal Broadcast

• Consider a replicated database with two copies of a bank account x residing in different sites. Initially, x has a value of 100. A user deposits 20, triggering a broadcast of “add 20 to x” to the two copies of x.

• At the same time, at a different site, the bank initiates a broadcast of “add 10 percent interest to x”. Not causally related, so Causal Broadcast allows the two copies of x to accept these update messages in different orders.

Page 83: DISTRIBUTED COMPUTING

THE NEED FOR ORDERED BROADCAST

In the causal protocol, it is possible for two updaters at different sites to send their messages in a different order to the various processes, so the sequence won’t be consistent.

U1 U2

A

U1 U2

A B

m m’

m’

m

B

Page 84: DISTRIBUTED COMPUTING

Total Order (Atomic Broadcast)

• If correct processes p and q both accept messages m and m’, the p accepts m before m’ if and only if q accepts m before m’

Page 85: DISTRIBUTED COMPUTING

DALE SKEEN’S ORDERED BROADCAST PROTOCOL

Idea is to assign each broadcast a global logical timestamp and deliver messages in the order of timestamps.

• As before, initiator send message m to all receiving processes (maybe not all)

• Receiver process marks m as undelivered (keeps m in buffer) and sends a proposed timestamp that is larger than any timestamp that the site has already proposed or received

Timestamps are made unique by attaching the site’s identifier as low-order bits.Time advances at each process based on lamport clocks.

Page 86: DISTRIBUTED COMPUTING

SKEEN’S ORDERED BROADCAST PROTOCOL (cont.)

Take max. (e.g., 29)

Initiator

Send message m

Receivers

Proposed timestamp (e.g., 17), based on local Lamport time

Final timestamp

Forget proposed timestamp for m.Wait until final timestamp for m is minimum of proposed or final timestamps.

Accept m.

Forget timestamp for m.

Page 87: DISTRIBUTED COMPUTING

CORRECTNESS

Theorem: m and m’ will be accepted in same order at all common sites.

Proof steps:

–Every two final timestamps will be different.

– If TS(m)<TS(m’), then any proposed timestamp for m<TS(m’); TS(m) is final timestamp for m.

Page 88: DISTRIBUTED COMPUTING

QUESTIONS TO ENSURE UNDERSTANDING

Find an example showing that changing the Skeen protocol in any one of the following ways would yield an incorrect protocol.

1. The timestamps at different sites could be the same.

2. The initiator chooses the minimum (instead of the maximum) proposed timestamp as the final timestamp.

3. Sites accept messages as soon as they become deliverable.

Page 89: DISTRIBUTED COMPUTING

ORDER-PRESERVING BROADCAST PROTOCOLS ON BROADCAST NET

• Proposes a virtual distributed system that implements ordered atomic broadcast and failure detection.

• Shows that this makes designing the rest of system easier.• Shows that implementing these two primitives isn’t so hard.

Paradigm: find an appropriate intermediate level of abstraction that can be implemented and that facilitates the higher functions.

Build Facilities that use Broadcast Network.

Implement Atomic Broadcast Network.

Framework: Chang, Jo-Mei. “Simplifying Distributed Database Systems Design by Using a Broadcast Network,” ACM SIGMOD, June 1984.

Page 90: DISTRIBUTED COMPUTING

RATIONALE

• Use property of current networks, which are naturally broadcast, although not so reliable.

• Common tasks of distributed systems: Send same information to many sites participating in a transaction (update all copies); reach agreement (e.g. transaction commitment).

Page 91: DISTRIBUTED COMPUTING

DESCRIPTION OF ABSTRACT MACHINE

Services and assurances it provides:

• Atomic broadcast: failure atomicity. If a message is received by an application program at one site, it will be received at all operational sites.

• System-wide clock and all messages are timestamped in sequence. This is the effective message order.

Assumptions: Failures are fail-stop, not malicious. So, for example token site will not lie about messages or sequence numbers.

Network failures require extra memory.

Page 92: DISTRIBUTED COMPUTING

CHANG SCHEME

Tools: Token-passing scheme + positive acknowledgments + negative acknowledgements.

Ack with counter

Sender Token Site

Broadcast

Increment counter Commit message

Page 93: DISTRIBUTED COMPUTING

BEAUTY OF NEGATIVE ACKNOWLEDGMENT

How does a site discover that it hasn’t received a message?

Non-token site knows that it has missed a message if there is a gap in the counter values that it has received. In that case, it requests that information from the token site (negative ack).

Overhead: one positive acknowledgment per broadcast message vs. one acknowledgment per site per message in naïve implementation.

Page 94: DISTRIBUTED COMPUTING

TOKEN TRANSFERToken transfer is a standard message. The

target site must acknowledge. To become a token site, the target site must guarantee that it has received all messages since the last time it was a token site.

Detect failure at a non-token site, when it fails to accept token responsibility.

Here is token I can take it

tokensite

token

Page 95: DISTRIBUTED COMPUTING

REVISIT ASSUMPTIONSSites do not lie about their state (i.e., no

malicious sites; could use authentication).

Sites tell you when they fail (e.g. through redundant circuitry) or by not responding.

If there is a network partition, then no negative ack would occur, so must keep message m around until everyone has acquired the token after m was sent.

Page 96: DISTRIBUTED COMPUTING

ROAD MAP: COMMIT PROTOCOLS

All-or-nothing commitment of transactions.Avoid partial updates.Fail-stop failures.

Recovery of data following fail-stop

failures

Page 97: DISTRIBUTED COMPUTING

THE NEED

Scenario: Transaction manager (representing user) communicates with several database servers.

Main problem is to make the commit atomic (i.e., either all sites commit the transaction or none do).

Page 98: DISTRIBUTED COMPUTING

NAÏVE (INCORRECT) ALGORITHM

RESULT: INCONSISTENT STATE

TM

Servers

TM

Commit

Done

INVENTORY INCREMENTED

CASH NOT DECREMENTED

CommitNo!

Page 99: DISTRIBUTED COMPUTING

TWO-PHASE COMMIT: PHASE 1– Transaction manager asks all servers whether they can

commit.– Upon receipt, each able server saves all updates to

stable storage and responds yes.

If server cannot say yes (e.g., because of a concurrency control problem), then it says no. In that case, it can immediately forget the transaction. Transaction manager will abort the transaction at all sites.

TM

Servers

Prepare

Prepare

Prepare

Yes

YesYes

Page 100: DISTRIBUTED COMPUTING

TWO-PHASE COMMIT: PHASE 2– If all servers say yes, then transaction manager writes a commit

record to stable storage and tells them all to commit, but if some say no or don’t respond, transaction manager tells them all to abort.

– Upon receipt, the server writes the commit record and then sends an acknowledgement. The transaction manager is done when it receives all acknowledgements.

If a database server fails during first step, all abort.

If a database server fails during second step, it can consult the transaction manager to see whether it should commit.

TM

Commit

Commit

Ack

Ack Ack

Commit

Page 101: DISTRIBUTED COMPUTING

ALL OF TWO-PHASE COMMITTransaction

Manager ServerPrepare

Yes

Commit

Done

Active Ready to Commit

Committed

Aborted

Prepare Commit

Cannot prepare

States of server

Page 102: DISTRIBUTED COMPUTING

QUESTIONS AND ANSWERSQ: What happens if the transaction manager fails?

A: A database server who said yes to the first phase but has received neither a commit nor abort instruction must wait until the transaction manager recovers. It is said to be blocked.

Q: How does a recovering transaction manager know whether it committed a given transaction before failing?

A: The transaction manager must write a commit T record to stable storage after it receives yes’s from all data base servers on behalf of T and before it sends any commit messages to them.

Q: Is there any way to avoid having a data base server block when the transaction manager fails?

A: A database server may consult other database servers who have participated in the transaction, if it knows who they are.

Page 103: DISTRIBUTED COMPUTING

OPTIMIZATION FOR READ-ONLY TRANSACTIONS

Read-only transactions.

Suppose a given server has done only reads (no updates) for a transaction.

• Instead of responding to the transaction manager that it can commit, it responds READ-only

• The transaction manager can thereby avoid sending that server a commit message

Page 104: DISTRIBUTED COMPUTING

THREE-PHASE COMMIT

A non-blocking protocol, assuming that:

• A process fail-stops and does not recover during the protocol

• The network delivers messages from A to B in the order they were sent

• Live processes respond within the timeout period

Non-blocking = surviving servers can decide what to do.

Page 105: DISTRIBUTED COMPUTING

PROTOCOLTransaction Manager

(Initiator)Server(Agent)

Willing?

Willing-YesPrepare

OK

Committed

Done

Page 106: DISTRIBUTED COMPUTING

STATES OF SERVER ASSUMING FIRST TM DOES NOT FAIL

Active Willing Ready to Commit

Abort

SendWilling-Yes

ReceivePrepare

No

Committed

ReceiveCommitted

Page 107: DISTRIBUTED COMPUTING

INVARIANTS while first TM active• No server can be in the willing state while any other

server (live or failed) is in the committed state

• No server can be in the aborted state while any other server (live or failed) is in the ready-to-commit state

Some may have aborted

No one has committed.

Active

Willing

Ready to Commit

Committed

No one has aborted. Some may have committed.

Page 108: DISTRIBUTED COMPUTING

CONTRAST WITH TWO-PHASE COMMIT

Some may have aborted

Active

Ready to Commit

Committed

Some may have committed.

Page 109: DISTRIBUTED COMPUTING

RECOVERY IN THREE-PHASE COMMITafter first TM fails or slows down too much

What the newly elected TM does:Any

Ready-to-Commitor

CommittedAny

Aborted?

SendABORT

Some Dead

All Willing

SendCommitted

SendABORT;

All Servers Alive

SendPREPARE; COMMITTED

Page 110: DISTRIBUTED COMPUTING

ROAD MAP: KNOWLEDGE LOGIC AND CONSENSUS

Knowledge Logic: consensus.Fail-stop.Network perhaps.

Commitment fail-stop site only

Page 111: DISTRIBUTED COMPUTING

EXAMPLE: COORDINATED ATTACKForget about computers. Think about a pair of allied generals A and B.

They have previously agreed to attack simultaneously or not at all. Now, they can only communicate via carrier pigeon (or some other unreliable medium).

Suppose general A sends the message to B

“Attack at Dawn”

Now, general A won’t attack alone. A doesn’t know whether B has received the message. B understand A’s predicament, so B sends an acknowledgment.

“Agreed”

Attack Agreed

A AB B

Page 112: DISTRIBUTED COMPUTING

WILL IT EVER END?

ack your ack ack your ack to my ack

A AB B

Page 113: DISTRIBUTED COMPUTING

IT NEVER ENDSTheorem: Assume that communication is unreliable. Any

protocol that guarantees that if one of the generals attacks, then the other does so at the same time, is a protocol in which necessarily neither general attacks.

Have you ever had this problem when making an appointment by electronic mail?

10 AM? OK

A AB B

But will he show up?

Page 114: DISTRIBUTED COMPUTING

BACK TO COMPUTERSWhile ostensibly about military matters, the Two Generals

problem and the Byzantine Agreement problem should remind you of the commit problem.

• In all three problems, there are two possibilities: commit (attack) and abort (don’t attack).

• In all three problems, all sites (generals) must agree.• In all three problems, always aborting (not attacking) is not an

interesting solution.

The theorem shows that no non-blocking commit protocol is possible when the network can drop messages.

Corollary: If the decision must be made within a fixed time period, then unbounded network delays prevent the sites from ever committing.

Page 115: DISTRIBUTED COMPUTING

BASIC MODEL FOR KNOWLEDGE LOGIC

• Each processor is in some local state. That is, it knows some things.

• The global state is just the set of all local states.

• Two global states are indistinguishable to a processor if the processor has the same local state in both global states.

Page 116: DISTRIBUTED COMPUTING

SOME USEFUL NOTATION FOR SUCH PROBLEMSKi – agent i knows.

CG – common knowledge among group G

A statement x is common knowledge if

1. Every agent knows x. i Ki x.2. Every agent knows that every other agent knows x.

i j Ki Kj x.3. Every agent knows that every other agent knows that

every other agent knows x

and so on. I know x.You know that I know x. You know that I know that you know x… …

Page 117: DISTRIBUTED COMPUTING

EXAMPLESIn coordinated attack problem, when A sends his message.

KA “A says attack at dawn”

When B receives that, then

KBKA “A says attack at dawn”

However, it is false that KAKBKA “A says attack at dawn”

This is remedied when A receives the first acknowledge, at which point

KAKBKA “A says attack at dawn”

However, it is false that

KBKAKBKA “A says attack at dawn”

More knowledge but never common knowledge.

Page 118: DISTRIBUTED COMPUTING

EXAMPLE: RELIABLE AND BOUNDED TIME COMMUNICATION

If A knows that B will receive any message that A sends within one minute of A’s sending it, then if A sends

“Attack at dawn”

A knows that within two minutes

CA,B “A says attack at dawn”

Page 119: DISTRIBUTED COMPUTING

CONCLUSIONS

• Common knowledge is unattainable in systems with unreliable communication (or with unbounded delay)

• Common knowledge is attainable in systems with reliable communication in bounded time

Page 120: DISTRIBUTED COMPUTING

ROAD MAP: KNOWLEDGE LOGIC AND TRANSMISSION

Knowledge Logic and Transmission.Common knowledge is unnecessary.

Knowledge Logic:consensus Failures of all types

Page 121: DISTRIBUTED COMPUTING

APPLYING KNOWLEDGE TO SEQUENCE TRANSMISSION PROTOCOLS

Problem: The two processes are the sender and the receiver.

Sender S has an input tape with an infinite sequence of data elements (0,1, blank). S tries to transmit these to receiver R. R writes these onto the output tape.

Correctness: Output tape should contain a prefix of input tape even in the face of errors (safety condition).

Given a sufficiently long correct transmission, output tape should make progress (liveness condition).

S R

….1 0 1 1 0 00 0 1

Page 122: DISTRIBUTED COMPUTING

MODEL• Messages are kept in order

• Sender and receiver are synchronous. This implies that sending a blank conveys information.

Three possible type of errors:

– Deletion errors: either a 0 or a 1 is sent, but a blank is received.

– Mutation errors: a 0 (resp. 1) is sent, but a 1 (resp. 0) is received. Blanks are transmitted correctly.

– Insertion errors: a blank is sent, but a 0 or 1 is received.

Question: Can we handle all three error types?

Page 123: DISTRIBUTED COMPUTING

POSSIBLE ERROR TYPES

If all error types are present, then a sent sequence can be transformed to any other sequence of the same length. So receiver R can gain no information about messages that sender S actually transmitted.

For any two of the three, the problem is solvable.

To show this, we will extend the transmission alphabet to consist of blank, 0, 1, ack, ack2, ack3.

Eventually, we will encode these into 0s, 1s and blanks.

Page 124: DISTRIBUTED COMPUTING

ERROR TYPE: DELETION ALONE

So a 1,0, or any acknowledgement can become a blank.

Suppose the input for S is 0,0,1…

For any symbol y, we went to achieve that the sender knows that the receiver has received (knows) symbol y.

Denote this Ks Kr (y).

Imagine the following protocol: If S doesn’t receive an acknowledgement, then it resends the symbol it just sent. If S receives an acknowledgement, S sends the next symbol on its tape.

Scenario: S sends y, R sends ack, S sends next symbol y’.

Is there a problem?

Page 125: DISTRIBUTED COMPUTING

GOAL OF PROTOCOL Yes, there is a problem. Look at this from R’s point of view. It may be that

y’ = y.

R doesn’t know whether S is resending y (because it didn’t receive R’s acknowledgement) or S is sending a new symbol.

So, R needs more knowledge. Specifically, R must know that S received its acknowledgement. S must know that R knows this.

We need Ks Kr Ks Kr y. To get this, S sends ack2 to R. Then R sends ack3 to S.

y

Ack

Ack2

Ack3

Kr y

Kr Ks Kr y

RS

Ks Kr Ks Kr y

Ks Kr y

Page 126: DISTRIBUTED COMPUTING

EXERCISE

Suppose that the symbol after y is y’ and y’ y.

Then can S send y’ as soon as it receives ack to y? (Assume R has a way of knowing that it received y and y’ correctly.)

S sends y

R sends ack

S sends y’

R sends ack …

y

Ack

y’

Kr y

Kr y’; Kr Ks Kr y

RS

Ks Kr y

Page 127: DISTRIBUTED COMPUTING

ENCODING PROTOCOL IN 0’s and 1’sProtocol Symbol Encoding

Blank 11

0 00

ack 0

1 01

ack3 1

ack2 10

WHAT IS SENT

S R

0

1

blank

ack20

1

ack

ack3

10

00

01

11

WHAT IS SENT

Page 128: DISTRIBUTED COMPUTING

WHAT DO WE WANT FROM AN ENCODING?

1. Unique decidability. If e(x) is received uncorrupted, then recipient knows that it is uncorrupted and is an encoding of x.

2. Corruption detectability. If e(x) is corrupted, the recipient knows that it is.

Thus, receiver knows when it receives good data and when it receives a garbled message.

Page 129: DISTRIBUTED COMPUTING

ENCODING FOR DELETIONS AND MUTATIONSRecall that mutation means that a 0 can become a 1 or vice versa.

Encoding (b is blank) Protocol Symbol Encoding

Blank bbb1

0 1bb b

ack 1b

1 b1b b

ack3 b1

ack2 bb1 b

The same extended alphabet protocol will work.

Any insertion will result in two non-blank characters. A mutation can only change a 1 to a 0.

Note: S

S

S R

R

R

ack

ack2

ack3

Page 130: DISTRIBUTED COMPUTING

Self-Stabilizing Systems

• A distributed system is self-stabilizing if, when started from an arbitrary initial configuration, it is guaranteed to reach a legitimate configuration as execution progresses, and once a legitimate configuration is achieved, all subsequent configurations remain legitimate.

Page 131: DISTRIBUTED COMPUTING

Self-Stabilizing Systems(using invariants)

• There is an invariant I which implies a safety condition S.

• When failures occur, S is maintained though I may not be.

• However when the failures go away, I returns.

• http://theory.lcs.mit.edu/classes/6.895/fall02/papers/Arora/masking.pdf

Page 132: DISTRIBUTED COMPUTING

Self-Stabilizing Systemscomponents

• A corrector returns a program from state S to I: e.g. error correction codes, exception handlers, database recovery.

• A detector sees whether there is a problem: e.g. acceptance tests, watchdog programs, parity ...

Page 133: DISTRIBUTED COMPUTING

Self-Stabilizing Systemsexample

• Error model: messages may be dropped.

• Message sending protocol called the alternating bit protocol, which we explain in stages.

• Sender sends a message, Receiver acknowledges if message is received and uncorrupted (can use checksum).

• Sender sends next message.

Page 134: DISTRIBUTED COMPUTING

Alternating Bit Protocol continued

• If Sender receives no ack, then it resends.

• But: What if receiver has received the message but the ack got lost.

• In that case, the receiver thinks of this as a new message.

Page 135: DISTRIBUTED COMPUTING

Alternating Bit Protocol-- we’ve arrived.

• Solution 1: Send a sequence number with the message so receiver knows whether a message is new or old.

• But: This number increases as the log of the number of messages.

• Better: Send the parity of the sequence number. This is the alternating bit protocol.

• Invariant: Output equals what was sent perhaps without the last message.

Page 136: DISTRIBUTED COMPUTING

Why is this Self-Stabilizing?

• Safety: output is a prefix of what was sent even in the face of failures (provided checksums are sufficient to detect corruption).

• Invariant: (Output equals what was sent perhaps without the last message) is a strong liveness guarantee.

Page 137: DISTRIBUTED COMPUTING

ROAD MAP: TECHNIQUES FOR REAL-TIME SYSTEMS

Fault ToleranceClock

Synchronization

Operating SystemScheduling

Real-Time

Page 138: DISTRIBUTED COMPUTING

SYSTMES THAT CANNOT OR SHOULD NOT WAIT

Time-sharing operating environments: concern for throughput.

Want to satisfy as many users as possible.

Soft real-time systems (e.g., telemarketing): concern for statistics of response time.

Want a few disgruntled customers.

Firm real-time systems (e.g., obtain ticker information on Wall Street): concern to meet as many deadlines as possible.

If you miss, you lose the deal.

Hard real-time systems (e.g., airplane controllers): requirements to meet all deadlines.

If you miss, then airplane may crash.

Page 139: DISTRIBUTED COMPUTING

DISTINCTIVE CHARACTERISTICS OR REAL-TIME SYSTEMS

1. Predictability is essential – For hard, real-time systems the time to run a routine must be known in advance. Implication: Much programming is done in assembly language. Changes are done by “patching” machine code.

2. Fairness is considered harmful – We do not want an ambulance to wait for a taxi.

3. Implication: messages must be prioritized, FIFO queues are bad.

4. Preemptibility is essential – An emergency condition must be able to override a low-priority task immediately.

5. Implication: Task switching must be fast, so processes must reside in memory.

6. Scheduling is of major concern – The time budget of an application is as important as its monetary budget. Meeting time constraints is more than just a matter of fast hardware.

7. Implication: We must look at the approaches to scheduling.

Page 140: DISTRIBUTED COMPUTING

SCHEDULING APPROACHES

1. Cyclic executive – Divide processor time into endlessly repeating cycles where each cycle is some fixed length, say 1 second. During a cycle some periodic tasks may occur several times, others only once. Gaps allow sporadic tasks to enter.

2. Rate-monotonic – Give tasks priority based on the frequency with which they are requested.

3. Earliest-deadline first – Give the highest priority to the task with the earliest deadline.

Page 141: DISTRIBUTED COMPUTING

CYCLIC EXECUTIVE STRATEGY

A cycle design containing sub-intervals of different lengths. During a sub-interval, either a periodic task runs or a gap is permitted for sporadic tasks to run.

Note the task T1 runs three times during each cycle. In general, different periodic tasks may have to run with different frequencies.

Gap

Gap

T1

T2

T1

T4T3

Gap T1

Page 142: DISTRIBUTED COMPUTING

RATE MONOTONIC ASSIGNMENT

Rate monotonic would say that T1 should get highest priority (because its period is smallest and rate is highest), then T2, then T3.

Assume that all tasks are perfectly preemptable.

As the following figure shows, all tasks meet their deadlines. What happens if T3 is given highest priority “because it is the most important”?

Three tasks:Task Period Compute Time

T1 100 20

T2 150 40

T3 350 100

Page 143: DISTRIBUTED COMPUTING

EXAMPLE OF RATE MONOTONIC SCHEDULING

Use of rate monotonic scheduler (higher rate gets higher priority) ensures that all tasks complete by their deadlines.

Notice that T3 completes earlier in its cycle the second time, indicating that the most-difficult-to-meet situation is the very initial one.

T3

T2

T1

T3 interrupted T3 completes first time T3 completes earlier

0 50 100 150 200 250 300 350 400 450 500

Page 144: DISTRIBUTED COMPUTING

CASE STUDYA group is designing a command and control system.Interrupts arrive at different rates, however the maximum rate

of each interrupt is predictable.

Computation time of task associated with each interrupt is predictable.

First implementation uses Ada and a special purpose operating system. The operating system handled interrupts in a round-robin fashion.

That is, first the OS checked for interrupts for task 1, then ask 2, and so on.

System did not meet its deadlines, yet was grossly underutilized (about 50%).

Page 145: DISTRIBUTED COMPUTING

FIRST DECISION

Management decided that the problem was Ada.

Do you think they were right? (Assume that they could have shortened each task by 10% and that the tasks and times are of the three task system given previously.

Page 146: DISTRIBUTED COMPUTING

CASE STUDY – SOLUTIONS

Switching from Ada probably would not have helped.

Consider using round-robin for the three task system given before. If task T3 is allowed to run to completion, then it will prevent task T1 from running for 100 time units (or 90 with the time improvement). That is not fast enough.

Change scheduler to give priority to task with smallest period, but tasks remain non-preemptable.

Helps, but not enough since the T3-T1 conflict would still prevent T1 from completing.

Change tasks so longer tasks are preemptable.

This would solve the problem in combination with rate monotonic priority assignment. (Show this.)

Motto: Look first at the scheduler.

Page 147: DISTRIBUTED COMPUTING

PRIORITY INVERSION AND PRIORITY IHERITANCE

T1

T2

T3

T1 preempts

T1 waits for lock

T2 preempts T3 before T3 releases lock

T3 acquires lock

Page 148: DISTRIBUTED COMPUTING

SPECIAL CONSIDERATIONS FOR DISTRIBUTED SYSTEMS

Since communication is unpredictable, most distributed non-shared memory real-time systems do no dynamic task allocation. Tasks are pre-allocated to specific processors.

Example: oil refineries where each chemical process is controlled by a separate computer.

Message exchange is limited to communicating data (e.g., in sensor applications) or status (e.g. time out messages). Messages must be prioritized and some messages should be datagrams.

Example: Command and control system has messages that take priority over all other messages, e.g., “hostilities have begun.”

Special processor architectures are possible that implement a global clock (hence require real-time clock synchronization) and guaranteed message deliveries.

Example application: airplane control with a token-passing network.

Page 149: DISTRIBUTED COMPUTING

OPEN PROBLEMS Major open problem is to combine real-time

algorithms with other needs, e.g., high performance network protocols and distributed database technology.

• What is the place of carrier-sense detection circuits in real–time system?

– If exponential back-off is used, then no guarantee is possible. (See text.)

– However, a tree-based conflict protocol, e.g., based on a site’s identifier, can guarantee message transmission.

• How should deadlocks be handled in a real-time transaction system?

– Aborting an arbitrary transaction is unacceptable.– Aborting a low priority transaction may be acceptable.

Page 150: DISTRIBUTED COMPUTING

COMPONENTS OF SECURITY

• Authentication – Proving that you are who you say you are.

• Access Rights – Giving you the information for which you have clearance.

• Integrity – Protecting information from unauthorized exposure.

• Prevention of Subversion – Guard against Replay attacks, Trojan Horse attacks, Covert Channel analysis attacks…

Page 151: DISTRIBUTED COMPUTING

AUTHENTICATION AND ZERO KNOWLEDGE PROOFS

The parable of the Amazing Sand Counter:

Person S makes the following claim:

• You fill a bucket with sand. I can tell, just by looking at it, how many grains of sand there are. However, I won’t tell you.

• You may test me, if you like, but I won’t answer any question that will teach you anything about the number of grains in the bucket.

• The test may include your asking me to leave the room.

What do you do?

Page 152: DISTRIBUTED COMPUTING

SAND MAGIC

The Amazing Sand Counter claims to know how many grains of sand there are in a bucket just by looking at it.

How can you put him to the test?

Page 153: DISTRIBUTED COMPUTING

AUTHENTICATING THE AMAZING SAND COUNTER

Answer:

1. Tester tells S to leave the room.2. Tester T removes a few grains from bucket and

counts them, then keeps in T’s pocket.3. T asks S to return and say how many grains have

been removed.4. T repeats until convinced or until T shows that S

lies.

Are there any problems left? Can the tester use the Amazing Sand Counter’s knowledge to masquerade as the Amazing Sand Counter?

Page 154: DISTRIBUTED COMPUTING

MIGHT A COUNTERFEIT AMAZING SAND COUNTER SUCCEED

Can the tester use the Amazing Sand Counter’s knowledge to masquerade as the Amazing Sand Counter?

Page 155: DISTRIBUTED COMPUTING

REPLAY ATTACKS AND TIMETester T can use a replay technique:

1. T claims to U that T is an Amazing Sand Counter.2. U presents a bucket to T.3. T removes the bucket and shows it to S, pretending to

engage S in yet another test.4. U asks T to leave the room and removes some sand.5. T returns, but asks for a little time.6. T then shows the bucket to S.7. S says how many grains have been removed.8. T repeats what S says to U.

To prevent this, a site must distinguish replayed messages from current ones, perhaps by signatures.

Page 156: DISTRIBUTED COMPUTING

Crypto puzzle 1: Fagin/Vardi management dilemma

Sometimes one doesn’t need zero knowledge, but just the answer to a yes/no question.

An employee E complains to boss B1 about some person P1 and to boss B2 about P2.

B1 and B2 confer and want to determine whether P1 = P2. If not, though, neither wants to reveal its Pi, nor ask any more of E.

Is there a good non-computational technique to figure this out? Assume the set of possible people is just the list of people in E’s group and this is known to both B1 and B2.

Page 157: DISTRIBUTED COMPUTING

Crypto puzzle 1: Hints

Paper cups.

Pieces of paper.

Pencils.

Remember: We just need to know if the same person is being complained about.

Page 158: DISTRIBUTED COMPUTING

PUBLIC-KEY ENCRYPTION

Motivation: Eliminate need for shared keys when keeping secrets.

The new idea is to put some information in a public place.

Here are the details:

• Associated with each user u is an encryption function E_u and a decryption function D_u.

• For all messages m, E_u(m) is unintelligible even knowing E_u. However, D_u(E_u(m)) = E_u(D_u(m)) = m.

• E_u is public.• Given E_u and E_u(m), it is impossible to figure out D_u.

Page 159: DISTRIBUTED COMPUTING

USING PUBLIC-KEY ENCRYPTION

• To send a message m to u that only u can read:

Send E_u(m)

• For user t to sign and send a message m to u, t can”

Send D_t(E_u(m)).

The receiver u can prove that t sent this message, because only t knows D_t.

Further, only u (and t) can read the message.

Page 160: DISTRIBUTED COMPUTING

SSL (Secure Socket Layer)

• Effect: client knows it is talking to a certain server; client remains anonymous to server; communication is secure.

• Useful for purchases, even anonymous ones.

Page 161: DISTRIBUTED COMPUTING

SSL Method -- simplified

• Client uses the well known public encryption method of the server to communicate desire to communicate. Nobody can eavesdrop. Client also generates a session private key.

• Server responds with the private key and then communication proceeds.

Page 162: DISTRIBUTED COMPUTING

One Way Functions

• Given a function like triple, if I tell you that 15 has been produced by triple(x), you can infer that x is 5.

• By contrast, given certain hash functions, if I tell you h(x) it is very hard to infer x. Such (hard to find the inverse) functions are called one-way.

• Ex: Given x, compute x2 then take the middle 20 digits. Given those digits, hard to find x.

• There is a standard one-way hash function called SHA-1.

Page 163: DISTRIBUTED COMPUTING

One Way Functions for Spies

• You have a bunch of spies. They are good people, very trustworthy.

• They go into enemy territory. When they return they must say something to the guards so they don’t get shot. Passwords can’t get reused for security.

• Also guards might go into a bar and be tempted to reveal secrets.

• Could we use one way functions in some way?

Page 164: DISTRIBUTED COMPUTING

Should I get a file from this site?

• Suppose you want to download some file (say a special kind of player) from a web site.

• You want to be sure that web site is to be trusted.

• Otherwise, you might be getting a “Trojan horse” (something that looks good but can do you harm).

• You believe that trusted entities will keep their private keys secret.

Page 165: DISTRIBUTED COMPUTING

Secure File System Protocol(David Mazieres)

• Basically, the downloader uses an SSL interaction with the server that downloader trusts.

• That is, downloader encrypts a session private key using that server’s public key.

• So, nobody else can know what downloader requested.

• Not yet solved: how does downloader know this server is validated by author?

Page 166: DISTRIBUTED COMPUTING

Secure File System Protocolkey insight

• Each file name is of the form:/sfs/nyu/cs5349874628/dennis/foobar

The funny number is the SHA-1 hash of the public key of the proper server.

So, this is the proper server for the file iff the public key of the server Pk when hashed gives cs5349874628.

Downloader checks this before sending the SSL message which will use Pk.

Page 167: DISTRIBUTED COMPUTING

Secure File System Protocolquestions

• What if you lie about your public key?

• How do I discover these strange names?

Page 168: DISTRIBUTED COMPUTING

Secure File System Protocolanswers

• What if you lie about your public key?– You can but then you won’t have the

associated private key so won’t be able to respond properly to my SSL request.

• How do I discover these strange names?– There does have to be one site you trust

which has names you believe.

Page 169: DISTRIBUTED COMPUTING

Secure File System Protocolsummary

• Downloader goes to trusted name server site.• Gets name (including funny encrypted part) and

maybe other information, e.g. hash of the contents.

• Goes to a server that allegedly holds that file. If server’s public key when hashed equals the funny part of the name, then that server is legitimate.

• So, downloader engages that server in SSL and downloads the file confidentially.

Page 170: DISTRIBUTED COMPUTING

Spy Border puzzle (McCarthy/Rabin)

• Spies go across border.

• When they return, they don’t want to be shot so they want to give a password.

• Spies are clever and professional but border guards have been known to have loose tongues, so conventional password might get leaked.

• What to do? (Hint: one-way functions)

Page 171: DISTRIBUTED COMPUTING

Courier Problem

• Capture a courier and you may change history (e.g., Hannibal and his brother Mago).

• So want to send n couriers with parts of a message. If majority of couriers arrive, then can reconstruct message. Any minority reveal nothing.

• How to do this? (Hint: start with three couriers.

Hint 2: polynomial of degree 2 can be determined by three points.)

Page 172: DISTRIBUTED COMPUTING

Assuring honesty in SETI-at-home

• I want a site to execute a function f on x1, x2, …, xn.

• How do I know it’s doing so?• Possibility: Compute some of the values

e.g. f(xi) and f(xj), and send those too.Ask to be told i and j.

• Still some possibility of cheating, e.g. once those two indices are found, could give guesses for other f(xk)s.

Page 173: DISTRIBUTED COMPUTING

Better Solution

• Also send some value y such that for NO j is it the case that y = f(xj). (Don’t tell of course)

• Now the function producer must state that y is not the result of f on any input.

• Function producer could guess this, but then risks getting caught.

• Ref: Radu Sion