SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING
description
Transcript of SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING
![Page 1: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/1.jpg)
Introduction 1
SEMINAR 236825 OPEN PROBLEMS IN
DISTRIBUTED COMPUTING
Winter 2013-14
Hagit Attiya & Faith Ellen
236825
![Page 2: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/2.jpg)
Introduction 2
INTRODUCTION
236825
![Page 3: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/3.jpg)
Distributed Systems• Distributed systems are
everywhere:– share resources– communicate– increase performance (speed &
fault tolerance)
• Characterized by– independent activities
(concurrency)– loosely coupled parallelism
(heterogeneity)– inherent uncertainty
E.g.• operating systems• (distributed) database
systems• software fault-tolerance• communication
networks• multiprocessor
architectures
236825 Introduction 3
![Page 4: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/4.jpg)
Introduction 4
Main Admin Issues
• Goal: Read some interesting papers, related to some open problems in the area
• Mandatory (active) participation– 1 absence w/o explanation
• Tentative list of papers already published– First come first served
• Lectures in English
236825
![Page 5: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/5.jpg)
Course Overview: Basic Models
236825 Introduction 5
messagepassing
synchronous
asynchronous
PRAM
sharedmemory
![Page 6: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/6.jpg)
Message-Passing Model• processors p0, p1, …, pn-1 are nodes of the graph.
Each is a state machine with a local state.• bidirectional point-to-point channels are the undirected edges of the
graph. • Channel from pi to pj is modeled in two pieces:
– outbuf variable of pi (physical channel) – inbuf variable of pj (incoming message queue)
236825 Introduction 6
1 3
2
2
1 1
1
2
p3p2
p0
p1
![Page 7: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/7.jpg)
Modeling Processors and Channels
236825 Introduction 7
inbuf[1]
p1's localvariables
outbuf[1] inbuf[2]
outbuf[2]
p2's localvariables
• processors p0, p1, …, pn-1 are nodes of the graph. Each is a state machine with a local state.
• bidirectional point-to-point channels are the undirected edges of the graph.
• Channel from pi to pj is modeled in two pieces: – outbuf variable of pi (physical channel) – inbuf variable of pj (incoming message queue)
![Page 8: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/8.jpg)
Configuration
A snapshot of entire system: accessible processor states (local variables & incoming msg queues) as well as communication channels.
Formally, a vector of processor states (including outbufs, i.e., channels), one per processor
236825 Introduction 8
![Page 9: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/9.jpg)
Deliver Event
Moves a message from sender's outbuf to receiver's inbuf; message will be available next time receiver takes a step
236825 Introduction 9
p1 p2m3 m2 m1
p1 p2m3 m2 m1
![Page 10: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/10.jpg)
Computation EventOccurs at one processor• Start with old accessible state (local vars + incoming messages)• Apply processor's state machine transition function;
handle all incoming messages• End with new accessible state with empty inbufs
& new outgoing messages
236825 Introduction 10
c d eoldlocalstate
a
newlocalstate
b
![Page 11: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/11.jpg)
Execution
• In the first configuration: each processor is in initial state and all inbufs are empty
• For each consecutive triple configuration, event, configuration
new configuration is same as old configuration except:– if delivery event: specified msg is transferred from sender's
outbuf to receiver's inbuf– if computation event: specified processor's state (including
outbufs) change according to transition function
236825 Introduction 11
configuration, event, configuration, event, configuration, …
![Page 12: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/12.jpg)
Asynchronous Executions
• An execution is admissible in asynchronous model if– every message in an outbuf is eventually delivered– every processor takes an infinite number of steps
• No constraints on when these events take place: arbitrary message delays and relative processor speeds are not ruled out
• Models a reliable system (no message is lost and no processor stops working)
236825 Introduction 12
![Page 13: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/13.jpg)
Example: Simple Flooding Algorithm
• Each processor's local state consists of variable color, either red or green
• Initially:– p0: color = green, all outbufs contain M– others: color = red, all outbufs empty
• Transition: If M is in an inbuf and color = red, then change color to green and send M on all outbufs
236825 Introduction 13
![Page 14: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/14.jpg)
Example: Flooding
236825 Introduction 14
p1
p0
p2
M M
p1
p0
p2
MM
deliver eventat p1 from p0
computationevent by p1
deliver eventat p2 from p1
p1
p0
p2
M
M
M Mp1
p0
p2
M
M
computationevent by p2
![Page 15: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/15.jpg)
Example: Flooding (cont'd)
236825 Introduction 15
deliver eventat p1 from p2
computationevent by p1
deliver eventat p0 from p1
etc. to deliverrest ofmsgs
p1
p0
p2
M
M M
Mp1
p0
p2
M
M M
M
p1
p0
p2
M
M Mp1
p0
p2
M
M
M
![Page 16: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/16.jpg)
(Worst-Case) Complexity Measures• Message complexity: maximum number of
messages sent in any admissible execution• Time complexity: maximum "time" until all
processes terminate in any admissible execution.
• How to measure time in an asynchronous execution?– Produce a timed execution by assigning non-decreasing
real times to events so that time between sending and receiving any message is at most 1.
– Time complexity: maximum time until termination in any timed admissible execution.
236825 Introduction 16
![Page 17: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/17.jpg)
Complexities of Flooding Algorithm
A state is terminated if color = green.• One message is sent over each edge in each
direction message complexity is 2m, where m = number of edges.
• A node turns green once a "chain" of messages reaches it from p0 time complexity is diameter + 1 time units.
236825 Introduction 17
![Page 18: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/18.jpg)
Introduction 18
Synchronous Message Passing Systems
An execution is admissible for the synchronous model if it is an infinite sequence of rounds
– A round is a sequence of deliver events moving all msgs in transit into inbuf's, followed by a sequence of computation events, one for each processor.
Captures the lockstep behavior of the modelAlso implies
– every message sent is delivered– every processor takes an infinite number of steps.
Time is the number of rounds until termination
236825
![Page 19: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/19.jpg)
Example: Flooding in the Synchronous Model
236825 Introduction 19
p1
p0
p2
M
MM
M
p1
p0
p2
M M
p1
p0
p2
round 1 events
round 2events
Time complexity is diameter + 1Message complexity is 2m
![Page 20: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/20.jpg)
Broadcast Over a Rooted Spanning Tree
• Processors have information about a rooted spanning tree of the communication topology– parent and children local variables at each processor
• Complexities (synchronous and asynchronous model)– time is depth of the spanning tree, which is at most n - 1– number of messages is n - 1, since one message is sent over each
spanning tree edge
236825 Introduction 20
• root initially sends M to its children• when a processor receives M from its parent
– sends M to its children– terminates (sets a local Boolean to true)
![Page 21: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/21.jpg)
Finding a Spanning Tree from a Root
236825 Introduction 24
• root sends M to all its neighbors• when non-root first gets M
– set the sender as its parent– send "parent" msg to sender– send M to all other neighbors (if no other neighbors, then
terminate)• when get M otherwise
– send "reject" msg to sender• use "parent" and "reject" msgs to set children
variables and terminate (after hearing from all neighbors)
![Page 22: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/22.jpg)
Execution of Spanning Tree Algorithm
236825 Introduction 25
g h
a
b c
d e f
Synchronous: always givesbreadth-first search (BFS) tree
Both models:O(m) messagesO(diam) time
root
g h
a
b c
d e f
Asynchronous: not necessarily BFS tree
root
![Page 23: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/23.jpg)
Execution of Spanning Tree Algorithm
236825 Introduction 26
g h
a
b c
d e f
An asynchronous execution gavea depth-first search (DFS) tree.Is DFS property guaranteed?
No!
Another asynchronousexecution results in this tree:neither BFS nor DFS
root
g h
a
b c
d e f
root
![Page 24: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/24.jpg)
Shared Memory Model
Processors (also called processes) communicate via a set of shared variablesEach shared variable has a type, defining a set of primitive operations (performed atomically)
• read, write• compare&swap (CAS)• LL/SC, DCAS, kCAS, …• read-modify-write (RMW),
kRMW
236825 Introduction 29
p0
X
p1 p2
Y
read write writeRMW
![Page 25: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/25.jpg)
Changes from the Message-Passing Model
236825 Introduction 30
• no inbuf and outbuf state components• configuration includes values for shared variables• one event type: a computation step by a process
– pi 's state in old configuration specifies which shared variable is to be accessed and with which primitive
– shared variable's value in the new configuration changes according to the primitive's semantics
– pi 's state in the new configuration changes according to its old state and the result of the primitive
An execution is admissible if every processor takes an infinite number of steps
![Page 26: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/26.jpg)
Introduction 31
Abstract Data Types
• Abstract representation of data & set of methods (operations) for accessing it
• Implement using primitives on base objects
• Sometimes, a hierarchy of implementations: Primitive operations implemented from more low-level ones
236825
data
![Page 27: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/27.jpg)
Introduction 32
Executing Operations
236825
P1
invocation response
P2
P3
deq
enq(1)
enq(2)
1
ok
![Page 28: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/28.jpg)
Introduction 33
Interleaving Operations, or Not
236825
deqenq(1) enq(2)1ok
Sequential behavior: invocations & responses alternate and match (on process & object)Sequential Specification: All legal sequential behaviors
![Page 29: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/29.jpg)
Introduction 34
Correctness: Sequential consistency[Lamport, 1979]
• For every concurrent execution there is a sequential execution that– Contains the same operations– Is legal (obeys the sequential specification)– Preserves the order of operations by the same
process
236825
![Page 30: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/30.jpg)
Introduction 35
Example 1: Multi-Writer Registers
Add logical time to values
Write(v,X)read TS1,..., TSn
TSi = max TSj +1write v,TSi
Read only own value
Read(X)read v,TSi return v
Once in a while read TS1,..., TSn
and write to TSi
236825
Using (multi-reader) single-writer registers
Need to ensure writes are eventually visible
![Page 31: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/31.jpg)
Introduction 36
Timestamps1. The timestamps of two write operations by the same process
are ordered 2. If a write operation completes before another one starts, it has a
smaller timestamp
236825
Write(v,X)read TS1,..., TSn
TSi = max TSj +1write v,TSi
![Page 32: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/32.jpg)
Introduction 37
Multi-Writer Registers: Proof
Write(v,X)read TS1,..., TSn
TSi = max TSj +1write v,TSi
Read(X)read v,TSi return v
Once in a while read TS1,..., TSn
and write to TSi
236825
Create sequential execution: – Place writes in timestamp order– Insert reads after the appropriate write
![Page 33: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/33.jpg)
38236825 Introduction
Multi-Writer Registers: Proof
Create sequential execution: – Place writes in timestamp order– Insert reads after the appropriate write
Legality is immediate Per-process order is preserved since a read returns a
value (with timestamp) larger than the preceding write by the same process
![Page 34: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/34.jpg)
Introduction 39
Correctness: Linearizability[Herlihy & Wing, 1990]
• For every concurrent execution there is a sequential execution that– Contains the same operations– Is legal (obeys the specification of the ADTs)– Preserves the real-time order of non-overlapping
operations• Each operation appears to takes effect
instantaneously at some point between its invocation and its response (atomicity)
236825
![Page 35: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/35.jpg)
Introduction 40
Example 2: Linearizable Multi-Writer Registers
Add logical time to values
Write(v,X)read TS1,..., TSn
TSi = max TSj +1write v,TSi
Read(X)read TS1,...,TSn
return value with max TS
236825
Using (multi-reader) single-writer registers[Vitanyi & Awerbuch, 1987]
![Page 36: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/36.jpg)
Introduction 41
Multi-writer registers: Linearization order
Write(v,X)read TS1,..., TSn
TSi = max TSj +1write v,TSi
236825
Create linearization: – Place writes in timestamp order– Insert each read after the appropriate write
Read(X)read TS1,...,TSn
return value with max TS
![Page 37: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/37.jpg)
Introduction 42
Multi-Writer Registers: Proof
236825
Create linearization: – Place writes in timestamp order– Insert each read after the appropriate write
Legality is immediate Real-time order is preserved since a read returns a value
(with timestamp) larger than all preceding operations
![Page 38: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/38.jpg)
Introduction 43
Example 3: Atomic Snapshot
• n components• Update a single component• Scan all the components
“at once” (atomically)
Provides an instantaneous view of the whole memory
236825
updateok
scanv1,…,vn
![Page 39: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/39.jpg)
44236825 Introduction
Atomic Snapshot Algorithm
Update(v,k)A[k] = v,seqi,i
Scan()repeat
read A[1],…,A[n]read A[1],…,A[n]if equal
return A[1,…,n]Linearize:
• Updates with their writes• Scans inside the double
collects
double collect
[Afek, Attiya, Dolev, Gafni, Merritt, Shavit, JACM 1993]
![Page 40: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/40.jpg)
Introduction 45
Atomic Snapshot: Linearizability
Double collect (read a set of values twice)If equal, there is no write between the collects
– Assuming each write has a new value (seq#)
Creates a “safe zone”, where the scan can be linearized
236825
read A[1],…,A[n] read A[1],…,A[n]
write A[j]
![Page 41: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/41.jpg)
Introduction 46
Liveness Conditions
• Wait-free: every operation completes within a finite number of (its own) steps no starvation for mutex
• Nonblocking: some operation completes within a finite number of (some other process) steps deadlock-freedom for mutex
• Obstruction-free: an operation (eventually) running solo completes within a finite number of (its own) steps– Also called solo termination
wait-free nonblocking obstruction-free
Bounded wait-free bounded nonblocking bounded obstruction-free
236825
![Page 42: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/42.jpg)
Introduction 47
Wait-free Atomic Snapshot[Afek, Attiya, Dolev, Gafni, Merritt, Shavit, JACM 1993]
• Embed a scan within the Update.
236825
Update(v,k)V = scanA[k] = v,seqi,i,V
Scan()repeat
read A[1],…,A[n]read A[1],…,A[n]if equal
return A[1,…,n]
else record diffif twice pj return Vj
Linearize:• Updates with their writes• Direct scans as before• Borrowed scans in place
direct scan
borrowedscan
![Page 43: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/43.jpg)
Introduction 48
Atomic Snapshot: Borrowed Scans
Interference by process pj
And another one… pj does a scan inbeteween
Linearizing with the borrowed scan is OK.236825
write A[j]
read A[j]… … read A[j]… …
embedded scan write A[j]
read A[j]… … read A[j]… …
![Page 44: SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING](https://reader035.fdocuments.us/reader035/viewer/2022070419/56815b60550346895dc944b7/html5/thumbnails/44.jpg)
Introduction 49
List of Topics (Indicative)
• Atomic snapshots
• Space complexity of consensus
• Dynamic storage
• Vector agreement
• Renaming
• Maximal independent set
• Routing
and possibly others…
236825