分散式系統

優點 Resource Sharing

不同地區的 Process 連通時→ USER A 可使用 USER B 的資源

Computation Speedup 困難複雜的問題分派多個處理器綜合處理

Reliability 因各處理器有各自獨立的 Memory→ 當有一個處理器受損時，

將不致影響其他處理器之作業 ; 同時互相幫忙修補 Communication

任何連通的 USER 皆可藉由網路互相通訊和諮詢

作業系統的類型資料傳輸

Site A ─Data→ Site B 資料可視需求而定，但格式需一致，避免遺失資料

計算傳輸使用者將指令藉由網路傳送至遠端處理器由遠端處理器以 Local Resources 執行再將執行結果回傳予使用者

行程傳輸將 Process 藉由網路傳送至遠端執行，用此執行的理由：

Load Balancing Computation Speedup Hardware / Software Preference Data Access

C Socket for Windows

C Socket for Windows Server.c

#include<winsock2.h>

#include<stdio.h>

int main() {

SOCKET server_sockfd, client_sockfd;

int server_len, client_len;

struct sockaddr_in server_address , sockaddr_in client_address;

// 註冊 Winsock DLL

WSADATA wsadata;

WSAStartup(0x101,(LPWSADATA)&wsadata)

// 產生 server socket

server_sockfd = socket(AF_INET, SOCK_STREAM, 0);

// AF_INET( 使用 IPv4); SOCK_STREAM; 0( 即 TCP)


server_address.sin_family = AF_INET;

server_address.sin_addr.s_addr = inet_addr("127.0.0.1");

server_address.sin_port = 1234;

server_len = sizeof(server_address);

bind(server_sockfd, (struct sockaddr *)&server_address, server_len);

listen(server_sockfd, 5); // 5( 即佇列數 )


while(1) {

char ch;

printf("Server waiting...\n");

client_len = sizeof(client_address);

client_sockfd = accept(server_sockfd, (struct sockaddr *)&client_address, &client_len);

recv(client_sockfd, &ch, 1, 0); // 接收’ A’

ch++; // ‘A’→’B’

send(client_sockfd, &ch, 1, 0); // 傳送’ B’

closesocket(client_sockfd);

WSACleanup();

}

}

C Socket for Windows Client.c

#include<winsock2.h>

#include<stdio.h>

int main() {

SOCKET sockfd;

int len , result;

struct sockaddr_in address;

char ch = 'A';

WSADATA wsadata;

WSAStartup(0x202,(LPWSADATA)&wsadata);

sockfd = socket(AF_INET, SOCK_STREAM, 0);

address.sin_family = AF_INET;

C Socket for Windows Client.c

address.sin_addr.s_addr = inet_addr("127.0.0.1");

address.sin_port = 1234;

len = sizeof(address);

connect(sockfd, (struct sockaddr *)&address, len);

send(sockfd, &ch, 1, 0);

recv(sockfd, &ch, 1, 0);

printf("char from server = %c\n", ch);

closesocket(sockfd);

WSACleanup();

system("pause");

}

Distributed Systems: Concepts and Design

Client and server with threads

Server

N threads

Input-output

Client

Thread 2 makes

T1

requests to server

generates

results

Requests

Receipt &queuingThread 1

Alternative server threading architectures

Distributed Systems: Concepts and Design

a. Thread-per-request b. Thread-per-connection c. Thread-per-object

remote

workers

I/O remoteremote I/O

per-connection threads per-object threads

objects objects objects

C Thread

-lpthreadGC2

C Thread pthread.c

#include <stdio.h>

#include <pthread.h>

void *thread_func(void *arg);

char message[] = "Hello World";

int main() {

pthread_t thread;

void *thread_result;

pthread_create(&thread,NULL,thread_func,(void *)message);

printf("Waiting for thread to finish...\n");

C Thread pthread.c

pthread_join(thread,&thread_result);

printf("Thread joined, it returned %s\n",(char *)thread_result);

system("pause");

}

void *thread_func(void *arg) {

printf("thread %s is running\n",(char *)arg);

sleep(3);

pthread_exit("Thange you use CPU Time\n");

}

Java TCP Socket (per-connection threads) Client.java

import java.net.*;

import java.io.*;

public class Client {

public static void main (String args[]) {

Socket s = null;

try{

int serverPort = 1234;

s = new Socket("localhost", serverPort);

DataInputStream in = new DataInputStream( s.getInputStream());

DataOutputStream out = new DataOutputStream( s.getOutputStream());

out.writeUTF(“Hello");

String data = in.readUTF();

System.out.println("Received: "+ data) ;

s.close();

}catch (IOException e){

System.out.println(e.getMessage());

}finally {

if(s!=null)

try {s.close();}

catch (IOException e){}

}

}

}

Java TCP Socket (per-connection threads) Server.java

import java.net.*;

import java.io.*;

public class Server {

public static void main(String args[]) {

try{

int serverPort = 1234;

ServerSocket listenSocket = new ServerSocket(serverPort);

while(true) {

Socket clientSocket = listenSocket.accept();

Connection c = new Connection(clientSocket);

}

} catch(IOException e) {


}

}

}

Java TCP Socket (per-connection threads) Connection.java

import java.net.*;

import java.io.*;

class Connection extends Thread {

DataInputStream in;

DataOutputStream out;

Socket clientSocket;

public Connection (Socket ClientSocket) {

try {

clientSocket = ClientSocket;

in = new DataInputStream( clientSocket.getInputStream());

out = new DataOutputStream( clientSocket.getOutputStream());

this.start();

} catch(IOException e){ System.out.println(e.getMessage());}

}

public void run(){

try {

String data = in.readUTF();

out.writeUTF("client data is " + data);

} catch(IOException e) {


} finally {

try {

clientSocket.close();

} catch (IOException e) {}

}

}

}

時間同步的類型 External

Synchronize all clocks against a single one, usually the one with external, accurate time information

Internal Synchronize all clocks among themselves

At least time monotonicity must be preserved

時間同步的類型 External (accuracy) :同步於驗證來源的時間 Each system clock Ci

differs at most Dext at every point in the synchronization interval from an external UTC source S: |S - Ci| < Dext for all i

C1 C3

C2

S

時間同步的類型 Internal

(agreement) :彼此間合力同步時間 Any two system

clocks Ci and Cj differs at most Dint at every point in the synchronization interval from each other: | Cj - Ci| < Dint

for all i and j

C1 C3

C2

時間同步的類型 Dext and Dint are synchronization bounds Dint <= 2Dext Max-Synch-interval = Dint / 2Dext

It means: If two events have single-value timestamps which

differ by less than some value ， we CAN’T SAY in which order the events occurred.

With interval timestamps, when intervals overlap, we CAN’T SAY in which order the events occurred.

同步系統時間

A

B

A’s clock time

B’s clock time

real time

TA

TB

Ttrans

TA+Ttrans

Tmin Tmax

Ttrans

(Tmin+ Tmax)/2

(Tmin- Tmax)/2 (Tmin- Tmax)/2

Ttrans is unkown

Tmin < Ttrans < Tmax Ttrans= (Tmin+ Tmax)/2 is at most wrong by (Tmin- Tmax)/2If A sends its clock time TA to B → B can set its clock to TA + (Tmin+ Tmax)/2 → then A and B are synchronized with bound (Tmin- Tmax)/2

非同步系統時間

In asynchronous system, we have no Tmax

How can A synchronize with B? By using the round-trip time Tround=TA-T’A in Cristian’s

algorithm:

TB= TB+ Tround/2

A

B

A’s clock time

B’s clock timeTA

TB

Tround

TA+Ttrans

TB +Tround/2

T’A

JAVA RMI (External Clock Synchronize)

JAVA RMI (External Clock Synchronize) Clock.javaimport java.rmi.*;public interface Clock extends Remote{

String getTime() throws RemoteException;}

ClockImpl.javaimport java.rmi.*;import java.rmi.server.*;import java.util.*;public class ClockImpl extends UnicastRemoteObject implements Clock {

public ClockImpl() throws RemoteException {super();

}public String getTime() {

Date d = new Date();return d.toString();

}}

JAVA RMI (External Clock Synchronize) ClockServer.java

import java.rmi.*;

public class ClockServer {

public ClockServer() {

try {

Clock c = new ClockImpl();

Naming.rebind("//localhost/ClockService",c);

} catch (Exception e) {

System.out.print(e.getMessage());

}

}


new ClockServer();

}

}

JAVA RMI (External Clock Synchronize) ClockClient.java

import java.rmi.*;

import java.net.*;

public class ClockClient {


try {

Clock c = (Clock)Naming.lookup("//localhost/ClockService");

System.out.println(c.getTime());

} catch (Exception e) {

System.out.print(e.getMessage());

}

}

}

Logical time One aspect of clock synchronization is to provide a

mechanism whereby systems can assign sequence numbers (“timestamps”) to messages upon which all cooperating processes can agree.

Leslie Lamport (1978) showed that clock synchronization need not be absolute and L. Lamport‘s two important points lead to “causality” First point:

If two processes do not interact, it is not necessary that their clocks be synchronized they can operate concurrently without fear of interferring with

each other Second (critical) point:

It is not important that all processes agree on time, but rather, that they agree on the order in which events occur

Such “clocks” are referred to as Logical Clocks Logical time is based on happens-before relationship

事件序列 Event Ordering Happens before and concurrent events illustrated

No causal path neitherfrom e1 to e2 nor from e2 to e1

e1 and e2 are concurrent

from e1 to e6 nor from e6 to e1


from e2 to e6 nor from e6 to e2


Types of eventsSendReceiveInternal (change of state)

𝑒1 𝑒2𝑒3

𝑒4

𝑒7

𝑒6𝑒5

𝑒8

𝑷 𝟏 𝑷 𝟐 𝑷 𝟑

協調 Co-ordination 對於分散式系統的困難點

Centralised solutions not appropriate communications bottleneck

Fixed master-slave arrangements not appropriate process crashes

Varying network topologies ring, tree, arbitrary; connectivity problems

Failures must be tolerated if possible link failures process crashes

Impossibility results in presence of failures, esp asynchronous model

Mutual Exclusion 要求

Safety At most one process may execute in CS at any time

Liveness Every request to enter and exit a CS is eventually granted

Ordering (desirable) Requests to enter are granted according to causality order (FIFO)

Synchronization scheme Centralized Distributed

Based on mutual

exclusion

Central process

Circulating token

No mutual exclusion

Physical Clock

Event Count

Physical clocks

Logical clocks

Mutual Exclusion 執行分三大類

Centralized Approach P1 有意進入 Critical Section 時→傳遞一個意願訊息 Request→C 接受意願訊息

Request →若 Critical Section 允許 Process 進入→傳遞一個允許訊息 Reply→P1 就能進入

此時當 P2 也有意願進行 Critical Section →C 將 P2 之意願訊息置入至 Waiting Queue 當 P1 離開臨界區時→傳遞一個釋出訊息 Release 至 C→C 將傳遞一個允許訊息 Reply 至

Waiting Queue 中的下一個意訊願訊息的擁有者 Process

Distributed Approach 比較 Timestamp 要知道網路上所有 Node 的 Name 及也要將本身的 Name 告知其它節點，降低增加節點的

頻率當 Node 故障，系統應立刻通知其它 Node 且進行修復後，故應經常維護各 Node 正常運作 Process 未進入 Critical Section ，必會頻頻停頓等待其他 Process 之操作

Token Passing Approach 適當的路徑，避免 Node 發生 Starvation 若 Token 遺失，系統應重新設定一個 Token 補救若路徑有 Node 故障，系統應重組最佳新路徑

緊密聚合 Aotomicity 每一個 Site 皆有自己的本機 Transaction

Coordinator 若由發起執行 Transaction ，將由來細分成多個部份交由

其他適當的 Site 執行，最後無論是執行成功 Commited或失敗 Abortion ，分由來決定結束 Transaction

兩階段協定故障處理

𝑆 𝑖

𝐶𝑖

𝑆1 𝑆2𝑆3 𝑆4 …

Two-Phase Commit Protocol Phase 1 :

由協調者 Coordinator 詢問各網路 Node 是否準備 Prepare 配合執行

流程 : 傳遞訊息 prepare(T) 至網路 Node ，

同時記錄 <prepare T>

當網路各 Node 接收到 prepare(T) 訊息後，檢視自己是否已準備就緒，若是就立刻傳遞訊息 ready(T) 至，同時記錄 <ready T> ；若尚未就緒，則傳遞訊息 abort(T) 至，同時記錄 <no T>

𝑆 𝑖

𝐶𝑖

𝑆1 𝑆2𝑆3 𝑆4 …

𝑆 𝑖

𝐶𝑖

𝑆1 𝑆2𝑆3 𝑆4 …

<prepare T>

prepare(T)

ready(T)<ready T>

abort(T)<no T>

Two-Phase Commit Protocol Phase 2 :

當每一個 Node 均回覆準備好之後，由協調者 Coordinator 傳遞委託執行訊息 Commit 至各 Node ，開始執行，並將結果傳回協調者

流程 : 當收到所有 Node 均回覆 ready(T) ，

立刻傳遞訊息 commit(T) 至各 Node ，要求各 Node 開始執行 T ，同時記錄<commit T>

收到任一 Node 回覆 abort(T) 或超過時限仍有 Node 未回覆訊息，則立刻傳遞訊息 abort(T) 至各 Node ，要求各 Node 停止執行 T ，同時記錄<abort T>

當各 Node 執行產生結果後會傳遞訊息acknowledge(T) ，而當接收到所有 Node的回應訊息，則記錄 <complete T>

𝑆 𝑖

𝐶𝑖

𝑆1 𝑆2𝑆3 𝑆4 …

𝑆 𝑖

𝐶𝑖

𝑆1 𝑆2𝑆3 𝑆4 …

commit(T)<commit T>

abort(T)<abort T>

<complete T>

acknowledge(T) acknowledge(T)

Failure Handling in 2PC Failure of a Participating Site

若最後紀錄是 <commit T> ，修復後執行 redo(T) ，繼續執行工作 T

若最後紀錄是 <abort T> ，修復後執行 undo(T) ，停止執行工作 T

若最後紀錄是 <ready T> ，修復後立即檢視是否是開啟狀態且仍在等待各 Site 回覆訊息若是就執行 redo(T) ，繼續執行工作 T 否則就執行undo(T) ，停止執行工作 T

若沒有 Control 的記錄時，修復後執行 undo(T) ，停止執行工作 T

Failure Handling in 2PC Failure of the Coordinator

若最後紀錄是 <commit T> ，則工作 T 必已執行完成若最後紀錄是 <abort T> ，則工作 T 必已停止執行 If some active site does not contain the record <ready T> in

its log then the failed coordinator Ci cannot have decided to commit T Rather than wait for Ci to recover, it is preferable to abort T

All active sites have a <ready T> record in their logs, but no additional control records we must wait for the coordinator to recover Blocking problem :T is blocked pending the recovery of site

Si

Deadlock Prevention and Avoidance 資源編碼演算法 Resources Ordering Algorithm

將網路上所有的資料源依我們想像的工作進行 Global Resources-ordering ，並給予唯一的編號

當某 Process 當時正佔有資源 i 時，不得再對於小於 i 的資源提出要求，如此可降低循環等待的機會

Simple to implement; requires little overhead

銀行家演算法 Banker’s Algorithm 分散式系統選出一個最適當的 Process 擔任銀行家 Banker ，管理網路上

所有的資源及對商上各 Process 作最適當的資源分配

(New) 時間戳記優先演算法 Timestamp Priority Algorithm 網路上所有 Process 的 TS 均設定為各 Process 之 Priority

Number TS 愈小的 Process 其優先等級愈高 ( 愈早發生 ) 唯有優先等級較高的 Process ，可以向優先等級低的提出資源要求

Timestamp Priority Algorithm Proces 提出要求資源 R ，但此時資源 R 正被使用，表示如下：

若的 TR 值小於的 TR 值時，則可以等待釋出資源 R ，否則會進行Rollback

因一定有一個最低優先等級的 Process ，故不會有 Cycle 存在因 TR 的 Priority Number 故不會發生 Starvation

消極方式 Wait-Die Scheme (Nonpreemptive) will wait ， will be rolled back

積極方式 Wound-Wait Scheme (Preemptive) 立刻搶用的資源， will wait

𝑃 𝑖 𝑃 𝑗

𝑃1 𝑃2

TR=5 TR=10

𝑃2 𝑃3

TR=10 TR=15

Deadlock Detection

集中式執行 Centralized Approach 分散式執行 Distributed Approach

𝑃1 𝑃2

𝑃5 𝑃3

𝑃2 𝑃4

𝑃3

𝑃1 𝑃2 𝑃4

𝑃3𝑃5

區域等待圖 Local Wait For Graph

全域等待圖 Global Wait For Graph

基本分散式演算法

複雜度測量 Computational Rounds

同步將以計時器度量回合數非同步演算法將以透過網路散播事件的次數 waves 來

決定回合數 Local Running Time Spaced

Global→ 所有電腦使用空間的總和 Local→ 每台電腦需要使用多少空間

Message complexity 電腦傳送的總訊息數

訊息 M 透過 p 個邊傳輸→訊息複雜度為 p|M| ， |M| 代表 M 的長度

基本分散式演算法 Ring Leader Tree Leader BFS MST

Ring Leader 每 Process 將它的 id 傳送到環狀裡的下一個

Process 之後的回合裡，每個 Process 將執行如下的計算：從上一個 Process 收到一個識別號碼 id 將 id 與自己的識別號碼比較把兩值之中的最小值，傳送到環狀裡的下一個 Process

AlgorithmRingLeader(id):

Input:The unique identifier, id, for the processor running Output:The smallest identifier of a processor in the ringM←[Candidate is id]Send message M to the successor processor in the ringdone←falserepeat

Get message M from the predecessor processor in the ring.if M=[Candidate is i] then

if i=id thenM←[Leader is id]done←true

Algorithmelse

m←min{i,id}M←[Candidate is m]

else{M is a “Leader is” message}done←true

Send message M to the next processor in the ringuntil donereturn M

Analysis Computational Rounds

O(2N) Local Running Time

O(N) Local Spaced

O(1) Message Complexity

O(N2)

Tree Leader 假設網路是一個自由樹狀圖

自然起始點外部節點

非同步訊息檢查 Message Check

特定邊是否已送出訊息且到達該節點二階段

Accumulation Phase id 自樹的外部節點流入，記錄最小 id 的節點找出 Leader

Broadcast Phase 廣播 Leader id 至各外部節點

AlgorithmTreeLeader(id):

Input:The unique identifier, id, for the processor running Output:The smallest identifier of a processor in the ring

{Accumulation Phase}

Let d be the number of neighbors of processor id

m ←0 {counter for messages received}

ℓ ←id {tentative leader}repeat

{begin a new round}for each neighbor j do

check if a message from processor j has arrivedif a message M = [Candidate is i] from j has

arrived then ℓ←min{i. ℓ}m←m＋ 1

Algorithmuntil m > d-1if m=d then

M←[Leader is ℓ]for each neighbor i≠k do

send message M to processor jreturn M {M is a “leader is ” message}

elseM←[Candidate is ℓ]send M to the neighbor k that has not sent a message yet

Algorithm{Broadcast Phase}

repeat

{begin a new round}

check if a message from processor k has arrived

if a message M from k has arrived then

m←m+1

if M=[Candidate is i] then

ℓ←min{i,ℓ}

M←[Leader is ℓ]

for each neighbor j do

send message M to process j

Algorithmelse

{M is a “leader is” message}

for each neighbor j≠k do

send message M to processor j

until m=d

return M {M is a “leader is” message}

Analysis• di 為處理器 i 的相鄰 Process 之數量 Computational Rounds

O(D) Local Running Time

O(diD) Local Spaced

O(di) Message Complexity

O(N)

Tree Leader

同步一塊石頭被丟池塘內後引起的漣漪直徑 Diameter 為圖中任兩個節點之間最長之路徑之長

度回合數為 Diameter 二階段

Accumulation Phase ：中心 Broadcast Phase ：向外傳播

Breadth-first Search 認定 s 為 source node 同步

以波wave 的型態向外散播一層層由上往下建構 BFS Tree 每部節點 v 傳送訊息給先前沒有與 v 有所接觸的鄰居任一節點 v 必須選擇另一個節點 v 當父節點

AlgorithmSynchronousBFS(v,s):

Input: The identifier v of the node (processor) executing this algorithm and the identifier s of the start node of the BFS traversal

Output: For each node v, its parent in a BFS tree rooted at s

repeat

{begin a new round}

if v=s or v has received a message from one of its neighbors then

set parent(v) to be a node requesting v to become its child

(or null, if v=s)

for each node w adjacent to v that has not contacted v yet do

send a message to w asking w to become a child of v

until v=s or v has received a message

Analysis n 個節點， m 個邊 Computational Rounds Local Running Time Local Spaced Message complexity

O(n+m)

Breadth-first Search 非同步

要求每個處理器知道在網路中的 Process 總數根節點 s 送出的一個「脈衝」訊息，來觸發其他

Process 開始進行整體計算的下一回合合併

向下脈衝從根節點 s 傳遞至 BFS Tree 向上脈衝從 BFS Tree 的外部節點一直到根節點 s

先收到向上脈衝信號之後，才會發出一個新的向下脈衝信號

AlgorithmAsynchronousBFS(v,s):

Input: The identifier v of the node (processor) executing this algorithm and the identifier s of the start node of the BFS traversal

Output: For each node v, its parent in a BFS tree rooted at s

C←ø {verified BFS children for v}

set A to be the set of neighbors of v

repeat

{begin a new round}

if parent(v) is defined or v=s then

if parent(v) is defined then

wait for pulse-down message from parent(v)

Algorithmif C is not empty then

{v is an internal node in the BFS tree}

send a pulse-down message to all nodes in C

wait for a pulse-up message from all nodes in C

else

{v is an external node in the BFS tree}

for each node u in A do

send a make child message to u

Algorithm

for each node u in A do

get a message M from u and remove u from A

if M is an accept-child message then

add u to C

send a pulse-up message to parent(v)

else

{v ≠s has no parent yet}

for each node w in A do

if w has sent v a make-child message then

remove w from A{w is no longer a candidate child for v}

Algorithm

if parent(v) is undefined then

parent(v)←w

send an accept-child message to w

else

send a reject-child message to w

until (v has received message done) or (v=s and has pulsed-down n-1 times)

send a done message to all the nodes in C

Analysis• n 個節點， m 個邊 Computational Rounds Local Running Time Local Spaced Message complexity

O(n2+m)

Minimum Spanning Tree 利用 Baruskal 演算法找出 MST 所提出的有效率的序列式

同步模式下的 Baruskal 分散式演算法決定出所有連通分量圖針對每個連通分量圖，找到具最小權重的邊加入到另一個分量圖

Baruskal AlgorithmKruskalMST(G):

Input: A simple connected weighted graph G with n vertices and m edges

Output: A minimum spanning tree T for G

for each vertext v in G do

define an elementary cluster C(v)←{v}

initialize a priority queue Q to contain all edges in G, using the weights as keys

T←ø

Baruskal Algorithmwhile T has fewer than n-1 edges do

(u,v)←Q.removeMin()

Let C(v) be the cluster containing v , Let C(u) be the cluster containing u.

if C(v)≠C(u) then

Add edge(v,u) to T.

Merge C(v) and C(u) into one cluster, that is union C(v) and C(u).

return tree T

Analysis• n 個節點， m 個邊 Computational Rounds

O(logn) Local Running Time Local Spaced

O(m) Message complexity

O(mlogn)

時間同步演算法

69

Synchronization Algorithms Multicast

Uses a central time server to synchronize clocks

Cristian’s algorithm (centralised) Berkeley algorithm (centralised) The Network Time Protocol (decentralised)

Cristian’s Algorithm(1989) 使用 time server 來同步時間，且為保留供參考的時間 Clients ask the time server for time

period depends on maximum clock drift and accuracy required Clients receive the value and may:

use it as it is add the known minimum network delay add half the time between this send and receive

For links with symmetrical latency: RTT = resp.-received-time – req.-sent-time

adjusted-local-time = server-timestamp + minimum network delay or server-timestamp + (RTT / 2) or server-timestamp + (RTT – server-latency) /2

local-clock-error = adjusted-local-time – local-time

Berkeley algorithm (Gusella & Zatti, 1989) if no machines have receivers, … Berkeley algorithm uses a designated server to

synchronize

The designated server polls or broadcasts to all machines for their time, adjusts times received for RTT & latency, averages times, and tells each machine how to adjust.

Polling is done using Cristian’s algorithm

Avg. time is more accurate, but still drifts

Network Time Protocol NTP is a best known and most widely

implemented decentralised algorithm Used for time synchronization on Internet

1

2 2 2

3 3 3 3 3 3

Primary server, direct synchronization

Secondary server, synchronized bythe primary server

Tertiary server, synchronized bythe secondary serverwww.ntp.org

互斥存取演算法

假設 Each pair of processes is connected by reliable

channels (such as TCP). Messages are eventually delivered to recipients’

input buffer. Processes will not fail. There is agreement on how a resource is identified

Pass identifier with requests

Exclusive Access Algorithm Centralized Algorithm Token Ring Algorithm Lamport Algorithm

(Timestamp Approach) Ricart & Agrawala Algorithm

Leader Election Algorithms Bully Algorithm Ring Algorithm

Chang&Roberts Algorithm Itai&Rodeh Algorithm

Centralized AlgorithmOperations

1. Request resource Send request to coordinator to enter CS

2. Wait for response

3. Receive grant Grants permission to enter CS keeps a queue of requests to enter the CS.

4. access resource

5. Release resource Send release message to inform coordinator

Safety, liveness and order are guaranteed

Delay Client and Synchronization

one round trip time (release + grant)

P

CRequest(R)

Grant(R)

Release(R)

P1

P2 P3

P4

4

2

CoordinatorQueue ofRequests

Grant

Release

Request

Token Ring AlgorithmOperations

For each CS a token is used. Only the process holding the token can enter the CS. To exit the CS, the process sends the token onto its neighbor. If a process does not require to enter the CS when it receives the

token, it forwards the token to the next neighbor. 在一個時間只會有一個程序取得 Token ，保證Mutual exclusion Order well-defined ，讓 Starvation不會發生假如 token 遺失 (e.g. process died) ，將必須重新產生 Safety & liveness are guaranteed, but ordering is not.

Delay Client : 0 to N message transmissions. Synchronization ： between one process’s exit from the CS and the next

process’s entry is between 1 and N message transmissions.

Lamport Algorithm A total ordering of requests is established by logical

timestamps. Each process maintains request Queue (mutual exclusion

requests)

Requesting CS, Pi multicasts “request” (i, Ti) to all processes (Ti is local Lamport

time). Places request on its own queue waits until all processes “reply”

Entering CS, Pi receives message (ack or release) from every other process with

a timestamp larger than Ti

Releasing CS , Pi Remove request from its queue Send a timestamped release message This may cause its own entry have the earliest timestamp in the

queue, enabling it to access the critical section

Ricart & Agrawala Algorithm Using reliable multicast and logical clocks Process wants to enter critical section

Compose message containing Identifier (machine ID, process ID) Name of resource Current time

Send request to all processes ,wait until everyone gives permission

When process receives request If receiver not interested →Send OK to sender If receiver is in critical section →Do not reply; add request to queue If receiver just sent a request as well:

Compare timestamps: received & sent msgs→Earliest wins If receiver is loser then send OK else receiver is winner, do not reply,

queue When done with critical section→Send OK to all queued requests

Ricart & Agrawala AlgorithmOn initialization

state := RELEASED;To enter the critical section

state := WANTED;Multicast request to all processes; request processing deferred hereT := request’s timestamp;Wait until (number of replies received = (N – 1));state := HELD;

On receipt of a request <Ti, pi> at pj (i≠ j)

if (state = HELD) or ((state = WANTED) and ((T, pj) < (Ti, pi))

then queue request from pi without replying;

else reply immediately to pi;To exit the critical section

state := RELEASED;reply to any queued requests;

Ricart & Agrawala Algorithm Safety, liveness, and ordering are guaranteed. It takes 2(N-1) messages per entry operation (N-1 multicast

requests + N-1 replies); N messages if the underlying network supports multicast. [3(N-1) in Lamport’s algorithm]

Delay Client

one round-trip time Synchronization

one message transmission time.

P1

P2

P3

Reply

P2 message:

Timestamp is 78

P1 message:

Timestamp is 87

P1 remains in “wanted” untilP2 sends “reply”

P2 Changes to “held”

P2不能傳 Reply給 P1因為 Timestamp →P1大於P2

Leader Election Algorithms Solution the problem

N processes, may or may not have unique IDs (UIDs) for simplicity assume no crashes must choose unique master coordinator amongst processes

Requirements Every process knows P, identity of leader, where P is

unique process id (usually maximum) or is yet undefined. All processes participate and eventually discover the

identity of the leader (cannot be undefined). When a coordinator fails, the algorithm must elect that

active process with the largest priority number 兩種類型的演算法

Bully: “the biggest guy in town wins” Ring: a logical, cyclic grouping

Bully Algorithm 假設

Synchronous system All messages arrive within Ttrans units of time.

A reply is dispatched within Tprocess units of time of the receipt of a message.

if no response is received in 2Ttrans + Tprocess, the node is assumed to be dead.

若 Process 知道自己有最高的 id ，就會 elect 自己當 Coordinator且會傳送 coordinator 訊息給所有比其 id 低的其餘 process

當 Process P 注意到 coordinator 太久沒回應要求，就初始一個election

當 Process P 拿到 election 就會傳送 election 訊息給其餘 process 若都沒人回應， P 就會當 Coordinator 若有一個人有更 higher numbered process 回答，就結束 P’s job is done

Bully Algorithm Performce

Best case scenario: The process with the second highest id notices the failure of the coordinator and elects itself. N-2 coordinator messages are sent. Turnaround time is one message transmission time.

Worst case scenario: When the process with the least id detects the failure. N-1 processes altogether begin elections, each sending

messages to processes with higher ids. The message overhead is O(N2). Turnaround time is approximately 5 message transmission

times.

Ring Algorithm No token is used in this algorithm 當演算法結束時，任一 Process 分有 Active清單 (consisting of all

the priority numbers of all active processes in the system) 若 Process Pi偵測 Coordinator failure ，就會建立初始空白的

Active清單，之後傳送訊息 elect(i)給 Pi 的 right neighbor ，和增加 number i 到 Pi 的 Active清單

若 Pi 接收到訊訊 elect(j) 從左邊的 Process ，它必須有所回應 If this is the first elect message it has seen or sent, Pi creates a

new active list with the numbers i and j and send the message elect(j)

If i j, then the active list for Pi now contains the numbers of all the active processes in the system , Pi can now determine the largest number in the active list to identify the new coordinator process

If i = j, then Pi receives the message elect(i) , The active list for Pi contains all the active processes in the system Pi can now determine the new coordinator process.

Chang&Roberts Algorithm Assume

Unidirectional ring Asynchronous system Each Process has UID

Election initially each process non-participant determine leader (election message):

initiator becomes participant and passes own UID on to neighbour when non-participant receives election message, forwards maximum of

own and the received UID and becomes participant participant does not forward the election message

announce winner (elected message): when participant receives election message with own UID, becomes

leader and non-participant, and forwards UID in elected message otherwise, records the leader’s UID, becomes non-participant and

forwards it

Itai&Rodeh Algorithm Assume

Unidirectional ring Synchronous system Each Process not has UID

Election each process selects ID at random from set {1,..K}

non-unique! but fast process pass all IDs around the ring after one round, if there exists a unique ID then

elect maximum unique ID otherwise, repeat

How do know the algorithm terminates? from probabilities:if you keep flipping a fair coin then after several

heads you must get tails

分散式系統

Technology

Transcript of 分散式系統