Operating System Concepts 作業系統原理 CHAPTER 15 系統安全性 (System Security)
分散式系統
-
Upload
acksinkwung -
Category
Technology
-
view
2.258 -
download
6
Transcript of 分散式系統
分散式系統
優點 Resource Sharing
不同地區的 Process 連通時→ USER A 可使用 USER B 的資源
Computation Speedup 困難複雜的問題分派多個處理器綜合處理
Reliability 因各處理器有各自獨立的 Memory→ 當有一個處理器受損時,
將不致影響其他處理器之作業 ; 同時互相幫忙修補 Communication
任何連通的 USER 皆可藉由網路互相通訊和諮詢
作業系統的類型 資料傳輸
Site A ─Data→ Site B 資料可視需求而定,但格式需一致,避免遺失資料
計算傳輸 使用者將指令藉由網路傳送至遠端處理器 由遠端處理器以 Local Resources 執行 再將執行結果回傳予使用者
行程傳輸 將 Process 藉由網路傳送至遠端執行,用此執行的理由:
Load Balancing Computation Speedup Hardware / Software Preference Data Access
C Socket for Windows
C Socket for Windows Server.c
#include<winsock2.h>
#include<stdio.h>
int main() {
SOCKET server_sockfd, client_sockfd;
int server_len, client_len;
struct sockaddr_in server_address , sockaddr_in client_address;
// 註冊 Winsock DLL
WSADATA wsadata;
WSAStartup(0x101,(LPWSADATA)&wsadata)
// 產生 server socket
server_sockfd = socket(AF_INET, SOCK_STREAM, 0);
// AF_INET( 使用 IPv4); SOCK_STREAM; 0( 即 TCP)
C Socket for Windows Server.c
server_address.sin_family = AF_INET;
server_address.sin_addr.s_addr = inet_addr("127.0.0.1");
server_address.sin_port = 1234;
server_len = sizeof(server_address);
bind(server_sockfd, (struct sockaddr *)&server_address, server_len);
listen(server_sockfd, 5); // 5( 即佇列數 )
C Socket for Windows Server.c
while(1) {
char ch;
printf("Server waiting...\n");
client_len = sizeof(client_address);
client_sockfd = accept(server_sockfd, (struct sockaddr *)&client_address, &client_len);
recv(client_sockfd, &ch, 1, 0); // 接收’ A’
ch++; // ‘A’→’B’
send(client_sockfd, &ch, 1, 0); // 傳送’ B’
closesocket(client_sockfd);
WSACleanup();
}
}
C Socket for Windows Client.c
#include<winsock2.h>
#include<stdio.h>
int main() {
SOCKET sockfd;
int len , result;
struct sockaddr_in address;
char ch = 'A';
WSADATA wsadata;
WSAStartup(0x202,(LPWSADATA)&wsadata);
sockfd = socket(AF_INET, SOCK_STREAM, 0);
address.sin_family = AF_INET;
C Socket for Windows Client.c
address.sin_addr.s_addr = inet_addr("127.0.0.1");
address.sin_port = 1234;
len = sizeof(address);
connect(sockfd, (struct sockaddr *)&address, len);
send(sockfd, &ch, 1, 0);
recv(sockfd, &ch, 1, 0);
printf("char from server = %c\n", ch);
closesocket(sockfd);
WSACleanup();
system("pause");
}
Distributed Systems: Concepts and Design
Client and server with threads
Server
N threads
Input-output
Client
Thread 2 makes
T1
requests to server
generates
results
Requests
Receipt &queuingThread 1
Alternative server threading architectures
Distributed Systems: Concepts and Design
a. Thread-per-request b. Thread-per-connection c. Thread-per-object
remote
workers
I/O remoteremote I/O
per-connection threads per-object threads
objects objects objects
C Thread
-lpthreadGC2
C Thread pthread.c
#include <stdio.h>
#include <pthread.h>
void *thread_func(void *arg);
char message[] = "Hello World";
int main() {
pthread_t thread;
void *thread_result;
pthread_create(&thread,NULL,thread_func,(void *)message);
printf("Waiting for thread to finish...\n");
C Thread pthread.c
pthread_join(thread,&thread_result);
printf("Thread joined, it returned %s\n",(char *)thread_result);
system("pause");
}
void *thread_func(void *arg) {
printf("thread %s is running\n",(char *)arg);
sleep(3);
pthread_exit("Thange you use CPU Time\n");
}
Java TCP Socket (per-connection threads) Client.java
import java.net.*;
import java.io.*;
public class Client {
public static void main (String args[]) {
Socket s = null;
try{
int serverPort = 1234;
s = new Socket("localhost", serverPort);
DataInputStream in = new DataInputStream( s.getInputStream());
DataOutputStream out = new DataOutputStream( s.getOutputStream());
out.writeUTF(“Hello");
String data = in.readUTF();
System.out.println("Received: "+ data) ;
s.close();
}catch (IOException e){
System.out.println(e.getMessage());
}finally {
if(s!=null)
try {s.close();}
catch (IOException e){}
}
}
}
Java TCP Socket (per-connection threads) Server.java
import java.net.*;
import java.io.*;
public class Server {
public static void main(String args[]) {
try{
int serverPort = 1234;
ServerSocket listenSocket = new ServerSocket(serverPort);
while(true) {
Socket clientSocket = listenSocket.accept();
Connection c = new Connection(clientSocket);
}
} catch(IOException e) {
System.out.println(e.getMessage());
}
}
}
Java TCP Socket (per-connection threads) Connection.java
import java.net.*;
import java.io.*;
class Connection extends Thread {
DataInputStream in;
DataOutputStream out;
Socket clientSocket;
public Connection (Socket ClientSocket) {
try {
clientSocket = ClientSocket;
in = new DataInputStream( clientSocket.getInputStream());
out = new DataOutputStream( clientSocket.getOutputStream());
this.start();
} catch(IOException e){ System.out.println(e.getMessage());}
}
public void run(){
try {
String data = in.readUTF();
out.writeUTF("client data is " + data);
} catch(IOException e) {
System.out.println(e.getMessage());
} finally {
try {
clientSocket.close();
} catch (IOException e) {}
}
}
}
時間同步的類型 External
Synchronize all clocks against a single one, usually the one with external, accurate time information
Internal Synchronize all clocks among themselves
At least time monotonicity must be preserved
時間同步的類型 External (accuracy) :同步於驗證來源的時間 Each system clock Ci
differs at most Dext at every point in the synchronization interval from an external UTC source S: |S - Ci| < Dext for all i
C1 C3
C2
S
時間同步的類型 Internal
(agreement) :彼此間合力同步時間 Any two system
clocks Ci and Cj differs at most Dint at every point in the synchronization interval from each other: | Cj - Ci| < Dint
for all i and j
C1 C3
C2
時間同步的類型 Dext and Dint are synchronization bounds Dint <= 2Dext Max-Synch-interval = Dint / 2Dext
It means: If two events have single-value timestamps which
differ by less than some value , we CAN’T SAY in which order the events occurred.
With interval timestamps, when intervals overlap, we CAN’T SAY in which order the events occurred.
同步系統時間
A
B
A’s clock time
B’s clock time
real time
TA
TB
Ttrans
TA+Ttrans
Tmin Tmax
Ttrans
(Tmin+ Tmax)/2
(Tmin- Tmax)/2 (Tmin- Tmax)/2
Ttrans is unkown
Tmin < Ttrans < Tmax Ttrans= (Tmin+ Tmax)/2 is at most wrong by (Tmin- Tmax)/2If A sends its clock time TA to B → B can set its clock to TA + (Tmin+ Tmax)/2 → then A and B are synchronized with bound (Tmin- Tmax)/2
非同步系統時間
In asynchronous system, we have no Tmax
How can A synchronize with B? By using the round-trip time Tround=TA-T’A in Cristian’s
algorithm:
TB= TB+ Tround/2
A
B
A’s clock time
B’s clock timeTA
TB
Tround
TA+Ttrans
TB +Tround/2
T’A
JAVA RMI (External Clock Synchronize)
JAVA RMI (External Clock Synchronize) Clock.javaimport java.rmi.*;public interface Clock extends Remote{
String getTime() throws RemoteException;}
ClockImpl.javaimport java.rmi.*;import java.rmi.server.*;import java.util.*;public class ClockImpl extends UnicastRemoteObject implements Clock {
public ClockImpl() throws RemoteException {super();
}public String getTime() {
Date d = new Date();return d.toString();
}}
JAVA RMI (External Clock Synchronize) ClockServer.java
import java.rmi.*;
public class ClockServer {
public ClockServer() {
try {
Clock c = new ClockImpl();
Naming.rebind("//localhost/ClockService",c);
} catch (Exception e) {
System.out.print(e.getMessage());
}
}
public static void main(String args[]) {
new ClockServer();
}
}
JAVA RMI (External Clock Synchronize) ClockClient.java
import java.rmi.*;
import java.net.*;
public class ClockClient {
public static void main(String args[]) {
try {
Clock c = (Clock)Naming.lookup("//localhost/ClockService");
System.out.println(c.getTime());
} catch (Exception e) {
System.out.print(e.getMessage());
}
}
}
Logical time One aspect of clock synchronization is to provide a
mechanism whereby systems can assign sequence numbers (“timestamps”) to messages upon which all cooperating processes can agree.
Leslie Lamport (1978) showed that clock synchronization need not be absolute and L. Lamport‘s two important points lead to “causality” First point:
If two processes do not interact, it is not necessary that their clocks be synchronized they can operate concurrently without fear of interferring with
each other Second (critical) point:
It is not important that all processes agree on time, but rather, that they agree on the order in which events occur
Such “clocks” are referred to as Logical Clocks Logical time is based on happens-before relationship
事件序列 Event Ordering Happens before and concurrent events illustrated
No causal path neitherfrom e1 to e2 nor from e2 to e1
e1 and e2 are concurrent
from e1 to e6 nor from e6 to e1
e1 and e6 are concurrent
from e2 to e6 nor from e6 to e2
e2 and e6 are concurrent
Types of eventsSendReceiveInternal (change of state)
𝑒1 𝑒2𝑒3
𝑒4
𝑒7
𝑒6𝑒5
𝑒8
𝑷 𝟏 𝑷 𝟐 𝑷 𝟑
協調 Co-ordination 對於分散式系統的困難點
Centralised solutions not appropriate communications bottleneck
Fixed master-slave arrangements not appropriate process crashes
Varying network topologies ring, tree, arbitrary; connectivity problems
Failures must be tolerated if possible link failures process crashes
Impossibility results in presence of failures, esp asynchronous model
Mutual Exclusion 要求
Safety At most one process may execute in CS at any time
Liveness Every request to enter and exit a CS is eventually granted
Ordering (desirable) Requests to enter are granted according to causality order (FIFO)
Synchronization scheme Centralized Distributed
Based on mutual
exclusion
Central process
Circulating token
No mutual exclusion
Physical Clock
Event Count
Physical clocks
Logical clocks
Mutual Exclusion 執行分三大類
Centralized Approach P1 有意進入 Critical Section 時→傳遞一個意願訊息 Request→C 接受意願訊息
Request →若 Critical Section 允許 Process 進入→傳遞一個允許訊息 Reply→P1 就能進入
此時當 P2 也有意願進行 Critical Section →C 將 P2 之意願訊息置入至 Waiting Queue 當 P1 離開臨界區時→傳遞一個釋出訊息 Release 至 C→C 將傳遞一個允許訊息 Reply 至
Waiting Queue 中的下一個意訊願訊息的擁有者 Process
Distributed Approach 比較 Timestamp 要知道網路上所有 Node 的 Name 及也要將本身的 Name 告知其它節點,降低增加節點的
頻率 當 Node 故障,系統應立刻通知其它 Node 且進行修復後,故應經常維護各 Node 正常運作 Process 未進入 Critical Section ,必會頻頻停頓等待其他 Process 之操作
Token Passing Approach 適當的路徑,避免 Node 發生 Starvation 若 Token 遺失,系統應重新設定一個 Token 補救 若路徑有 Node 故障,系統應重組最佳新路徑
緊密聚合 Aotomicity 每一個 Site 皆有自己的本機 Transaction
Coordinator 若由發起執行 Transaction ,將由來細分成多個部份交由
其他適當的 Site 執行,最後無論是執行成功 Commited或失敗 Abortion ,分由來決定結束 Transaction
兩階段協定 故障處理
𝑆 𝑖
𝐶𝑖
𝑆1 𝑆2𝑆3 𝑆4 …
Two-Phase Commit Protocol Phase 1 :
由協調者 Coordinator 詢問各網路 Node 是否準備 Prepare 配合執行
流程 : 傳遞訊息 prepare(T) 至網路 Node ,
同時記錄 <prepare T>
當網路各 Node 接收到 prepare(T) 訊息後,檢視自己是否已準備就緒,若是就立刻傳遞訊息 ready(T) 至,同時記錄 <ready T> ;若尚未就緒,則傳遞訊息 abort(T) 至,同時記錄 <no T>
𝑆 𝑖
𝐶𝑖
𝑆1 𝑆2𝑆3 𝑆4 …
𝑆 𝑖
𝐶𝑖
𝑆1 𝑆2𝑆3 𝑆4 …
<prepare T>
prepare(T)
ready(T)<ready T>
abort(T)<no T>
Two-Phase Commit Protocol Phase 2 :
當每一個 Node 均回覆準備好之後,由協調者 Coordinator 傳遞委託執行訊息 Commit 至各 Node ,開始執行,並將結果傳回協調者
流程 : 當收到所有 Node 均回覆 ready(T) ,
立刻傳遞訊息 commit(T) 至各 Node ,要求各 Node 開始執行 T ,同時記錄<commit T>
收到任一 Node 回覆 abort(T) 或超過時限仍有 Node 未回覆訊息,則立刻傳遞訊息 abort(T) 至各 Node ,要求各 Node 停止執行 T ,同時記錄<abort T>
當各 Node 執行產生結果後會傳遞訊息acknowledge(T) ,而當接收到所有 Node的回應訊息,則記錄 <complete T>
𝑆 𝑖
𝐶𝑖
𝑆1 𝑆2𝑆3 𝑆4 …
𝑆 𝑖
𝐶𝑖
𝑆1 𝑆2𝑆3 𝑆4 …
commit(T)<commit T>
abort(T)<abort T>
<complete T>
acknowledge(T) acknowledge(T)
Failure Handling in 2PC Failure of a Participating Site
若最後紀錄是 <commit T> ,修復後執行 redo(T) ,繼續執行工作 T
若最後紀錄是 <abort T> ,修復後執行 undo(T) ,停止執行工作 T
若最後紀錄是 <ready T> ,修復後立即檢視是否是開啟狀態且仍在等待各 Site 回覆訊息若是就執行 redo(T) ,繼續執行工作 T 否則就執行undo(T) ,停止執行工作 T
若沒有 Control 的記錄時,修復後執行 undo(T) ,停止執行工作 T
Failure Handling in 2PC Failure of the Coordinator
若最後紀錄是 <commit T> ,則工作 T 必已執行完成 若最後紀錄是 <abort T> ,則工作 T 必已停止執行 If some active site does not contain the record <ready T> in
its log then the failed coordinator Ci cannot have decided to commit T Rather than wait for Ci to recover, it is preferable to abort T
All active sites have a <ready T> record in their logs, but no additional control records we must wait for the coordinator to recover Blocking problem :T is blocked pending the recovery of site
Si
Deadlock Prevention and Avoidance 資源編碼演算法 Resources Ordering Algorithm
將網路上所有的資料源依我們想像的工作進行 Global Resources-ordering ,並給予唯一的編號
當某 Process 當時正佔有資源 i 時,不得再對於小於 i 的資源提出要求,如此可降低循環等待的機會
Simple to implement; requires little overhead
銀行家演算法 Banker’s Algorithm 分散式系統選出一個最適當的 Process 擔任銀行家 Banker ,管理網路上
所有的資源及對商上各 Process 作最適當的資源分配
(New) 時間戳記優先演算法 Timestamp Priority Algorithm 網路上所有 Process 的 TS 均設定為各 Process 之 Priority
Number TS 愈小的 Process 其優先等級愈高 ( 愈早發生 ) 唯有優先等級較高的 Process ,可以向優先等級低的提出資源要求
Timestamp Priority Algorithm Proces 提出要求資源 R ,但此時資源 R 正被使用,表示如下:
若的 TR 值小於的 TR 值時,則可以等待釋出資源 R ,否則會進行Rollback
因一定有一個最低優先等級的 Process ,故不會有 Cycle 存在 因 TR 的 Priority Number 故不會發生 Starvation
消極方式 Wait-Die Scheme (Nonpreemptive) will wait , will be rolled back
積極方式 Wound-Wait Scheme (Preemptive) 立刻搶用的資源, will wait
𝑃 𝑖 𝑃 𝑗
𝑃1 𝑃2
TR=5 TR=10
𝑃2 𝑃3
TR=10 TR=15
Deadlock Detection
集中式執行 Centralized Approach 分散式執行 Distributed Approach
𝑃1 𝑃2
𝑃5 𝑃3
𝑃2 𝑃4
𝑃3
𝑃1 𝑃2 𝑃4
𝑃3𝑃5
區域等待圖 Local Wait For Graph
全域等待圖 Global Wait For Graph
基本分散式演算法
複雜度測量 Computational Rounds
同步將以計時器度量回合數 非同步演算法將以透過網路散播事件的次數 waves 來
決定回合數 Local Running Time Spaced
Global→ 所有電腦使用空間的總和 Local→ 每台電腦需要使用多少空間
Message complexity 電腦傳送的總訊息數
訊息 M 透過 p 個邊傳輸→訊息複雜度為 p|M| , |M| 代表 M 的長度
基本分散式演算法 Ring Leader Tree Leader BFS MST
Ring Leader 每 Process 將它的 id 傳送到環狀裡的下一個
Process 之後的回合裡,每個 Process 將執行如下的計算: 從上一個 Process 收到一個識別號碼 id 將 id 與自己的識別號碼比較 把兩值之中的最小值,傳送到環狀裡的下一個 Process
AlgorithmRingLeader(id):
Input:The unique identifier, id, for the processor running Output:The smallest identifier of a processor in the ringM←[Candidate is id]Send message M to the successor processor in the ringdone←falserepeat
Get message M from the predecessor processor in the ring.if M=[Candidate is i] then
if i=id thenM←[Leader is id]done←true
Algorithmelse
m←min{i,id}M←[Candidate is m]
else{M is a “Leader is” message}done←true
Send message M to the next processor in the ringuntil donereturn M
Analysis Computational Rounds
O(2N) Local Running Time
O(N) Local Spaced
O(1) Message Complexity
O(N2)
Tree Leader 假設網路是一個自由樹狀圖
自然起始點 外部節點
非同步 訊息檢查 Message Check
特定邊是否已送出訊息且到達該節點 二階段
Accumulation Phase id 自樹的外部節點流入,記錄最小 id 的節點 找出 Leader
Broadcast Phase 廣播 Leader id 至各外部節點
AlgorithmTreeLeader(id):
Input:The unique identifier, id, for the processor running Output:The smallest identifier of a processor in the ring
{Accumulation Phase}
Let d be the number of neighbors of processor id
m ←0 {counter for messages received}
ℓ ←id {tentative leader}repeat
{begin a new round}for each neighbor j do
check if a message from processor j has arrivedif a message M = [Candidate is i] from j has
arrived then ℓ←min{i. ℓ}m←m+ 1
Algorithmuntil m > d-1if m=d then
M←[Leader is ℓ]for each neighbor i≠k do
send message M to processor jreturn M {M is a “leader is ” message}
elseM←[Candidate is ℓ]send M to the neighbor k that has not sent a message yet
Algorithm{Broadcast Phase}
repeat
{begin a new round}
check if a message from processor k has arrived
if a message M from k has arrived then
m←m+1
if M=[Candidate is i] then
ℓ←min{i,ℓ}
M←[Leader is ℓ]
for each neighbor j do
send message M to process j
Algorithmelse
{M is a “leader is” message}
for each neighbor j≠k do
send message M to processor j
until m=d
return M {M is a “leader is” message}
Analysis• di 為處理器 i 的相鄰 Process 之數量 Computational Rounds
O(D) Local Running Time
O(diD) Local Spaced
O(di) Message Complexity
O(N)
Tree Leader
同步 一塊石頭被丟池塘內後引起的漣漪 直徑 Diameter 為圖中任兩個節點之間最長之路徑之長
度 回合數為 Diameter 二階段
Accumulation Phase :中心 Broadcast Phase :向外傳播
Breadth-first Search 認定 s 為 source node 同步
以波wave 的型態向外散播 一層層由上往下建構 BFS Tree 每部節點 v 傳送訊息給先前沒有與 v 有所接觸的鄰居 任一節點 v 必須選擇另一個節點 v 當父節點
AlgorithmSynchronousBFS(v,s):
Input: The identifier v of the node (processor) executing this algorithm and the identifier s of the start node of the BFS traversal
Output: For each node v, its parent in a BFS tree rooted at s
repeat
{begin a new round}
if v=s or v has received a message from one of its neighbors then
set parent(v) to be a node requesting v to become its child
(or null, if v=s)
for each node w adjacent to v that has not contacted v yet do
send a message to w asking w to become a child of v
until v=s or v has received a message
Analysis n 個節點, m 個邊 Computational Rounds Local Running Time Local Spaced Message complexity
O(n+m)
Breadth-first Search 非同步
要求每個處理器知道在網路中的 Process 總數 根節點 s 送出的一個「脈衝」訊息,來觸發其他
Process 開始進行整體計算的下一回合 合併
向下脈衝從根節點 s 傳遞至 BFS Tree 向上脈衝從 BFS Tree 的外部節點一直到根節點 s
先收到向上脈衝信號之後,才會發出一個新的向下脈衝信號
AlgorithmAsynchronousBFS(v,s):
Input: The identifier v of the node (processor) executing this algorithm and the identifier s of the start node of the BFS traversal
Output: For each node v, its parent in a BFS tree rooted at s
C←ø {verified BFS children for v}
set A to be the set of neighbors of v
repeat
{begin a new round}
if parent(v) is defined or v=s then
if parent(v) is defined then
wait for pulse-down message from parent(v)
Algorithmif C is not empty then
{v is an internal node in the BFS tree}
send a pulse-down message to all nodes in C
wait for a pulse-up message from all nodes in C
else
{v is an external node in the BFS tree}
for each node u in A do
send a make child message to u
Algorithm
for each node u in A do
get a message M from u and remove u from A
if M is an accept-child message then
add u to C
send a pulse-up message to parent(v)
else
{v ≠s has no parent yet}
for each node w in A do
if w has sent v a make-child message then
remove w from A{w is no longer a candidate child for v}
Algorithm
if parent(v) is undefined then
parent(v)←w
send an accept-child message to w
else
send a reject-child message to w
until (v has received message done) or (v=s and has pulsed-down n-1 times)
send a done message to all the nodes in C
Analysis• n 個節點, m 個邊 Computational Rounds Local Running Time Local Spaced Message complexity
O(n2+m)
Minimum Spanning Tree 利用 Baruskal 演算法找出 MST 所提出的有效率的序列式
同步模式下的 Baruskal 分散式演算法 決定出所有連通分量圖 針對每個連通分量圖,找到具最小權重的邊 加入到另一個分量圖
Baruskal AlgorithmKruskalMST(G):
Input: A simple connected weighted graph G with n vertices and m edges
Output: A minimum spanning tree T for G
for each vertext v in G do
define an elementary cluster C(v)←{v}
initialize a priority queue Q to contain all edges in G, using the weights as keys
T←ø
Baruskal Algorithmwhile T has fewer than n-1 edges do
(u,v)←Q.removeMin()
Let C(v) be the cluster containing v , Let C(u) be the cluster containing u.
if C(v)≠C(u) then
Add edge(v,u) to T.
Merge C(v) and C(u) into one cluster, that is union C(v) and C(u).
return tree T
Analysis• n 個節點, m 個邊 Computational Rounds
O(logn) Local Running Time Local Spaced
O(m) Message complexity
O(mlogn)
時間同步演算法
69
Synchronization Algorithms Multicast
Uses a central time server to synchronize clocks
Cristian’s algorithm (centralised) Berkeley algorithm (centralised) The Network Time Protocol (decentralised)
Cristian’s Algorithm(1989) 使用 time server 來同步時間,且為保留供參考的時間 Clients ask the time server for time
period depends on maximum clock drift and accuracy required Clients receive the value and may:
use it as it is add the known minimum network delay add half the time between this send and receive
For links with symmetrical latency: RTT = resp.-received-time – req.-sent-time
adjusted-local-time = server-timestamp + minimum network delay or server-timestamp + (RTT / 2) or server-timestamp + (RTT – server-latency) /2
local-clock-error = adjusted-local-time – local-time
Berkeley algorithm (Gusella & Zatti, 1989) if no machines have receivers, … Berkeley algorithm uses a designated server to
synchronize
The designated server polls or broadcasts to all machines for their time, adjusts times received for RTT & latency, averages times, and tells each machine how to adjust.
Polling is done using Cristian’s algorithm
Avg. time is more accurate, but still drifts
Network Time Protocol NTP is a best known and most widely
implemented decentralised algorithm Used for time synchronization on Internet
1
2 2 2
3 3 3 3 3 3
Primary server, direct synchronization
Secondary server, synchronized bythe primary server
Tertiary server, synchronized bythe secondary serverwww.ntp.org
互斥存取演算法
假設 Each pair of processes is connected by reliable
channels (such as TCP). Messages are eventually delivered to recipients’
input buffer. Processes will not fail. There is agreement on how a resource is identified
Pass identifier with requests
Exclusive Access Algorithm Centralized Algorithm Token Ring Algorithm Lamport Algorithm
(Timestamp Approach) Ricart & Agrawala Algorithm
Leader Election Algorithms Bully Algorithm Ring Algorithm
Chang&Roberts Algorithm Itai&Rodeh Algorithm
Centralized AlgorithmOperations
1. Request resource Send request to coordinator to enter CS
2. Wait for response
3. Receive grant Grants permission to enter CS keeps a queue of requests to enter the CS.
4. access resource
5. Release resource Send release message to inform coordinator
Safety, liveness and order are guaranteed
Delay Client and Synchronization
one round trip time (release + grant)
P
CRequest(R)
Grant(R)
Release(R)
P1
P2 P3
P4
4
2
CoordinatorQueue ofRequests
Grant
Release
Request
Token Ring AlgorithmOperations
For each CS a token is used. Only the process holding the token can enter the CS. To exit the CS, the process sends the token onto its neighbor. If a process does not require to enter the CS when it receives the
token, it forwards the token to the next neighbor. 在一個時間只會有一個程序取得 Token ,保證Mutual exclusion Order well-defined ,讓 Starvation不會發生 假如 token 遺失 (e.g. process died) ,將必須重新產生 Safety & liveness are guaranteed, but ordering is not.
Delay Client : 0 to N message transmissions. Synchronization : between one process’s exit from the CS and the next
process’s entry is between 1 and N message transmissions.
Lamport Algorithm A total ordering of requests is established by logical
timestamps. Each process maintains request Queue (mutual exclusion
requests)
Requesting CS, Pi multicasts “request” (i, Ti) to all processes (Ti is local Lamport
time). Places request on its own queue waits until all processes “reply”
Entering CS, Pi receives message (ack or release) from every other process with
a timestamp larger than Ti
Releasing CS , Pi Remove request from its queue Send a timestamped release message This may cause its own entry have the earliest timestamp in the
queue, enabling it to access the critical section
Ricart & Agrawala Algorithm Using reliable multicast and logical clocks Process wants to enter critical section
Compose message containing Identifier (machine ID, process ID) Name of resource Current time
Send request to all processes ,wait until everyone gives permission
When process receives request If receiver not interested →Send OK to sender If receiver is in critical section →Do not reply; add request to queue If receiver just sent a request as well:
Compare timestamps: received & sent msgs→Earliest wins If receiver is loser then send OK else receiver is winner, do not reply,
queue When done with critical section→Send OK to all queued requests
Ricart & Agrawala AlgorithmOn initialization
state := RELEASED;To enter the critical section
state := WANTED;Multicast request to all processes; request processing deferred hereT := request’s timestamp;Wait until (number of replies received = (N – 1));state := HELD;
On receipt of a request <Ti, pi> at pj (i≠ j)
if (state = HELD) or ((state = WANTED) and ((T, pj) < (Ti, pi))
then queue request from pi without replying;
else reply immediately to pi;To exit the critical section
state := RELEASED;reply to any queued requests;
Ricart & Agrawala Algorithm Safety, liveness, and ordering are guaranteed. It takes 2(N-1) messages per entry operation (N-1 multicast
requests + N-1 replies); N messages if the underlying network supports multicast. [3(N-1) in Lamport’s algorithm]
Delay Client
one round-trip time Synchronization
one message transmission time.
P1
P2
P3
Reply
P2 message:
Timestamp is 78
P1 message:
Timestamp is 87
P1 remains in “wanted” untilP2 sends “reply”
P2 Changes to “held”
P2不能傳 Reply給 P1因為 Timestamp →P1大於P2
Leader Election Algorithms Solution the problem
N processes, may or may not have unique IDs (UIDs) for simplicity assume no crashes must choose unique master coordinator amongst processes
Requirements Every process knows P, identity of leader, where P is
unique process id (usually maximum) or is yet undefined. All processes participate and eventually discover the
identity of the leader (cannot be undefined). When a coordinator fails, the algorithm must elect that
active process with the largest priority number 兩種類型的演算法
Bully: “the biggest guy in town wins” Ring: a logical, cyclic grouping
Bully Algorithm 假設
Synchronous system All messages arrive within Ttrans units of time.
A reply is dispatched within Tprocess units of time of the receipt of a message.
if no response is received in 2Ttrans + Tprocess, the node is assumed to be dead.
若 Process 知道自己有最高的 id ,就會 elect 自己當 Coordinator且會傳送 coordinator 訊息給所有比其 id 低的其餘 process
當 Process P 注意到 coordinator 太久沒回應要求,就初始一個election
當 Process P 拿到 election 就會傳送 election 訊息給其餘 process 若都沒人回應, P 就會當 Coordinator 若有一個人有更 higher numbered process 回答,就結束 P’s job is done
Bully Algorithm Performce
Best case scenario: The process with the second highest id notices the failure of the coordinator and elects itself. N-2 coordinator messages are sent. Turnaround time is one message transmission time.
Worst case scenario: When the process with the least id detects the failure. N-1 processes altogether begin elections, each sending
messages to processes with higher ids. The message overhead is O(N2). Turnaround time is approximately 5 message transmission
times.
Ring Algorithm No token is used in this algorithm 當演算法結束時,任一 Process 分有 Active清單 (consisting of all
the priority numbers of all active processes in the system) 若 Process Pi偵測 Coordinator failure ,就會建立初始空白的
Active清單,之後傳送訊息 elect(i)給 Pi 的 right neighbor ,和增加 number i 到 Pi 的 Active清單
若 Pi 接收到訊訊 elect(j) 從左邊的 Process ,它必須有所回應 If this is the first elect message it has seen or sent, Pi creates a
new active list with the numbers i and j and send the message elect(j)
If i j, then the active list for Pi now contains the numbers of all the active processes in the system , Pi can now determine the largest number in the active list to identify the new coordinator process
If i = j, then Pi receives the message elect(i) , The active list for Pi contains all the active processes in the system Pi can now determine the new coordinator process.
Chang&Roberts Algorithm Assume
Unidirectional ring Asynchronous system Each Process has UID
Election initially each process non-participant determine leader (election message):
initiator becomes participant and passes own UID on to neighbour when non-participant receives election message, forwards maximum of
own and the received UID and becomes participant participant does not forward the election message
announce winner (elected message): when participant receives election message with own UID, becomes
leader and non-participant, and forwards UID in elected message otherwise, records the leader’s UID, becomes non-participant and
forwards it
Itai&Rodeh Algorithm Assume
Unidirectional ring Synchronous system Each Process not has UID
Election each process selects ID at random from set {1,..K}
non-unique! but fast process pass all IDs around the ring after one round, if there exists a unique ID then
elect maximum unique ID otherwise, repeat
How do know the algorithm terminates? from probabilities:if you keep flipping a fair coin then after several
heads you must get tails