Post on 17-Jan-2017
Constructing Distributed Doubly Linked List
without Distributed Locking
IEEE Peer-to-Peer Conference 2015 Sep 23rd–24th, 2015
Kota Abe, Osaka City University / NICT, Japan Mikio Yoshida, BBR Inc., Japan
1
Outline
BackgroundWhat is distributed doubly linked listConventional approaches
The DDLL algorithmProcedure for node insertion, deletion and traversalProcedure for recovery from failure
EvaluationComparison with conventional algorithms
Conclusion
2
Outline
BackgroundWhat is distributed doubly linked listConventional approaches
The DDLL algorithmProcedure for node insertion, deletion and traversalProcedure for recovery from failure
EvaluationComparison with conventional algorithms
Conclusion
3
Distributed Doubly Linked List
aka Bidirectional RingCommonly used in structured P2P networks
Chord, Chord#, Skip Graph, SkipNet, etc.
StructurePointer (e.g. IP address) to the next (successor) node and previous (predecessor) node
We call right and left pointersSorted by node-specific keyCircular
4
0
2060
40
70 10
50 30
Maintaining Distributed Doubly Linked List
ChallengesNodes are distributed and may be simultaneously and independently inserted and deletedNodes may fail
5
u
p qu
p q
Insertion Deletion
up q
Recovery
p q r
Traversal
up q
Conventional Approaches (1/2)Eventual Consistency Approach
Node insertion and deletion temporarily breaks the list structureStabilizing procedure recovers
6
p qu
up q
Distributed Locking ApproachUse a lock🔒 to mutually exclude node insertion / deletion
up q
JoinDone
JoinPoint
NewSuccAck
🔒
🔓
🔒
🔓
NewSucc
JoinReq
Chord
Atomic Ring Maintenance (Ghodsi)
up q
Conventional Approach (2/2)Eventual Consistency Approach
Pros 👍Easy to recover from failure
Cons 👎No lookup consistency: Lookup results may differ depending on the querying node
Distributed Locking ApproachPros 👍
Lookup consistencyCons 👎
Lock disturbs another node insertion / deletion
When a node fails, locking duration may be quite long
Recovery procedure is rather complicated
Release a lock by timeout, which may be premature
→ locks should not be used if possible
Outline
BackgroundWhat is distributed doubly linked listConventional approaches
The DDLL algorithmProcedure for node insertion, deletion and traversalProcedure for recovery from failure
EvaluationComparison with conventional algorithms
Conclusion
8
Our Contribution — DDLL Algorithm
DDLL = Distributed algorithm for constructing distributed doubly linked lists
Acronym of “Distributed Doubly Linked List”Guarantees lookup consistency without using distributed locking (in absence of failure)Simple and EfficientProved correctness (insertion and deletion procedure)Practical
Works with non-FIFO channels (e.g. UDP)Used in our PIAX P2P platform as a foundation of Skip Graph and Chord# implementations
9
Node Insertion
10
u
p q
up q
u
p q
u
p q
(1) u.l := p, u.r := q
(2) Update right link:Change p’s right link to u
(3) Update left link:Change q’s left link to u
u is going to be inserted between p and q
Updating Right Link (1/3)
11
u
p qv
u
qo
p has been deleted
We want to change p’s right link only ifthere is no conflict u
p r
q has been deleted
q
p
Conflictsanother node has been
inserted between p and q
SetR message is used for updating a right linkSetR message contains:
new right nodeexpected right node of the recipient node
When a SetR request is accepted, p returns a SetRAck messageOtherwise, p returns SetRNak message
Updating Right Link (2/3)
12
u
p q
u
p q
SetR(u, q)
Please change your right link to me (u) if your right link still points to q and you has not initiated deletion
SetRAck
Ok!
Right links are always correct without using locking
Updating Right Link (3/3)
13
u
p qv
another node has been inserted between p and q
SetR(u, q)
p.r != q
Conflict case example:
u
p qv
SetRNak
Sorry!
Updating Left Link (1/3)
p q
p q
uSetR(u, q)
Message Sequence
14
up q
u
v
u v
SetL(v)SetRAckSetL(u)
SetRAck
p q
SetR(v, q)
Problem:Multiple SetL messages arrive from different nodes in arbitrary order (because we do not want to use locking)Node must determine which SetL message is newer
!?
p q
Topology Change
v
Updating Left Link (2/3)
Solution:SetL message contains a sequence number (seq)Each node holds a sequence number for its right node (rseq)
rseq is transferred using SetRAck
Each node holds the max sequence number of SetL messages received so far (lseq)SetL message is accepted only if msg.seq > lseq
15
p qrseq = 0
lseq = 0
up q
rseq = 1SetRAck(1)
lseq = 0SetL(u, 1)
u
p q
rseq = 1
lseq = 1
u
p q
rseq = 2v
lseq = 2
up q
rseq = 2v
SetL(u, 2)lseq = 1
SetRAck(2)
Updating Left Link (3/3)
p q
uSetR(u, q, 0)
Message Sequence
16
up q
u
v
vSetL(v, 2)SetRAck(2)
SetL(u, 1)
p q
SetR(v, q)
How our scheme solves the previous case
p q0
0
SetRAck(1)00
1
0
00
2
00
2
This SetL message is staled and ignored
Topology Change
Lock is not necessary !
lseq = 0
lseq = 2
rseq = 0
rseq = 1
rseq = 2
Node Insertion Sequence
u
p q
p qi
u
p q
i
00
i
u
SetR(u, q, 0)
SetRAck(i+1)
SetL(u, i+1)
Message Sequence
17
Topology Change
qp
00
i+1
i+1
Node Deletion Sequence
u
p q
u
p q up q
SetR(q, u, i2+1)
SetRAck(i1+1)
SetL(p, i2+1)
Message Sequence
18
Topology Change
u
p qi2 + 1
i2 + 1
i2
i2i1i1
i2 + 1
i2
i1+1 is not used
Insertion and Deletion
3 messages are required for insertion/deletionA node is atomically inserted/deleted when SetR message is acceptedIf SetRNak message is received, application retries insertion/deletionRight links are always correctLeft links are correct when there is no SetL message in transmissionNo distributed lockingDoes not require FIFO channel (UDP friendly)
19
Traversals
Every inserted node can be looked up either rightward or leftwardTraversing rightward: easyTraversing leftward:
left links are not always correct1. Node X visits q and fetches
q.l (= p)2. X visits p and fetches p.l
and p.r (= u)3. X detects that u is missed
(because p.r != q) and X visits u
20
u
p q
X1.visit2.visit
Incorrect left link
3.visit
traversing leftward
Insertion Retry OptimizationInsertion requires pointers to the immediate left and right nodesWhen an inserting node receives SetRNak, the node retriesOptimization: SetRNak contains the pointer to the right node
Extra messages can be eliminated if p is not initiated deletion AND u ∈ (p, p.r)
2121
qpvu SetR
SetRAck
SetL
qpvu SetR
SetRAck
SetLSetRNak
MyR(v)GetR
SetRAckSetL
SetRAckSetL
Unoptimized
SetRNak(v)
SetR(u, v)
Optimized
SetRSetR
SetR(u, v)
Handling failure
So far, no failure is assumedDDLL algorithm considers:
Crash failureOmission failureTiming failure
In asynchronous network, it is impossible to distinguish slow nodes and failed nodes
Erroneously suspected nodes are temporarily removed but eventually recovered
22
}Omitted in this presentation
Recovery | Basic
Each node maintains a neighbor node set N
N contains sufficient number of left-side nodes
Each node u periodically finds live closest left-side node vu obtains v.r and v.rseq
If (v = u.l) ∧ (v.r = u)∧ (v.rseq = u.lseq) then OK
23
A C
A C
A C?
?BA Crseq uv
lseq
uv
Otherwise, start recovery
B
B
B
SetR(C, B, ?)
Recovery | Sequence Number (1)Let’s consider the sequence number of the recovered link
24
A C
A C
A C
i
i
i +1
i +1
i +1
B
B
B
SetR(C, B, i+1)
Assigning C.lseq + 1 ?
A C
A C?
?
B
B
SetR(C, B, ?)
Recovery | Sequence Number (2)
Both A and X have the same right node (C) and the same rseq (i +1)
25
A Xi +1
iC
A X C
A X C
SetL
SetL
i +1
i +1
i +1
i +1
i +1
B
B
B
SetR(C, B, i +1)
C’s left link may rollback !
A Xi +1
CSetLBX inserts between B and C
B fails while SetL to C is still in transmission
C starts recoveryw/o noticing X
Subtle Case
Recovery | Sequence Number (3)
Solution:Extend sequence number:(recovery-number, seq)Recovery number is increased only on recovery Left links do not rollback!
26
A X(0, i +1)
(0, i)C
A X C
A X C
SetL
SetL
(1, 0) (0, i +1)
(1, 0)
BA(0, i)
(0, i)C
B
B
B
SetR(C, B, (1, 0))(0, i +1)
Outline
BackgroundWhat is distributed doubly linked listConventional approaches
The DDLL algorithmProcedure for node insertion, deletion and traversalProcedure for recovery from failure
EvaluationComparison with conventional algorithms
Conclusion
27
Evaluation
ComparisonDDLL(without optimization)DDLL(with optimization)Atomic Ring Maintenance (distributed-locking)
A. Ghodsi, “Distributed k-ary System: Algorithms for distributed hash tables,” PhD Dissertation, KTH—Royal Institute of Technology, 2006.
Li’s algorithm (distributed locking, no finger table)X. Li, et. al., “Concurrent maintenance of rings.” Distributed Comp., vol. 19, no. 2, pp. 126–148, 2006.
Chord (eventual consistency, no finger table)I. Stoica, et. al., “Chord: A scalable peer-to-peer lookup protocol for internet applications,” IEEE/ACM Trans. on Net., vol. 11, no. 1, pp. 17–32, 2003.
28
Eval | Insertion Sequence
29
u
p q
Join(u)
Ack(p, q)
Grant(u)
🔒
🔓
🔒
🔓
Li’s
Done
up q
JoinReq
JoinDone
JoinPoint
NewSuccAck
🔒
🔓
🔒
🔓
Atomic Ring Maintenance
NewSucc
DDLL
qp
SetLSetRAck
uSetR
Eval | Time for Concurrent Insertion
Simulated on a discrete event simulatorInsert an initial nodeInsert n nodes in parallel (n = 1 to 100)Measured time required to converge all links
Time includes lookup messages for searching node insertion position
30
0
20
40
60
80
100
120
0 20 40 60 80 100
time
# of simultaneously inserting nodes
DDLL(Opt)DDLL(NoOpt)
AtomicLi's
Chord
DDLL(Opt) converges quickly
Time to convergetime unit = one-way message
transmission time
Eval | # of Msgs for Concurrent Insertion
31
0
1
2
3
4
5
0 20 40 60 80 100
#ofmessages(x1000)
# of simultaneously inserting nodes
DDLL(Opt)DDLL(NoOpt)
AtomicLi's
Chord
# of messages to convergeMeasured # of messages required to converge all links
DDLL(Opt) uses less messages
Outline
BackgroundWhat is distributed doubly linked listConventional approaches
The DDLL algorithmProcedure for node insertion, deletion and traversalProcedure for recovery from failure
EvaluationComparison with conventional algorithms
Conclusion
32
Conclusion
DDLL algorithm for constructing distributed doubly linked lists
No distributed lockingRight links are always correct, Left links converge quicklyMaintains lookup consistency (in absence of failure)More efficient than conventional algorithmsRecovery procedure is providedNo FIFO channel is requiredCorrectness proofs for insertion and deletion procedure
DDLL is suitable for ring-based structured P2P networksReal example: DDLL is used as a foundation of Skip Graph and Chord# implementations in PIAX P2P platform
33
Spare Slides
34
Recovery | Sequence Number (4)
X is excluded from the linked list but eventually returns
35
BA X C(1, 0) (0, i +1)
(1, 0)
BA X C(0, i +1)
(1, 0)
SetR(X, C, (0, 0))
BA X C(0, 0) (1, 1)
(1, 0)
(1, 0)
BA X C(0, i +1)
(1, 0)
(0, 0)
SetRAck((1,1))
(0, 0)
DDLL pseudo code
36
1 p r o c e s s u2 var s : {out , i n s , in , d e l}3 l , r : {p o i n t e r t o a node or n i l}4 lseq , rseq : { i n t e g e r or n i l}5 i n i t s = o u t ; l = r = n i l ; lseq = 0 ; rseq = n i l6 begin7 {Cr ea t e a l i n k e d l i s t }8 (A1 ) r e c e i v e C r e a t e ( ) from app →9 l , r , s , lseq , rseq := u , u , in , 0 , 0
10 { I n s e r t be tween p and q}11 [ ] (A2 ) r e c e i v e I n s e r t ( p , q ) from app →12 i f ( s ̸= o u t ∨ u ̸∈ (p, q) ) then error ; f i13 l , r , s := p , q , i n s14 send SetR ( u , r , lseq ) to l15 {D e l e t e}16 [ ] (A3 ) r e c e i v e D e l e t e ( ) from app →17 i f ( s ̸= i n ) then error18 e l s e i f ( u = r ) then { i n case o f t h e l a s t node}19 s := o u t20 e l s e s := d e l ; send SetR ( r , u , rseq + 1) to l ; f i21 [ ] (A4 ) r e c e i v e SetR ( rnew , rcur , rnewseq ) from v →22 i f ( s = i n ∧ r = rcur ) then23 i f ( rnew = v ) then { i n s e r t i o n case}24 send SetL ( rnew , rseq + 1) to r25 e l s e { d e l e t i o n case}26 send SetL ( u , rnewseq ) to rnew ; f i27 send SetRAck ( rseq + 1) to v28 r , rseq := rnew , rnewseq
29 e l s e send SetRNak ( ) to v ; f i30 [ ] (A5 ) r e c e i v e SetRAck ( rnewseq ) from v →31 i f ( s = i n s ) then32 s , rseq := in , rnewseq
33 e l s e i f ( s = d e l ) then34 s := o u t ; f i35 [ ] (A6 ) r e c e i v e SetRNak ( ) from v →36 i f ( s = i n s ) then37 s := o u t ; error {app r e t r i e s i n s e r t i o n l a t e r}38 e l s e i f ( s = d e l ) then39 s := i n ; error ; f i {app r e t r i e s d e l e t i o n l a t e r}40 [ ] (A7 ) r e c e i v e SetL ( lnew , seq ) from v →41 i f ( lseq< seq ) then l , lseq := lnew , seq ; f i42 end
Fig. 1: DDLL algorithm (without optimization)
are executed.
(A2) u sets u’s left link and right link to p andq, respectively. u also sets u.s as ins to indicate u isinserting. u sends a SetR message to p, which containsu (as the new right node), q (as the expected currentright node, or rcur), and zero (as the new right sequencenumber, or rnewseq).
(A4) On receiving the SetR message, p checkswhether its status is in and rcur equals p.r. If the formeris false, either p has not received a SetRAck messageafter its insertion (as we describe next, SetRAck mes-sage is to inform that node insertion or deletion issucceeded), or p has started its deletion. If the latter isfalse, it indicates either that another node has inserted atthe right side of p, or that q has been deleted. In eithercase, p rejects the request and sends a SetRNak messageto u to notify that the insertion failed. Otherwise, psends a SetL message to p’s right node (q in this case)to update its left link to u. The SetL message contains
u (as the new left node) and p.rseq+1(= i+1) (as thesequence number of the SetL message). Next, p sendsa SetRAck message to u to notify that the insertionwas successful. Because left(q) is changed from p to u,the incremented right sequence number for q should betransferred from p to u. For this purpose, the SetRAckmessage contains p.rseq+1(= i+1). Finally, p changesp.r to u and p.rseq to 0 (rnewseq). Because u’s right linkhas already been set to q, the rightward linked list isnever interrupted, even for a moment. Note that at thismoment, p.rseq = u.lseq holds.
(A5) On receiving the SetRAck message, u confirmsthat u is successfully inserted. Node u updates u.s toin to indicate that u is inserted, and sets u.rseq to i+1.
(A7) On receiving the SetL message, q compares thesequence number of the SetL message with q.lseq. If theformer is larger (we assume this case), q updates q.l tou and q.lseq to i+1. Otherwise, q ignores the message.
In the scenario above, it is assumed that a SetRAckmessage is sent to u in A4. If a SetRNak message issent (i.e., in the case of insertion failure), then (A6) u.sis reverted to out and u retries the insertion procedurefrom locating its insertion position.
Note that a node u might receive a SetL messagebefore receiving a SetRAck message. This happens,for example, when another node is inserted betweenp and u while the SetRAck message from p to u isstill in transmission. This is normal and the algorithmcan handle this situation. Actually we consider a nodeu becomes inserted at the moment when a SetRAckmessage is sent to u (see Section V).
Figure 3 depicts the situation where two nodes senda SetL message to the same node. There are 4 nodes A,B, C and D (A < B < C < D) and nodes A and Dare initially inserted. A.rseq and D.lseq are i. Nodes Band C are then inserted in this order. When D receivesthe SetL message from C, its left link is updated to Cand its left sequence number is updated to i+2. WhenD later receives the SetL message from B, D ignores itbecause its sequence number (i+1) is smaller than D’sleft sequence number (i+ 2). Thus, the receiving orderof the SetL message does not affect the final results.
E. Deletion
Let us assume that node u, which is inserted betweenp and q, is going to be deleted. We also assume that bothp.rseq and u.lseq are i1 and that both u.rseq and q.lseqare i2 (Fig. 4). To delete node u, u sends a messageDelete() to u. Then, the following actions are executed.
(A3) If u.s is not in, deletion is rejected because it isuncertain whether u is inserted. If u is the last node (i.e.,
1 p r o c e s s u2 var s : {out , i n s , in , d e l}3 l , r : {p o i n t e r t o a node or n i l}4 lseq , rseq : { i n t e g e r or n i l}5 i n i t s = o u t ; l = r = n i l ; lseq = 0 ; rseq = n i l6 begin7 {Cr ea t e a l i n k e d l i s t }8 (A1 ) r e c e i v e C r e a t e ( ) from app →9 l , r , s , lseq , rseq := u , u , in , 0 , 0
10 { I n s e r t be tween p and q}11 [ ] (A2 ) r e c e i v e I n s e r t ( p , q ) from app →12 i f ( s ̸= o u t ∨ u ̸∈ (p, q) ) then error ; f i13 l , r , s := p , q , i n s14 send SetR ( u , r , lseq ) to l15 {D e l e t e}16 [ ] (A3 ) r e c e i v e D e l e t e ( ) from app →17 i f ( s ̸= i n ) then error18 e l s e i f ( u = r ) then { i n c ase o f t h e l a s t node}19 s := o u t20 e l s e s := d e l ; send SetR ( r , u , rseq + 1) to l ; f i21 [ ] (A4 ) r e c e i v e SetR ( rnew , rcur , rnewseq ) from v →22 i f ( s = i n ∧ r = rcur ) then23 i f ( rnew = v ) then { i n s e r t i o n case}24 send SetL ( rnew , rseq + 1) to r25 e l s e { d e l e t i o n ca se}26 send SetL ( u , rnewseq ) to rnew ; f i27 send SetRAck ( rseq + 1) to v28 r , rseq := rnew , rnewseq
29 e l s e send SetRNak ( ) to v ; f i30 [ ] (A5 ) r e c e i v e SetRAck ( rnewseq ) from v →31 i f ( s = i n s ) then32 s , rseq := in , rnewseq
33 e l s e i f ( s = d e l ) then34 s := o u t ; f i35 [ ] (A6 ) r e c e i v e SetRNak ( ) from v →36 i f ( s = i n s ) then37 s := o u t ; error {app r e t r i e s i n s e r t i o n l a t e r}38 e l s e i f ( s = d e l ) then39 s := i n ; error ; f i {app r e t r i e s d e l e t i o n l a t e r}40 [ ] (A7 ) r e c e i v e SetL ( lnew , seq ) from v →41 i f ( lseq< seq ) then l , lseq := lnew , seq ; f i42 end
Fig. 1: DDLL algorithm (without optimization)
are executed.
(A2) u sets u’s left link and right link to p andq, respectively. u also sets u.s as ins to indicate u isinserting. u sends a SetR message to p, which containsu (as the new right node), q (as the expected currentright node, or rcur), and zero (as the new right sequencenumber, or rnewseq).
(A4) On receiving the SetR message, p checkswhether its status is in and rcur equals p.r. If the formeris false, either p has not received a SetRAck messageafter its insertion (as we describe next, SetRAck mes-sage is to inform that node insertion or deletion issucceeded), or p has started its deletion. If the latter isfalse, it indicates either that another node has inserted atthe right side of p, or that q has been deleted. In eithercase, p rejects the request and sends a SetRNak messageto u to notify that the insertion failed. Otherwise, psends a SetL message to p’s right node (q in this case)to update its left link to u. The SetL message contains
u (as the new left node) and p.rseq+1(= i+1) (as thesequence number of the SetL message). Next, p sendsa SetRAck message to u to notify that the insertionwas successful. Because left(q) is changed from p to u,the incremented right sequence number for q should betransferred from p to u. For this purpose, the SetRAckmessage contains p.rseq+1(= i+1). Finally, p changesp.r to u and p.rseq to 0 (rnewseq). Because u’s right linkhas already been set to q, the rightward linked list isnever interrupted, even for a moment. Note that at thismoment, p.rseq = u.lseq holds.
(A5) On receiving the SetRAck message, u confirmsthat u is successfully inserted. Node u updates u.s toin to indicate that u is inserted, and sets u.rseq to i+1.
(A7) On receiving the SetL message, q compares thesequence number of the SetL message with q.lseq. If theformer is larger (we assume this case), q updates q.l tou and q.lseq to i+1. Otherwise, q ignores the message.
In the scenario above, it is assumed that a SetRAckmessage is sent to u in A4. If a SetRNak message issent (i.e., in the case of insertion failure), then (A6) u.sis reverted to out and u retries the insertion procedurefrom locating its insertion position.
Note that a node u might receive a SetL messagebefore receiving a SetRAck message. This happens,for example, when another node is inserted betweenp and u while the SetRAck message from p to u isstill in transmission. This is normal and the algorithmcan handle this situation. Actually we consider a nodeu becomes inserted at the moment when a SetRAckmessage is sent to u (see Section V).
Figure 3 depicts the situation where two nodes senda SetL message to the same node. There are 4 nodes A,B, C and D (A < B < C < D) and nodes A and Dare initially inserted. A.rseq and D.lseq are i. Nodes Band C are then inserted in this order. When D receivesthe SetL message from C, its left link is updated to Cand its left sequence number is updated to i+2. WhenD later receives the SetL message from B, D ignores itbecause its sequence number (i+1) is smaller than D’sleft sequence number (i+ 2). Thus, the receiving orderof the SetL message does not affect the final results.
E. Deletion
Let us assume that node u, which is inserted betweenp and q, is going to be deleted. We also assume that bothp.rseq and u.lseq are i1 and that both u.rseq and q.lseqare i2 (Fig. 4). To delete node u, u sends a messageDelete() to u. Then, the following actions are executed.
(A3) If u.s is not in, deletion is rejected because it isuncertain whether u is inserted. If u is the last node (i.e.,