xuwangxuwang.tech/course/sc2020fall/08-sc2020fall-20200731.pdf · 2020. 9. 22. · –2PC 3PC •#/...
Transcript of xuwangxuwang.tech/course/sc2020fall/08-sc2020fall-20200731.pdf · 2020. 9. 22. · –2PC 3PC •#/...
-
Service Computing
http://xuwang.tech
-
•– Facebook– Google–
• : XaaS, Internet of Services– Amazon
• LBS•
-
“ ”
-
• (scalability)–
• Facebook–
• 2013 Google 100PB–
• 2012 11 11 100019.2 /
• 2014 3
•
-
Scalability
-
• (availability): 7X24– (SLA):
• Amazon EC2 99.95%, Google Cloud Storage 99.9%, Microsoft Windows Azure 99.9%
–
1 Amazon 2011 4 21 22 EC2 , 2012 6 10
2 2012 7 26 Azure 2.5
3 Google GAE google 2012 10 26 4
4 Hotmail Outlook.com SkyDrive 2013 3 12 17
5 Twitter 2013 6 3 tweets 45
6 Google 2013 8 17 Google 40%
7 Adobe 2014 5 25 Adobe Creative Cloud 24
8 2014 6 23 24 Lync instant messaging service Outlook Exchange online
-
( / )ScalabilityLoad Balance
Failover&Recovery
Replication
Availability
R1 R2 RnRk… …
Load Balancer
RWrite(A,2)Read(A)=?
A=1 A=1 A=1A=1A=2
Read(A) 2
-
(1)• CAP(2000 SIGMOD keynote)– (Consistency) (Availability)
(Partition)
• PACELC(2012 IEEE Computer)– (Partition),
(Availability) (Consistency) (Else) (Latency) (Consistency)
-
Scalability: CAP
PODC 2000: C A P
2012 IEEE Computer The Growing Impact of the CAP Theorem
ACID, BASE, PACELC
-
CAP
Data Center A Data Center B
S1 S2
Load Balance
???
P: Partition Tolerance
C: Consistency
A: Ava
ilability
A: Availability
• Consistency: all nodes have same data at any time• Availability: the system allows operations all the time• Partition-tolerance: the system continues to work in spite of network
partitions
-
(2)• Jim Gray–
• Amazon––
• Yahoo–
–
-
•– (Linearizability) ACID
• à– (eventual consistency)– ACIDàBASE(Basic Availability, Soft state,
Eventual consistency)•– (Write conflicts)
– (Read Staleness),
-
•–
•–
•
•–
-
•–
•– 2PC 3PC
•–
• Paxos• Quorum– N W( ) R( )
-
•– SMATER POSTGRES-R HBase
•– Eventual Consistency– Timeline consistency– Session consistency– Causal consistency– Snapshot Isolation consistency
-
•––––
18
-
•––––
19
-
•– K K-Versions
•–
•– TACT
•– PBS
•
-
• Quorum
– Amazon Dynamo– Cassandra– W,R
• W,R
• Quorum
– (Ring)
-
• (Fast Paxos)• (Generalized Paxos)• IP•• (Mencius Multi-Ring
Paxos)• SSD (Spinnaker CORFU)
-
•––
• : – : +2s à / -1.8% and / -
4.3%– : +500msà -25% – : +100msà -1%
23
-
Client
Replica1
Replica2Replican
(Coordinator)…
([n/2]+1) ACKs
ACK
ACK
Client
Replica1
Replica2Replican
(Coordinator)…
n ACKs
ACK
ACK
Client
Replica1
Replica2Replican
(Coordinator)…
Non-Uniform
Total Order
Broadcast
Distributed
Consensus
(Paxos)
Uniform
Total Order
Broadcast
COMMITC
OM
MIT
COMMITC
OM
MIT
COMMITC
OM
MIT
WRITE
WRITE
WRITE
log
log
commit
ACK
1, [n/2]+1,n
24
-
: RSM-dClient
Replica2Replica3
(Coordinator)
WRITE
Replican
…
d ACKs
ACK
ACK
REPL
Y
COMMIT
(1) f : f ≤ [n/2]
Replica1
(2) (coordinator) , (log history)(commit history) .
log history commit history
PROPOSEFIN
ISHED
Agreement Phase Commit Phase
(1 ≤ d ≤ n)
d=1 (Non-Uniform Total Order Broadcastd=[n/2]+1 Paxosd=n (Uniform Total Order Broadcast)Otherwise
25
-
• ( )– ( )– ( )–
•––
-
RSM-d Pwc=0Paxos
d ≥[n/2]+1
1. 100% ;
2. ”PROPOSE” d
( );
3. .
Write Log LostPwc= Pwll
Non-uniform WritePwc= Pwll Pwnu
3
Write LostPwc= Pwl
2
(Write Duplicaton)
Pwc= Pwl+ Pwd
1
27
-
: Client
Replica2Replica3
(Coordinator)
WRITE
Replican
…
d ACKs
ACK
ACK
(1 ≤ d ≤ [n/2])
Replica1
log history
PROPOSE d(d ≤f)
• 1,2,3• d(d ≤f) ,
• Pwll : d
cwllwc PPP )(==
Pc
28
-
(Non-uniform Write)Client
Replica2Replica3
WRITE
Replican
…
d ACKs
ACK
ACK
(1 ≤ d ≤ [n/2])
Replica1
log history
PROPOSEd(d ≤f)
k
COMMIT
commit history
• 1,2, (commit history)• Pwnu :
kdnc
dn
k
kcwnu PPP
---
=
-= å )1()(]2/[
0
wnuwllwc PPP =
(Coordinator)
k
29
-
• 1. “PROPOSE”• D: . • Pelw(D) : D•
• D=d ( 1,2 ), • , D=m(d ≤ m ≤ [n/2]) , • D ( Pwl):
kxnc
xn
k
kc
xc PPPxQ
---
=
-= å )1()()()(]2/[
0
)(dQPwc =)()( mQmDPwc ==
å=
==]2/[
)()(n
dmelwwlwc mQmPPP
30
-
(Write Duplication)•
––
Client WRITE
…
d ACKs
(1 ≤ d ≤ [n/2])
SuspectedCoordinator
NewCoordinator
WR
ITE
d ACKs
WRITE d ACKs
(1 ≤ d ≤ [n/2])
WR
ITE
d ACKs
÷÷ø
öççè
æ--
÷÷ø
öççè
æ--
÷÷ø
öççè
æ---
÷÷ø
öççè
æ--
=
11
11
11
12
dn
dn
ddn
dn
PnoPno :
31
-
•
Coordinator
Replica
To
Te
To
Te Te
To To To
Te Te
Trust Suspicion
Tf(t): TThb2 Thb1 : T .Pfs :
)Pr()Pr( 1212 eohbhbohbehbfs TTTTTTTTP ->-=+>+=
heartbeat
heartbeat32
-
• n
• :
))1(1()1()(1 11]2/[
0
fnfs
fnc
fc
n
fgfs PPPf
nP ----
=
---÷÷ø
öççè
æ -= å
nogfscwd PPPP )1( -=
33
-
• (Write lost)– (
)• (Write duplication)–
• , :
wdwlwc PPP +=
34
-
Client
Replica2Replica3
(Coordinator)
WRITE
Replican
…
d ACKsAC
K
ACK
REPLY
COMMIT
(1 ≤ d ≤ n)
Replica1 PROPOSE
LAi = tp(i)+tlog(i)+ta(i)LA1, LA2,…, LAn :
LA(1) ≤ LA(2) ≤… ≤ LA(n-1) ( )
Latency Lw (d)= LA(d-1)+min(LCi)
Lw (d)
tp ta
LAi LCi
FINISH
ED
tlog tcom
tc tf
35
-
Lw (d)• LAi=tp(i)+ta(i) G(t)• LA(k) h(k)(t) H(k)(t)(1 ≤k ≤n-1)
• , :
• :
dxxtfxftg )()()( -= ò+¥
¥-
knkn
iik tLAPtLAPk
nttLAPttLAP ---
-
=
+>D÷÷ø
öççè
æ--
=-=-+ - ddLAdn
LALAEdLEdLE ddww
ΔLA(d) = 0))(1())((0
1 >-ò+¥ -- dttGtG dnd
36
-
• B(d) BcBl :
• :
37
-
•–
–
–
j0=j
brc=j
C=j
w: b: ( )rb: rl:
-
•– Nwl Nwd– Pwc (Nwl + Nwd)/10,000,000
•– n∈[2,9] d∈[1,[n/2]]– RMSE= 0.0009%, std.dev.=0.0052%
•– n∈[2,9] d∈[1,n]– RMSE=0.013ms, std.dev.=0.019ms
39
-
d
-6
-5
-4
-3
-2
-1
0
d=1d=2d=3d=4
n=2 n=3 n=4 n=5 n=6 n=7 n=8 n=9
wcPlg
0
50
100
150
200
250
n=2 n=3 n=4 n=5 n=6 n=7 n=8 n=9
Lw(2)
Lw(3)Lw(4)Lw(5)
Latency-L w(1)(ms)
Impact of d on Consistency Impact of d on Latency
40
-
vs.
1-10-1
1-10-2
1-10-3
1-10-4
1-10-5
1-10-6
Consistency
Latency-Lw(1)(ms)050100150200
n=3n=5n=7
(3,2)(5,3)
(7,4)
(7,3)
(5,2)
(7,2)
(3,1)(5,1)
(7,1)
(n,d)
41
-
• S , V– ( ) rb– ( ) !(
)–
• Microsoft Bing rl=0.0009%• Amazon.com rl=0.01%• Google Search rl=0.05%
– v(d)
42
-
d
-
•––––
44
-
Quorum
÷÷ø
öççè
æ÷÷ø
öççè
æ -=
RN
RWN
prs
-
W/R ,
Client
Replica1
Replica2(Coordinator)
…
COMMIT
WRITE
ReplicaN
Send to N replicas
Wait forW responses
N=3, W=2,R=1
Replica2 Replica3
Partition 1
Partition 2Replica1
Replica2…
COMMIT
ReplicaN
Send to N replicas
Waiting, timeoutand then unavailable
Client
(Coordinator)
Replica1
Replica1
Replica2(Coordinator)
COMMIT
ReplicaN
Send to N replicas
Wait forW responses
Client
-
• Quorum– Quorum– / (2/3-tier basic tree), (fat
tree), folded clos
top of rack (ToR) switches
Aggregation switches (AS)
Core switches (CS)
-
• Quorum QS(DCN,PM,W/R)– DCN∈{2 (bt2) 3 (bt3) K (ft)
folded clos (fc)}– PM: 0/1 ( )– W/R : Quorum Quorum
(1) : Pc(2) : Pa(3) ToR : Pt(4) : Ps
PM tor
-
(1)
-
(2)
à
àà
-
Server
ToRSEdge
CSCore
…
CoordinatorReplica Replica
ToRi
Ai : ToRi:
W
å Õ=+++
-- ==xmmm i
iitorbtN
mAPxQ...
221
)()(
å=
---=N
Wxtorbts xQPWPMbtQSAvail )()1()),,2(( 2
WRITE
… … …
-
Server
ToRS
CS
…
AS
CoordinatorReplica Replica
)(2 xQ torbt --
WRITE
… … … …
Xbt3-as:
å=
- =-=N
Wxasbts xXPWPMbtQSAvail
'3 )'Pr()1()),,3((
-
AS
ToRS
CS
k/2 (CS) à GCSpod (AS)à SAS
…
Pod1 Podk
Server
WRITE
… … … …
…
…
…
…
…… …
Coordinator ReplicaReplica
Group1 Groupk/2
K
-
AS
ToRS
CS
…
Pod1 Podk
Server
WRITE
… … … …
… …
…
Coordinator ReplicaReplica
GCS1 GCSx
SAS1 SASk
K
x GCS 3
å=
- ==N
Wxsasft xXWPMftQSAvail
')'Pr()),,((
-
Folded Clos
(CS) à GCS(AS)à SAS
-
ToRS
CS
…
Server
AS
WRITE
Coordinator ReplicaReplica
SAS1 SASDI/2
GCS
Folded Clos
3
å=
- =-=N
Wxsasfcgc xXPWPMfcQSAvail
')'Pr()'1()),,((
-
•–
–
•
-
• ( 8K) N=3• =〈1,1,1〉 =〈2,1,0〉 =〈3,0,0〉torPM1 torPM 2 torPM 2
HadoopCassandra
-
vs. • α
•
• - (W,R)– [!n]– [!n]– [!n] αW+(1- α)R
-
N∈[3,9] W∈[1,N],
4 9
-
W
• N=3, W 9 (ft, fc)
-
N
• , N 9
-
PM• Folded Clos , N=3 6
•••
ñá= 0,0,1,0,0,1,0,0,11torPM ñá= 0,0,0,0,0,1,0,0,22
torPM
ñá= 0,0,0,0,0,1,0,1,13torPM ñá= 0,0,0,0,0,0,0,1,24
torPMñá= 0,0,0,0,0,0,1,1,15
torPM ñá= 0,0,0,0,0,0,0,0,36torPM
Hadoop
-
-• (α=0.05)
•– 99.9%à(W,R)=(2,1)– 99%à(W,R)=(3,1)
-
•––––
65
-
Web• Web HTTP– Ebay Web– AWS
• ,–
•• (2PC)•• Paxos
-
Rep4WS
-
Paxos
Client
WS Replica1
WS Replica2
WS Replica3
(the leader)
time
REQUEST
NUMBER
PROPOSE
PROPO
SE
PROPOSE
LOGGING
LOGGING
LOGGING
ACK
AC
K
LEARN EXECUTE
EXECUTE
EXECUTELEA
RN
LEARN
RESPOND
RESP
OND
RESP
OND
REP
LY
NUMBER: 1,2,3…
ACK: n/2+1 ACK
EXECUTE:
(AGREEMENT PHASE) (EXECUTION PHASE)
(follower)
(follower)
(RDG)
AGREEMENT PHASE: α (Pipeline Concurrency)
-
• API–
(commutable)– ( )
(RDG)• WebShopService
– getMyCart– addToCart– searchProduct– orderProduct– modifyMyProfile
getMyCart
searchProduct
Commutable
getMyCart
addToCart
Can not commute
-
WSOP1 × × × √ ×OP2 √ × √ √OP3 √ × √OP4 × ×OP5 √
OP2 OP1 OP2 OP4
OP4 OP5
OP2
OP3
n+1 n+2 n+3 n+4 n+5 n+6 n+7 n+8
OP2n+1
-
RDG
LEARN:(n+1)(n+5)(n+3)(n+8)(n+4)(n+6)(n+2)(n+7)
n+1 n+2 n+3 n+4 n+5 n+6 n+7 n+8
n+1
n+2
n+3n+7
n+8
n+4
n+5
n+6
EXECUTE ExecutedWaitWaitWaitEXECUTE Executed
EXECUTE Executed
WaitEXECUTE Executed
EXECUTE Executed
EXECUTE Executed
EXECUTE Executed
EXECUTE Executed
: 3+4+4+1=12 units
RDG : 5+4+4+2+1=16 units
RDG of Pipeline[n+1,n+8]
-
RDG• RDG
•– RDG
n+8
RDGn+8
n+1 n+2
n+7
n+3
n+4 n+5n+6
n
(1) 0 RDG0
(2) (i+1)RDG
, (i/2)
( )RDG
-
• (leader)– Paxos–
• (follower)– (checkpoint)– leader leader
– leader• à
-
Replica Number
Res
pons
e Ti
me(
ms)
0
20
40
60
80
100
120
3 4 5 6 7
2PC
GC-TCP
Rep4WS/Basic Paxos
Load(reqs/sec)R
espo
nse
Tim
e(m
s) Replica Number=3
0
50
100
150
200
250
300
350
10 20 30 40 50 60 70 80 90 100
2PC
GC-TCP
Basic Paxos
Rep4WS
-
RDG
…n n+1 n+2 n+12 n+13 n+14
n n+1 n+5 n+6
n+7
…
…n+8 n+9 n+13 n+14
…n n+1 n+2 n+12 n+13 n+14
RDG1
RDG2
RDG3
read operation write operation
RDGs
0
50
100
150
200
250
10 20 30 40 50 60 70 80 90 100
RDG
RDG
RDG
Res
pons
e Ti
me(
ms)
12
3
(No RDG)
Load(reqs/sec)
7 4.2 0
RDG3 RDG1 15%-66%26%
-
•––––
76
-
•– O P O P
P⇝O•1. A2. B A– à
•–
77
-
•––
•––
• Lamport–
78
-
• Monotonic reads–
• Monotonic writes–
• Read your writes–
• Writes Follow Reads– a b
b a
•79
-
•–
•–
•– w(x=1) ⇝ w(x=2) x 1
80
-
•– < , >
•––
•–
ts– ts–
81
-
20% 40% 60% 80%Cassandra-Eventual 0.16% 0.08% 0.03% 0.02%
Cassandra-RYW >0 >0 >0 >0CoCaCo 0 0 0 0
82
Cassandra: CoCaCo
-
83
-
Q&A