Self healing data
-
Upload
uwe-friedrichsen -
Category
Technology
-
view
3.259 -
download
2
description
Transcript of Self healing data
![Page 1: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/1.jpg)
Self-healing data An introduction to consistency models, Quorums and CRDTs
Uwe Friedrichsen, codecentric AG, 2014
![Page 2: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/2.jpg)
@ufried Uwe Friedrichsen | [email protected] | http://slideshare.net/ufried | http://ufried.tumblr.com
![Page 3: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/3.jpg)
Why NoSQL?
• Scalability
• Easier schema evolution
• Availability on unreliable OTS hardware
• It‘s more fun ...
![Page 4: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/4.jpg)
Why NoSQL?
• Scalability
• Easier schema evolution
• Availability on unreliable OTS hardware
• It‘s more fun ...
![Page 5: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/5.jpg)
Challenges • Giving up ACID transactions
• (Temporal) anomalies and inconsistencies short-term due to replication or long-term due to partitioning
It might happen. Thus, it will happen!
![Page 6: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/6.jpg)
C Consistency
A Availability
P Partition Tolerance
Strict Consistency
ACID / 2PC
Strong Consistency
Quorum R&W / Paxos
Eventual Consistency
CRDT / Gossip / Hinted Handoff
![Page 7: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/7.jpg)
Strict Consistency (CA) • Great programming model
no anomalies or inconsistencies need to be considered
• Does not scale well best for single node databases
„We know ACID – It works!“
„We know 2PC – It sucks!“
Use for moderate data amounts
![Page 8: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/8.jpg)
And what if I need more data?
• Distributed datastore
• Partition tolerance is a must
• Need to give up strict consistency (CP or AP)
![Page 9: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/9.jpg)
Strong Consistency (CP) • Majority based consistency model
can tolerate up to N nodes failing out of 2N+1 nodes
• Good programming model Single-copy consistency
• Trades consistency for availabilityin case of partitioning
Paxos (for sequential consistency)
Quorum-based reads & writes
![Page 10: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/10.jpg)
Quorum-based reads & writes • N – number of nodes (replicas)
• R – number of reads delivering the same result required
• W – number of confirmed writes required
W > N / 2
R + W > N
![Page 11: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/11.jpg)
Example 1 • 3 replica nodes
This is: N = 3
• W > N / 2 à W > 3 / 2 à W > 1,5
Let‘s pick: W = 2 (can tolerate failure of 1 node)
• R + W > N à R + 2 > 3 à R > 1
Let‘s pick: R = 2 (can tolerate failure of 1 node)
![Page 12: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/12.jpg)
Client
Node 1 Node 3 Node 2
![Page 13: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/13.jpg)
W W W
Client
Node 1 Node 3 Node 2
![Page 14: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/14.jpg)
Client
Node 1 Node 3 Node 2
✔ ✖ ✔
W = 2 ✔
![Page 15: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/15.jpg)
Client
Node 1 Node 3 Node 2
![Page 16: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/16.jpg)
R R R
Client
Node 1 Node 3 Node 2
✔ ✔ ✔
![Page 17: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/17.jpg)
Client
Node 1 Node 3 Node 2
2x / 1x ✔
R = 2 ✔
![Page 18: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/18.jpg)
Client
Node 1 Node 3 Node 2
![Page 19: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/19.jpg)
R R R
Client
Node 1 Node 3 Node 2
✔ ✔ ✖
![Page 20: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/20.jpg)
Client
Node 1 Node 3 Node 2
1x / 1x ✖
R = 1 ✖
![Page 21: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/21.jpg)
Example 2 • 3 replica nodes
This is: N = 3
• W > N / 2 à W > 3 / 2 à W > 1,5
Let‘s pick: W = 3 (strict consistency – does not tolerate any failure)
• R + W > N à R + 3 > 3 à R > 0
Let‘s pick: R = 1 (single read from any node sufficient)
![Page 22: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/22.jpg)
Example 3 • 5 replica nodes
This is: N = 5
• W > N / 2 à W > 5 / 2 à W > 2,5
Let‘s pick: W = 3 (can tolerate failure of 2 nodes)
• R + W > N à R + 3 > 5 à R > 2
Let‘s pick: R = 3 (can tolerate failure of 2 nodes)
![Page 23: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/23.jpg)
Limitations of QB R&W • Requires fixed set of replica nodes
Dynamic adding and removal of nodes not possible
• Client-centric consistency Consistency only perceived on client, not on nodes
• Requires additional measures on nodes Nodes need to implement at least eventual consistency
Use if client-perceived strong consistency is
sufficient and simplicity trumps formal precision
![Page 24: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/24.jpg)
And what if I need more availability?
• Need to give up strong consistency (CP)
• Relax required consistency properties even more
• Leads to eventual consistency (AP)
![Page 25: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/25.jpg)
Eventual Consistency (AP) • Gives up some consistency guarantees
no sequential consistency, anomalies become visible
• Maximum availability possiblecan tolerate up to N-1 nodes failing out of N nodes
• Challenging programming model anomalies usually need to be resolved explicitly
Gossip / Hinted Handoffs
CRDT
![Page 26: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/26.jpg)
Conflict-free Replicated Data Types • Eventually consistent, self-stabilizing data structures
• Designed for maximum availability
• Tolerates up to N-1 out of N nodes failing
State-based CRDT: Convergent Replicated Data Type (CvRDT)
Operation-based CRDT: Commutative Replicated Data Type (CmRDT)
![Page 27: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/27.jpg)
A bit of theory first ...
![Page 28: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/28.jpg)
Convergent Replicated Data Type State-based CRDT – CvRDT
• All replicas (usually) connected
• Exchange state between replicas, calculate new state on target replica
• State transfer at least once over eventually-reliable channels
• Set of possible states form a Semilattice • Partially ordered set of elements where all subsets have a Least Upper Bound (LUB)
• All state changes advance upwards with respect to the partial order
![Page 29: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/29.jpg)
Commutative Replicated Data Type Operation-based CRDT - CmRDT
• All replicas (usually) connected
• Exchange update operations between replicas, apply on target replica
• Reliable broadcast with ordering guarantee for non-concurrent updates
• Concurrent updates must be commutative
![Page 30: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/30.jpg)
That‘s been enough theory ...
![Page 31: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/31.jpg)
Counter
![Page 32: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/32.jpg)
Op-based Counter Data
Integer i
Init i ≔ 0
Query return i
Operations increment(): i ≔ i + 1 decrement(): i ≔ i - 1
![Page 33: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/33.jpg)
State-based G-Counter (grow only) (Naïve approach)
Data
Integer i
Init i ≔ 0
Query return i
Update increment(): i ≔ i + 1
Merge( j) i ≔ max(i, j)
![Page 34: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/34.jpg)
State-based G-Counter (grow only) (Naïve approach)
R1
R3
R2
i = 1
U
i = 1
U
i = 0
i = 0
i = 0
I
I
I
M
i = 1 i = 1
M
![Page 35: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/35.jpg)
State-based G-Counter (grow only) (Vector-based approach)
Data
Integer V[] / one element per replica set
Init V ≔ [0, 0, ... , 0]
Query return ∑i V[i]
Update increment(): V[i] ≔ V[i] + 1 / i is replica set number
Merge(V‘) ∀i ∈ [0, n-1] : V[i] ≔ max(V[i], V‘[i])
![Page 36: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/36.jpg)
State-based G-Counter (grow only) (Vector-based approach)
R1
R3
R2
V = [1, 0, 0]
U
V = [0, 0, 0]
I
I
I
V = [0, 0, 0]
V = [0, 0, 0]
U
V = [0, 0, 1]
M
V = [1, 0, 0]
M
V = [1, 0, 1]
![Page 37: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/37.jpg)
State-based PN-Counter (pos./neg.) • Simple vector approach as with G-Counter does not work
• Violates monotonicity requirement of semilattice
• Need to use two vectors • Vector P to track incements
• Vector N to track decrements
• Query result is ∑i P[i] – N[i]
![Page 38: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/38.jpg)
State-based PN-Counter (pos./neg.) Data
Integer P[], N[] / one element per replica set
Init P ≔ [0, 0, ... , 0], N ≔ [0, 0, ... , 0]
Query Return ∑i P[i] – N[i]
Update increment(): P[i] ≔ P[i] + 1 / i is replica set number decrement(): N[i] ≔ N[i] + 1 / i is replica set number
Merge(P‘, N‘) ∀i ∈ [0, n-1] : P[i] ≔ max(P[i], P‘[i]) ∀i ∈ [0, n-1] : N[i] ≔ max(N[i], N‘[i])
![Page 39: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/39.jpg)
Non-negative Counter Problem: How to check a global invariant with local information only? • Approach 1: Only dec when local state is > 0
• Concurrent decs could still lead to negative value • Approach 2: Externalize negative values as 0
• inc(negative value) == noop(), violates counter semantics • Approach 3: Local invariant – only allow dec if P[i] - N[i] > 0
• Works, but may be too strong limitation • Approach 4: Synchronize
• Works, but violates assumptions and prerequisites of CRDTs
![Page 40: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/40.jpg)
Sets
![Page 41: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/41.jpg)
Op-based Set (Naïve approach)
Data
Set S
Init S ≔ {}
Query(e) return e ∈ S
Operations add(e) : S ≔ S ∪ {e} remove(e): S ≔ S \ {e}
![Page 42: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/42.jpg)
Op-based Set (Naïve approach)
R1
R3
R2
S = {e}
add(e)
S = {}
S = {}
S = {}
I
I
I
S = {e}
S = {}
rmv(e)
add(e)
add(e) add(e) rmv(e)
add(e)
S = {e}
S = {e} S = {e} S = {}
![Page 43: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/43.jpg)
State-based G-Set (grow only) Data
Set S
Init S ≔ {}
Query(e) return e ∈ S
Update add(e): S ≔ S ∪ {e}
Merge(S‘) S = S ∪ S‘
![Page 44: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/44.jpg)
State-based 2P-Set (two-phase) Data
Set A, R / A: added, R: removed
Init A ≔ {}, R ≔ {}
Query(e) return e ∈ A ∧ e ∉ R
Update add(e): A ≔ A ∪ {e} remove(e): (pre query(e)) R ≔ R ∪ {e}
Merge(A‘, R‘) A ≔ A ∪ A‘, R ≔ R ∪ R‘
![Page 45: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/45.jpg)
Op-based OR-Set (observed-remove) Data
Set S / Set of pairs { (element e, unique tag u), ... }
Init S ≔ {}
Query(e) return ∃u : (e, u) ∈ S
Operations add(e): S ≔ S ∪ { (e, u) } / u is generated unique tag remove(e):
pre query(e) R ≔ { (e, u) | ∃u : (e, u) ∈ S } /at source („prepare“) S ≔ S \ R /downstream („execute“)
![Page 46: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/46.jpg)
Op-based OR-Set (observed-remove)
R1
R3
R2
S = {ea}
add(e)
S = {}
S = {}
S = {}
I
I
I
S = {eb}
S = {}
rmv(e)
add(e)
add(eb) add(ea) rmv(ea)
add(eb)
S = {eb}
S = {eb} S = {ea, eb} S = {eb}
![Page 47: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/47.jpg)
More datatypes • Register
• Dictionary (Map)
• Tree
• Graph
• Array
• List
plus more representations for each datatype
![Page 48: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/48.jpg)
Garbage collection • Sets could grow infinitely in worst case
• Garbage collection possible, but a bit tricky
• Only remove updates that can be deleted safely
and were received by all replicas
• Usually implemented using vector clocks
• Can tolerate up to n-1 crashes
• Live only while no replicas are crashed
• Can induce surprising behavior sometimes
• Sometimes stronger consensus is needed (Paxos, ...)
![Page 49: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/49.jpg)
Limitations of CRDTs • Very weak consistency guarantees
Strives for „quiescent consistency“
• Eventually consistentNot suitable for high-volume ID generator or alike
• Not easy to understand and model
• Not all data structures representable
Use if availability is extremely important
![Page 50: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/50.jpg)
Further reading 1. Shapiro et al., Conflict-free Replicated
Data Types, Inria Research report, 2011
2. Shapiro et al., A comprehensive study of Convergent and Commutative Replicated Data Types, Inria Research report, 2011
3. Basho Tag Archives: CRDT, https://basho.com/tag/crdt/
4. Leslie Lamport, Time, clocks, and the ordering of events in a distributed system, Communications of the ACM (1978)
![Page 51: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/51.jpg)
Wrap-up
• CAP requires rethinking consistency
• Strict Consistency ACID / 2PC
• Strong Consistency Quorum-based R&W, Paxos
• Eventual Consistency CRDT, Gossip, Hinted Handoffs
Pick your consistency model based on your consistency and availability requirements
![Page 52: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/52.jpg)
The real world is not ACID Thus, it is perfectly fine to go for a relaxed consistency model
![Page 53: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/53.jpg)
@ufried Uwe Friedrichsen | [email protected] | http://slideshare.net/ufried | http://ufried.tumblr.com
![Page 54: Self healing data](https://reader034.fdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/54.jpg)