Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014
-
Upload
nosqlmatters -
Category
Data & Analytics
-
view
2.954 -
download
2
Transcript of Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014
Redis Clusterdesign tradeoffs @antirez - Pivotal
What is performance?
• Low latency.
• IOPS.
• Operations quality and data model.
Go Cluster
• Redis Cluster must have same Redis use case.
• Tradeoffs are inherently needed in DS.
• CAP? Merge values? Strong consistency and consensus? How to replicate values?
CP systems
Client S1
S2
S3
S4
CAP: consistency price is added latency
CP systems
Client S1
S2
S3
S4
Reply to client after majority ACKs
And… there is the diskS1 S2 S3
Disk Disk Disk
CP algorithms may require fsync-befor-ack. Durability / Consistency not always orthogonal.
AP systems
Client
S1
S2
Eventual consistency with merges? (note: merge is not strictly part of EC)
Client
A = {1,2,3,8,12,13,14}
A = {2,3,8,11,12,1}
Many kinds of consistencies• “C” of CAP is strong consistency.
• It is not the only available tradeoff of course.
• Consistency is the set of liveness and safety properties a given system provides.
• Eventual consistency: like to say nothing at all. What liveness/safety properties if not “C”?
Redis Cluster
Client
A,B,C
A,B,C
Sharding and replication (asynchronous).
A,B,C
D,E,F
D,E,F
D,E,F
Asynchronous replication
Client A,B,C
A,B,C
A,B,C
A,B,C
A,B,C
A,B,C
async ACK
Full Mesh
A,B,C A,B,C
D,E,F D,E,F
• Heartbeats.
• Nodes gossip.
• Failover auth.
• Config update.
No proxy, but redirections
A,B,C D,E,F G,H,I L,M,N O,P,Q R,S,T
Client Client
A? D?
Failure detection
• Failure reports within window of time (via gossip).
• Trigger for actual failover.
• Two main states: PFAIL -> FAIL.
Failure detection
S1
S2
S3
S4
S1 is not responding?S1 = PFAIL
S1 = PFAIL
S1 = PFAIL
Failure detection
S1
S2
S3
S4
PFAIL state propagatesS1 = PFAIL
S1 = PFAIL Reported by:
S2, S4
S1 = PFAIL
Failure detection
S1
S2
S3
S4
PFAIL state propagatesS1 = PFAIL
S1 = FAIL
S1 = PFAIL
Failure detection
S1
S2
S3
S4
Force FAIL stateS1 = FAIL
S1 = FAIL
S1 = FAIL
Global slots config
• A master FAIL state triggers a failover.
• Cluster needs a coherent view of configuration.
• Who is serving this slot currently?
• Slots config must eventually converge.
Raft and failover• Config propagation is solved using ideas from the
Raft algorithm (just a subset).
• Raft is a consensus algorithm built on top of different “layers”.
• Raft paper is already a classic (highly recommended).
• Full Raft not needed for Redis Cluster slots config.
Failover and config
FailedSlave
Slave
Slave
Master
Master
Master
Epoch = Epoch+1(logical clock)
Vote for me!
Too easy?
• Why we don’t need full Raft?
• Because our config is idempotent: when the partition heals we can trow away slots config for new versions.
• Same algorithm is used in Sentinel v2 and works well.
Config propagation
• After a successful failover, new slot config is broadcasted.
• If there are partitions, when they heal, config will get updated (broadcasted from time to time, plus stale config detection and UPADTE messages).
• Config with greater Epoch always wins.
Redis Cluster consistency?
• Eventual consistent: last failover wins.
• In the “vanilla” losing writes is unbound.
• Mechanisms to avoid unbound data loss.
Failure mode… #1
Client A,B,C
A,B,C
A,B,C
Failed
A,B,C
A,B,C
lost write…
Failure mode #2Client
A,B,C
A,B,C
D,E,F
G,H,I
Minority side Majority side
Boud divergencesClient A,B,C
D,E,F
G,H,I
Minority side Majority sideAfter node-tim
eot
More data safety?• OP logging until async ACK received.
• Re-played to master when node turns into slave.
• “Safe” connections, on demand.
• Example SADD (idempotent + commutative).
• SET-LWW foo bar <wall-clock>.
Multi key ops
• Hey hashtags!
• {user:1000}.following {user:1000}.followers.
• Unavailable for small windows, but no data exchange between nodes.
Multi key ops (availability)
• Single key ops: always available during resharding.
• Multi key ops, available if:
• No manual resharding of this hash slot in progress.
• Resharding in progress, but source or destination node have all keys.
• Otherwise we get a -TRYAGAIN error.
{User:1}.key_A {User:2}.Key_B
{User:1}.key_A {User:1}.Key_B
{User:1}.key_A {User:1}.Key_B
SUNION key_A key_B-TRYAGAIN
SUNION key_A key_B… output …
SUNION key_A key_B… output …
Redis Cluster ETA
• Release Candidate available.
• We’ll go stable in Q1 2015.
• Ask me anything.