Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Redis Clusterdesign tradeoffs @antirez - Pivotal

What is performance?

• Low latency.

• IOPS.

• Operations quality and data model.

Go Cluster

• Redis Cluster must have same Redis use case.

• Tradeoffs are inherently needed in DS.

• CAP? Merge values? Strong consistency and consensus? How to replicate values?

CP systems

Client S1

S2

S3

S4

CAP: consistency price is added latency

CP systems

Client S1

S2

S3

S4

Reply to client after majority ACKs

And… there is the diskS1 S2 S3

Disk Disk Disk

CP algorithms may require fsync-befor-ack. Durability / Consistency not always orthogonal.

AP systems

Client

S1

S2

Eventual consistency with merges? (note: merge is not strictly part of EC)

Client

A = {1,2,3,8,12,13,14}

A = {2,3,8,11,12,1}

Many kinds of consistencies• “C” of CAP is strong consistency.

• It is not the only available tradeoff of course.

• Consistency is the set of liveness and safety properties a given system provides.

• Eventual consistency: like to say nothing at all. What liveness/safety properties if not “C”?

Redis Cluster

Client

A,B,C

A,B,C

Sharding and replication (asynchronous).

A,B,C

D,E,F

D,E,F

D,E,F

Asynchronous replication

Client A,B,C

A,B,C

A,B,C

A,B,C

A,B,C

A,B,C

async ACK

Full Mesh

A,B,C A,B,C

D,E,F D,E,F

• Heartbeats.

• Nodes gossip.

• Failover auth.

• Config update.

No proxy, but redirections

A,B,C D,E,F G,H,I L,M,N O,P,Q R,S,T

Client Client

A? D?

Failure detection

• Failure reports within window of time (via gossip).

• Trigger for actual failover.

• Two main states: PFAIL -> FAIL.

Failure detection

S1

S2

S3

S4

S1 is not responding?S1 = PFAIL

S1 = PFAIL

S1 = PFAIL

Failure detection

S1

S2

S3

S4

PFAIL state propagatesS1 = PFAIL

S1 = PFAIL Reported by:

S2, S4

S1 = PFAIL

Failure detection

S1

S2

S3

S4

PFAIL state propagatesS1 = PFAIL

S1 = FAIL

S1 = PFAIL

Failure detection

S1

S2

S3

S4

Force FAIL stateS1 = FAIL

S1 = FAIL

S1 = FAIL

Global slots config

• A master FAIL state triggers a failover.

• Cluster needs a coherent view of configuration.

• Who is serving this slot currently?

• Slots config must eventually converge.

Raft and failover• Config propagation is solved using ideas from the

Raft algorithm (just a subset).

• Raft is a consensus algorithm built on top of different “layers”.

• Raft paper is already a classic (highly recommended).

• Full Raft not needed for Redis Cluster slots config.

Failover and config

FailedSlave

Slave

Slave

Master

Master

Master

Epoch = Epoch+1(logical clock)

Vote for me!

Too easy?

• Why we don’t need full Raft?

• Because our config is idempotent: when the partition heals we can trow away slots config for new versions.

• Same algorithm is used in Sentinel v2 and works well.

Config propagation

• After a successful failover, new slot config is broadcasted.

• If there are partitions, when they heal, config will get updated (broadcasted from time to time, plus stale config detection and UPADTE messages).

• Config with greater Epoch always wins.

Redis Cluster consistency?

• Eventual consistent: last failover wins.

• In the “vanilla” losing writes is unbound.

• Mechanisms to avoid unbound data loss.

Failure mode… #1

Client A,B,C

A,B,C

A,B,C

Failed

A,B,C

A,B,C

lost write…

Failure mode #2Client

A,B,C

A,B,C

D,E,F

G,H,I

Minority side Majority side

Boud divergencesClient A,B,C

D,E,F

G,H,I

Minority side Majority sideAfter node-tim

eot

More data safety?• OP logging until async ACK received.

• Re-played to master when node turns into slave.

• “Safe” connections, on demand.

• Example SADD (idempotent + commutative).

• SET-LWW foo bar <wall-clock>.

Multi key ops

• Hey hashtags!

• {user:1000}.following {user:1000}.followers.

• Unavailable for small windows, but no data exchange between nodes.

Multi key ops (availability)

• Single key ops: always available during resharding.

• Multi key ops, available if:

• No manual resharding of this hash slot in progress.

• Resharding in progress, but source or destination node have all keys.

• Otherwise we get a -TRYAGAIN error.

{User:1}.key_A {User:2}.Key_B



SUNION key_A key_B-TRYAGAIN

SUNION key_A key_B… output …

SUNION key_A key_B… output …

Redis Cluster ETA

• Release Candidate available.

• We’ll go stable in Q1 2015.

• Ask me anything.

Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Data & Analytics

Transcript of Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014