Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

31
Redis Cluster design tradeoffs @antirez - Pivotal

Transcript of Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Page 1: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Redis Clusterdesign tradeoffs @antirez - Pivotal

Page 2: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

What is performance?

• Low latency.

• IOPS.

• Operations quality and data model.

Page 3: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Go Cluster

• Redis Cluster must have same Redis use case.

• Tradeoffs are inherently needed in DS.

• CAP? Merge values? Strong consistency and consensus? How to replicate values?

Page 4: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

CP systems

Client S1

S2

S3

S4

CAP: consistency price is added latency

Page 5: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

CP systems

Client S1

S2

S3

S4

Reply to client after majority ACKs

Page 6: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

And… there is the diskS1 S2 S3

Disk Disk Disk

CP algorithms may require fsync-befor-ack. Durability / Consistency not always orthogonal.

Page 7: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

AP systems

Client

S1

S2

Eventual consistency with merges? (note: merge is not strictly part of EC)

Client

A = {1,2,3,8,12,13,14}

A = {2,3,8,11,12,1}

Page 8: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Many kinds of consistencies• “C” of CAP is strong consistency.

• It is not the only available tradeoff of course.

• Consistency is the set of liveness and safety properties a given system provides.

• Eventual consistency: like to say nothing at all. What liveness/safety properties if not “C”?

Page 9: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Redis Cluster

Client

A,B,C

A,B,C

Sharding and replication (asynchronous).

A,B,C

D,E,F

D,E,F

D,E,F

Page 10: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Asynchronous replication

Client A,B,C

A,B,C

A,B,C

A,B,C

A,B,C

A,B,C

async ACK

Page 11: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Full Mesh

A,B,C A,B,C

D,E,F D,E,F

• Heartbeats.

• Nodes gossip.

• Failover auth.

• Config update.

Page 12: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

No proxy, but redirections

A,B,C D,E,F G,H,I L,M,N O,P,Q R,S,T

Client Client

A? D?

Page 13: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Failure detection

• Failure reports within window of time (via gossip).

• Trigger for actual failover.

• Two main states: PFAIL -> FAIL.

Page 14: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Failure detection

S1

S2

S3

S4

S1 is not responding?S1 = PFAIL

S1 = PFAIL

S1 = PFAIL

Page 15: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Failure detection

S1

S2

S3

S4

PFAIL state propagatesS1 = PFAIL

S1 = PFAIL Reported by:

S2, S4

S1 = PFAIL

Page 16: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Failure detection

S1

S2

S3

S4

PFAIL state propagatesS1 = PFAIL

S1 = FAIL

S1 = PFAIL

Page 17: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Failure detection

S1

S2

S3

S4

Force FAIL stateS1 = FAIL

S1 = FAIL

S1 = FAIL

Page 18: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Global slots config

• A master FAIL state triggers a failover.

• Cluster needs a coherent view of configuration.

• Who is serving this slot currently?

• Slots config must eventually converge.

Page 19: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Raft and failover• Config propagation is solved using ideas from the

Raft algorithm (just a subset).

• Raft is a consensus algorithm built on top of different “layers”.

• Raft paper is already a classic (highly recommended).

• Full Raft not needed for Redis Cluster slots config.

Page 20: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Failover and config

FailedSlave

Slave

Slave

Master

Master

Master

Epoch = Epoch+1(logical clock)

Vote for me!

Page 21: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Too easy?

• Why we don’t need full Raft?

• Because our config is idempotent: when the partition heals we can trow away slots config for new versions.

• Same algorithm is used in Sentinel v2 and works well.

Page 22: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Config propagation

• After a successful failover, new slot config is broadcasted.

• If there are partitions, when they heal, config will get updated (broadcasted from time to time, plus stale config detection and UPADTE messages).

• Config with greater Epoch always wins.

Page 23: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Redis Cluster consistency?

• Eventual consistent: last failover wins.

• In the “vanilla” losing writes is unbound.

• Mechanisms to avoid unbound data loss.

Page 24: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Failure mode… #1

Client A,B,C

A,B,C

A,B,C

Failed

A,B,C

A,B,C

lost write…

Page 25: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Failure mode #2Client

A,B,C

A,B,C

D,E,F

G,H,I

Minority side Majority side

Page 26: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Boud divergencesClient A,B,C

D,E,F

G,H,I

Minority side Majority sideAfter node-tim

eot

Page 27: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

More data safety?• OP logging until async ACK received.

• Re-played to master when node turns into slave.

• “Safe” connections, on demand.

• Example SADD (idempotent + commutative).

• SET-LWW foo bar <wall-clock>.

Page 28: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Multi key ops

• Hey hashtags!

• {user:1000}.following {user:1000}.followers.

• Unavailable for small windows, but no data exchange between nodes.

Page 29: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Multi key ops (availability)

• Single key ops: always available during resharding.

• Multi key ops, available if:

• No manual resharding of this hash slot in progress.

• Resharding in progress, but source or destination node have all keys.

• Otherwise we get a -TRYAGAIN error.

Page 30: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

{User:1}.key_A {User:2}.Key_B

{User:1}.key_A {User:1}.Key_B

{User:1}.key_A {User:1}.Key_B

SUNION key_A key_B-TRYAGAIN

SUNION key_A key_B… output …

SUNION key_A key_B… output …

Page 31: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

Redis Cluster ETA

• Release Candidate available.

• We’ll go stable in Q1 2015.

• Ask me anything.