C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

59
#Cassandra13 Rethinking Topology in Cassandra Cassandra Summit June 11, 2013 Eric Evans [email protected] @jericevans

description

A discussion of the recent work to transition Cassandra from its naive 1-partition-per-node distribution, to a proper virtual nodes implementation.

Transcript of C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

Page 1: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Rethinking Topology in Cassandra

Cassandra SummitJune 11, 2013

Eric [email protected]

@jericevans

Page 2: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

DHT 101

Page 3: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

DHT 101partitioning

AZ

Page 4: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

DHT 101partitioning

AZ

BY

C

Page 5: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

DHT 101partitioning

AZ

BY

C

Key = Aaa

Page 6: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

DHT 101replica placement

AZ

BY

C

Key = Aaa

Page 7: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

DHT 101consistency

Consistency

Availability

Partition tolerance

Page 8: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

DHT 101scenario: consistency level = one

A

?

?

W

Page 9: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

DHT 101scenario: consistency level = all

A

?

?

R

Page 10: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

DHT 101scenario: quorum write

A

B

?

W

Page 11: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

DHT 101scenario: quorum read

A

B

?R

Page 12: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Awesome, yes?

Page 13: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Well...

Page 14: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Problem:Poor request/stream distribution

Page 15: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Distribution

AZ

BY

C

M

Page 16: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Distribution

AZ

BY

C

M

Page 17: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Distribution

AZ

BY

C

M

Page 18: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Distribution

AZ

BY

C

M

Page 19: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Distribution

Z A

BY

C

M

Page 20: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Distribution

A

BY

C

M

A1Z

Page 21: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Distribution

A

BY

C

M

A1Z

Page 22: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Distribution

A

BY

C

M

A1Z

Page 23: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Problem:Poor data distribution

Page 24: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Distribution

A

BD

C

Page 25: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Distribution

A

BD

C

E

Page 26: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Distribution

E

A

D B

C

Page 27: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Distribution

E

A

D B

C

Page 28: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Distribution

A

BD

C

H E

FG

Page 29: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Distribution

A

BD

C

H E

FG

Page 30: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Virtual Nodes

Page 31: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

In a nutshell...

Page 32: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Benefits

● Operationally simpler (no token management)

● Better distribution of load

● Concurrent streaming (all hosts)

● Smaller partitions mean greater reliability

● Better supports heterogeneous hardware

Page 33: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Strategies

● Automatic sharding

● Fixed partition assignment

● Random token assignment

Page 34: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Strategyautomatic sharding

● Partitions are split when data exceeds a threshold

● Newly created partitions are relocated to a host with less data

● Similar to Bigtable, or Mongo auto-sharding

Page 35: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Strategyfixed partition assignment

● Namespace divided into Q evenly-sized partitions

● Q/N partitions assigned per host (where N is number of hosts)

● Joining hosts “steal” partitions evenly from existing hosts

● Used by Dynamo/Voldemort (“strategy 3” in Dynamo paper)

Page 36: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Strategyrandom token assignment

● Each host assigned T random tokens

● T random tokens generated for joining hosts; New tokens divide

existing ranges

● Similar to libketama; Identical to Classic Cassandra when T=1

Page 37: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Considerations

1.Number of partitions

2.Partition size

3.How 1 changes with more nodes and data

4.How 2 changes with more nodes and data

Page 38: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Evaluating

Strategy No. Partitions Partition size

Random O(N) O(B/N)

Fixed O(1) O(B)

Auto-sharding O(B) O(1)

Page 39: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Evaluating

Automatic sharding● Partition size is constant (great)

● Number of partitions scales linearly with data size (bad)

Page 40: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Evaluating

Fixed partition assignment● Number of partitions is constant (good)

● Partition size scales linearly with data size (bad)

● Greater operational complexity (bad)

Page 41: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Evaluating

Random token assignment● Number of partitions scales linearly with number of hosts (OK)

● Partition size increases with more data; Decreases with more

hosts (good)

Page 42: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Evaluating

● Automatic sharding

● Fixed partition assignment

● Random token assignment

Page 43: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Cassandra

Page 44: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Configurationconf/cassandra.yaml

# Comma separated list of tokens, (new# installs only).initial_token:<token>,<token>,<token>

or

# Number of tokens to generate.num_tokens: 256

Page 45: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Configurationnodetool info

Token : (invoke with -T/--tokens to see all 256 tokens)ID : 6a8dc22c-1f37-473f-8f7e-47742f4b83a5Gossip active : trueThrift active : trueLoad : 42.92 MBGeneration No : 1370016307Uptime (seconds) : 221Heap Memory (MB) : 998.72 / 1886.00Data Center : datacenter1Rack : rack1Exceptions : 0Key Cache : size 1128 (bytes), capacity 98566144 (bytes), 42 hits, 54 re...Row Cache : size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN ...

Page 46: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Configurationnodetool ring

Datacenter: datacenter1==========Replicas: 0

Address Rack Status State Load Owns Token 3074457345618258602127.0.0.1 rack1 Up Normal 42.92 MB 33.33% -9223372036854775808127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3098476543630901247127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3122495741643543892127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3146514939656186537127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3170534137668829183127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3194553335681471828127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 321857253369411447127.0.0.1 rack1 Up Normal 42.92 MB 33.33% 3242591731706757118...

Page 47: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Configurationnodetool status

Datacenter: datacenter1=======================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 127.0.0.1 42.92 MB 256 33.3% 6a8dc22c-1f37-473f-8f7e-47742f4b83a5 rack1UN 127.0.0.2 60.17 MB 256 33.3% 26263a2b-768e-4a79-8d41-3624a14b13a8 rack1UN 127.0.0.3 56.85 MB 256 33.3% 5b3e208f-6d36-4c7b-b2bb-b7c476a1af66 rack1

Page 48: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Configurationnodetool status

Datacenter: datacenter1=======================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 127.0.0.1 42.92 MB 256 33.3% 6a8dc22c-1f37-473f-8f7e-47742f4b83a5 rack1UN 127.0.0.2 60.17 MB 256 33.3% 26263a2b-768e-4a79-8d41-3624a14b13a8 rack1UN 127.0.0.3 56.85 MB 256 33.3% 5b3e208f-6d36-4c7b-b2bb-b7c476a1af66 rack1

Page 49: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Configurationnodetool status

Datacenter: datacenter1=======================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns Host ID RackUN 127.0.0.1 42.92 MB 256 33.3% 6a8dc22c-1f37-473f-8f7e-47742f4b83a5 rack1UN 127.0.0.2 60.17 MB 256 33.3% 26263a2b-768e-4a79-8d41-3624a14b13a8 rack1UN 127.0.0.3 56.85 MB 256 33.3% 5b3e208f-6d36-4c7b-b2bb-b7c476a1af66 rack1

Page 50: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Migration

A

BD

Page 51: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Migrationedit conf/cassandra.yaml and restart

# Number of tokens to generate.num_tokens: 256

Page 52: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Migrationconvert to T contiguous tokens in existing ranges

B

AAAAAA

AA

A

A A A

AAAA

C

A

B

Page 53: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Migrationshuffle

B

AAAAAA

AA

A

A A A

AAAA

C

A

B

Page 54: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Shuffle

● Range transfers are queued on each host

● Hosts initiate transfer to self

● Pay attention to the logs!

Page 55: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

ShuffleUsage: shuffle [options] <sub-command>

Sub-commands: create Initialize a new shuffle operation ls List pending relocations clear Clear pending relocations en[able] Enable shuffling dis[able] Disable shuffling

Options: -dc, --only-dc Apply only to named DC (create only) -u, --username JMX username -tp, --thrift-port Thrift port number (Default: 9160) -p, --port JMX port number (Default: 7199) -tf, --thrift-framed Enable framed transport for Thrift (Default: false) -en, --and-enable Immediately enable shuffling (create only) -pw, --password JMX password -H, --help Print help information -h, --host JMX hostname or IP address (Default: localhost) -th, --thrift-host Thrift hostname or IP address (Default: JMX host)

Page 56: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

Performance

Page 57: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

removenode

Cassandra 1.2 Cassandra 1.10

50

100

150

200

250

300

350

400

450

Page 58: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

bootstrap

Cassandra 1.2 Cassandra 1.10

100

200

300

400

500

600

Page 59: C* Summit 2013: Virtual Nodes: Rethinking Topology in Cassandra by Eric Evans

#Cassandra13

The End

● Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels “Dynamo: Amazon’s Highly Available Key-value Store” Web.

● Low, Richard. “Improving Cassandra's uptime with virtual nodes” Web.

● Overton, Sam. “Virtual Nodes Strategies.” Web.

● Overton, Sam. “Virtual Nodes: Performance Results.” Web.

● Jones, Richard. "libketama - a consistent hashing algo for memcache clients” Web.