Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf 2013

Post on 14-Dec-2014

1.601 views 2 download

description

 

Transcript of Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf 2013

©2013 DataStax Confidential. Do not distribute without consent.

CTO, DataStax

Jonathan EllisProject Chair, Apache Cassandra

Modern Apache Cassandra

1

Five years of Cassandra

Jul-09 May-10 Feb-11 Dec-11 Oct-12 Jul-13

0.1 0.3 0.6 0.7 1.0 1.2...

2.0

DSE

Jul-08

Application/Use Case• Social Signals: like/want/own

features for eBay product and item pages

• Hunch taste graph for eBay users and items

• Many time series use cases

Why Cassandra? • Multi-datacenter• Scalable• Write performance• Distributed counters• Hadoop support

ACE

Time series data

Multi-datacenter support

Distributed counters

Hadoop support

Application/Use Case• Adobe AudienceManager: web

analytics, content management, and online advertising

Why Cassandra? • Low-latency• Scalable• Multi-datacenter• Tuneable consistency

ACE

Bootstrapping

Bootstrapping

Bootstrapping

sd

s d

sd

sd

Bootstrapping

sd

s d

sd

sd

Bootstrapping

Tuneable consistency•(We’ll come back to this)

Application/Use Case• Logging• Notifications

Why Cassandra? • Efficient writes• Durable• Scalable• High availability

ACE

Durable + efficient writes

Memory

Hard drive

Memtable

write( , )k1 c1:v1

Commit log

Memory

Hard drive

Memtable

write( , k1 c1:v

Commit log

k1 c1:v

k1 c1:v

Memory

Hard drive

write( , k1 c2:v

k1 c1:v

k1 c1:v

k1 c2:v

c2:v

Memory

Hard drive

k1 c1:v

k1 c1:v

k1 c2:v

c2:v

write( , )k2 c1:v c2:v

k2 c1:v c2:v

k2 c1:v c2:v

Memory

Hard drive

k1 c1:v

k1 c1:v

k1 c2:v

c2:v

write( , )k1 c1:v c3:v

k2 c1:v c2:v

k2 c1:v c2:v

k1 c1:v c3:v

c3:v

Memory

Hard drive

SSTable

flush

k1 c1:v c2:v

k2 c1:v c2:v

c3:v

index / BF

cleanup

High availability•99.9999% availability on Cassandra•(We’ll come back to this, too)

Core values•Massive scalability•High performance

•Ease of use

•Reliability/Availabilty

Cassandra HBase RedisMySQL

0

20000

40000

60000

80000

0 2 4 6 8 10 12

Cassandra HBase RedisMySQL

NUMBER OF NODES

THRO

UG

HPU

T O

PS/S

EC) CASSANDRA

VLDB benchmark (RWS)

0

8750

17500

26250

35000

1 2 4 8 16 32

Cassandra HBase MongoDB

CASSANDRA

Endpoint benchmark (RW)TH

ROU

GH

PUT

OPS

/SEC

)

NUMBER OF NODES

Ease of useCREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date int);

CREATE INDEX ON users(state);

SELECT * FROM users WHERE state=‘Texas’ AND birth_date > 1950;

Classic partitioning (SPOF)

partition 1 partition 2 partition 3 partition 4

router

client

(Not a theoretical problem)

https://speakerdeck.com/mitsuhiko/a-year-of-mongodb

http://aphyr.com/posts/288-the-network-is-reliable

Fully distributed, no SPOF

p1

p1

p1p3

p6

Client

Primary key determines placement*

Partitioning

jim

carol

johnny

suzy

age: 36 car: camaro gender: M

age: 37 car: subaru gender: F

age:12 gender: M

age:10 gender: F

jim

carol

johnny

suzy

PK

5e02739678...

a9a0198010...

f4eb27cea7...

78b421309e...

Murmur Hash

Murmur* hash operation yields a 64-bit number for keysof any size.

Node A

Node D Node C

Node B

The “token ring”

jim 5e02739678...

carol a9a0198010...

johnny f4eb27cea7...

suzy 78b421309e...

Start EndA 0xc000000000..

10x0000000000..0

B 0x0000000000..1

0x4000000000..0

C 0x4000000000..1

0x8000000000..0

D 0x8000000000..1

0xc000000000..0

jim 5e02739678...

carol a9a0198010...

johnny f4eb27cea7...

suzy 78b421309e...

Start EndA 0xc000000000..

10x0000000000..0

B 0x0000000000..1

0x4000000000..0

C 0x4000000000..1

0x8000000000..0

D 0x8000000000..1

0xc000000000..0

jim 5e02739678...

carol a9a0198010...

johnny f4eb27cea7...

suzy 78b421309e...

Start EndA 0xc000000000..

10x0000000000..0

B 0x0000000000..1

0x4000000000..0

C 0x4000000000..1

0x8000000000..0

D 0x8000000000..1

0xc000000000..0

jim 5e02739678...

carol a9a0198010...

johnny f4eb27cea7...

suzy 78b421309e...

Start EndA 0xc000000000..

10x0000000000..0

B 0x0000000000..1

0x4000000000..0

C 0x4000000000..1

0x8000000000..0

D 0x8000000000..1

0xc000000000..0

jim 5e02739678...

carol a9a0198010...

johnny f4eb27cea7...

suzy 78b421309e...

Start EndA 0xc000000000..

10x0000000000..0

B 0x0000000000..1

0x4000000000..0

C 0x4000000000..1

0x8000000000..0

D 0x8000000000..1

0xc000000000..0

Node A

Node D Node C

Node B

carol a9a0198010...

Replication

Node A

Node D Node C

Node B

carol a9a0198010...

Node A

Node D Node C

Node B

carol a9a0198010...

C’’A’’

D’

C’A’ D

A

B’

CB

Virtual nodes

Node A

Node D Node C

Node B

Without vnodes With vnodes

A closer look at reads

Client Coordinator

40%busy

90%busy

30%busy

A closer look at reads

Client Coordinator

40%busy

90%busy

30%busy

A closer look at reads

Client Coordinator

40%busy

90%busy

30%busy

A closer look at reads

Client Coordinator

40%busy

90%busy

30%busy

A closer look at reads

Client Coordinator

40%busy

90%busy

30%busy

Rapid read protection

Client Coordinator

40%busy

90%busy

30%busy

Rapid read protection

Client Coordinator

40%busy

90%busy

30%busy

Rapid read protection

Client Coordinator

40%busy

90%busy

30%busy

Rapid read protection

Client Coordinator

40%busy

90%busy

30%busyX

Rapid read protection

Client Coordinator

40%busy

90%busy

30%busyX

Rapid read protection

Client Coordinator

40%busy

90%busy

30%busyX

Rapid read protection

Client Coordinator

40%busy

90%busy

30%busyX

Rapid Read Protection

NONE

Consistency levels

Client Coordinator

40%busy

90%busy

30%busy

Consistency levels

Client Coordinator

40%busy

90%busy

30%busy

Consistency levels

Client Coordinator

40%busy

90%busy

30%busy

Consistency levels

Client Coordinator

40%busy

90%busy

30%busy

Consistency levels

Client Coordinator

40%busy

90%busy

30%busy

Consistency levels•ONE•QUORUM

•LOCAL_QUORUM

•LOCAL_ONE•TWO

•ALL

#CASSANDRAEURace conditionSELECT nameFROM usersWHERE username = 'pmcfadin';

#CASSANDRAEURace conditionSELECT nameFROM usersWHERE username = 'pmcfadin';

(0 rows) SELECT nameFROM usersWHERE username = 'pmcfadin';

#CASSANDRAEURace conditionSELECT nameFROM usersWHERE username = 'pmcfadin';

(0 rows) SELECT nameFROM usersWHERE username = 'pmcfadin';

INSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ba27e03fd9...', '2011-06-20 13:50:00');

(0 rows)

#CASSANDRAEURace conditionSELECT nameFROM usersWHERE username = 'pmcfadin';

(0 rows) SELECT nameFROM usersWHERE username = 'pmcfadin';

INSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ba27e03fd9...', '2011-06-20 13:50:00');

(0 rows)

INSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ea24e13ad9...', '2011-06-20 13:50:01');

#CASSANDRAEURace conditionSELECT nameFROM usersWHERE username = 'pmcfadin';

This one wins

(0 rows) SELECT nameFROM usersWHERE username = 'pmcfadin';

INSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ba27e03fd9...', '2011-06-20 13:50:00');

(0 rows)

INSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ea24e13ad9...', '2011-06-20 13:50:01');

#CASSANDRAEULightweight transactionsINSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ba27e03fd9...', '2011-06-20 13:50:00')IF NOT EXISTS;

#CASSANDRAEULightweight transactionsINSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ba27e03fd9...', '2011-06-20 13:50:00')IF NOT EXISTS;

[applied]----------- True

INSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ea24e13ad9...', '2011-06-20 13:50:01')IF NOT EXISTS;

#CASSANDRAEULightweight transactions

[applied] | username | created_date | name -----------+----------+----------------+---------------- False | pmcfadin | 2011-06-20 ... | Patrick McFadin

INSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ba27e03fd9...', '2011-06-20 13:50:00')IF NOT EXISTS;

[applied]----------- True

INSERT INTO users (username, name, email, password, created_date)VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ea24e13ad9...', '2011-06-20 13:50:01')IF NOT EXISTS;

Paxos•All operations are quorum-based•Each replica sends information about unfinished operations to the leader during prepare

•Paxos made Simple

Details•4 round trips vs 1 for normal updates•Paxos state is durable

•Immediate consistency with no leader election or failover

•ConsistencyLevel.SERIAL•http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0

Cassandra 2.1

User defined typesCREATE TYPE address (

street text, city text, zip_code int, phones set<text>)

CREATE TABLE users ( id uuid PRIMARY KEY, name text, addresses map<text, address>)

SELECT id, name, addresses.city, addresses.phones FROM users;

id | name | addresses.city | addresses.phones--------------------+----------------+-------------------------- 63bf691f | jbellis | Austin | {'512-4567', '512-9999'}

Collection indexingCREATE TABLE songs (

id uuid PRIMARY KEY, artist text, album text, title text, data blob, tags set<text>);

CREATE INDEX song_tags_idx ON songs(tags);

SELECT * FROM songs WHERE 'blues' IN tags;

id | album | artist | tags | title----------+---------------+-------------------+-----------------------+------------------ 5027b27e | Country Blues | Lightnin' Hopkins | {'acoustic', 'blues'} | Worrying My Mind

More-efficient repair

More-efficient repair

More-efficient repair

More-efficient repair

More-efficient repair

More-efficient repair

More-efficient repair

More-efficient repair

More-efficient repair

2.1 roadmap•Efficient handling of cold data•Counters 2.0

•Only repair new-since-last-repair data

•January/February 2014

Вопросы?