C* Summit EU 2013: Being Closer to Cassandra at Ok.ru

Oleg Anastasyevlead platform developerOdnoklassniki.ru

Being Closer to Cassandra

#CASSANDRAEU

Top 10 of World’s social networks40M DAU, 80M MAU, 7M peak

~ 300 000 www req/sec, 20 ms render latency

>240 Gbit out

> 5 800 iron servers in 5 DCs99.9% java

* Odnoklassniki means “classmates” in english

#CASSANDRAEU

Cassandra @ * Since 2010

-branched 0.6-aiming at:

full operation on DC failure, scalability, ease of operations

*Now-23 clusters-418 nodes in total-240 TB of stored data

-survived several DC failures

#CASSANDRAEU

Case #1. The fast

#CASSANDRAEU

Like! 103 927 You and 103 927

#CASSANDRAEUData Range

00-64

Like! widget* Its everywhere

-Have it on every page, dozen-On feeds (AKA timeline)-3rd party websites elsewhere on internet

* Its on everything-Pictures and Albums-Videos-Posts and comments-3rd party shared URLs

Like! 103 927

#CASSANDRAEUData Range

00-64

Like! widget*High load

-1 000 000 reads/sec, 3 000 writes/sec

*Hard load profile-Read most -Long tail (40% of reads are random)-Sensitive to latency variations-3TB total dataset (9TB with RF) and growing-~ 60 billion likes for ~6bi entities

Like! 103 927

#CASSANDRAEU

RefId:long RefType:byte UserId:long Created

9999999999 PICTURE(2) 11111111111 11:00

Classic solution

= N >=1

= M>N

= N*140

You and 4256

SQL table

to render

SELECT TOP 1 WHERE RefId,RefType,UserId=?,?,? (98% are NONE)

SELECT COUNT (*) WHERE RefId,RefType=?,? (80% are 0)

SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)

#CASSANDRAEU

Cassandra solutionLikeByRef (

refType byte,refId bigint,userId bigint,

PRIMARY KEY ( (RefType,RefId), UserId)

LikeCount (refType byte,refId bigint,likers counter,

PRIMARY KEY ( (RefType,RefId))

= N*20%

so, to render

SELECT FROM LikeCount WHERE RefId,RefType=?,? (80% are 0)

SELECT * FROM LikeByRef WHERE RefId,RefType,UserId=?,?,? (98% are NONE)

You and 4256

#CASSANDRAEU

>11 M iops

LikeByRef (refType byte,refId bigint,userId bigint,

PRIMARY KEY ( (RefType,RefId, UserId) )

*Quick workaround ?

SELECT TOP N * RefId,RefType=? WHERE IsFriend(?,UserId)

-Forces Order Pres Partitioner (random not scales)

-Key range scans-More network overhead-Partitions count >10x, Dataset size > x2

#CASSANDRAEU

*What is does- Includes pairs of (PartKey, ColumnKey) in

SSTable *-Filter.db

*The good-Eliminated 98 % of reads -Less false positives

*The bad-They become too large

GC Promotion Failures.. but fixable (CASSANDRA-2466)

By column bloom filter

#CASSANDRAEU

Are we there yet ?

- min 2 roundtrips per render (COUNT+RR) - THRIFT is slow, esp having lot of connections- EXISTS() is 200 Gbit/sec (140*8*1Mps*20%)

cassandra

00

application server> 400

1. COUNT()

2. EXISTS

#CASSANDRAEU

Co-locate!

- one-nio remoting (faster than java nio)- topology aware clients

odnoklassniki-like

cassandra

get() : LikeSummary

Remote Business Intf

Counters Cache

Social Graph Cache

#CASSANDRAEU

* Fast TOP N friend likers query1. Take friends from graph cache2. Check it with memory bloom filter3. Read some until N friends found

*Custom caches-Tuned for application

*Custom data merge logic- ... so you can detect and resolve conflicts

co-location wins

#CASSANDRAEU

Listen for mutations// Implement itinterface StoreApplyListener { boolean preapply(String key, ColumnFamily data); }

*Register itbetween commit logs replay and gossip

*RowMutation.apply()extend original mutation+ Replica, hints, ReadRepairs

// and register with CFSstore=Table.open(..) .getColumnFamilyStore(..);store.setListener(myListener);

#CASSANDRAEU

Like! optimized countersLikeCount (

refType byte,refId bigint,ip inet,counter intPRIMARY KEY ( (RefType,RefId), ip)

*Counters cache-Off heap (sun.misc.Unsafe)-Compact (30M in 1G RAM)-Read cached local node only

*Replicated cache state- cold replica cache problem- making (NOP) mutations

less reads- long tail aware

#CASSANDRAEU

Read latency variations*CS read behavior

1. Choose 1 node for data and N for digest2. Wait for data and digest3. Compare and return (or RR)

*Nodes suddenly slowdown-SEDA hiccup, commit log rotation, sudden IO

saturation, Network hiccup or partition, page cache miss

*The bad-You have spikes.-You have to wait (and timeout)

#CASSANDRAEU

Read Latency leveling* “Parallel” read handler

1. Ask all replicas for data in parallel2. Wait for CL responses and return

*The good-Minimal latency response-Constant load when DC fails

*The (not so) bad- “Additional” work and traffic

#CASSANDRAEU

More tiny tricks*On SSD io

-Deadline IO elevator-64k -> 4k read request size

*HintLog-Commit log for hints-Wait for all hints on startup

* Selective compaction-Compacts most read CFs more often

#CASSANDRAEU

Case #2. The fat

#CASSANDRAEU

*Messages in chats-Last page is accessed on open- long tail (80%) for rest

-150 billion, 100 TB in storage-Read most (120k reads/sec, 8k writes/sec)

#CASSANDRAEU

Messages have structure

-All chat’s messages in single partition-Single blob for message data

to reduce overhead

-The badConflicting modifications can happen

(users, anti-spam, etc..)

Message (chatId, msgId,

created, type,userIndex,deletedBy,...text)

MessageCF (chatId, msgId,

data blob,

PRIMARY KEY ( chatId, msgId )

#CASSANDRAEU

LW conflict resolution

Messages (chatId, msgId,version timestamp,data blobPRIMARY KEY ( chatId, msgId, version )

get

(version:ts1, data:d1)

write( ts1, data2 )

get

(version:ts1, data:d1)

write( ts1, data3 )

(ts2, data2)(ts3, data3)

delete(version:ts1)insert(version: ts3=now(), data3)

- merged on read

delete(version:ts1)insert(version: ts2=now(), data2)

#CASSANDRAEU

Specialized cache*Again. Because we can

-Off-heap (Unsafe)-Caches only freshest chat page-Saves its state to local (AKA system) CF

keys AND values seq read, much faster startup

- In memory compression2x more memory almost free

#CASSANDRAEU

Disk mgmt*4U HDDx24, up to 4TB/node

-Size tiered compaction = 4 TB sstable file-RAID10 ? LCS ?

* Split CF to 256 pieces*The good

-Smaller, more frequent memtable flushes-Same compaction work

in smaller sets

-Can distribute across disks

#CASSANDRAEU

Disk Allocation Policies*Default is

- “Take disk with most free space”

* Some disks have-Too much read iops

*Generational policy-Each disk has same # of same gen files

work better for HDD

#CASSANDRAEU

Case #3. The uglyfeed my Frankenstein

#CASSANDRAEU

*Chats overview-small dataset (230GB)-has hot set, short tail (5%)- list reorders often-130k read/s, 21k write/s

#CASSANDRAEU

Conflicting updates* List<Overview> is single blob

.. or you’ll have a lot of tombstones

* Lot of conflictsupdates of single column

*Need conflict detection*Has merge algoritm

#CASSANDRAEU

Vector clocks*Voldemort-byte[] key -> byte[] value + VC-Coordination logic on clients-Pluggable storage engines

* Plugged-CS 0.6 SSTables persistance -Fronted by specialized cache

we love caches

#CASSANDRAEU

Performance*3 node cluster, RF = 3

- Intel Xeon CPU E5506 2.13GHz RAM: 48Gb, 1x HDD, 1x SSD

*8 byte key -> 1 KB byte value

*Results-75 k /sec reads, 15 k/ sec writes

#CASSANDRAEU

Why cassandra ?*Reusable distributed DB components

fast persistance, gossip, Reliable Async Messaging, Fail detectors,Topology, Seq scans, ...

*Has structurebeyond byte[] key -> byte[] value

*Delivered promises* Implemented in Java

#CASSANDRAEU CASSANDRASUMMITEU

THANK YOU

one-niormi faster than java nio with fast and compact automagic java serialization

shared-memory-cachejava Off-Heap cache using shared memory

Oleg [email protected]/oa@m0nstermind

github.com/odnoklassniki

mailto:[email protected]

mailto:[email protected]

C* Summit EU 2013: Being Closer to Cassandra at Ok.ru

Technology

Transcript of C* Summit EU 2013: Being Closer to Cassandra at Ok.ru