Post on 26-Jan-2015
description
PRACTICE MAKES PERFECT:EXTREME CASSANDRA OPTIMIZATION
@AlTobeyTech Lead, Compute and Data Services
#CASSANDRAThursday, August 8, 13
2
⁍ About me / Ooyala⁍ How not to manage your Cassandra clusters⁍ Make it suck less⁍ How to be a heuristician⁍ Tools of the trade⁍ More Settings⁍ Show & Tell
#CASSANDRA
Outline
Thursday, August 8, 13
3
⁍ Tech Lead, Compute and Data Services at Ooyala, Inc.⁍ C&D team is #devops: 3 ops, 3 eng, me⁍ C&D team is #bdaas: Big Data as a Service⁍ ~100 Cassandra nodes, expanding quickly⁍ Obligatory: we’re hiring
#CASSANDRA
@AlTobey
Thursday, August 8, 13
4
⁍ Founded in 2007⁍ 230+ employees globally⁍ 200M unique users,110+ countries⁍ Over 1 billion videos played per month⁍ Over 2 billion analytic events per day
#CASSANDRA
Ooyala
Thursday, August 8, 13
5
Ooyala has been using Cassandra since v0.4Use cases: ⁍ Analytics data (real-time and batch) ⁍ Highly available K/V store ⁍ Time series data ⁍ Play head tracking (cross-device resume) ⁍ Machine Learning Data
#CASSANDRA
Ooyala & Cassandra
Thursday, August 8, 13
Ooyala: Legacy Platform
cassandracassandracassandracassandra
6
S3
hadoophadoophadoophadoophadoop
cassandra
ABE Service
APIloggersplayers
START HERE
#CASSANDRA
read-modify-write
Thursday, August 8, 13
memTable
Avoiding read-modify-write
7#CASSANDRA
Albert 6 Wednesday 0
Evan Tuesday 0 Wednesday 0
Frank Tuesday 3 Wednesday 3
Kelvin Tuesday 0 Wednesday 0
cassandra13_drinks column family
Krzysztof Tuesday 0 Wednesday 0
Phillip Tuesday 12 Wednesday 0
Tuesday
Thursday, August 8, 13
memTable
Avoiding read-modify-write
8#CASSANDRA
Al Tuesday 2 Wednesday 0
Phillip Tuesday 0 Wednesday 1
cassandra13_drinks column family
ssTable
Albert 6 Wednesday 0
Evan Tuesday 0 Wednesday 0
Frank Tuesday 3 Wednesday 3
Kelvin Tuesday 0 Wednesday 0
Krzysztof Tuesday 0 Wednesday 0
Phillip Tuesday 12 Wednesday 0
Tuesday
Thursday, August 8, 13
memTable
Avoiding read-modify-write
9#CASSANDRA
Albert Tuesday 22 Wednesday 0
cassandra13_drinks column family
ssTableAlbert Tuesday 2 Wednesday 0
Phillip Tuesday 0 Wednesday 1
ssTable
Albert 6 Wednesday 0
Evan Tuesday 0 Wednesday 0
Frank Tuesday 3 Wednesday 3
Kelvin Tuesday 0 Wednesday 0
Krzysztof Tuesday 0 Wednesday 0
Phillip Tuesday 12 Wednesday 0
Tuesday
Thursday, August 8, 13
Avoiding read-modify-write
10#CASSANDRA
cassandra13_drinks column family
ssTable
Albert Tuesday 22 Wednesday 0
Evan Tuesday 0 Wednesday 0
Frank Tuesday 3 Wednesday 3
Kelvin Tuesday 0 Wednesday 0
Krzysztof Tuesday 0 Wednesday 0
Phillip Tuesday 0 Wednesday 1
Thursday, August 8, 13
2011: 0.6 ➜ 0.8
11
⁍ Migration is still a largely unsolved problem⁍ Wrote a tool in Scala to scrub data and write via Thrift⁍ Rebuilt indexes - faster than copying
hadoopcassandra
GlusterFS P2Pcassandra
Thrift
#CASSANDRA
Scala Map/Reduce
Thursday, August 8, 13
Changes: 0.6 ➜ 0.8
12
⁍ Cassandra 0.8⁍ 24GiB heap⁍ Sun Java 1.6 update⁍ Linux 2.6.36⁍ XFS on MD RAID5⁍ Disabled swap or at least vm.swappiness=1
#CASSANDRAThursday, August 8, 13
13
⁍ 18 nodes ➜ 36 nodes⁍ DSE 3.0⁍ Stale tombstones again!⁍ No downtime!
cassandraGlusterFS P2P
DSE 3.0
Thrift
#CASSANDRA
Scala Map/Reduce
2012: Capacity Increase
Thursday, August 8, 13
System Changes: Apache 1.0 ➜ DSE 3.0
14
⁍ DSE 3.0 installed via apt packages⁍ Unchanged: heap, distro⁍ Ran much faster this time!⁍ Mistake: Moved to MD RAID 0 Fix: RAID10 or RAID5, MD, ZFS, or btrfs⁍ Mistake: Running on Ubuntu Lucid Fix: Ubuntu Precise
#CASSANDRAThursday, August 8, 13
Config Changes: Apache 1.0 ➜ DSE 3.0
15
⁍ Schema: compaction_strategy = LCS⁍ Schema: bloom_filter_fp_chance = 0.1⁍ Schema: sstable_size_in_mb = 256⁍ Schema: compression_options = Snappy⁍ YAML: compaction_throughput_mb_per_sec: 0
#CASSANDRAThursday, August 8, 13
16
⁍ 36 nodes ➜ lots more nodes⁍ As usual, no downtime!
#CASSANDRA
DSE 3.1DSE 3.1
replication
2013: Datacenter Move
Thursday, August 8, 13
17
Upcoming use cases: ⁍ Store every event from our players at full resolution ⁍ Cache code for our Spark job server ⁍ AMPLab Tachyon backend?
#CASSANDRA
Coming Soon for Cassandra at Ooyala
Thursday, August 8, 13
18
spark
APIloggersplayers kafka
ingest
job server
#CASSANDRA
DSE 3.1
Next Generation Architecture: Ooyala Event Store
Tachyon?
Thursday, August 8, 13
19
⁍ Security⁍ Cost of Goods Sold⁍ Operations / support⁍ Developer happiness⁍ Physical capacity (cpu/memory/network/disk)⁍ Reliability / Resilience⁍ Compromise
#CASSANDRA
There’s more to tuning than performance:
Thursday, August 8, 13
20
⁍ I’d love to be more scientific, but production comes first⁍ Sometimes you have to make educated guesses⁍ It’s not as difficult as it’s made out to be⁍ Your brain is great at heuristics. Trust it.⁍ Concentrate on bottlenecks⁍ Make incremental changes⁍ Read Malcom Gladwell’s “Blink”
#CASSANDRA
I am not a scientist ... heuristician?
Thursday, August 8, 13
21
Observe, Orient, Decide, Act:⁍ Observe the system in production under load⁍ Make small, safe changes⁍ Observe⁍ Commit or Revert
#CASSANDRA
The OODA Loop
Thursday, August 8, 13
Testing Shiny Things
22
⁍ Like kernels⁍ And Linux distributions⁍ And ZFS⁍ And btrfs⁍ And JVM’s & parameters⁍ Test them in production!
#CASSANDRAThursday, August 8, 13
ext4
ext4
ext4
ZFS
ext4
kernelupgrade
ext4
btrfs
Testing Shiny Things: In Production
23#CASSANDRAThursday, August 8, 13
24#CASSANDRA
Brendan Gregg’s Tool Chart
http://joyent.com/blog/linux-performance-analysis-and-tools-brendan-gregg-s-talk-at-scale-11x
Thursday, August 8, 13
25#CASSANDRA
dstat -lrvn 10
Thursday, August 8, 13
26#CASSANDRA
cl-netstat.pl
https://github.com/tobert/perl-ssh-tools
Thursday, August 8, 13
27#CASSANDRA
iostat -x 1
Thursday, August 8, 13
28#CASSANDRA
htop
Thursday, August 8, 13
29#CASSANDRA
jconsole
Thursday, August 8, 13
30#CASSANDRA
opscenter
Thursday, August 8, 13
31#CASSANDRA
nodetool ring
10.10.10.10 Analytics rack1 Up Normal 47.73 MB 1.72% 10120466947217566370246917203789658009810.10.10.10 Analytics rack1 Up Normal 63.94 MB 0.86% 10267140381235212259670785569061971894010.10.10.10 Analytics rack1 Up Normal 85.73 MB 0.86% 10413813815252858149094653934334285778210.10.10.10 Analytics rack1 Up Normal 47.87 MB 0.86% 10560487249270504038518522299606599662410.10.10.10 Analytics rack1 Up Normal 39.73 MB 0.86% 10707160683288149927942390664878913546610.10.10.10 Analytics rack1 Up Normal 40.74 MB 1.75% 11004239456625750601145828592000033495010.10.10.10 Analytics rack1 Up Normal 40.08 MB 2.20% 11378142086690767579161636803057946630110.10.10.10 Analytics rack1 Up Normal 56.19 MB 3.45% 11965015139561879701796205307352452448710.10.10.10 Analytics rack1 Up Normal 214.88 MB 11.62% 13942488677708971556132479214987206104910.10.10.10 Analytics rack1 Up Normal 214.29 MB 2.45% 14358821087139961811070002843144079930510.10.10.10 Analytics rack1 Up Normal 158.49 MB 1.76% 14657736862492802169017525034490443612910.10.10.10 Analytics rack1 Up Normal 40.3 MB 0.92% 148140168357822348318107048925037023042
Thursday, August 8, 13
32#CASSANDRA
nodetool cfstatsKeyspace: gostress Read Count: 0 Read Latency: NaN ms. Write Count: 0 Write Latency: NaN ms. Pending Tasks: 0 Column Family: stressful SSTable count: 1 Space used (live): 32981239 Space used (total): 32981239 Number of Keys (estimate): 128 Memtable Columns Count: 0 Memtable Data Size: 0 Memtable Switch Count: 0 Read Count: 0 Read Latency: NaN ms. Write Count: 0 Write Latency: NaN ms. Pending Tasks: 0 Bloom Filter False Positives: 0 Bloom Filter False Ratio: 0.00000 Bloom Filter Space Used: 336 Compacted row minimum size: 7007507 Compacted row maximum size: 8409007 Compacted row mean size: 8409007
Could be using a lot of heap
Controllable by sstable_size_in_mb
Thursday, August 8, 13
33#CASSANDRA
nodetool proxyhistogramsOffset Read Latency Write Latency Range Latency35 0 20 042 0 61 050 0 82 060 0 440 072 0 3416 086 0 17910 0103 0 48675 0124 1 97423 0149 0 153109 0179 2 186205 0215 5 139022 0258 134 44058 0310 2656 60660 0372 34698 742684 0446 469515 7359351 0535 3920391 31030588 0642 9852708 33070248 0770 4487796 9719615 0924 651959 984889 0
Thursday, August 8, 13
34#CASSANDRA
nodetool compactionstats
al@node ~ $ nodetool compactionstatspending tasks: 3 compaction type keyspace column family bytes compacted bytes total progress Compaction hastur gauge_archive 9819749801 16922291634 58.03% Compaction hastur counter_archive 12141850720 16147440484 75.19% Compaction hastur mark_archive 647389841 1475432590 43.88%Active compaction remaining time : n/aal@node ~ $ nodetool compactionstatspending tasks: 3 compaction type keyspace column family bytes compacted bytes total progress Compaction hastur gauge_archive 10239806890 16922291634 60.51% Compaction hastur counter_archive 12544404397 16147440484 77.69% Compaction hastur mark_archive 1107897093 1475432590 75.09%Active compaction remaining time : n/a
Thursday, August 8, 13
35#CASSANDRA
⁍ cassandra-stress⁍ YCSB⁍ Production⁍ Terasort (DSE)⁍ Homegrown
Stress Testing Tools
Thursday, August 8, 13
36#CASSANDRA
kernel.pid_max = 999999fs.file-max = 1048576vm.max_map_count = 1048576net.core.rmem_max = 16777216net.core.wmem_max = 16777216net.ipv4.tcp_rmem = 4096 65536 16777216net.ipv4.tcp_wmem = 4096 65536 16777216vm.dirty_ratio = 10vm.dirty_background_ratio = 2vm.swappiness = 1
/etc/sysctl.conf
Thursday, August 8, 13
37#CASSANDRA
ra=$((2**14))# 16kss=$(blockdev --getss /dev/sda)blockdev --setra $(($ra / $ss)) /dev/sda
echo 256 > /sys/block/sda/queue/nr_requestsecho cfq > /sys/block/sda/queue/schedulerecho 16384 > /sys/block/md7/md/stripe_cache_size
/etc/rc.local
Thursday, August 8, 13
38#CASSANDRA
-Xmx8G leave it alone-Xms8G leave it alone-Xmn1200M 100MiB * nCPU-Xss180k should be fine
-XX:+UseNUMAnumactl --interleave
JVM Args
Thursday, August 8, 13
cgroups
39#CASSANDRA
Provides fine-grained control over Linux resources⁍ Makes the Linux scheduler better⁍ Lets you manage systems under extreme load⁍ Useful on all Linux machines⁍ Can choose between determinism and flexibility
Thursday, August 8, 13
cgroups
40#CASSANDRA
cat >> /etc/default/cassandra <<EOFcpucg=/sys/fs/cgroup/cpu/cassandramkdir $cpucgcat $cpucg/../cpuset.mems >$cpucg/cpuset.memscat $cpucg/../cpuset.cpus >$cpucg/cpuset.cpusecho 100 > $cpucg/sharesecho $$ > $cpucg/tasksEOF
Thursday, August 8, 13
Successful Experiment: btrfs
41#CASSANDRA
mkfs.btrfs -m raid10 -d raid0 /dev/sd[c-h]1mkfs.btrfs -m raid10 -d raid0 /dev/sd[c-h]1mount -o compress=lzo /dev/sdc1 /data
Thursday, August 8, 13
Successful Experiment: ZFS on Linux
42#CASSANDRA
zpool create data raidz /dev/sd[c-h]zfs create data/cassandrazfs set compression=lzjb data/cassandrazfs set atime=off data/cassandrazfs set logbias=throughput data/cassandra
Thursday, August 8, 13
Conclusions
43#CASSANDRA
⁍ Tuning is multi-dimensional⁍ Production load is your most important benchmark⁍ Lean on Cassandra, experiment!⁍ No one metric tells the whole story
Thursday, August 8, 13
Questions?
44#CASSANDRA
⁍ Twitter: @AlTobey⁍ Github: https://github.com/tobert⁍ Email: al@ooyala.com / tobert@gmail.com
Thursday, August 8, 13