Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cassandra Summit 2016

Brooke Jensen

VP Technical Operations & Customer Services

Instaclustr

Lessons learned from running over 1800 2000 clusters

Instaclustr• Launched at 2014 Summit.

• Now 25+ staff in 4 countries.

• Engineering (dev & ops) from Canberra, AU

• Cassandra as a service (CaaS)

• AWS, Azure, Softlayer, GCP in progress.

• Automated provisioning – running within minutes.

• 24/7 monitoring and response.

• Repairs, backups, migrations, etc.

• Expert Cassandra support.

• Spark and Zeppelin add-ons.

• Enterprise support

• For customers who cannot use a managed service or require greater level of control of their cluster.

• Gain 24/7 access to our Engineers for “third level” Cassandra support

• Troubleshooting, advice, emergency response.

• Consulting solutions

• Data model design and/or review

• Cluster design, sizing, performance testing and tuning

• Training for developers and operational engineers

• Find out more or start a free trial

© DataStax, All Rights Reserved. 2

https://www.instaclustr.com/blog/2016/08/16/offerings/

https://console.instaclustr.com/user/signup

“Globally unique perspective of Cassandra.”

• Our customer base:

• Diverse. From early stage start-ups to large well-

known global enterprises.

• Education, Retail, Marketing, Advertising, Finance,

Insurance, Health, Social, Research

• All use cases: Messaging, IoT, eCommerce, Analytics,

Recommendations, Security.

• Small development clusters to large scale production

deployments requiring 100% uptime.

• Nodes under management:

• 700+ active nodes under management,

• All versions from Cassandra 2.0.11 - Cassandra 3.7


About Me

Brooke Jensen

VP Technical Operations, Customer Services / Cassandra MVP

• Previous: Senior Software Engineer, Instaclustr

• Education: Bachelor Software Engineering

• Life before Instaclustr:

• 11+ years Software Engineering.

• Specialized in performance optimization of large enterprise systems (e.g. Australian Customs, Taxation

Office, Department of Finance, Deutsche Bank)

• Extensive experience managing and resolving major system incidents and outages.

• Lives: Canberra, Au


Talk Overview

• Collection of common problems we see and manage on a daily basis.

• Examples and war stories from the field.

• HOWTOs, tips and tricks.

• Covering:

• Cluster Design

• Managing compactions

• Large partitions

• Disk usage and management

• Tombstones and Deletes

• Common sense advice


Cluster Design Basics – Racks & RF• For production we recommend (minimum): 3 nodes in 3 racks with RF3.

Make racks a multiple of RF.

• Use logical racks and map to physical racks.

• Each rack will contain a full copy of the data.

• Can survive the loss of nodes without losing QUORUM (strong consistency)

• Use NetworkTopologyStrategy. It’s not just for multi-DC, but is also “rack aware”

ALTER KEYSPACE <keyspace> WITH replication = {'class': 'NetworkTopologyStrategy','DC1': '3'}


Getting this right upfront will make

management of the cluster much

easier in the future.

R2

R2

R2

R1

R1

R1

R3

R3

R3

The case for single racks• Datastax docs suggest not to use racks?

• “It’s hard to set up”

• “Expanding is difficult” – not if using vnodes (default from 2.0.9)

• Spending the time to set up is WORTH IT!

• Minimizes downtime during upgrades and maintenance

• Can perform upgrades/restarts rack-by-rack

• Can (technically) lose a whole rack without downtime

• We go one further and map racks to AWS AZ:


Setting it upyaml:

endpoint_snitch: GossipingPropertyFileSnitch

cassandra-rackdc.properties:

Executing 'cat /etc/cassandra/cassandra-rackdc.properties' on 52.37.XXX.XXX

Host 52.37.XXX.XXX response:

#Generated by Instaclustr

#Mon Mar 28 19:22:21 UTC 2016

dc=US_WEST_2

prefer_local=true

rack=us-west-2b


Compactions – the basics


• Regular compactions are an integral part of any healthy Cassandra cluster.

• Occur periodically to purge tombstones, merge disparate row data into new SSTables to reclaim

disk space and keep read operations optimized.

• Can have a significant disk, memory (GC), cpu, IO overhead.

• Are often the cause of “unexplained” latency or IO issues in the cluster

• Ideally, get the compaction strategy right at table creation time. You can change it later, but that

may force a re-write all of the data in that CF using the new Compaction Strategy

• STCS – Insert heavy and general workloads

• LCS – Read heavy workloads, or more updates than inserts

• DTCS – Not where there are updates to old data or inserts that are out of order.

Monitoring Compactions$ nodetool compactionstats -H

pending tasks: 130

compaction type keyspace table completed total unit progress

Compaction instametrics events_raw 1.35 GB 1.6 GB bytes 84.77%

Compaction instametrics events_raw 1.28 GB 1.6 GB bytes 80.21%

Active compaction remaining time : 0h00m33s

• Not uncommon for large compactions to get “stuck” or fall behind.

• On 2.0 in particular. Significantly improved in 2.1, even better in 3

• A single node doing compactions can cause latency issues across

the whole cluster, as it will become slow to respond to queries.

• Heap pressure will cause frequent flushing of Memtables to disk.

=> many small SSTables => many compactions


Compactions: other things to check


Managing CompactionsFew things you can do if compactions are causing issues (e.g. latency)

Throttle: nodetool setcompactionthroughput 16

Stop and disable : nodetool stop COMPACTION

Take the node out (and unthrottle):

nodetool disablebinary && nodetool disablegossip && nodetool disablethrift && nodetool setcompactionthroughput 0


Set until C* is restarted. On 2.1 applies to NEW

compactions, on 2.2.5+ applies instantly

Other nodes will mark this node as down,

So need to complete within HH window (3h)

Case is important!

Stops currently active compactions only.

Compaction starts

Node taken out

Large Partitions• One of the biggest problems we deal with. Root cause of many other issues, and a PITA to manage.

• We recommend to keep them 100MB or less.

Creates issues with:

Compactions

In 2.0, compactions of partitions > 64MB were considerably slower. Partitions >2GB often getting stuck.

Improved in 2.1 and confirmed we observe less of these problems in upgraded clusters.

Adding, replacing nodes – streaming will often fail.

Querying large partitions is considerably slower. The whole partition is stored on every replica node,

leading to hotspots.

Can be hard to get rid of.


Checking partition sizes

~ $ nodetool cfstats -H keyspace.columnfamily

…

Compacted partition minimum bytes: 125 bytes

Compacted partition maximum bytes: 11.51 GB

Compacted partition mean bytes: 844 bytes

$ nodetool cfhistograms keyspace columnfamily

Percentile SSTables Write Latency Read Latency Partition Size Cell Count

(micros) (micros) (bytes)

50% 1.00 14.00 124.00 372 2

75% 1.00 14.00 1916.00 372 2

95% 3.00 24.00 17084.00 1597 12

98% 4.00 35.00 17084.00 3311 24

99% 5.00 50.00 20501.00 4768 42

Min 0.00 4.00 51.00 125 0

Max 5.00 446.00 20501.00 12359319162 129557750


Huge delta between 99th percentile and

Max indicates most data (bytes) is in

one partition.

Disk Usage• As a guide, maintain nodes under 70% (50% for STCS).

• At 80% take action.

• Why so much headroom?

• Compactions will cause a temporary increase in disk usage while both sets of SSTables exist, but

once complete will free up space that was occupied by old SSTables.

• FYI, repair requests a snapshot before execution.

• Recovering from a filled disk can be a pain, and you CAN LOSE DATA.

• C* won’t start, for a start.

• Nodes out of the cluster during recovery >3 hours will require repair.


Sep 08 05:38:15 cassandra[17118]: at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]

Sep 08 05:38:15 cassandra[17118]: Caused by: java.io.IOException: No configured data directory contains

enough space to write 99 bytes

Sep 08 05:38:16 systemd[1]: cassandra.service: Main process exited, code=exited,

Sep 08 05:38:16 systemd[1]: cassandra.service: Unit entered failed state.

Try this first: stop writing data.


Can’t stop? Won’t stop?Quick win: clearing snapshots.

nodetool cfstats or nodetool listsnapshots will show if you have any snapshots to clear:


nodetool clearsnapshot

Finding data to remove


I like to look at the data folders on disk – easier to identify than with cfstats.

Note also: might not just be your data. Space can commonly be consumed by snapshots or even system keyspaces.

• We’ve had nodes nearly fill up because of stored hints.

Tip: Removing data


DELETE - creates tombstones which will not be purged by compactions until after gc_grace_seconds

• Default is 10 days, but you can ALTER it and it is effective immediately.

• Make sure all nodes are UP before changing gc_grace.

TRUNCATE or DROP – only creates a snapshot as a backup before removing all the data.

• The disk space is released as soon as the snapshot is cleared

• Preferred where possible.

Disk Usage – Other Actions to try

• Add Nodes + run cleanups

• After all new nodes are running, run nodetool cleanup on each of the previously existing nodes to remove

the keys that no longer belong to those nodes.

• If on AWS, add EBS (requires restart).

• Disable autocompactions (will negatively effect read latency so not recommended)


Tip: JOINING (adding) nodes

• When you add nodes to a cluster, they will typically overstream data initially using more disk space than you expect. Duplicates will be

compacted away eventually.

• Disable compaction throttling while the node is JOINING.

• If streaming/joining fails and you have to restart it, the node will restream ALL SSTables again from the beginning, potentially filling up the

disks. ‘rm’ cassandra data folder before restarting.

https://docs.datastax.com/en/cassandra/2.2/cassandra/tools/toolsCleanup.html

Compaction spikes

• Compactions, particularly large ones, will cause spikes in disk usage while both sets of SSTables

exist.

• Ideally, you want the compaction(s) to complete and free up space, but how can you assess

whether that is possible?

Unlikely.


Compaction spikes

1. Find the tmp SSTable associated with the current compaction. From this, together with %

complete in compactionstats you can get a feel for how much more space you need:

$ -rw-r--r-- find /var/lib/cassandra/data/ -name "*tmp*Data.db" | xargs ls –lh

1 root root 4.5G Sep 1 14:56 keyspace1/posts/keyspace1-posts-tmp-ka-118955-Data.db

2. Keep a very close eye the disk, compaction and size of tmp file:

watch –n30 'df -h; ls -lh keyspace1-posts-tmp-ka-118955-Data.db; nodetool compactionstats –H’

Filesystem Size Used Avail Use% Mounted on

/dev/md127 787G 746G 506M 100% /var/lib/cassandra


Case study: Yesterday’s drama

Scene:

• 15 node production cluster, 12 * m4xl-1600 nodes + 3 * m4xl-800 nodes (ie 3 with half storage)

• Keyspace is RF 2 and application requires QUORUM

• (sum_of_replication_factors / 2) + 1 = 2 (ie both replicas)

• Therefore can’t take nodes out (or let them die) as it will cause application outage.

• Peak processing time is 8am-6pm.

• Need to keep the node up until the end of the day.

• Write heavy workload


09:33:

~ $ df –h


/dev/md127 787G 777G 10G 99% /var/lib/cassandra

11:03:Filesystem Size Used Avail Use% Mounted on

/dev/md127 787G 781G 5.9G 100% /var/lib/cassandra

12:37:~ $ nodetool disableautocompaction

~ $ df –h





13:40:Filesystem Size Used Avail Use% Mounted on


Crap.

Solution was to move one CF to EBS in the background before the disk fills up.

~ $ du -hs /var/lib/cassandra/data/prod/*

89G /var/lib/cassandra/data/prod/cf-39153090119811e693793df4078eeb99

38G /var/lib/cassandra/data/prod/cf_one_min-e17256f091a011e5a5c327b05b4cd3f4

~ $ rsync -aOHh /var/lib/cassandra/data/prod/cf_one_min-e17256f091a011e5a5c327b05b4cd3f4 /mnt/ebs/

Meanwhile:Filesystem Size Used Avail Use% Mounted on

/dev/md127 787G 746G 906M 100% /var/lib/cassandra

/dev/xvdp 79G 37G 39G 49% /mnt/ebs

Now just mount bind it, and restart Cassandra:

/dev/xvdp on /lib/cassandra/data/prod/cf_one_min-e17256f091a011e5a5c327b05b4cd3f4


Monitoring – how we detect problems

• Client read and write latency

• Local CF read and write latency

• Number of reads or writes deviating from average

• Outlier nodes

• Down nodes

• Disk usage

• Pending compactions

• Check for large partitions (data model issues)

• In the logs:

• Large batch warnings

• Tombstone warnings

• Excessive GC and/or long pauses


Case study: Don’t break your cluster.

WARNING! It is possible to get your cluster into a state from which you are unable to recover

without significant downtime or data loss.


“This happened during normal operations at night, so I don't think any of us were doing anythingabnormal. We've been doing some processing that creates pretty heavy load over the last few weeks...”

Orly?


Unthrottled data load

• Load average of 56, on 8 core machines.

• Nodes were saturated and exhausted heap space.

• Regular GC pauses of 12000ms - 17000ms

• Memtables are frequently flushed to disk.

• This resulted in over 120,000 small SSTables being created on some nodes.

• Data was spread across thousands of SSTables, so read latency skyrocketed.

• Was using paxos writes (LWT), which require a read before every write. This caused writes to fail

because as reads were timing out.

• Compactions could not keep up, and added additional load to the already overloaded nodes.

• C* eventually crashed on most nodes, leaving some corrupt SSTables.


17 second GC pauses. Nice.

Aug 16 15:51:58 INFO o.a.cassandra.service.GCInspector ConcurrentMarkSweep GC in 12416ms. CMS

Old Gen: 6442450872 -> 6442450912; Par Eden Space: 1718091776 -> 297543768; Par Survivor Space:

214695856 -> 0

Aug 16 15:52:20 INFO o.a.cassandra.service.GCInspector ConcurrentMarkSweep GC in 17732ms. CMS

Old Gen: 6442450912 -> 6442450864; Par Eden Space: 1718091776 -> 416111040; Par Survivor Space:

214671752 -> 0

Heap pressure causes C* to flush memtables to disk. This created >120,000 Memtables on some

nodes.

3+ days just to catch up on compactions, which were continually failing because of:

Aug 18 22:11:43 java.io.FileNotFoundException: /var/lib/cassandra/data/keyspace/cf-

f4683d90f88111e586b7e962b0d85be3/keyspace-cf-ka-1243722-Data.db (Too many open files)

java.lang.RuntimeException: java.io.FileNotFoundException: /var/lib/cassandra/data/keyspace/cf-

f4683d90f88111e586b7e962b0d85be3/keyspace-cf-ka-1106806-Data.db (No such file or directory)


1. Once we got C* stable and caught up on compactions, there were still corrupt SSTables present and nodes were in an inconsistent state.

2. Couldn’t fix with repairs:

ERROR o.apache.cassandra.repair.Validator Failed creating a merkle tree for [repair #21be1ac0-6809-11e6-a098-

b377cb035d78 on keyspace/cf, (-227556542627198517,-225096881583623998]], /52.XXX.XXX.XXX (see log for details)

ERROR o.a.c.service.CassandraDaemon Exception in thread Thread[ValidationExecutor:708,1,main]

java.lang.NullPointerException: null

3. Have deleted corrupt SSSTables on some nodes. This is ok, presume there are other copies of the data in the cluster. We’ll have to repair later.

4. Run online scrubs on each node to identify corrupt SSTables, and fix (rewrite) where possible.

5. For nodes where online scrub does not complete, take the node offline and attempt an offline scrub of identified corrupt SSTables.

6. If offline scrub fails to rewrite any SSTables a node, delete those remaining corrupt SSTables.

7. Run a repair across the cluster to make data consistent across all nodes.

@ 8th September, 3 weeks after the initial data load and the cluster is STILL in an inconsistent state with corrupt SSTables and queries occasionally failing.


Long road to recovery

Some final tips

• When making major changes to the cluster (expanding, migrating, decomissioning), GO SLOW.

• It takes longer to recover from errors than just doing it right the first time.

• Things I’ve seen customers do:

• Rebuild 16 nodes in a new DC concurrently

• Decommission multiple nodes at once

• Unthrrotled data loads

• Keep C* up to date, but not too up to date.

• 2.0 has troubles with large compactions

• Currently investigating segfaults with MV in 3.7

• Read the source code.

• It is the most thorough and up to date documentation.


Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cassandra Summit 2016

Software

Transcript of Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cassandra Summit 2016