Cassandra: From tarball to production
Why talk about this?You are about to deploy CassandraYou are looking for “best practices”You don’t want:... to scour through the documentation... to do something known not to work well... to forget to cover some important step
What we won’t cover● Cassandra: how does
it work?● How do I design my
schema?● What’s new in
Cassandra X.Y?
So many things to doMonitoring Snitch DC/Rack Settings Time Sync
Seeds/Autoscaling Full/Incremental Backups
AWS Instance Selection
Disk - SSD?
Disk Space - 2x? AWS AMI (Image) Selection
Periodic Repairs Replication Strategy
Compaction Strategy
SSL/VPC/VPN Authorization + Authentication
OS Conf - Users
OS Conf - Limits OS Conf - Perms OS Conf - FSType OS Conf - Logs
C* Start/Stop OS Conf - Path Use case evaluation
Chef to the rescue?Chef community cookbook availablehttps://github.com/michaelklishin/cassandra-chef-cookbook
Installs java Creates a “cassandra” user/group
Download/extract the tarball Fixes up ownership
Builds the C* configuration files Sets the ulimits for filehandles, processes, memory locking
Sets up an init script Sets up data directories
Chef Cookbook CoverageMonitoring Snitch DC/Rack Settings Time Sync
Seeds/Autoscaling Full/Incremental Backups
Disk - SSD? Disk - How much?
AWS Instance Type AWS AMI (Image) Selection
Periodic Repairs Replication Strategy
Compaction Strategy
SSL/VPC/VPN Authorization + Authentication
OS Conf - Users
OS Conf - Limits OS Conf - Perms OS Conf - FSType OS Conf - Logs
C* Start/Stop OS Conf - Path Use case evaluation
MonitoringIs every node answering queries?Are nodes talking to each other?Are any nodes running slowly?Push UDP! (statsd)http://hackers.lookout.com/2015/01/cassandra-monitoring/https://github.com/lookout/cassandra-statsd-agent
Monitoring - SyntheticHealth checks, bad and good● ‘nodetool status’ exit code
○ Might return 0 if the node is not accepting requests○ Slow, cross node reads
● cqlsh -u sysmon -p password < /dev/null● Verifies this node can read auth table● https://github.com/lookout/cassandra-health-check
What about OpsCenter?We chose not to use itWant consistent interface for all monitoringGUI vs Command Line argumentDidn’t see good auditing capabilitiesDidn’t interface well with our chef solution
SnitchUse the right snitch!● AWS EC2MultiRegionSnitch● Google? GoogleCloudSnitch● GossipingPropertyFileSnitchNOT● SimpleSnitch (default)Community cookbook: set it!
What is RF?Replication Factor is how many copies of dataValue is hashed to determine primary hostAdditional copies always next node
Hash here
What is CL?Consistency Level -- It’s not RF!Describes how many nodes must respond before operation is considered COMPLETECL_ONE - only one node respondsCL_QUORUM - (RF/2)+1 nodes (round down)CL_ALL - RF nodes respond
DC/Rack SettingsYou might need to set these
Maybe you’re not in AmazonRack == Availability Zone?Hard: Renaming DC or adding racks
Renaming DCsClients “remember” which DC they talk toRenaming single DC causes all clients to failBetter to spin up a new one than rename old
Adding a rackStart with 6 node cluster, rack R1Replication factor 3Add 1 node in R2, and rebalanceALL data in R2 node?Good idea to keep racks balanced
I don’t have time for thisClusters must have synchronized timeYou will get lots of drift with: [0-3].amazon.pool.ntp.orgCommunity cookbook doesn’t cover anything here
Better make time for thisC* serializes write operations by time stampsClocks on virtual machines drift!It’s the relative difference among clocks that mattersC* nodes should synchronize with each otherSolution: use a pair of peered NTP servers (level 2 or 3) and a small set of known upstream providers
From a small seed…Seeds are used for new nodes to find clusterEvery new node should use the same seedsSeed nodes get topology changes fasterEach seed node must be in the config fileMultiple seeds per datacenter recommendedTricky to configure on AWS
Backups - Full+IncrementalNothing in the cookbooks for thisC* makes it “easy”: snapshot, then copySnapshots might require a lot more spaceRemove the snapshot after copying it
Disk selectionSSD Rotational
EphemeralEBS
Low latency Any size instance Any size instance
Recommended Not cheap Less expensive
Great random r/w perf Good write performance No node rebuilds
No network use for disk No network use for disk
AWS Instance SelectionWe moved to EC2c3.2xlarge (15GiB mem, Disk 160GB)?i2.xlarge (30GiB mem, 800GB disk)Max recommended storage per node is 1TBUse instance types that support HVMSome previous generation instance types, such as T1, C1, M1, and M2 do not support Linux HVM AMIs. Some current generation instance types, such as T2, I2, R3, G2, and C4 do not support PV AMIs.
How much can I use??Snapshots take space (kind of)Best practice: keep disks half full!800GB disk becomes 400GBSnapshots during repairs?Lots of uses for snapshots!
Periodic RepairsBuried in the docs:“As a best practice, you should schedule repairs weekly”http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_repair_nodes_c.html
● “-pr” (yes)● “-par” (maybe)● “--in-local-dc” (no)
Repair TipsRaise gc_grace_seconds (tombstones?)Run on one node at a timeSchedule for low usage hoursUse “par” if you have dead time (faster)Tune with: nodetool setcompactionthroughput
I thought I deleted thatCompaction removes “old” tombstones10 day default grace period (gc_grace_period)After that, deletes will not be propagated!Run ‘nodetool repair’ at least every 10 daysOnce a week is perfect (3 day grace)Node down >7 days? ‘nodetool remove’ it!
Changing RF within DC?Easy to decrease RFImpossible to increase RF without (usually)Reads with CL_ONE might fail!
Hash here
Replication StrategyHow many replicas should we have?What happens if some data is lost?Are you write-heavy or read-heavy?Quorum considerations: odd is better!RF=1? RF=3? RF=5?
Magic JMX setting: reduce traffic to a nodeGreat when node is “behind” the 4 hour windowUsed by gossiper to divert traffic during repairsWrites: ok, read repair: ok, nodetool repair: ok$ java -jar jmxterm.jar -l localhost:7199$> set -b org.apache.cassandra.db:type=DynamicEndpointSnitch Severity 10000
Don’t be too severe!
Compaction StrategySolved by using a good C* designSizeTiered or Leveled?
Leveled has better guarantees for read timesSizeTiered may require 10 (or more) reads!Leveled uses less disk spaceLeveled tombstone collection is slower
Auth*Cookbooks default to OFF
Turn authenticator and authorizer on‘cassandra’ user is super special
Requires QUORUM (cross-DC) for signonLOCAL_ONE for all other users!
UsersOS users vs Cassandra users: 1 to 1?Shared credentials for apps?Nothing logs the user taking the action!‘cassandra’ user is created by cookbookAll processes run as ‘cassandra’
LimitsChef helps here! Startup:ulimit -l unlimited # mem lockulimit -n 48000 # fds
/etc/security/limits.dcassandra - nofile 48000cassandra - nproc unlimitedcassandra - memlock unlimited
Filesystem TypeOfficially supported: ext4 or XFSXFS is slightly fasterInteresting options:● ext4 without journal● ext2● zfs
LogsTo consolidate or not to consolidate?Push or pull? Usually push!FOSS: syslogd, syslog-ng, logstash/kibana, heka, bananaOthers: Splunk, SumoLogic, Loggly, Stackify
ShutdownNice init script with cookbook, steps are:● nodetool disablethrift (no more clients)● nodetool disablegossip (stop talking to
cluster)● nodetool drain (flush all memtables)● kill the jvm
Quick performance wins● Disable assertions - cookbook property● No swap space (or vm.swappiness=1)● max_concurrent_reads● max_concurrent_writes
Thank You!
Top Related