Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat...
Transcript of Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat...
![Page 1: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/1.jpg)
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra
Sam Bisbee, Threat Stack CTO
![Page 2: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/2.jpg)
Typical [time series] problems on C*
● Disk utilization creates a scaling pattern of lighting money on fire
– Only works for a month or two, even with 90% disk utilization
● Every write up we found focused on schema design for tracking integers across time
– There are days we wish we only tracked integers
● Data drastically loses value over time, but C*'s design doesn't acknowledge this
– TTLs only address 0 value states, not partial value
– Ex., 99% of reads are for data in its first day
● Not all sensors are equal
![Page 3: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/3.jpg)
Categories of Time Series Data
Volume of Tx's
Size of Tx's
CRUD, Web 2.0
System Monitoring(CPU, etc.)System Monitoring(CPU, etc.)
Traditional object store
Threat Stack
![Page 4: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/4.jpg)
Categories of Time Series Data
Volume of Tx's
Size of Tx's
CRUD, Web 2.0
System Monitoring(CPU, etc.)System Monitoring(CPU, etc.)
Traditional object store
Threat Stack
Traditional timeseries on C*, whateveryone writes about
“We're going to needa bigger boat. Or disks.”
![Page 5: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/5.jpg)
We care about this thing called margins
(see: we're in Boston, not the Valley)
![Page 6: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/6.jpg)
Data at Threat Stack
● 5 to 10TBs per day of raw data
– Crossed several TB per day in first few months of production with ~4 people
● 80,000 to 150,000 Tx per second, analyzed in real time
– Internal goal of analyzing, persisting, and firing alerts in <1s
● 90% write to 10% read tx
● Pre-compute query results for 70% of queries for UI
– Optimized lookup tables & complex data structures, not just “query & cache”
● 100% AWS, distrust of remote storage in our DNA
– This is not just EBS bashing. This applies to all databases on all platforms, even a cage in a data center.
● By the way, we're on DSE 4.8.4 (C* 2.1)
![Page 7: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/7.jpg)
Generic data model
● Entire platform assumes that events form a partially ordered, eventually consistent, write ahead log
– A wonderful C* use case, so long as you only INSERT
● UPDATE is a dirty word and C* counters are “banned”
– We do our big counts elsewhere (“right tool for the right job”)
● No DELETEs, too many key permutations and don't want tombstones
● Duplicate writes will happen
– Legitimate: fully or partially failed batches of writes
– Legitimate: sensor resends data because it doesn't see platform's acknowledgement of data
– How-do-you-even-computer: people cannot configure NTP, so have fun constantly receiving data from 1970
● TTL on insert time, store and query on event time
![Page 8: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/8.jpg)
We need to show individual events or slices,
cannot use time granularity rows
(1min, 15min, 30min, 1hr, etc.)
![Page 9: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/9.jpg)
Creating and updating tables' schema
● ALTER TABLE isn't fun, so we support dual writes instead
– Create new schema, performing dual reads for new & old
– Cut writes over to new schema
– After TTL time, DROP TABLE old
● Each step is verifiable with unit tests and metrics
● Maintains insert only data model for temporary disk util cost
● Allows trivial testing of analysis and A/B'ing of schema
– Just toss a new schema in, gather some insights, and then feel free to drop it
![Page 10: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/10.jpg)
AWS Instance Types & EBS
● EBS is generally banned on our platform
– Too many of us lived through the great outage
– Too many of us cannot live with unpredictable I/O patterns
– Biggest reason: you cannot RI EBS
● Originally used i2.2xlarge's in 2014/2015
– Considering amount of “learning” we did, we were very grateful for SSDs due to amount of streaming we had to do
● Moved to d2.xlarge's and d2.2xlarge's in 2015
– RAID 0 the spindles with xfs
– We like the CPU and RAM to disk ratio, especially since compaction stops after a few hours
![Page 11: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/11.jpg)
$/TB on AWS
i2.2xlarge d2.2xlarge c3.2xlarge +6 x 2TB io1 EBS
No Prepay $619.04 / 1.6TB= $386 / TB / month
$586.92 / 12TB= $49.91 / TB / month
$1,713.16 / 12TB= $142.77/TB/month
Partial Prepay $530.37 / 1.6TB= $331.48/TB/month
$502.12 / 12TB= $41.85 / TB / month
$1,684.59 / 12TB= $140.39/TB/month
Full Prepay $519.17 / 1.6TB= $324.85/TB/month
$492 / 12TB= $41 / TB / month
$1,680.84 / 12TB= $140.07/TB/month
● Amortizes one-time RI across 1yr, focusing on cost instead of cash out of pocket
● Does not account for N=3 in cluster, so x3 for each record, then x2 for worst case compaction headroom (realistically need MUCH LESS)
● c3 column assumes d2 comparison on disk size, not fair versus i2
![Page 12: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/12.jpg)
We only store some raw data in C*
● Deleting data proved too difficult in the early days, even with DTCS (slides coming on how we solved this)
● Re-streaming due to regular maintenance could take a week or more
– Dropping instance size doesn't solve throughput problem since all resources are cut, not just disk size
– Another reason not to use EBS since you'll “never” get close to 100% disk utilization
● Due to aforementioned C* durability design, cost of data for day 2..N is too high even if you drop replica count
![Page 13: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/13.jpg)
Tying C* to raw data
● Every query must constrain a minimum of:
– Sensor ID
– Event Day
● Every query result must include a minimum of:
– Sensor ID
– Event Day
– Event ID
● Batches of (sensor_id, event_day, event_id) triples are then used to look up the raw events from raw data storage
– This isn't always necessary (aggregates, correlations, etc.)
– Even with additional hops, full reads are still <1s
![Page 14: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/14.jpg)
Using triples to batch writes
● Partition key starts with sensor id and event day
– Bonus: you get fresh ring location every day! Helps for averaging out your schema mistakes over the TTL
● Event batches off of RabbitMQ are already constrained to a single sensor id and event day
– Allows mapping a single AMQP read to a single C* write (RabbitMQ is podded, not clustered)
– Flow state of pipeline becomes trivial to understand
● Batch C* writes on partition key, then data size (soft cap at 5120 bytes, C* inner warn)
![Page 15: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/15.jpg)
Compaction woes, STCS & DTCS
● Used STCS in 2014/2015, expired data would get stuck ∞
– “We could rotate tables” → eh, no
– “We could rotate clusters” → oh c'mon, hell no
– “We could generate every historic permutation of keys within that time bucket with Spark and run DELETEs” →...............
● Used DTCS in 2015, but expired data still got stuck ∞
– When deciding whether an SSTable is too old to compact, compares “now” versus max timestamp (most recent write)
– If you write constantly (time series), then SSTables will rarely or never stop compacting
– This means that you never realize the true value of DTCS for time series, the ability to unlink whole SSTables from disk
![Page 16: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/16.jpg)
Cluster disk states assuming const sensor count
Disk Util
Time
What you want
What you get
Initial build up toretention period
![Page 17: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/17.jpg)
MTCS, fixing DTCS
https://github.com/threatstack/mtcs
Now compare w/ min time(oldest write)
![Page 18: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/18.jpg)
MTCS settings
● Never run repairs (never worked on STCS or DTCS anyway) and hinted handoff is off (great way to kill a cluster anyway)
● max_sstable_age_days = 1
base_time_seconds = 1 hour
● Results in roughly hour bucket sequential SSTables
– Reads are happy due to day or hour resolution, which have to provide this in the partition key anyway
● Rest of DTCS sub-properties are default
● Not worried about really old and small SSTables since those are simply unlinked “soon”
![Page 19: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/19.jpg)
MTCS + sstablejanitor.sh
● Even with MTCS, SSTables were still not getting unlinked
● So enters sstablejanitor.sh
– Cron job fires it once per hour
– Iterates over each SSTable on disk for MTCS tables (chef/cron feeds it a list of tables and their TTLs)
– Uses sstablemetadata to determine max timestamp
– If past TTL, then uses JMX to invoke CompactionManager's forceUserDefinedCompaction on the table
● Hack? Yes, cron + sed + awk + JMX qualifies as a hack, but it works like a charm and we don't carry expired data
● Bonus: don't need to reserve half your disks for compaction
![Page 20: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/20.jpg)
![Page 21: Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016](https://reader031.fdocuments.us/reader031/viewer/2022030306/586f75e31a28ab10258b6235/html5/thumbnails/21.jpg)
Discussion
@threatstack@sbisbee