Post on 08-Jan-2017
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Mike Heffner, Data, Librato
Peter Norton, Operations, Librato
Case Study: Librato’s Experience Running
Cassandra Using Amazon EBS, ENIs,
and Amazon VPC
December 1, 2016
CMP213
Librato circa 2015
● Cassandra 2.0.11 + patches
● i2.2xlarge
• DataStax and Netflix preferred instance type
● 160 instances
● Never Amazon EBS – only instance store
● Raid0 over instance store (1.5 TB)
Operational challenges
● CPU/cost ratio low on i2.2xlarge
• Kept rings hot to maximize efficiency
● Persistent data tied to instance
• Long MTTR to stream large data sets
● Maximum data volume size
• Had to scale rings for data capacity
Enter Amazon VPC migration – 3Q 2015
● Librato moving Classic → Amazon VPC
● Code all the things: SaltStack / Terraform / Flask
● Opportunity to overhaul Cassandra
● Emboldened by CrowdStrike Amazon EBS talk
@ re:Invent 2015
• We can do this!
• Anticipating a big win
Initial target spec
● Cassandra 2.2.x
● c4.4xlarge
● EBS GP2
• 4 TB data
• 200 GB commitlog
● XFS
● Ubuntu 14.04 + enhanced networking driver (ixgbevf)
● sysctl, IO, NUMA tuning suggestions from Al Tobey’s guide
● Kernel 3.13
Write timeouts – When did this start?
Started Write
Timeouts
Bisected to Here
Known Good Tested Version
Feb 2016 March
Write timeouts – Found it!
No timeouts in C* 2.1.4
Appear at 2.1.5
Started Write
Timeouts
Feb 2016 March
EBS metrics
● Spent a lot of time second-guessing EBS performance
● No GP2 burst scheduling metrics
● EBS/CloudWatch metrics only 5 minute resolution
● Started with 200 GB GP2 volume
● 600 IOPS max
● Hit bottlenecks during test (15-30 min+)
● Workaround
• Bumped to 1 TB commitlog volume (3k IOPS)
• Tested with sharing commitlog on data disk
Commitlog scaling – April 2016
Write
Timeouts
Started Commitlog
Scaling
Feb 2016 March April
● Throughput Optimized HDD (st1)
● Use a 600 GB st1 partition
● Cost <50% of GP2
● Commitlog separate from data
New commitlog config
Started Write
Timeouts
Commitlog
Scaling
st1 GA st1 Testing
Feb 2016 March April April 9 May
Ring connection timeouts – June 2016
● Added our read load
● Small message drops
● Slow start, grew to collapse of ring
● Rolling restarts fixed it for a day
Started Write
Timeouts
Commitlog
Scaling
st1 GA st1 Testing Connection
Timeouts
Feb 2016 March April April 9 May June
Ring connection timeouts
● Called The Last Pickle
otc_coalescing_strategy: DISABLED
Started Write
Timeouts
Commitlog
Scaling
st1 GA st1 Testing Connection
Timeouts
Feb 2016 March April April 9 May June
Production reached! – July 2016
We’ll live with more network traffic for now
Write
Timeouts
Commitlog
Scaling
Started st1 GA st1 Testing Connection
Timeouts
In Prod!
Feb 2016 March April April 9 May June July 2016
Librato today
Real Time Long Retention
One week retention Retain over a year
c4.4xlarge m4.2xlarge
EBS 2 TB GP2 data partition EBS 4T B GP2 data partition
EBS 600 GB ST1 commitlog EBS 600 GB ST1 commitlog
Split ring configurations:
Before
● 120 * i2.2xlarge
● Instance cost: $62k monthly*
Real-time rings: before and after
After
● 66 * c4.4xlarge
● 2 TB GP2 + 600 GB ST1
● Instance cost: $25k monthly*
● EBS cost: $15k monthly
● Total: $40k monthly
Total savings
35%(*) 1 year up-front
Before
● 36 * i2.2xlarge
● Instance cost: $19k monthly*
Long retention rings: before and after
After
● 30 * m4.2xlarge
● 4 TB GP2 + 600 GB ST1
● Instances: $6k monthly*
● EBS cost: $13k monthly
● Total: $19k monthly
Even cost
2x+ more disk capacity
(*) 1 year up front
Reducing MTTR
● Two critical pieces of state for Cassandra:
• Data files (commitlog and sstables)
• Network interface
● Data now on EBS
● ENI provides detached IP address (Amazon VPC only)
● Mobility provides a lot of flexibility
Bring them up, bring them down
● Now new rings are brought up fast
● Easier to automate
When not in use
● Shut down nodes
● Park the disks
… Or just destroy them
We’ve grown up: managing resources with
Terraform
● Query Terraform for state
● Create EBS
● Create ENI
● Create Security groups
Organized to let us remove resources
Keeps us from being cloud hoarders. When we’re done with a ring, we
can remove resources easily.
● Remove EBS
● Remove ENI
● Remove Security groups
Snapshots are still available in case we need them.
● Launch instances
● Attach EBS and ENI
● Configure rings
● Augment with the Salt API
● Clear guardrails built into the process
We launch rings with SaltStack
Disaster recovery
● Previously
• Tablesnap to Amazon S3
• Required constant pruning (tablechop)
• Amazon S3 bill high
● Now
• EBS snapshots
• Cron job to snapshot EBS via Ops API
• Cron job to clean old snapshots via Ops API
• Block differences: no pruning needed
In-place ring scale up
Now we have a button to push
● Sudden load change
● Rolling operation to scale up instances
● Example: scale from c4.4xl → c4.8xl
After comfortable with capacity, we can still scale ring out with
bootstrap
Disk access mode
● MMap (4 KB faults) and Standard (read/write syscalls)
● MMap works well for small, random-access row reads
● Read ahead kept small for performant small reads
● Large compaction operations are sequential I/O
● What does this mean?
This impacts cost
● We must provision for a high base IOP load
● Disk size much larger than used capacity
● EBS GP2 IOP size is 256 KB
● What can be done?
Hybrid disk access mode
● MMap reads for row queries
● Standard mode (read/write) during compaction
● Ensure reads are chunked
● Chunk size configurable by disk (eg., 256 KB for GP2)
Wrapup
● Make it easy to test
production traffic
● Instance flexibility with EBS
● Operational simplicity and
reduced MTTR
● Reduced cost and increased
headroom
Future
● Debug network coalescing
● Cassandra 3.0
● More testing of hybrid disk
access models