Watching Your Cassandra Cluster Melt

17
10/20/14 Watching Your Cassandra Cluster Melt

description

PagerDuty's very own Owen Kim had the misfortune of watching its abused, under-provisioned Cassandra cluster collapse. This presentation covers the lessons learned from that experience like: • Which of the many, many metrics did we learn to watch for • What mistakes we made that lead to this catastrophe • How we have changed our use to make our Cassandra cluster more stable

Transcript of Watching Your Cassandra Cluster Melt

Page 1: Watching Your Cassandra Cluster Melt

10/20/14

Watching Your Cassandra Cluster Melt

Page 2: Watching Your Cassandra Cluster Melt

10/20/14

What is PagerDuty?

WATCHING YOUR CASSANDRA CLUSTER MELT

Page 3: Watching Your Cassandra Cluster Melt

10/20/14

Cassandra at PagerDuty

WATCHING YOUR CASSANDRA CLUSTER MELT

• Used to provide durable, consistent read/writes in a critical pipeline of

service applications

• Scala, Cassandra, Zookeeper.

• Receives ~25 requests a sec

• Each request is a handful of operations then processed asynchronously

• Never lose an event. Never lose a message.

• This has HUGE implications around our design and architecture.

Page 4: Watching Your Cassandra Cluster Melt

10/20/14

Cassandra at PagerDuty

WATCHING YOUR CASSANDRA CLUSTER MELT

• Cassandra 1.2

• Thrift API

• Using Hector/Cassie/Astyanax

• Assigned tokens

• Putting off migrating to vnodes

• It is not big data

• Clusters ~10s of GB

• Data in the pipe is considered ephemeral

Page 5: Watching Your Cassandra Cluster Melt

10/20/14

Cassandra at PagerDuty

WATCHING YOUR CASSANDRA CLUSTER MELT

DC-C

DC-A DC-B

~20 MS ~5 MS

~20 MS

• Five (or ten) nodes in three regions

• Quorum CL

• RF = 5

Page 6: Watching Your Cassandra Cluster Melt

10/20/14

Cassandra at PagerDuty

WATCHING YOUR CASSANDRA CLUSTER MELT

• Operations cross the WAN and take inter-DC latency hit.

• Since we use it as our pipeline without much of a user-facing front,

we’re not latency sensitive, but throughput sensitive.

• We get consistent read/write operations.

• Events aren’t lost. Messages aren’t repeated.

• We get availability in the face of a loss of entire DC-region.

Page 7: Watching Your Cassandra Cluster Melt

10/20/14

What Happened?

WATCHING YOUR CASSANDRA CLUSTER MELT

• Everything fell apart and our critical pipeline began refusing new events and

halted progress on existing ones.

• Created degraded performance and a three-hour outage in PagerDuty

• Unprecedented flush of in-flight data

• Gory details on the impact found on the PD blog: https://blog.pagerduty.com/

2014/06/outage-post-mortem-june-3rd-4th-2014/

Page 8: Watching Your Cassandra Cluster Melt

10/20/14

What Happened…

WATCHING YOUR CASSANDRA CLUSTER MELT

• It was just a semi-regular day…

• …no particular changes in traffic

• …no particular changes in volume

• We had an incident the day before

• Repairs and compactions had been taking longer and longer. They

were starting to overlap on machines.

• We used ‘nodetool disablethrift' to mitigate load on nodes that

couldn’t handle being coordinators.

• We even disabled nodes and found odd improvements with a

smaller 3/5 cluster (any 3/5).

• The next day, we started a repair that had been foregone…

Page 9: Watching Your Cassandra Cluster Melt

10/20/14

What happened…

WATCHING YOUR CASSANDRA CLUSTER MELT

1 MIN SYSTEM LOAD

Page 10: Watching Your Cassandra Cluster Melt

10/20/14

What we did…

WATCHING YOUR CASSANDRA CLUSTER MELT

• Tried a few things to mitigate the damage

• Stopped less critical tenants.

• Disabled thrift interfaces

• Disabled nodes

• No discernible effect.

• Left with no choice, we blew away all data and restarted Cassandra fresh

• This only took 10 minutes after committing to do this.

sudo rm -r /var/lib/cassandra/commitlog/*

sudo rm -r /var/lib/cassandra/saved_caches/*

sudo rm -r /var/lib/cassandra/data/*

• Then everything was fine and dandy, like sour candy.

Page 11: Watching Your Cassandra Cluster Melt

10/20/14

So, what happened…?

WATCHING YOUR CASSANDRA CLUSTER MELT

WHAT WENT HORRIBLY WRONG?

• Multi-tenancy in the Cassandra cluster.

• Operational ease isn’t worth the transparency.

• Underprovisioning

• AWS m1.larges

• 2 cores

• 8 GB RAM <—definitely not enough.

• Poor monitoring and high-water marks

• A twisted desire to get everything out of our little cluster

Page 12: Watching Your Cassandra Cluster Melt

10/20/14

Why we didn’t see it coming…

WATCHING YOUR CASSANDRA CLUSTER MELT

OR, HOW I LIKE TO MAKE MYSELF FEEL BETTER.

• Everything was fine 99% of the time.

• Read/write latencies close to the inter-DC latencies.

• Despite load being relatively high sometimes.

• Cassandra seems to have two modes: fine and catastrophe

• We thought, “we don’t have much data, it should be able to handle this.”

• Thought we must have misconfigured something. We didn’t need to scale up…

Page 13: Watching Your Cassandra Cluster Melt

10/20/14

What we should have seen…

WATCHING YOUR CASSANDRA CLUSTER MELT

CONSTANT MEMORY PRESSURE

This is bad

This is good

Page 14: Watching Your Cassandra Cluster Melt

10/20/14

What we should have seen…

WATCHING YOUR CASSANDRA CLUSTER MELT

• Consistent memtable flushing

• “Flushing CFS(…) to relieve memory pressure”

• Slower repair/compaction times

• Likely related to the memory pressure

• Widening disparity between median and p95 read/write latencies

Page 15: Watching Your Cassandra Cluster Melt

10/20/14

What we changed…

WATCHING YOUR CASSANDRA CLUSTER MELT

THE AFTERMATH WAS ROUGH…

• Immediately replaced all nodes with m2.2xlarges

• 4 cores

• 32 GB RAM

• No more multi-tenancy.

• Required nasty service migrations

• Began watching a lot of pending task metrics.

• Flushed blocker writers

• Dropped messages

Page 16: Watching Your Cassandra Cluster Melt

10/20/14

Lessons Learned

WATCHING YOUR CASSANDRA CLUSTER MELT

• Cassandra has a steep performance degradation.

• Stay ahead of the scaling curve.

• Jump on any warning signs

• Practice scaling. Be able to do it on quick notice.

• Cassandra performance deteriorates with changes in the data set and

asynchronous, eventual consistency.

• Just because your latencies were one way doesn’t mean they’re

supposed to be that way.

• Don’t build for multi tenancy in your cluster.

Page 17: Watching Your Cassandra Cluster Melt

10/20/14

PS. We’re hiring Cassandra people (enthusiast to expert) in our Realtime or Persistence teams.

Thank you.

http://www.pagerduty.com/company/work-with-us/

http://bit.ly/1ym8j9g