Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Post on 26-Jul-2015

88 views 2 download

Tags:

Transcript of Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

Benjamin (Jerry) FranzSr Site Reliability Engineer

Scalability and Performance

Lessons learned taking the second highest QPS Couchbase server at LinkedIn from zero to

awesome

Couchbase

We have aCouchbase cluster?

Day 1

Meet the Couchbase Cluster• Three parallel clusters of 16 machines• 64 Gbytes of RAM per machine• 1 TB of RAID1 (spinning drives) per machine• 6 buckets in each cluster

Meet the Couchbase Cluster• Three parallel clusters of 16 machines• 64 Gbytes of RAM per machine• 1 TB of RAID1 (spinning drives) per machine• 6 buckets in each cluster• Massively under resourced• Memory completely full• 1 node failed out in each parallel cluster• Disk I/O Utilization: 100%, all the time• No Alerting

Meet the Couchbase Cluster• Three parallel clusters of 16 machines• 64 Gbytes of RAM per machine• 1 TB of RAID1 (spinning drives) per machine• 6 buckets in each cluster• Massively under resourced• Memory completely full• 1 node failed out in each parallel cluster• Disk I/O Utilization: 100%, all the time• No Alerting

The immediate problems• Unable to store new data because the memory was full and there wasn’t enough

I/O capacity available to flush it to disk.

The immediate problems• Unable to store new data because the memory was full and there wasn’t enough

I/O capacity available to flush it to disk.

• Aggravated by nodes failing out of the cluster - reducing both available memory and disk IOPS even further.

The immediate problems• Unable to store new data because the memory was full and there wasn’t enough

I/O capacity available to flush it to disk.

• Aggravated by nodes failing out of the cluster - reducing both available memory and disk IOPS even further.

• There was no visibility into cluster health because1. We didn’t know what healthy metrics should look like for Couchbase. We

didn’t even know which metrics were most important.2. Alerts were not being sent even when the cluster was in deep trouble.

The First Aid• Configured alerting

The First Aid• Configured alerting

• Started a temporary program of semi-manual monitoring and intervention to keep the cluster from falling over as it got too far behind. When it did get too far behind, we deleted all the data in the affected buckets and restarted.

The First Aid• Configured alerting.

• Started a temporary program of semi-manual monitoring and intervention to keep the cluster from falling over as it got too far behind. When it did get too far behind, we deleted all the data in the affected buckets and restarted.

• Doubled the number of nodes (from 48 to 96) in the clusters to improve available memory and disk IOPS.

The First Aid• Configured alerting.

• Started a temporary program of semi-manual monitoring and intervention to keep the cluster from falling over as it got too far behind. When it did get too far behind, we deleted all the data in the affected buckets and restarted.

• Doubled the number of nodes (from 48 to 96) in the clusters to improve available memory and disk IOPS.

• Increased the disk fragmentation threshold for compaction from 30% to 65% to reduce disk I/O.

The First Aid• Configured alerting.

• Started a temporary program of semi-manual monitoring and intervention to keep the cluster from falling over as it got too far behind. When it did get too far behind, we deleted all the data in the affected buckets and restarted.

• Doubled the number of nodes (from 48 to 96) in the clusters to improve available memory and disk IOPS.

• Increased the disk fragmentation threshold for compaction from 30% to 65% to reduce disk I/O.

• Reduced metadata expiration time from 3 days to 1 day to free memory.

Node failouts - Solved• The node failouts had two interacting causes:

1. Linux Transparent HugePages were active on many nodes causing semi-random slowdowns of nodes lasting up to several minutes when memory was defragged - causing them to fail out of the cluster. Fixed by correcting the kernel settings and restarting the nodes.

Node failouts - Solved• The node failouts had two interacting causes:

1. Linux Transparent HugePages were active on many nodes causing semi-random slowdowns of nodes lasting up to several minutes when memory was defragged - causing them to fail out of the cluster. Fixed by correcting the kernel settings and restarting the nodes.

2. ‘Pre-failure’ drives were going into data recovery mode and causing failouts on the affected nodes during the nightly access log scan at 10:00 UTC (02:00 PST/03:00 PDT).

Disk Persistence – Not solved• Despite more than doubling the available system resources, tuning

filesystem options for performance, and slashing the amount of data being fed to it by the application, disk utilization remained stubbornly close to 100% and the disk queues were still growing.

Disk Persistence – Not solved• Despite more than doubling the available system resources, tuning

filesystem options for performance, and slashing the amount of data being fed to it by the application, disk utilization remained stubbornly close to 100% and the disk queues were still growing.

• The cluster had a huge amount of ‘hidden I/O demand’. Because a large fraction of the data had very short TTLs but was taking up to a day to persist to disk, it was expiring in queue before it could be persisted. This was actually doing quite a lot to keep the cluster from falling over completely as it acted to throttle the disk demand as the cluster became overloaded. We were actually now persisting twice as much data as before.

Disk Persistence – Not solved• Despite more than doubling the available system resources, tuning

filesystem options for performance, and slashing the amount of data being fed to it by the application, disk utilization remained stubbornly close to 100% and the disk queues were still growing.

• The cluster had a huge amount of ‘hidden I/O demand’. Because a large fraction of the data had very short TTLs but was taking up to a day to persist to disk, it was expiring in queue before it could be persisted. This was actually doing quite a lot to keep the cluster from falling over completely as it acted to throttle the disk demand as the cluster became overloaded. We were actually now persisting twice as much data as before.

Cluster Health Visibility – Solved• Alerts were being sent to the appropriate people - cluster was not suffering

from outages without notices.

• Critical cluster metrics were identified and being used to for health monitoring and to measure performance tuning improvements.

Cluster Health Visibility – SolvedThe most important performance metrics

• ep_diskqueue_items – The number of items waiting to be persisted. This should be a somewhat stable number day to day. If if has a persistently upward trend that means cluster is unable to keep up with its disk I/O requirements.

Cluster Health Visibility – SolvedThe most important performance metrics

• ep_diskqueue_items – The number of items waiting to be persisted. This should be a somewhat stable number day to day. If if has a persistently upward trend that means cluster is unable to keep up with its disk I/O requirements.

• ep_storage_age – The age of the most recently persisted item. This has been a critical metric for quantifying the effects of configuration changes. A healthy cluster should keep this number close to or below 1 second on average. We started with values approaching days.

Cluster Health Visibility – SolvedThe most important performance metrics

• ep_diskqueue_items – The number of items waiting to be persisted. This should be a somewhat stable number day to day. If if has a persistently upward trend that means cluster is unable to keep up with its disk I/O requirements.

• ep_storage_age – The age of the most recently persisted item. This has been a critical metric for quantifying the effects of configuration changes. A healthy cluster should keep this number close to or below 1 second on average. We started with values approaching days.

• vb_active_perc_mem_resident – The percentage of items in the RAM cache. For most clusters at LinkedIn it should be 100%. If it falls below that, the cluster is probably underprovisioned and taking a big performance hit.

Overall Status Update

Servers. Lots of servers.My best estimate was that to meet our I/O requirements we would have to at least double our total node count again, to 192 servers total (3 x 64).

This was getting expensive

Servers. Lots of servers.My best estimate was that to meet our I/O requirements we would have to at least double our total node count again, to 192 servers total (3 x 64).

This was getting expensive

Servers. Lots of servers.My best estimate was that to meet our I/O requirements we would have to at least double our total node count again, to 192 servers total (3 x 64).

This was getting expensive

It was time to change up my strategy

SSDs

SSDsInitial Integration Testing

• 2 x 550GB Virident SSDs were integrated into one of the sub-clusters

SSDsInitial Integration Testing

• 2 x 550GB Virident SSDs were integrated into one of the sub-clusters

• Reduced cluster to 16 nodes to test under heavier load

SSDsInitial Integration Testing

• 2 x 550GB Virident SSDs were integrated into one of the sub-clusters

• Reduced cluster to 16 nodes to test under heavier load

• Write I/O on SSDs shot up to multiple times of the rate of the HDDs

SSDsInitial Integration Testing

• 2 x 550GB Virident SSDs were integrated into one of the sub-clusters

• Reduced cluster to 16 nodes to test under heavier load

• Write I/O on SSDs shot up to multiple times of the rate of the HDDs

• Performance scaling indicated that in the final configuration we would burn through the SSD lifetime write capacity in less than one year

SSDs

SSD Strategy Tuning

• Switched to 2200 GB Virident SSDs to extend service life

• Reduced cluster size to 8 nodes per sub-cluster (24 nodes total)

Full Scale SSD Impact

Full Scale SSD ImpactFor the first time since the cluster was turned on nearly a year ago,

almost all of our data was getting persisted to disk.

So we converted the other two clusters as well.

Full Scale SSD ImpactFor the first time since the cluster was turned on nearly a year ago,

almost all of our data was getting persisted to disk.

So we converted the other two clusters as well.

Done?

Full Scale SSD ImpactFor the first time since the cluster was turned on nearly a year ago,

almost all of our data was getting persisted to disk.

So we converted the other two clusters as well.

Done?

Not Yet

Turning it up to 11While we were no longer completely on fire, we weren’t yet awesome

Turning it up to 11While we were no longer completely on fire, we weren’t yet awesome

We were still taking up to 40 minutes to persist new data

Turning it up to 11While we were no longer completely on fire, we weren’t yet awesome

We were still taking up to 40 minutes to persist new data

It wasn’t the drives at this point – it was the application

Turning it up to 11While we were no longer completely on fire, we weren’t yet awesome

We were still taking up to 40 minutes to persist new data

It wasn’t the drives at this point – it was the application

It wasn’t keeping up with the drives

Preparing Couchbase for Ludicrous Speed

Preparing Couchbase for Ludicrous SpeedIncrease the number of reader/writer threads to 8

Preparing Couchbase for Ludicrous SpeedConsolidate the buckets (4 high QPS buckets -> 2 high QPS buckets)

Preparing Couchbase for Ludicrous SpeedIncreased frequency of disk cleanup (exp_pager_stime) to every 10 minutes

And buckle your seatbelt

And buckle your seatbelt75% writes (sets + incr) / 25% reads – 13 byte values, 25 byte keys on average

2.5 billion items (+ 1 replica)

600 Gbytes of RAM / 3 Tbytes of disk in use on average

Average store latency ~ 0.4 milliseconds

99th percentile store latency ~ 2.5 milliseconds

Average get latency ~ 0.8 milliseconds

99th percentile get latency ~ 8 milliseconds

And buckle your seatbelt

And buckle your seatbelt

And buckle your seatbelt

The End