Michael Kehoe Staff Site Reliability Engineer
Going all in:From single use-case to many
2
Overview
• The LinkedIn Story• Couchbase Use-Cases• Development & Operations• Conclusions• Questions
3
$ whoamiMichael Kehoe
• Staff Site Reliability Engineer (SRE)• Production-SRE team• Funny accent = Australian
• Contact• linkedin.com/in/michaelkkehoe• @matrixtek
4
$ whatis SREMichael Kehoe
• Site Reliability Engineering• Operations for the production application environment• Responsibilities include
• Architecture design• Capacity planning• Operations• Tooling
5
$ whatis CBVTMichael Kehoe
• Couchbase Virtual Team• ~10 SRE’s• 2 Software Engineers• Sponsored by SRE Director• 5-90% of their time to support Couchbase• Encourage as many people to contribute as possible
• What do we do?• Operational work on Couchbase clusters• Evangelize the use of Couchbase within LinkedIn• Develop tools for the Couchbase Ecosystem
6
The LinkedIn Story
• Founded in 2002, LinkedIn has grown into the world’s largest professional social media network
• 30 offices in 24 countries, Available in 24 languages• More than 450+ million members worldwide
7
The LinkedIn Story
• Growth in Products• Profiles• Groups• Recruiter• Sales Navigator
• Growth in Internet Traffic• Billions of page-hits per day• 100k+ QPS to production services
8
In-Memory Storage NeedsThe LinkedIn Story
• LinkedIn started as an Oracle shop
• Hyper-growth = Scaling challenges• Read-Scaling becomes important
• Applicable use-cases• Simple cache store
• Pre-warmed• Read through
• Potential for Source of Truth (SoT) store
9
Enter CouchbaseThe LinkedIn Story
• Until 2012, we were only using Memcache as a non SoT In-Memory store
• Drawbacks• Difficult to pre-warm• No partitioning/sharding (had to write our own)• Cold-cache restarts• Difficult to move data across hosts/clusters data-centers
10
Enter CouchbaseThe LinkedIn Story
• Evaluated replacement systems for Memcached: Mongo, Redis, and others• Couchbase had distinct advantages:
• Simple replacement for Memcached• Built-in replication and cluster expansion• Automatic partitioning• Low latency• Async writes to disk• Building tooling is simple
11
Enter CouchbaseThe LinkedIn Story
• Today we run Couchbase in our Corporate, Staging and Production environments
• Production/ Staging statistics:• 148 buckets• 2821 hosts• 10M+ QPS
• Largest Clusters:• By Hosts: 72 Hosts• By Documents: 1.4B Documents• By QPS: 2.5M QPS
12
SummaryUse-Cases
Today’s use-cases:• Simple read-through cache• Ephemeral Counter Store• Temporary de-duping store• SoT data-store for internal tooling
13
Simple read-through cacheUse-Cases
• Drop-in replacement for memcache• Read-scaling• Protecting backend database from large amounts of traffic
• E.g. 3rd party ingestion credential cache
14
Counter StoreUse-Cases
• In certain places, we simply need to increment counters from multiple systems and store them
• E.g. Anti-abuse/Anti-scraping systems (Fuse)
15
Temporary De-duping storeUse-Cases
• Need to de-dup data over a large application cluster• E.g. Email systems – Ensure we don’t send the same email twice
16
SoT Store for Internal ToolsUse-Cases
• For Non-Member facing tools, we use Couchbase as a SoT store.• Benefits:
• Schema-less• Short setup time• Couchbase Python Client works easily in our environment• Use views for simple map-reduce
• Example Uses:• Nurse – Autoremediation system• TrafficshiftIn – Global traffic automation system• Availability – Storing and tracking Linkedin availability data
17
Couchbase EcosystemThe LinkedIn Story
18
Developing around Couchbase
• Java – li-couchbase-client• Wrapper around standard Java Couchbase Client• Custom metrics emission• Using Spring interface• Storing data as Java serialized objects
• Python – couchbase-python-client
19
Operational Tooling
In order to efficiently use Couchbase as SRE’s, we need the following:• Provisioning• Installation• Monitoring & Alerting• Infrastructure Visibility
20
ProvisioningOperational Tooling
• Provisioning Flow• Seek estimated usage statistics for cluster
• Size of data to be stored• QPS• Redundancy Needs
• Calculate cluster sizing• Currently done with a template• Couchbase has a simple calculator available online: http://
docs.couchbase.com/prebuilt/calculators/sizing-calc.html• Request hardware for cluster(s)
21
InstallationOperational Tooling
• Process• Enter cluster metadata into our management system (Range)• Use Salt States to install and configure cluster• See Issa Fattah’s post for more information:
• https://engineering.linkedin.com/blog/2016/04/leveraging-saltstack-to-scale-couchbase
• Benefits• Ability to perform ‘state enforcement’• Using Salt Pillar’s to encrypt cluster/ bucket passwords end-to-end
22
Monitoring & AlertingOperational Tooling
• We run a daemon on each Couchbase Server that collects metrics every minute via Couchbase API’s
• Use cluster metadata from range to build dashboards with our own system InGraphs
• See: ‘Monitoring production deployments’: 4pm - Great America 1
23
Monitoring & AlertingOperational Tooling
24
ManagementOperational Tooling
• We want to see a world-view of all the clusters we run
• Having bucket cluster/server level statistics is useful
• Having a global view of who owns and operates each cluster/ bucket is useful
25
ManagementOperational Tooling
26
Conclusions
• Couchbase was a natural fit into our existing infrastructure
• Building an ecosystem around Couchbase was important to us and has helped Couchbase be successful at LinkedIn
• Expanding use of Couchbase• In the past year we’ve grown the number of buckets over 50%• Starting to use Views in production• Moving Couchbase into LinkedIn standard deployment infrastructure
27
Thank You
Questions?
©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.
Top Related