Couchbase at LinkedIn: Couchbase Connect 2015

35

Transcript of Couchbase at LinkedIn: Couchbase Connect 2015

Page 1: Couchbase at LinkedIn: Couchbase Connect 2015
Page 2: Couchbase at LinkedIn: Couchbase Connect 2015

Michael Kehoe Brian Cory Sherwin

LinkedIn

Couchbase at LinkedIn2015

Page 3: Couchbase at LinkedIn: Couchbase Connect 2015

3

Overview

• The LinkedIn Story• Development & Operations• Operational Tooling• LinkedIn’s Couchbase as a Database• Questions

Page 4: Couchbase at LinkedIn: Couchbase Connect 2015

4

• Site Reliability Engineer (SRE) at LinkedIn

• SRE for Profile &

Higher-Education

• Member of CBVT

• B.E. (Electrical Engineering) fromthe University of Queensland,Australia

Drag picture to placeholder or click icon to add

Michael Kehoe

Page 5: Couchbase at LinkedIn: Couchbase Connect 2015

5

The LinkedIn Story

• Founded in 2002, LinkedIn has grown into the world’s largest professional social media network

• Offices in 24 countries, Available in 23 languages• Over 360M members• Revenue of $638M in Q1 2015

Page 6: Couchbase at LinkedIn: Couchbase Connect 2015

6

 In-Memory storage needs

The LinkedIn Story

• At our scale, it becomes challenging to scale data systems• Read-Scaling becomes important• Applicable use-cases:

• Simple cache store• Pre-warmed• Read through

• Temporary data storage for de-duping• Potential for Source of Truth (SoT) store

Page 7: Couchbase at LinkedIn: Couchbase Connect 2015

7

 Enter Couchbase

The LinkedIn Story

• Until 2012, we were only using Memcached as a non SoT In-Memory store

• However it had some drawbacks;• Long cache warmup times• No partitioning/sharing – Had to write our own• Cold-cache restarts• Difficult to move data across hosts/clusters/datacentres

Page 8: Couchbase at LinkedIn: Couchbase Connect 2015

8

 Enter Couchbase

The LinkedIn Story

• Evaluated systems to replace Memcached: Mongo, Redis, and others• Couchbase had advantages

• Drop-in replacement for Memcached• Built in replication and cluster expansion• Memory latency for operations• Asynchronous writes to disk• Utilize some of the development infrastructure we’ve built

Page 9: Couchbase at LinkedIn: Couchbase Connect 2015

9

 Coding

Development & Operations

• Memcached configured with Spring and implements a caching Java interface

• Implemented with Couchbase Native Client• Developer just replaces the Spring

Page 10: Couchbase at LinkedIn: Couchbase Connect 2015

10

 Operations

Development & Operations

• Hadoop jobs build warm cache data• Tools to partition the data and load into Couchbase offline• Apply deltas when brought on-line• Clean, warm caches ready when needed

Page 11: Couchbase at LinkedIn: Couchbase Connect 2015

11

Operational Tooling

• In order to efficiently use Couchbase as SRE’s, we need the following:• Provisioning• Installation• Monitoring & Alerting• Infrastructure Visibility

Page 12: Couchbase at LinkedIn: Couchbase Connect 2015

12

 Provisioning

Operational Tooling

• Provisioning Flow• Seek estimated usage statistics on cluster

• Size of data to be stored• QPS• Redundancy Needs

• Calculate cluster sizing• Currently done via a spreadsheet with a template• Moving into an in-house application

• Request hardware for cluster(s)

Page 13: Couchbase at LinkedIn: Couchbase Connect 2015

13

 Installation

Operational Tooling

• Current System• Enter cluster metadata into our management system (Yahoo range)• Use SALT module to install & configure cluster

• Future System• Use same metadata system• Use SALT States to install and configure cluster

• Benefits of the new system• It’s possible to have ‘state enforcement’• Use SALT Pillar’s to encrypt cluster/bucket passwords

Page 14: Couchbase at LinkedIn: Couchbase Connect 2015

14

 Installation

Operational Tooling

CLUSTER: - ela4.couchbase.30 - prod-lva1.couchbase.30 - prod-ltx1.couchbase.30NAME: follow-bluePORT: 11211INSTANCE: 30ALERT_ADDRESSES: - q([email protected])SRE_GROUPS: - sre-team-nameCLIENT_CONTAINERS: - following-servicesEMAIL_ALERTS: - HIGHWATER_PERCENT_FULL - MEMORY_PERCENT_FULL - NOT_MY_VBUCKET - PERCENT_IN_MEMORY - KEY_USAGE - AUTOFAILOVER

Page 15: Couchbase at LinkedIn: Couchbase Connect 2015

15

 Monitoring & Alerting

Operational Tooling

• We run a daemon on each Couchbase Server that collects metrics every minute via a Couchbase Library API

• Use cluster metadata from range to build dashboard definition file via Jinja template & Python

Page 16: Couchbase at LinkedIn: Couchbase Connect 2015

16

 Monitoring & Alerting

Operational Tooling

$ ./couchbase.py –I 30[INFO] Generating dashboard file: common-templates/couchbase.follow-blue

Page 17: Couchbase at LinkedIn: Couchbase Connect 2015

17

 Monitoring & Alerting

Operational Tooling

- title: couchbase.follow-blue AutoFailover Enabled

defs:

- range: "%{FABRIC}.couchbase.30"

label: "autofailover_enabled"

rrd: couchbase.follow-blue/autofailover_enabled.rrd

params:

vlabel: 'enabled_boolean'

autoalerts:

zones: ['COUCHBASE-SLA2']

enabled-fabrics: ['ela4', 'prod-lva1', 'prod-ltx1']

processor: 'ingraphs'

filter-type: 'ingraphs_filter'

contacts: [‘[email protected]']

state-check: threshold

state-check-args:

min: 1.0

consecutive-events: 10

alert-plugin: emailer

alert-plugin-args:

recipients: [’[email protected]’]

interval: 3600

include-definition: True

Page 18: Couchbase at LinkedIn: Couchbase Connect 2015

18

 Monitoring & Alerting

Operational Tooling

Page 19: Couchbase at LinkedIn: Couchbase Connect 2015

19

 Management

Operational Tooling

• We want to see a world-view of all the clusters that we run

• Having bucket cluster/server level statistics are useful• Having a view of who owns each cluster/bucket is useful

Page 20: Couchbase at LinkedIn: Couchbase Connect 2015

20

 Management

Operational Tooling

Page 21: Couchbase at LinkedIn: Couchbase Connect 2015

21

 Management

Operational Tooling

Page 22: Couchbase at LinkedIn: Couchbase Connect 2015

22

 Management

Operational Tooling

Page 23: Couchbase at LinkedIn: Couchbase Connect 2015

23

 Management

Operational Tooling

Page 24: Couchbase at LinkedIn: Couchbase Connect 2015

24

 Management

Operational Tooling

Page 25: Couchbase at LinkedIn: Couchbase Connect 2015

25

 Management

Operational Tooling

Page 26: Couchbase at LinkedIn: Couchbase Connect 2015

26

Conclusions

• Couchbase fits into our existing infrastructure• We have good management and monitoring of the

clusters• Rich set of tooling we extended for our environment• Starting to expand our use from a cache to a store for

internal tooling

Page 27: Couchbase at LinkedIn: Couchbase Connect 2015

Brian Cory Sherwin Site Reliability Engineer

LinkedIn

LinkedIn’s Couchbase as a Database

Page 28: Couchbase at LinkedIn: Couchbase Connect 2015

28

• Our use case and requirements

• Why we chose Couchbase vs MySQL

• Pitfalls encountered

The Agenda

Page 29: Couchbase at LinkedIn: Couchbase Connect 2015

29

Memcache replacement

• Data resiliency

• Maintenance friendly

Couchbase @ Linkedin

Page 30: Couchbase at LinkedIn: Couchbase Connect 2015

30

AutoRemediation!

A job execution platform to remediate operations issues

• Database backend for state tracking of a workflow engine

Using Couchbase as a Workflow Backend

Page 31: Couchbase at LinkedIn: Couchbase Connect 2015

31

• Easy JSON documents

• Rapid iteration

• Horizontally scalable

Our Requirements

Page 32: Couchbase at LinkedIn: Couchbase Connect 2015

32

Couchbase as a database

• Document store

• Views for indexing

• Data resiliency

• Replication

• Simplicity

Why Couchbase?

Page 33: Couchbase at LinkedIn: Couchbase Connect 2015

33

• Upfront cost in creating the schema

• Rapidly changing documents• Number of columns

• Consistent incremental updates

Why not MySQL?

Page 34: Couchbase at LinkedIn: Couchbase Connect 2015

34

• ACID implications• Durability and Consistency

• Concurrency

• Different and new tech

Pitfalls using Couchbase