Couchbase at LinkedIn: Couchbase Connect 2015

Post on 15-Aug-2015

164 views 4 download

Tags:

Transcript of Couchbase at LinkedIn: Couchbase Connect 2015

Michael Kehoe Brian Cory Sherwin

LinkedIn

Couchbase at LinkedIn2015

3

Overview

• The LinkedIn Story• Development & Operations• Operational Tooling• LinkedIn’s Couchbase as a Database• Questions

4

• Site Reliability Engineer (SRE) at LinkedIn

• SRE for Profile &

Higher-Education

• Member of CBVT

• B.E. (Electrical Engineering) fromthe University of Queensland,Australia

Drag picture to placeholder or click icon to add

Michael Kehoe

5

The LinkedIn Story

• Founded in 2002, LinkedIn has grown into the world’s largest professional social media network

• Offices in 24 countries, Available in 23 languages• Over 360M members• Revenue of $638M in Q1 2015

6

 In-Memory storage needs

The LinkedIn Story

• At our scale, it becomes challenging to scale data systems• Read-Scaling becomes important• Applicable use-cases:

• Simple cache store• Pre-warmed• Read through

• Temporary data storage for de-duping• Potential for Source of Truth (SoT) store

7

 Enter Couchbase

The LinkedIn Story

• Until 2012, we were only using Memcached as a non SoT In-Memory store

• However it had some drawbacks;• Long cache warmup times• No partitioning/sharing – Had to write our own• Cold-cache restarts• Difficult to move data across hosts/clusters/datacentres

8

 Enter Couchbase

The LinkedIn Story

• Evaluated systems to replace Memcached: Mongo, Redis, and others• Couchbase had advantages

• Drop-in replacement for Memcached• Built in replication and cluster expansion• Memory latency for operations• Asynchronous writes to disk• Utilize some of the development infrastructure we’ve built

9

 Coding

Development & Operations

• Memcached configured with Spring and implements a caching Java interface

• Implemented with Couchbase Native Client• Developer just replaces the Spring

10

 Operations

Development & Operations

• Hadoop jobs build warm cache data• Tools to partition the data and load into Couchbase offline• Apply deltas when brought on-line• Clean, warm caches ready when needed

11

Operational Tooling

• In order to efficiently use Couchbase as SRE’s, we need the following:• Provisioning• Installation• Monitoring & Alerting• Infrastructure Visibility

12

 Provisioning

Operational Tooling

• Provisioning Flow• Seek estimated usage statistics on cluster

• Size of data to be stored• QPS• Redundancy Needs

• Calculate cluster sizing• Currently done via a spreadsheet with a template• Moving into an in-house application

• Request hardware for cluster(s)

13

 Installation

Operational Tooling

• Current System• Enter cluster metadata into our management system (Yahoo range)• Use SALT module to install & configure cluster

• Future System• Use same metadata system• Use SALT States to install and configure cluster

• Benefits of the new system• It’s possible to have ‘state enforcement’• Use SALT Pillar’s to encrypt cluster/bucket passwords

14

 Installation

Operational Tooling

CLUSTER: - ela4.couchbase.30 - prod-lva1.couchbase.30 - prod-ltx1.couchbase.30NAME: follow-bluePORT: 11211INSTANCE: 30ALERT_ADDRESSES: - q(some-sre-team@linkedin.com)SRE_GROUPS: - sre-team-nameCLIENT_CONTAINERS: - following-servicesEMAIL_ALERTS: - HIGHWATER_PERCENT_FULL - MEMORY_PERCENT_FULL - NOT_MY_VBUCKET - PERCENT_IN_MEMORY - KEY_USAGE - AUTOFAILOVER

15

 Monitoring & Alerting

Operational Tooling

• We run a daemon on each Couchbase Server that collects metrics every minute via a Couchbase Library API

• Use cluster metadata from range to build dashboard definition file via Jinja template & Python

16

 Monitoring & Alerting

Operational Tooling

$ ./couchbase.py –I 30[INFO] Generating dashboard file: common-templates/couchbase.follow-blue

17

 Monitoring & Alerting

Operational Tooling

- title: couchbase.follow-blue AutoFailover Enabled

defs:

- range: "%{FABRIC}.couchbase.30"

label: "autofailover_enabled"

rrd: couchbase.follow-blue/autofailover_enabled.rrd

params:

vlabel: 'enabled_boolean'

autoalerts:

zones: ['COUCHBASE-SLA2']

enabled-fabrics: ['ela4', 'prod-lva1', 'prod-ltx1']

processor: 'ingraphs'

filter-type: 'ingraphs_filter'

contacts: [‘couchbase-team@linkedin.com']

state-check: threshold

state-check-args:

min: 1.0

consecutive-events: 10

alert-plugin: emailer

alert-plugin-args:

recipients: [’some-sre-team@linkedin.com’]

interval: 3600

include-definition: True

18

 Monitoring & Alerting

Operational Tooling

19

 Management

Operational Tooling

• We want to see a world-view of all the clusters that we run

• Having bucket cluster/server level statistics are useful• Having a view of who owns each cluster/bucket is useful

20

 Management

Operational Tooling

21

 Management

Operational Tooling

22

 Management

Operational Tooling

23

 Management

Operational Tooling

24

 Management

Operational Tooling

25

 Management

Operational Tooling

26

Conclusions

• Couchbase fits into our existing infrastructure• We have good management and monitoring of the

clusters• Rich set of tooling we extended for our environment• Starting to expand our use from a cache to a store for

internal tooling

Brian Cory Sherwin Site Reliability Engineer

LinkedIn

LinkedIn’s Couchbase as a Database

28

• Our use case and requirements

• Why we chose Couchbase vs MySQL

• Pitfalls encountered

The Agenda

29

Memcache replacement

• Data resiliency

• Maintenance friendly

Couchbase @ Linkedin

30

AutoRemediation!

A job execution platform to remediate operations issues

• Database backend for state tracking of a workflow engine

Using Couchbase as a Workflow Backend

31

• Easy JSON documents

• Rapid iteration

• Horizontally scalable

Our Requirements

32

Couchbase as a database

• Document store

• Views for indexing

• Data resiliency

• Replication

• Simplicity

Why Couchbase?

33

• Upfront cost in creating the schema

• Rapidly changing documents• Number of columns

• Consistent incremental updates

Why not MySQL?

34

• ACID implications• Durability and Consistency

• Concurrency

• Different and new tech

Pitfalls using Couchbase

35

Questions?

bsherwin@linkedin.com

If you want to learn more on AutoRemediaiton

http://www.meetup.com/Auto-Remediation-and-Event-Driven-Automation/

Questions?