Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team

Post on 08-Sep-2014

1.604 views 0 download

description

A presentation by Redis Labs' CTO, Yiftach Shoolman, given at the July 2nd meet up, hosted by I am OnDemand and IGT Cloud at the Microsoft ILDC Auditorium. See the video at: https://www.youtube.com/watch?v=eymqHZaUOH4 In this In this session Yiftach shares tips on how the company manages 50,000+ scalable and highly avaliable Redis databases over the 4 largest public clouds, 8 leading Platforms-as-a-Service, and across 10 geographical regions. He explains the service's back-end architecture, the open-source projects it uses, and which tools the company builds in-house. Shoolman also shares what Redis Labs' small DevOps team does automatically, and what it still does manually. Finally, he offers advice on how to build a strong R&D team that lives and breathes DevOps. Since the company launched its Redis Cloud service, it has dealt with 150+ node failure events and a half-dozen complete data-center outages. In addition, its team has experienced many interesting scenarios, such as hard to believe scaling patterns like 0 to a few hundreds gigabytes of in-memory data in just a few minutes, and 0 to 300K+ ops/sec in just a few seconds.

Transcript of Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team

1

powering lightning fast apps

2

The newest NoSQL

The fastest data store available today (served entirely

from RAM)

Among the top 3 databases chosen by developers

Much more than a simple key/value - Strings, Hashes,

Lists, Sets, Sorted Set, LUA, transactions, Bits

operations

Strong use cases, dynamic community, large eco-

system

Redis

3

Leading the commercial Redis market

Founded in 2011; GA in 02/2013

2,400+ paying customers; 52,000+ DBs; 100+

new DBs/day

2nd largest contributor to open source Redis

Raised $13M - Bain/Carmel/Strategic/Angels

Offices in Santa Clara and Tel-Aviv

Redis Labs

4

Redis Cloud Memcached Cloud

Our offering

Fully-managed cloud services.

On-prem server license - soon.

5

100msec =

Fast apps requirements

max E2E response time, under any load

50msec = average Internet latency

50msec = required app response time (includes processing & multi DB accesses)

1msec = required DB response time

The only database to meet requirement

=

6

DB performance comparison@<1mse

c

@<1msec

@<1msec

@<20msec

@<10-50msec

@<10-50msec

@<100msec

@<100msec

@>100msec

7

Why is Redis efficient ?

Many data-structures

Many cool commands (atomicity

maintained)

Complexity aware

8

Real world use case:

•500+GB

•400K writes/sec

•1500 reads/sec

•37.5KB average object size

Efficiency

No extra work at app level

1.5Gbps 120Gbps

Tones of work at

app level

NoSQL

6 Nodes cluster

150+ Nodes cluster

9

Timeline

Followers

Caching

Messaging

Geo search

Leaderboards

Job management

RT analytics

Verticals & main use cases

Online advertisin

g

Social Gaming

Financial Services

10

• Multi-TB in memory

• ~ 300,000 reads/sec

• ~ 5,000*N writes/sec

N - # of followers

Twitter

Every Timeline

(800 tweets per user)

is on Redis

11

• 20TB+ in memory

• ~ 6,000,000 reads/sec

• ~ 600,000 writes/sec

Weibo (Chinese Twitter)

• Counting

• Reverse cache

• Top 10 lists

• Last Index

• Relational list/Message Queue

• Fast transactions w/ LUA

12

Pinterest

Object graph:

• Per user (Sorted Set w/ timestamp as

score)

store the users followed (explicit+

implicit)

store the user’s followers

(explicit+implicit)

• Per board

Redis Hash for storing explicit followers

Redis Set for storing explicit unfollowers

13

Stack Overflow

Three levels of cache:

• Local cache (no persistence)

sessions, and pending view count

updates

• Site cache

hot question id lists, users acceptance

rates..

• Global cache

Inboxes, API usage quotas, …

14

Github

• Redis is used for routing info

• Matching user repositories to server

names

15

Hipchat

• Which users are in which room

• Who is online

• XMPP server balancing

16

Youporn

Most data is found in Hashes with ordered Sets used to

know what data to show

(1) ZinterStore on:

{videos:filters:release}{videos:filters:orientation:straig

ht}

{videos:filters:categories(id)}{videos:ordering:rating}

(2) Perform a ZRANGE to get the pages we want and get

the list of video_ids back

(3) Start pipelining to get all the videos from Hashes

17

Snapchat

• 500+ instances

• 15-50TB

• Running on GCE

400M messages/day

18

Why Redis Labs ?

19

Infinite seamless scalability

True high-availability

Stable top performance

Zero management

Users choose us because..

Dynamic Clustering Technology

Zero-latency proxy

Cluster

manager

In-Memory Node

Cross-shard processor

In-Memory Cluster

+

21

Challenge #1

How to serve users from the same data-center ?

4 clouds /10 regions

18 data-centers / 30 clusters

24

AWS zones mapping dilemma

Redis Labs Userus-east-1a us-east-1c

us-east-1b

us-east-1c us-east-1e

us-east-1d us-east-1a

us-east-1e us-east-1b

25

Eric Hammond’s post on: Matching EC2 Availability

Zones Across AWS Accounts

How did we solve it

26

How did we solve it

Redis Labs

User

27

Challenge #2

Which instance type shall we use for our cluster?

28

Various instance types in the same cluster• High load scenarios • High memory usage scenarios • New generation of instances

Dedicated instances

As cheap as possible

Cluster’s node requirements

29

Adrian Cockcroft's Blog - Understanding and using Amazon EBS - Elastic Block Store

• use large instances and get dedicated instances for free

The tip

30

What we use today

C3 & R3 A4/5/6/7n1-standardn1-highmemn1-highcpu

BM+VM

31

Challenge #3

How to mange data-persistence with high volumes

of ‘writes’ and slow cloud storage ?

32

Ephemeral vs. Persistence storage

Ephemeral

EBS/Cloud Drive/Persistent

Disk/SAN

Network attachedPersistent

Slow

Direct attachedEphemeral

“Fast”

33

Adrian’ s Blog use the larger EBSes if you want speed

Google (GCP) “Larger volumes can achieve higher I/O levels than smaller volumes”

The tips

34

We use large volumes (1TB+)

We use both ephemeral and persistent storage

We improved/tuned/optimized the Redis persistent storage interface

If replication is enabled, slave writes to disk

We don’t use PIOPS

What we do

35

Why not PIOPS

36

Challenge #4

How to monitor 50K+ databases, 30+ clusters and

hundreds of nodes ?

37

Zabbix (not Nagios) - per node metrics

Limbic (home made) - databases’ metrics• 50K (databases) x 100+(metrics) x 10K+(time

resolutions)

• Based on Python, RRD, Redis

Redis adminUI – cluster configuration

Monitoring

38

Team/Method/Spirit

39

Team /Method/Spirit

Tiny devops team

Core dev. team knows ops (very well)

Baby steps, especially in production

The practical approach always wins

Review your plans every 3 months

40

We are hiring !

41

Thank You

42

Why is Redis efficient ?

Many data-structures

Many cool commands (atomicity

maintained)

Complexity aware

43

Think data-structure • Strings

• Hashes

• Lists

• Sets Sorted Sets

• HyperLogLogs

44

Cool commands• SET if it doesn’t exist – O(1)

• Blocking POP (with timeout) – O(1)

• (blocking) POP from one list, PUSH to another – O(1)

• Get/Set string ranges (and bit operation) – O(N)

• Union/Intersect/Ranges of SETs – O(N)+O(Mxlog(M)) 

• Pub/Sub – O(1)/O(M)/O(M+N)

• LUA / Transactions / Pipelining