(DAT407) Amazon ElastiCache: Deep Dive

Post on 10-Feb-2017

2.551 views 3 download

Transcript of (DAT407) Amazon ElastiCache: Deep Dive

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Nate Wiger, Principal Solutions Architect, AWS

Tom Kerr, Software Engineer, Riot Games

October 8, 2015

Amazon ElastiCache Deep DiveScaling Your Data in a Real-Time World

DAT407

Amazon ElastiCache

• Managed in-memory service

• Memcached or Redis

• Cluster of nodes

• Read replicas

• Monitoring + alerts

ELB App

External APIs

Modern Web / Mobile App

Memcached vs Redis

• Flat string cache

• Multithreaded

• No persistence

• Low maintenance

• Easy to scale horizontally

• Single-threaded

• Persistence

• Atomic operations

• Advanced data types -

http://redis.io/topics/data-types

• Pub/sub messaging

• Read replicas / failover

Storing JSON – Memcached vs Redis

# Memcached: Serialize string

str_json = Encode({“name”: “Nate Wiger”, “gender”: “M”})

SET user:nateware str_json

GET user:nateware

json = Decode(str_json)

# Redis: Use a hash!

HMSET user:nateware name “Nate Wiger” gender M

HGET user:nateware name

>> Nate Wiger

HMGET user:nateware name gender

>> Nate Wiger

>> M

ElastiCache with

ElastiCache with Memcached – Development

Region

Availability Zone A Availability Zone B

Auto Scaling group

ElastiCache cluster

ElastiCache with Memcached – Development

Region

Availability Zone A Availability Zone B

Auto Scaling group

ElastiCache cluster

Nope

Add Nodes to Memcached Cluster

Add Nodes to Memcached Cluster

Add Nodes to Memcached Cluster

aws elasticache modify-cache-cluster

--cache-cluster-id my-cache-cluster

--num-cache-nodes 4

--apply-immediately

# response

"CacheClusterStatus": "modifying",

"PendingModifiedValues": {

"NumCacheNodes": 4

},

ElastiCache with Memcached – High Availability

Region

Availability Zone A Availability Zone B

Auto Scaling group

ElastiCache cluster

ElastiCache with Memcached – Scale Out

Region

Availability Zone A Availability Zone B

Auto Scaling group

ElastiCache cluster

Sharding

Consistent HashingClient pre-calculates a hash ring for best key distribution

http://berb.github.io/diploma-thesis/original/062_internals.html

It’s All Been Done Before• Ruby

• Dalli https://github.com/mperham/dalli

• Plus ElastiCache https://github.com/ktheory/dalli-elasticache

• Python• HashRing / MemcacheRing https://pypi.python.org/pypi/hash_ring/

• Django w/ Auto-Discovery https://github.com/gusdan/django-elasticache

• Node.js• node-memcached https://github.com/3rd-Eden/node-memcached

• Auto-Discovery example http://stackoverflow.com/questions/17046661

• Java• SpyMemcached https://github.com/dustin/java-memcached-client

• ElastiCache Client https://github.com/amazonwebservices/aws-elasticache-cluster-client-memcached-for-java

• PHP• ElastiCache Client https://github.com/awslabs/aws-elasticache-cluster-client-

memcached-for-php

• .NET• ElastiCache Client https://github.com/awslabs/elasticache-cluster-config-net

Auto-Discovery Endpoint

# PHP

$server_endpoint = "mycache.z2vq55.cfg.usw2.cache.amazonaws.com";

$cache = new Memcached();

$cache->setOption(

Memcached::OPT_CLIENT_MODE, Memcached::DYNAMIC_CLIENT_MODE);

# Set config endpoint as only server

$cache->addServer($server_endpoint, 11211);

DIY: http://bit.ly/elasticache-autodisc

Memcached Node Auto-Discovery

App Caching Patterns

Be Lazy

# Python

def get_user(user_id):

record = cache.get(user_id)

if record is None:

# Run a DB query

record = db.query("select * from users where id = ?", user_id)

cache.set(user_id, record)

return record

# App code

user = get_user(17)

Write On Through

# Python

def save_user(user_id, values):

record = db.query("update users ... where id = ?", user_id, values)

cache.set(user_id, record)

return record

# App code

user = save_user(17, {"name": "Nate Dogg"})

Combo Move!

def save_user(user_id, values):

record = db.query("update users ... where id = ?", user_id, values)

cache.set(user_id, record, 300) # TTL

return record

def get_user(user_id):

record = cache.get(user_id)

if record is None:

record = db.query("select * from users where id = ?", user_id)

cache.set(user_id, record, 300) # TTL

return record

# App code

save_user(17, {"name": "Nate Diddy"})

user = get_user(17)

Web Cache with Memcached

# Gemfile

gem 'dalli-elasticache’

# config/environments/production.rb

endpoint = “mycluster.abc123.cfg.use1.cache.amazonaws.com:11211”

elasticache = Dalli::ElastiCache.new(endpoint)

config.cache_store = :dalli_store, elasticache.servers,

expires_in: 1.day, compress: true

# if you change ElastiCache cluster nodes

elasticache.refresh.client

Ruby on Rails Example

Thundering Herd

Causes

• Cold cache – app startup

• Adding / removing nodes

• Cache key expiration (TTL)

• Out of cache memory

Large # of cache misses

Spike in database load

Mitigations

• Script to populate cache

• Gradually scale nodes

• Randomize TTL values

• Monitor cache utilization

ElastiCache with

Not if I

destroy

it first!It’s

mine!

Need uniqueness + ordering

Easy with Redis Sorted Sets

ZADD "leaderboard" 1201 "Gollum”

ZADD "leaderboard" 963 "Sauron"

ZADD "leaderboard" 1092 "Bilbo"

ZADD "leaderboard" 1383 "Frodo”

ZREVRANGE "leaderboard" 0 -1

1) "Frodo"

2) "Gollum"

3) "Bilbo"

4) "Sauron”

ZREVRANK "leaderboard" "Sauron"

(integer) 3

Real-time Leaderboard!

Ex: Throttling requests to an API

Leverages Redis Counters

ELB

Externally

Facing

API

Reference: http://redis.io/commands/INCR

FUNCTION LIMIT_API_CALL(APIaccesskey)limit = HGET(APIaccesskey, “limit”)time = CURRENT_UNIX_TIME()keyname = APIaccesskey + ":” + timecount = GET(keyname)IF current != NULL && count > limit THEN

ERROR ”API request limit exceeded"ELSE

MULTIINCR(keyname)EXPIRE(keyname,10)

EXECPERFORM_API_CALL()

END

Rate Limiting

• Redis counters – increment likes/dislikes

• Redis hashes – list of everyone’s ratings

• Process with algorithm like Slope One or Jaccardian similarity

• Ruby example - https://github.com/davidcelis/recommendable

Recommendation Engines

INCR item:38927:likesHSET item:38927:ratings "Susan" 1

INCR item:38927:dislikesHSET item:38927:ratings "Tommy" -1

Chat and Messaging

• PUBLISH and SUBSCRIBE Redis commands

• Game or Mobile chat

• Server intercommunication

SUBSCRIBE chat_channel:114PUBLISH chat_channel:114 "Hello all"

["message", " chat_channel:114 ", "Hello all"]UNSUBSCRIBE chat_channel:114

ElastiCache with Redis – Development

Region

Availability Zone A Availability Zone B

Auto Scaling group

ElastiCache cluster

Availability Zone A Availability Zone B

Use Primary Endpoint

Use Read Replicas

Auto-Failover

Chooses replica with

lowest replication lag

DNS endpoint is same

Redis Multi-AZ

ElastiCache with Redis Multi-AZ

Region

Availability Zone A Availability Zone B

Auto Scaling group

ElastiCache cluster

ElastiCache with Redis Multi-AZ

Region

Availability Zone A Availability Zone B

Auto Scaling group

ElastiCache cluster

ElastiCache with Redis Multi-AZ

Region

Availability Zone A Availability Zone B

Auto Scaling group

ElastiCache cluster

ElastiCache with Redis Multi-AZ

Region

Availability Zone A Availability Zone B

Auto Scaling group

ElastiCache cluster

Redis Multi-AZ – Reads and Writes

ELB App

External APIs

Replication Group

ReadsWrites

Redis – Read/Write Connections

# Ruby example

redis_write = Redis.new(

'mygame-dev.z2vq55.ng.0001.usw2.cache.amazonaws.com')

redis_read = Redis::Distributed.new([

'mygame-dev-002.z2vq55.ng.0001.usw2.cache.amazonaws.com',

'mygame-dev-003.z2vq55.ng.0001.usw2.cache.amazonaws.com'

])

redis_write.zset("leaderboard", "nateware", 1976)

top_10 = redis_read.zrevrange("leaderboard", 0, 10)

Recap – Endpoint Autodetection

• Cluster endpoints:

aws elasticache describe-cache-clusters

--cache-cluster-id mycluster

--show-cache-node-info

• Redis read replica endpoints:

aws elasticache describe-replication-groups

--replication-group-id myredisgroup

• Can listen for SNS events: http://bit.ly/elasticache-sns

http://bit.ly/elasticache-whitepaper

Splitting Redis By Purpose

ELB App

External APIs

ReadsWrites

Replication Group

Leaderboards

Replication Group

User Profiles

Reads

Don’t Plan Ahead!!

1. Start with one Redis Multi-AZ cluster

2. Split as needed

3. Scale read load via replicas

4. Rinse, repeat

Tune It Up!

Alarms

Monitoring with CloudWatch

• CPU

• Evictions

• Memory

• Swap Usage

• Network In/Out

Key ElastiCache CloudWatch Metrics

• CPUUtilization

• Memcached – up to 90% ok

• Redis – divide by cores (ex: 90% / 4 = 22.5%)

• SwapUsage low

• CacheMisses / CacheHits Ratio low / stable

• Evictions near zero

• Exception: Russian doll caching

• CurrConnections stable

• Whitepaper: http://bit.ly/elasticache-whitepaper

Scaling Up Redis

1. Snapshot existing cluster to Amazon S3

http://bit.ly/redis-snapshot

2. Spin up new Redis cluster from snapshot

http://bit.ly/redis-seeding

3. Profit!

4. Also good for debugging copy of production data

Common Issues

DNS Caching – Redis Failover

• Failover requires updating a DNS CNAME

• Can take up to two minutes

• Watch out for app DNS caching – esp. Java!

http://bit.ly/jvm-dns

• No API for triggering Redis failover• Turn off Multi-AZ temporarily

• Promote replica to primary

• Turn on Multi-AZ

1. Forks main Redis process

2. Writes data to disk from child process

3. Continues to accept traffic on main process

4. Any key update causes a copy-on-write

5. Potentially DOUBLES memory usage by Redis

Swapping During Redis Backup (BGSAVE)

Reduce memory allocated to Redis

• Set reserved-memory field in parameter groups

• Evicts more data from memory

Use larger cache node type

• More expensive

• But no data eviction

Write-heavy apps need extra Redis memory

Swapping During Redis Backup – Solutions

Redis reserved-memory Parameter

Redis Engine Enhancements

• Only Available in Amazon ElastiCache

• Forkless backups = Lower memory usage

• If enough memory, will still fork (faster)

• Improved replica sync under heavy write loads

• Smoother failovers (PSYNC)

• Two new CloudWatch metrics

• ReplicationBytes: Number of bytes sent from primary node

• SaveInProgress: 1/0 value that indicates if save is running

• Try it today! Redis 2.8.22 or later.`

Riot Games: ElastiCache in the Wild

Tom Kerr

LEAGUE OF LEGENDS

APOLLO

APOLLO: COMMENTS ANYWHERE

APOLLO: COMMENTS ANYWHERE

APOLLO: ARCHITECTURE

Replication with automatic failover

Replication across availability zones

More snapshots, more often

LESS GOOD

Fun Stuff Deploy Stuff

GOOD

Fun Stuff Deploy Stuff

APOLLO

LEADERBOARDS

LEADERBOARDS: ARCHITECTURE

LEADERBOARDS: DATA STORE

US-WEST2:NA:3848433 37

US-WEST2:NA:3848 37433

http://redis.io/topics/memory-optimization

LEADERBOARDS

Replicas with automatic failoverBEST

PRACTICES

Manually snapshot more often

Monitor your replication metrics

Redis hash key trick

Thank you!

Nate Wiger, Principal Solutions Architect, AWS

Tom Kerr, Software Engineer, Riot Games

Remember to complete

your evaluations!