Riak a successful failure

43
In Production: Portrait of a Successful Failure Sean Cribbs @seancribbs [email protected]

Transcript of Riak a successful failure

Page 1: Riak   a successful failure

In Production:Portrait of a Successful Failure

Sean Cribbs@seancribbs [email protected]

Page 2: Riak   a successful failure

Riak is...

a scalable,

highly-available,

networked

key/value store.

Page 3: Riak   a successful failure

Riak Data Model

Riak stores values against keys

Encode your data how you like it

Keys are grouped into buckets

Page 4: Riak   a successful failure

Basic Operations

GET /buckets/B/keys/K

PUT /buckets/B/keys/K

DELETE /buckets/B/keys/K

Page 5: Riak   a successful failure

Extras

MapReduce, Link-walking

Value Metadata

Secondary Indexes

Full-text Search

Configurable Storage Engines

Admin GUI

Page 6: Riak   a successful failure

When things go wrong

A Real Customer Story

Page 7: Riak   a successful failure

SituationYou have cluster

Things are great

It’s time to add capacity

Page 8: Riak   a successful failure

Solution

Add a new node

Page 9: Riak   a successful failure

Hostnames

This customer named nodes after drinks:

Aston

IPA

Highball

Gin

Framboise

ESB

Page 10: Riak   a successful failure

riak-admin join

•With Riak, it’s easy to add a new node.

on aston:$ riak-admin join [email protected]

•Then you leave for a quick lunch.

Page 11: Riak   a successful failure

This can’t be good...

Page 12: Riak   a successful failure

Quick, what do you do?

1.add another system!

2.shutdown the entire site!

3.alert Basho Support via an URGENT ticket

Page 13: Riak   a successful failure

Control the situation

Stop the handoff between nodeson every node we:

riak attachapplication:set_env(riak_core, handoff_concurrency, 0).

Page 14: Riak   a successful failure

Monitor

Page 15: Riak   a successful failure

...for signs of...

Page 16: Riak   a successful failure

Stabilization

Page 17: Riak   a successful failure

Now what?

•What happened?

•Why did it happen?

•Can we fix this situation?

Page 18: Riak   a successful failure

But first•Are you still operational?

• yes

•Any noticeable changes in service latency?

•no

•Have any nodes failed?

•no, the cluster is still servicing requests.

Page 19: Riak   a successful failure

So what happened?!

1.New node added

2.Ring must rebalance

3.Nodes claim partitions

4.Handoff of data begins

5.Disks fill up

Page 20: Riak   a successful failure

Member Status

First let’s peek under the hood.$ riak-admin member_status

================================= Membership ================================Status Ring Pending Node-----------------------------------------------------------------------------valid 4.3% 16.8% riak@astonvalid 18.8% 16.8% riak@esbvalid 19.1% 16.8% riak@framboisevalid 19.5% 16.8% riak@ginvalid 19.1% 16.4% riak@highballvalid 19.1% 16.4% riak@ipa-----------------------------------------------------------------------------Valid:6 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

Page 21: Riak   a successful failure

Relief

Let’s try to relieve the pressure a bitFocus on the node with the least disk space left.

gin:~$ riak attachapplication:set_env(riak_core, forced_ownership_handoff, 0).application:set_env(riak_core, vnode_inactivity_timeout, 300000).application:set_env(riak_core, handoff_concurrency, 1). riak_core_vnode:trigger_handoff(element(2, riak_core_vnode_master:get_vnode_pid(411047335499316445744786359201454599278231027712, riak_kv_vnode))).

Page 22: Riak   a successful failure

ReliefIt took 20 minutes to transfer the vnode(riak@gin)7> 19:34:00.574 [info] Starting handoff of partition riak_kv_vnode 411047335499316445744786359201454599278231027712 from riak@gin to riak@aston

gin:~$ sudo netstat -nap | fgrep 10.36.18.245 tcp 0 1065 10.36.110.79:40532 10.36.18.245:8099 ESTABLISHED 27124/beam.smp tcp 0 0 10.36.110.79:46345 10.36.18.245:53664 ESTABLISHED 27124/beam.smp

(riak@gin)7> 19:54:56.721 [info] Handoff of partition riak_kv_vnode 411047335499316445744786359201454599278231027712 from riak@gin to riak@astoncompleted: sent 3805730 objects in 1256.14 seconds

Page 23: Riak   a successful failure

ReliefAnd the vnode had arrived at Aston from Ginaston:/data/riak/bitcask/205523667749658222872393179600727299639115513856-132148847970820$ ls -latotal 7305344drwxr-xr-x 2 riak riak 4096 2011-11-11 18:05 .drwxr-xr-x 258 riak riak 36864 2011-11-11 18:56 ..-rw------- 1 riak riak 2147479761 2011-11-11 17:53 1321055508.bitcask.data-rw-r--r-- 1 riak riak 86614226 2011-11-11 17:53 1321055508.bitcask.hint-rw------- 1 riak riak 1120382399 2011-11-11 19:50 1321055611.bitcask.data-rw-r--r-- 1 riak riak 55333675 2011-11-11 19:50 1321055611.bitcask.hint-rw------- 1 riak riak 2035568266 2011-11-11 18:03 1321056070.bitcask.data-rw-r--r-- 1 riak riak 99390277 2011-11-11 18:03 1321056070.bitcask.hint-rw------- 1 riak riak 1879298219 2011-11-11 18:05 1321056214.bitcask.data-rw-r--r-- 1 riak riak 56509595 2011-11-11 18:05 1321056214.bitcask.hint-rw------- 1 riak riak 119 2011-11-11 17:53 bitcask.write.lock

Page 24: Riak   a successful failure

Eureka!

•Data was not being cleaned up after handoff.

•This would eventually eat all disk space!

Page 25: Riak   a successful failure

What’s the solution?

•We already had a bugfix for the next release (1.0.2) that detects the problem

•Tested the bugfix locally before delivering to customer

Page 26: Riak   a successful failure

Hot Patch

We patched their live, production system while still under load.

(on all nodes) riak attachl(riak_kv_bitcask_backend).m(riak_kv_bitcask_backend).Module riak_kv_bitcask_backend compiled: Date: November 12 2011, Time: 04.18Compiler options: [{outdir,"ebin"}, debug_info,warnings_as_errors, {parse_transform,lager_transform}, {i,"include"}]Object file: /usr/lib/riak/lib/riak_kv-1.0.1/ebin/riak_kv_bitcask_backend.beamExports: api_version/0 is_empty/1callback/3 key_counts/0delete/4 key_counts/1drop/1 module_info/0fold_buckets/4 module_info/1fold_keys/4 put/5fold_objects/4 start/2get/3 status/1...

Page 27: Riak   a successful failure

Bingo!

And the new code did what we expected.{ok, R} = riak_core_ring_manager:get_my_ring().[riak_core_vnode_master:get_vnode_pid(Partition, riak_kv_vnode) || {Partition,_} <- riak_core_ring:all_owners(R)].(riak@gin)19> [riak_core_vnode_master:get_vnode_pid(Partition, riak_kv_vnode) || {Partition,_} <- riak_core_ring:all_owners(R)].22:48:07.423 [notice] Unused data directories exist for partition "11417981541647679048466287755595961091061972992": "/data/riak/bitcask/11417981541647679048466287755595961091061972992"22:48:07.785 [notice] Unused data directories exist for partition "582317058624031631471780675535394015644160622592": "/data/riak/bitcask/582317058624031631471780675535394015644160622592"22:48:07.829 [notice] Unused data directories exist for partition "782131735602866014819940711258323334737745149952": "/data/riak/bitcask/782131735602866014819940711258323334737745149952"[{ok,<0.30093.11>},...

Page 28: Riak   a successful failure

Manual Cleanup

So we backed up those vnodes with unused data on Gin to another system and manually removed them.gin:/data/riak/bitcask$ ls manual_cleanup/ 11417981541647679048466287755595961091061972992 782131735602866014819940711258323334737745149952582317058624031631471780675535394015644160622592

gin:/data/riak/bitcask$ rm -rf manual_cleanup

Page 29: Riak   a successful failure

Gin’s Status Improves

Page 30: Riak   a successful failure

Bedtime

•It was late at night, things were stable and the customer’s users were unaffected.

•We all went to bed, and didn’t reconvene for 12 hours.

Page 31: Riak   a successful failure

Next Day’s Plan1.Start up handoff on the node with the

lowest disk space• let it move data 1 partition at a time to

other nodes• observe that data directories were removed

after successful transfers complete

2.When disk space frees up a bit, start up other nodes, increase handoff concurrency, watch the ring rebalance.

Page 32: Riak   a successful failure

Let’s Get Started

On Gin only: reset to defaults, re-enable handoffson gin:

application:unset_env(riak_core, forced_ownership_handoff).application:set_env(riak_core, vnode_inactivity_timeout, 60000).application:set_env(riak_core, handoff_concurrency, 1).

Page 33: Riak   a successful failure

Gin Moves Data to IPA

Page 34: Riak   a successful failure

Highball’s TurnHighball was next lowest now that Gin was handing data off, time to restart it too.on highball

application:unset_env(riak_core, forced_ownership_handoff).application:set_env(riak_core, vnode_inactivity_timeout, 60000).application:set_env(riak_core, handoff_concurrency, 1).

on ginapplication:set_env(riak_core, handoff_concurrency, 4). % the default settingriak_core_vnode_manager:force_handoffs().

Page 35: Riak   a successful failure

Rebalance Starts

Page 36: Riak   a successful failure

and keeps going...

Page 37: Riak   a successful failure

and going...

Page 38: Riak   a successful failure

and going...

Page 39: Riak   a successful failure

Rebalanced

Page 40: Riak   a successful failure

Minimal Impact•6ms variance for 99th % (32ms to

38ms)

•0.68s variance for 100th % (0.12s to 0.8s)

Page 41: Riak   a successful failure

Moral of the Story

•Riak’s resilience under stress resulted in minimal operational impact

•Hot code-patching solved the problem in-situ, without downtime

•We all got some sleep!

Page 42: Riak   a successful failure

Things break,Riak bends.

Page 43: Riak   a successful failure

Thank You

http://basho.com/resources/downloads/

https://github.com/basho/riak/

[email protected]