Riak a successful failure

In Production:Portrait of a Successful Failure

Sean Cribbs@seancribbs [email protected]

mailto:[email protected]

Riak is...

a scalable,

highly-available,

networked

key/value store.

Riak Data Model

Riak stores values against keys

Encode your data how you like it

Keys are grouped into buckets

Basic Operations

GET /buckets/B/keys/K

PUT /buckets/B/keys/K

DELETE /buckets/B/keys/K

Extras

MapReduce, Link-walking

Value Metadata

Secondary Indexes

Full-text Search

Configurable Storage Engines

Admin GUI

When things go wrong

A Real Customer Story

SituationYou have cluster

Things are great

It’s time to add capacity

Solution

Add a new node

Hostnames

This customer named nodes after drinks:

Aston

IPA

Highball

Gin

Framboise

ESB

riak-admin join

•With Riak, it’s easy to add a new node.

on aston:$ riak-admin join [email protected]

•Then you leave for a quick lunch.

mailto:[email protected]

This can’t be good...

Quick, what do you do?

1.add another system!

2.shutdown the entire site!

3.alert Basho Support via an URGENT ticket

Control the situation

Stop the handoff between nodeson every node we:

riak attachapplication:set_env(riak_core, handoff_concurrency, 0).

Monitor

...for signs of...

Stabilization

Now what?

•What happened?

•Why did it happen?

•Can we fix this situation?

But first•Are you still operational?

• yes

•Any noticeable changes in service latency?

•no

•Have any nodes failed?

•no, the cluster is still servicing requests.

So what happened?!

1.New node added

2.Ring must rebalance

3.Nodes claim partitions

4.Handoff of data begins

5.Disks fill up

Member Status

First let’s peek under the hood.$ riak-admin member_status

================================= Membership ================================Status Ring Pending Node-----------------------------------------------------------------------------valid 4.3% 16.8% riak@astonvalid 18.8% 16.8% riak@esbvalid 19.1% 16.8% riak@framboisevalid 19.5% 16.8% riak@ginvalid 19.1% 16.4% riak@highballvalid 19.1% 16.4% riak@ipa-----------------------------------------------------------------------------Valid:6 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

Relief

Let’s try to relieve the pressure a bitFocus on the node with the least disk space left.

gin:~$ riak attachapplication:set_env(riak_core, forced_ownership_handoff, 0).application:set_env(riak_core, vnode_inactivity_timeout, 300000).application:set_env(riak_core, handoff_concurrency, 1). riak_core_vnode:trigger_handoff(element(2, riak_core_vnode_master:get_vnode_pid(411047335499316445744786359201454599278231027712, riak_kv_vnode))).

ReliefIt took 20 minutes to transfer the vnode(riak@gin)7> 19:34:00.574 [info] Starting handoff of partition riak_kv_vnode 411047335499316445744786359201454599278231027712 from riak@gin to riak@aston

gin:~$ sudo netstat -nap | fgrep 10.36.18.245 tcp 0 1065 10.36.110.79:40532 10.36.18.245:8099 ESTABLISHED 27124/beam.smp tcp 0 0 10.36.110.79:46345 10.36.18.245:53664 ESTABLISHED 27124/beam.smp

(riak@gin)7> 19:54:56.721 [info] Handoff of partition riak_kv_vnode 411047335499316445744786359201454599278231027712 from riak@gin to riak@astoncompleted: sent 3805730 objects in 1256.14 seconds

ReliefAnd the vnode had arrived at Aston from Ginaston:/data/riak/bitcask/205523667749658222872393179600727299639115513856-132148847970820$ ls -latotal 7305344drwxr-xr-x 2 riak riak 4096 2011-11-11 18:05 .drwxr-xr-x 258 riak riak 36864 2011-11-11 18:56 ..-rw------- 1 riak riak 2147479761 2011-11-11 17:53 1321055508.bitcask.data-rw-r--r-- 1 riak riak 86614226 2011-11-11 17:53 1321055508.bitcask.hint-rw------- 1 riak riak 1120382399 2011-11-11 19:50 1321055611.bitcask.data-rw-r--r-- 1 riak riak 55333675 2011-11-11 19:50 1321055611.bitcask.hint-rw------- 1 riak riak 2035568266 2011-11-11 18:03 1321056070.bitcask.data-rw-r--r-- 1 riak riak 99390277 2011-11-11 18:03 1321056070.bitcask.hint-rw------- 1 riak riak 1879298219 2011-11-11 18:05 1321056214.bitcask.data-rw-r--r-- 1 riak riak 56509595 2011-11-11 18:05 1321056214.bitcask.hint-rw------- 1 riak riak 119 2011-11-11 17:53 bitcask.write.lock

Eureka!

•Data was not being cleaned up after handoff.

•This would eventually eat all disk space!

What’s the solution?

•We already had a bugfix for the next release (1.0.2) that detects the problem

•Tested the bugfix locally before delivering to customer

Hot Patch

We patched their live, production system while still under load.

(on all nodes) riak attachl(riak_kv_bitcask_backend).m(riak_kv_bitcask_backend).Module riak_kv_bitcask_backend compiled: Date: November 12 2011, Time: 04.18Compiler options: [{outdir,"ebin"}, debug_info,warnings_as_errors, {parse_transform,lager_transform}, {i,"include"}]Object file: /usr/lib/riak/lib/riak_kv-1.0.1/ebin/riak_kv_bitcask_backend.beamExports: api_version/0 is_empty/1callback/3 key_counts/0delete/4 key_counts/1drop/1 module_info/0fold_buckets/4 module_info/1fold_keys/4 put/5fold_objects/4 start/2get/3 status/1...

Bingo!

And the new code did what we expected.{ok, R} = riak_core_ring_manager:get_my_ring().[riak_core_vnode_master:get_vnode_pid(Partition, riak_kv_vnode) || {Partition,_} <- riak_core_ring:all_owners(R)].(riak@gin)19> [riak_core_vnode_master:get_vnode_pid(Partition, riak_kv_vnode) || {Partition,_} <- riak_core_ring:all_owners(R)].22:48:07.423 [notice] Unused data directories exist for partition "11417981541647679048466287755595961091061972992": "/data/riak/bitcask/11417981541647679048466287755595961091061972992"22:48:07.785 [notice] Unused data directories exist for partition "582317058624031631471780675535394015644160622592": "/data/riak/bitcask/582317058624031631471780675535394015644160622592"22:48:07.829 [notice] Unused data directories exist for partition "782131735602866014819940711258323334737745149952": "/data/riak/bitcask/782131735602866014819940711258323334737745149952"[{ok,<0.30093.11>},...

Manual Cleanup

So we backed up those vnodes with unused data on Gin to another system and manually removed them.gin:/data/riak/bitcask$ ls manual_cleanup/ 11417981541647679048466287755595961091061972992 782131735602866014819940711258323334737745149952582317058624031631471780675535394015644160622592

gin:/data/riak/bitcask$ rm -rf manual_cleanup

Gin’s Status Improves

Bedtime

•It was late at night, things were stable and the customer’s users were unaffected.

•We all went to bed, and didn’t reconvene for 12 hours.

Next Day’s Plan1.Start up handoff on the node with the

lowest disk space• let it move data 1 partition at a time to

other nodes• observe that data directories were removed

after successful transfers complete

2.When disk space frees up a bit, start up other nodes, increase handoff concurrency, watch the ring rebalance.

Let’s Get Started

On Gin only: reset to defaults, re-enable handoffson gin:

application:unset_env(riak_core, forced_ownership_handoff).application:set_env(riak_core, vnode_inactivity_timeout, 60000).application:set_env(riak_core, handoff_concurrency, 1).

Gin Moves Data to IPA

Highball’s TurnHighball was next lowest now that Gin was handing data off, time to restart it too.on highball

application:unset_env(riak_core, forced_ownership_handoff).application:set_env(riak_core, vnode_inactivity_timeout, 60000).application:set_env(riak_core, handoff_concurrency, 1).

on ginapplication:set_env(riak_core, handoff_concurrency, 4). % the default settingriak_core_vnode_manager:force_handoffs().

Rebalance Starts

and keeps going...

and going...

Rebalanced

Minimal Impact•6ms variance for 99th % (32ms to

38ms)

•0.68s variance for 100th % (0.12s to 0.8s)

Moral of the Story

•Riak’s resilience under stress resulted in minimal operational impact

•Hot code-patching solved the problem in-situ, without downtime

•We all got some sleep!

Things break,Riak bends.

Thank You

http://basho.com/resources/downloads/

https://github.com/basho/riak/

[email protected]



Riak a successful failure

Technology

Transcript of Riak a successful failure