Memory is the new disk, disk is the new tape, Bela Ban (JBoss by RedHat)

Memory is the new disk,disk is the new tape

Bela Ban, JBoss / Red Hat

Motivation

● We want to store our data in memory– Memory access is faster than disk access

– Even across a network

– A DB requires network communication, too

● The disk is used for archival purposes● Not a replacement for DBs !

– Only a key-value store

– NoSQL

Problems

● #1: How do we provide memory large enough to store the data (e.g. 2 TB of memory) ?

● #2: How do we guarantee persistence ?– Survival of data between reboots / crashes

#1: Large memory

● We aggregate the memory of all nodes in a cluster into a large virtual memory space

– 100 nodes of 10 GB == 1 TB of virtual memory

#2: Persistence

● We store keys redundantly on multiple nodes

– Unless all nodes on which key K is stored crash at the same time, K is persistent

● We can also store the data on disk– To prevent data loss in case all cluster

nodes crash

– This can be done asynchronously, on a background thread

How do we provide redundancy ?

Store every key on every node

AA BB CC DD

K1 K1 K1 K1

K2 K2 K2 K2

K3 K3 K3 K3

K4 K4 K4 K4

● RAID 1● Pro: data is available everywhere

– No network round trip

– Data loss only when all nodes crash

● Con: we can only use 25% of our memory

Store every key on 1 node only

AA BB CC DD

K1 K2 K3 K4

● RAID 0, JBOD● Pro: we can use 100% of our memory● Con: data loss on node crash

– No redundancy

Store every key on K nodes

AA BB CC DD

K1 K1

K2 K2

K3 K3

K4 K4

● K is configurable (2 in the example)● Variable RAID● Pro: we can use a variable % of our memory

– User determines tradeoff between memory consumption and risk of data loss

So how do we determine on which nodes the keys are stored ?

Consistent hashing

● Given a key K and a set of nodes, CH(K) will always pick the same node P for K

– We can also pick a list {P,Q} for K

● Anyone 'knows' that K is on P● If P leaves, CH(K) will pick another node Q

and rebalance affected keys● A good CH will rebalance 1/N keys at most

(where N = number of cluster nodes)

Example

AA BB CC DD

K1 K1

K2 K2

K3 K3

K4 K4

● K2 is stored on B (primary owner) and C (backup owner)

Example

AA BB CC DD

K1 K1

K2 K2

K3 K3

K4 K4

● Node B now crashes

Example

● C (the backup owner of K2) copies K2 to D– C is now the primary owner of K2

● A copies K1 to C– C is now the backup owner of K1

AA BB CC DD

K1 K1 K1

K2 K2 K2

K3 K3

K4 K4

Rebalancing

● Unless all N owners of a key K crash exactly at the same time, K is always stored redundantly

● When less than N owners crash, rebalancing will copy/move keys to other nodes, so that we have N owners again

Enter ReplCache

● ReplCache is a distributed hashmap spanning the entire cluster

● Operations: put(K,V), get(K), remove(K)● For every key, we can define how many

times we'd like it to be stored in the cluster– 1: RAID 0

– -1: RAID 1

– N: variable RAID

Use of ReplCache

HTTP

Apache

mod_jk

DB

JBoss

Servlet

ReplCache

JBoss

Servlet

ReplCache

JBoss

Servlet

ReplCacheCluster

Use cases

● JBoss AS: session distribution using Infinispan

– For data scalability, sessions are stored only N times in a cluster

● GridFS (Infinispan)– I/O over grid

– Files are chunked into slices, each slice is stored in the grid (redundantly if needed)

– Store a 4GB DVD in a grid where each node has only 2GB of heap

Use cases

● Hibernate Over Grid (OGM)– Replaces DB backend with Infinispan

backed grid

Conclusion

● Given enough nodes in a cluster, we can provide persistence for data

● Unlike RAID, where everything is stored fully redundantly (even /tmp), we can define persistence guarantees per key

● Ideal for data sets which need to be accessed quickly

– For the paranoid we can still stream to disk

Conclusion

● Data is distributed over a grid– Cache is closer to clients

– No bottleneck to the DBMS

– Keys are on different nodes

Conclusion

CacheCache

ClientClient

ClientClient

ClientClient

ClientClient

ClientClient

ClientClient

ClientClient

ClientClient

ClientClient

CacheCache

CacheCache

CacheCache

CacheCache

CacheCache

CacheCache

Questions ?

● Demo (JGroups)– http://www.jgroups.org

● Infinispan– http://www.infinispan.org

● OGM– http://community.jboss.org/en/hibernate/ogm

http://www.jgroups.org/

http://www.infinispan.org/

Memory is the new disk, disk is the new tape, Bela Ban (JBoss by RedHat)

Technology

Transcript of Memory is the new disk, disk is the new tape, Bela Ban (JBoss by RedHat)