Scalable Persistent Storage for Erlang: Theory and Practice
-
Upload
amir-ghaffari -
Category
Software
-
view
69 -
download
3
description
Transcript of Scalable Persistent Storage for Erlang: Theory and Practice
![Page 1: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/1.jpg)
Scalable Persistent Storage for Erlang
Theory and Practice
Amir Ghaffari Jon Meredith
Natalia Chechina , Phil Trinder
London Riak Meetup - October 22, 2013
1
http://www.release-project.eu
![Page 2: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/2.jpg)
Outline
• RELEASE Project
• General principles of scalable DBMSs
• NoSQL DBMSs for Erlang
• Riak 1.1.1 Scalability in Practice
• Investigating the scalability of distributed Erlang
• Riak Elasticity
• Conclusion & Future work
2
![Page 3: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/3.jpg)
RELEASE project
• RELEASE is an European project
aiming to scale Erlang onto
commodity architectures with 100,000
cores.
3
![Page 4: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/4.jpg)
RELEASE project
The RELEASE consortium work at following levels:
Virtual machine
Language
scalable Computation model
Scalable In-memory data structures
Scalable Persistent data structures
Infrastructure levels
Profiling and refactoring tools
4
![Page 5: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/5.jpg)
General principles of scalable DBMSs
Data Fragmentation
1. Decentralized model (e.g. P2P model)
2. Systematic load balancing (make life easier for developer)
3. Location transparency
5
0-2K 2k-4K 4k-6K 16k-18K 18k-20K
20kB
e.g. 20k data is fragmented among 10 nodes
![Page 6: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/6.jpg)
General principles of scalable DBMSs
Replication 1. Decentralized model (e.g. P2P model)
2. Location transparency
3. Asynchronous replication (write is considered complete as soon
as on node acknowledges it)
6
X
e.g. Key X is replicated on three nodes
.
.
X
.
.
X
.
.
X
![Page 7: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/7.jpg)
General principles of scalable DBMSs
Consistency Availability
Partition
Tolerance ACID Systems Eventual
Consistency
CAP theorem: cannot
simultaneously guarantee:
•Partition tolerance: system
continues to operate despite nodes
can't talk to each other
•Availability: guarantee that every
request receives a response
•Consistency: all nodes see the
same data at the same time
Not achievable because network failures are inevitable
7
Solution: Eventual consistency and reconciling conflicts via data versioning
ACID=Atomicity, Consistency, Isolation, Durability
![Page 8: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/8.jpg)
NoSQL DBMSs for Erlang
Mnesia CouchDB Riak Cassandra
Fragmentation •Explicit placement
•Client-server
•Automatic by using
a hash function
•Explicit placement
•Multi-server
•Lounge is not part of
each CouchDB node
•Implicit placement
•Peer to peer
•Automatic by using
consistent hash
technique
•Implicit placement
•Peer to peer
•Automatic by using
consistent hash
technique
Replication •Explicit placement
•Client-server
•Asynchronous
( Dirty operation)
•Explicit placement
•Multi-server
•Asynchronous
•Implicit placement
•Peer to peer
•Asynchronous
•Implicit placement
•Peer to peer
•Asynchronous
Partition
Tolerant
•Strong consistency •Eventual consistency
•Multi-Version
Concurrency Control
for reconciliation
•Eventual consistency
•Vector clocks for
reconciliation
•Eventual consistency
•Use timestamp to
reconcile
Query
Processing
&
Backend
Storage
•The largest possible
Mnesia table is 4Gb
•No limitation
•Supports Map/Reduce
Queries
•Bitcask has memory
limitation
•LevelDB has no
limitation
•Supports
Map/Reduce queries
•No limitation
•Supports Map/Reduce
queries
8
![Page 9: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/9.jpg)
Initial Evaluation Results
General Principles
Initial Evaluation
• Mnesia
• CouchDB
• Riak
• Cassendra
Scalable persistent storage for SD Erlang can be provided by
Dynamo-style DBMSs such as Riak,Cassandra
9
![Page 10: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/10.jpg)
Riak Scalability in Practice
• Basho Bench: a benchmarking tool for Riak
• We measure Basho Bench on 348-node Kalkyl cluster
• Scalability: How does adding more Riak nodes affect the
throughput?
• There are two kinds of nodes in a cluster:
• Traffic generators
• Riak nodes
10
![Page 11: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/11.jpg)
Node Organisation
11
Heuristic: one traffic generator per 3 Riak nodes
![Page 12: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/12.jpg)
Traffic Generator
12
![Page 13: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/13.jpg)
Riak 1.1.1 Scalability
Benchmark on 100-node cluster (800 cores)
13
![Page 14: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/14.jpg)
Failures
14
![Page 15: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/15.jpg)
Profiling Resource Usage
15
CPU Usage
![Page 16: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/16.jpg)
Profiling Resource Usage
16
DISK Usage
![Page 17: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/17.jpg)
Profiling Resource Usage
17
Memory Usage
![Page 18: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/18.jpg)
Profiling Resource Usage
18
Network Traffic of Generator Nodes
![Page 19: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/19.jpg)
Profiling Resource Usage
19
Network Traffic of Riak Nodes
![Page 20: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/20.jpg)
Bottleneck for Riak Scalability
CPU, RAM, Disk, and Network profiling reveal that
they can't be bottleneck for Riak scalability.
Is the Riak scalability limits due to limits in
distributed Erlang?
To find out, let’s measure the scalability of
distributed Erlang.
20
![Page 21: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/21.jpg)
DE-Bench
21
• DE-Bench: a benchmarking tool for distributed
Erlang
• It is based on Basho Bench
• Measures the throughput of a cluster of Erlang nodes
• Records the latency of distributed Erlang commands
individually
![Page 22: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/22.jpg)
Distributed Erlang Commands
22
• Spawn/RPC: peer to peer commands
• register_name : global name tables located on every node
• unregister_name : global name tables located on every node
• whereis_name : a lookup in the local table
Register
Unregister
Erlang VM Erlang VM Erlang VM Erlang VM
Global
name table
Global
name table
Global
name table
Global
name table
![Page 23: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/23.jpg)
DE-Bench’s P2P Design
23
Physical host
1
Physical host
2
![Page 24: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/24.jpg)
Frequency of Global Operation
24
Frequently Max Throughput
1% 30 nodes
0.5% 50 nodes
0.33% 70 nodes
0% 1600 nodes
Global Operations limit the scalability of distributed Erlang
![Page 25: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/25.jpg)
Riak Software Scalability
• Monitoring global.erl module from OTP library shows
that Riak does NOT use any global operation.
• Instrumenting gen_server.erl module reveals that:
Of the 15 most time-consuming operations, only the time of
rpc:call grows with cluster size.
Moreover, of the five Riak RPC calls, only start_put_fsm
function from module riak_kv_put_fsm_sup grows with cluster
size.
25
![Page 26: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/26.jpg)
Eliminating the Bottlenecks
• Independently, Basho identified that two supervisor
processes, i.e. riak_kv_get/put_fsm_sup, become
bottleneck under heavy load, exhibiting build up in
message queue length.
• To improve the Riak scalability in version 1.3 and 1.4
Basho applied a number of techniques and introduced
new library sidejob
(https://github.com/basho/sidejob).
26
![Page 27: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/27.jpg)
Riak1.1.1 Elasticity
Time-line shows Riak cluster losing and gaining nodes
27
![Page 28: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/28.jpg)
Riak1.1.1 Elasticity
How Riak cluster deals with nodes leaving and joining
28
![Page 29: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/29.jpg)
Observation
• Number of failures (37)
• Number of successful operations (approximately 3.41
million)
• When failed nodes come back up, the throughput has
grown that shows Riak1.1.1 has a good elasticity.
29
![Page 30: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/30.jpg)
Conclusion and Future work
Our benchmark confirms that Riak has a good elasticity.
We establish for the first time scientifically the scalability limit of Riak
1.1.1 as 60 nodes.
We have shown how global operations limits the scalability of distributed
Erlang.
Riak scalability bottelnecks are eliminated in Riak versions 1.3 and
upcoming versions.
In RELEASE, we are working to scale up distributed Erlang by grouping
nodes in smaller partitions.
30
![Page 31: Scalable Persistent Storage for Erlang: Theory and Practice](https://reader034.fdocuments.us/reader034/viewer/2022051817/547e7f26b379594e2b8b5491/html5/thumbnails/31.jpg)
References
Benchmarking Riak https://github.com/amirghaffari/benchmark_riak
Basho Bench http://docs.basho.com/riak/latest/ops/building/benchmarking/
DE-Bench https://github.com/amirghaffari/DEbench
A. Ghaffari, N. Chechina, P. Trinder, and J. Meredith. Scalable Persistent Storage for
Erlang: Theory and Practice. In Proceedings of the Twelfth ACM SIGPLAN Workshop
on Erlang, pages 73-74, September 2013. ACM Press.
Clusters at UPPMAX http://www.uppmax.uu.se/hardware
Sidejob https://github.com/basho/sidejob
31