Download - A new methodology for large scale nosql benchmarking

A new methodology for large scale benchmarking

A step by step methodology

Dory Thibault

UCL

Contact : [email protected]

Sponsor : Euranova

Website : nosqlbenchmarking.com

March 1, 2011

Wikipedia infrastructureThe benchmark VS the real Wikipedia load

The updated methodology

Existing Wikipedia infrastructure

2 / 13




The structured data (revision history, articles relations, user

accounts...) are stored in MySQL

Each wiki has its own database, not necessarily its own cluster

Each cluster is made of several MySQL servers using

replication

Only one master for each cluster

All the writes are handled by the master

The multiple slaves serve the reads

Currently there are 37 servers running MySQL according to

ganglia.wikimedia.org

Each one has

between 8 and 12 CPUs running at � 2.2Ghz

between 32 and 64 Gb of RAM

3 / 13




The content of the last version of an article is stored as a blob on

external storage servers

Replicated cluster of 3 MySQL hosts

Those data are stored appart from the main core databasesbecause this content :

Needs a lot of storage space

Is largely unused thanks to the cache servers

4 / 13



The benchmark VS the real Wikipedia load

A very simpli�ed model

The benchmark does not try to reproduce the real load on the

MySQL clusters

There is no computational work on the structured data

There is no other cache than the one provided by the

database itself

The MySQL clusters run on a few powerful servers while the

NoSQL clusters will run on many small servers

So why Wikipedia?

The main point in using Wikipedia's data is to use real data : each

entry has a di�erent size and the MapReduce computation on the

content makes sense.

5 / 13



The new data set

All the articles from Wikipedia in English

The new data set is made of all the +10 millions articles from the

english version of Wikipedia

Sums up to 28Gb uncompressed

Each article is considered as a XML blob with all its metadata

and is identi�ed with a unique integer ID

Is that enough data?

Not really for a very big cluster. The solution is simply to insert the

same data set several times but still using unique ID for each insert.

6 / 13



The old benchmark architecture

Scaling problem

This architecture does not scale, mainly for bandwidth reasons. The

computational power needed is small but the whole article is trans-

mited for each request.

7 / 13



The distributed benchmark architecture

8 / 13



The new infrastructure

Amazon EC2 infrastructure

I plan to use mainly small standard instances (1 CPU, 1.7Gb of

RAM) on the Amazon EC2 infrastructure.

The biggest cluster should be made of :

Hundreds of small EC2 instances

A few bigger servers for systems that use master or load

balancer like HBase

9 / 13



The measured properties

1 The raw performances : how fast is it to make all the

requests?

2 The scalability : what is the impact on the perfomances of

changing the cluster size (number of nodes and data set)?

3 The elasticity : how long does it take to get to a stable state

with increased performances when node are added to the

cluster?

10 / 13



Measuring the elasticity

The most complex of the three measures

The time needed for the system to stabilize should be di�erent for

each system and for each cluster size. I have chosen to character-

ize the elasticity by computing the standard deviation for smaller

benchmark runs.

1 Use a stable cluster to determine the usual standard deviation

of the DB2 Add the new nodes to the cluster but do not increase the data

set3 Repeat :

Start a benchmark run and compute the standard deviation

Wait X seconds

4 Until the standard deviation for the last Y runs does not

diverge more than Z percents from the usual standard

deviation11 / 13



The step by step methodology

1 Start up a clean cluster of size 50 and insert all the articles

2 Measure the standard deviation for this cluster once it has

stabilized

3 Choose a total number of requests and a read-only percentage

4 Start the benchmark with the chosen number of requests and

read-only percentage

5 Start the MapReduce benchmark

6 Double the number of nodes in the cluster

7 Start the elasticity test

8 Double the size of the data set inserted

9 Jump to 4 with a doubled number of requests until there are

no more servers to add to the cluster

12 / 13



Bibliography

www.nedworks.org/�mark/presentations/san/Wikimedia%20architecture.pdf

http://meta.wikimedia.org/wiki/Wikimedia servers

http://ganglia.wikimedia.org/

13 / 13