A new methodology for large scale benchmarking
A step by step methodology
Dory Thibault
UCL
Contact : [email protected]
Sponsor : Euranova
Website : nosqlbenchmarking.com
March 1, 2011
Wikipedia infrastructureThe benchmark VS the real Wikipedia load
The updated methodology
Existing Wikipedia infrastructure
2 / 13
Wikipedia infrastructureThe benchmark VS the real Wikipedia load
The updated methodology
Existing Wikipedia infrastructure
The structured data (revision history, articles relations, user
accounts...) are stored in MySQL
Each wiki has its own database, not necessarily its own cluster
Each cluster is made of several MySQL servers using
replication
Only one master for each cluster
All the writes are handled by the master
The multiple slaves serve the reads
Currently there are 37 servers running MySQL according to
ganglia.wikimedia.org
Each one has
between 8 and 12 CPUs running at � 2.2Ghz
between 32 and 64 Gb of RAM
3 / 13
Wikipedia infrastructureThe benchmark VS the real Wikipedia load
The updated methodology
Existing Wikipedia infrastructure
The content of the last version of an article is stored as a blob on
external storage servers
Replicated cluster of 3 MySQL hosts
Those data are stored appart from the main core databasesbecause this content :
Needs a lot of storage space
Is largely unused thanks to the cache servers
4 / 13
Wikipedia infrastructureThe benchmark VS the real Wikipedia load
The updated methodology
The benchmark VS the real Wikipedia load
A very simpli�ed model
The benchmark does not try to reproduce the real load on the
MySQL clusters
There is no computational work on the structured data
There is no other cache than the one provided by the
database itself
The MySQL clusters run on a few powerful servers while the
NoSQL clusters will run on many small servers
So why Wikipedia?
The main point in using Wikipedia's data is to use real data : each
entry has a di�erent size and the MapReduce computation on the
content makes sense.
5 / 13
Wikipedia infrastructureThe benchmark VS the real Wikipedia load
The updated methodology
The new data set
All the articles from Wikipedia in English
The new data set is made of all the +10 millions articles from the
english version of Wikipedia
Sums up to 28Gb uncompressed
Each article is considered as a XML blob with all its metadata
and is identi�ed with a unique integer ID
Is that enough data?
Not really for a very big cluster. The solution is simply to insert the
same data set several times but still using unique ID for each insert.
6 / 13
Wikipedia infrastructureThe benchmark VS the real Wikipedia load
The updated methodology
The old benchmark architecture
Scaling problem
This architecture does not scale, mainly for bandwidth reasons. The
computational power needed is small but the whole article is trans-
mited for each request.
7 / 13
Wikipedia infrastructureThe benchmark VS the real Wikipedia load
The updated methodology
The distributed benchmark architecture
8 / 13
Wikipedia infrastructureThe benchmark VS the real Wikipedia load
The updated methodology
The new infrastructure
Amazon EC2 infrastructure
I plan to use mainly small standard instances (1 CPU, 1.7Gb of
RAM) on the Amazon EC2 infrastructure.
The biggest cluster should be made of :
Hundreds of small EC2 instances
A few bigger servers for systems that use master or load
balancer like HBase
9 / 13
Wikipedia infrastructureThe benchmark VS the real Wikipedia load
The updated methodology
The measured properties
1 The raw performances : how fast is it to make all the
requests?
2 The scalability : what is the impact on the perfomances of
changing the cluster size (number of nodes and data set)?
3 The elasticity : how long does it take to get to a stable state
with increased performances when node are added to the
cluster?
10 / 13
Wikipedia infrastructureThe benchmark VS the real Wikipedia load
The updated methodology
Measuring the elasticity
The most complex of the three measures
The time needed for the system to stabilize should be di�erent for
each system and for each cluster size. I have chosen to character-
ize the elasticity by computing the standard deviation for smaller
benchmark runs.
1 Use a stable cluster to determine the usual standard deviation
of the DB2 Add the new nodes to the cluster but do not increase the data
set3 Repeat :
Start a benchmark run and compute the standard deviation
Wait X seconds
4 Until the standard deviation for the last Y runs does not
diverge more than Z percents from the usual standard
deviation11 / 13
Wikipedia infrastructureThe benchmark VS the real Wikipedia load
The updated methodology
The step by step methodology
1 Start up a clean cluster of size 50 and insert all the articles
2 Measure the standard deviation for this cluster once it has
stabilized
3 Choose a total number of requests and a read-only percentage
4 Start the benchmark with the chosen number of requests and
read-only percentage
5 Start the MapReduce benchmark
6 Double the number of nodes in the cluster
7 Start the elasticity test
8 Double the size of the data set inserted
9 Jump to 4 with a doubled number of requests until there are
no more servers to add to the cluster
12 / 13
Wikipedia infrastructureThe benchmark VS the real Wikipedia load
The updated methodology
Bibliography
www.nedworks.org/�mark/presentations/san/Wikimedia%20architecture.pdf
http://meta.wikimedia.org/wiki/Wikimedia servers
http://ganglia.wikimedia.org/
13 / 13
Top Related