Performance Evaluation of a MongoDB and Hadoop Platform...

27
1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis Elif Dede, Madhusudhan Govindaraju Lavanya Ramakrishnan, Dan Gunter, Shane Canon Department of Computer Science, Binghamton University (SUNY) Lawrence Berkeley National Laboratory

Transcript of Performance Evaluation of a MongoDB and Hadoop Platform...

Page 1: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

1

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

Elif Dede, Madhusudhan Govindaraju

Lavanya Ramakrishnan, Dan Gunter, Shane Canon

Department of Computer Science, Binghamton University (SUNY) Lawrence Berkeley National Laboratory

Page 2: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Computa(on  and  Data  are  cri(cal  parts  of  the  scien(fic  process  

Experiment

Theory

Computation

Data (Fourth Paradigm)

Advance  Light  Source  Data  Rates  

2009          65  TB/yr  

2011      312  TB/yr  

2013   1900  TB/yr  

Three Pillars of Science

Page 3: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Materials  Project  

3

Schemaless database

manager.x manager.x manager.x

Brain

www.materialsproject.org Source: Michael Kocher, Daniel Gunter

Page 4: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Data  is  “Big”  

4

Page 5: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Processing  “Big  Data”:  MapReduce  

5

• Introduced in OSDI 2004 by Dean and Ghemawat from Google

• Programming model for processing large data sets

• Exploits large a set of commodity machines

• Characteristics of the model: •  Relaxed synchronization

constraints •  Locality optimization •  Fault-tolerance •  Load balancing OSDI  2004  

Page 6: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Map  and  Reduce    Map/Reduce:

•  The map() function is called on every item in the input set and emits a series of intermediate key/value pairs

•  All values associated with a given intermediate key are grouped together

•  The reduce() function is called on every unique intermediate key, and its value list, and emits a final output value

6

Page 7: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Apache  Hadoop  •   Open-­‐source  MapReduce  implementa;on  in  Java  •   Easy  scalability    •   Built-­‐in  I/O  management  

•   Hadoop  Distributed  File  System(HDFS)  • Data  distribu(on,  management  and  replica(on  

•   Load  balancing  •   Handles  stragglers  

•   Fault  tolerance  •   Commodity  hardware  •   Heartbeats  •   Specula(ve  execu(on  and  data  replica(on  

• Hadoop  Streaming  • Create  and  run  MapReduce  jobs  with  any  executable  or  script  as  the  mapper  and/or  the  reducer  

7

Page 8: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Scien;fic  Compu;ng  and  Hadoop    Hadoop  provides:  •  Data  Flow  Parallelism    

•  Data  goes  through  different  steps  of  processing  •  Similar  Job  Phases  

•  Data  prepara(on,  transforma(on  and  reduc(on    •  MapReduce:  maps  (transforma(on)  and  reduces  

(reduc(on)  •  Number  of  maps  >>>  Number  of  reduce  

•  Data  transforma(on  is  typically  more  parallel    than  data  reduc(on    

•  Fault  Tolerance  and  Data  Locality  •  Data  intensive  loads  •  Long  running  scien(fic  jobs  

8

Page 9: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Scien;fic  Compu;ng  and  Hadoop  (Cont.)    

Hadoop  does  not  provide:  •  Java  implementa(on  

•  Legacy  scien(fic  code  mostly  is  not  in  java  and  hard  to  rewrite  as  map  and  reduce  func(ons  

•  Hadoop  Streaming  allows  other  modes    •  HDFS  is  a  non-­‐POSIX  file  system  

•  HDFS  java  library  calls  needed  to  create,  read  and  write  files  •  HDFS  data  locality  good  but  does  not  handle  applica(ons  that  

might  have  mul(ple  data  sets  •  Scien(fic  data  formats  do  not  fit  in  the  line/block  oriented  inputs  of  

typical  Hadoop  jobs  •  Scien(fic  applica(ons  o]en  work  with  files  where  the  logical  

division  of  work  is  per  file  •  New  file  formats  require  addi(onal  java  programming  to  define  

the  format,  appropriate  split  for  a  single  map  task  

9

Page 10: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Scien;fic  Compu;ng  and  Hadoop  (Cont.)    

Hadoop  does  not  provide:  

•  Maps  and  reduces  are  considered  iden(cal  (executables/arguments)  

•  Implemen(ng  different  tasks  requires  logic  in  the    tasks  that  differen(ate  the  func(onality  

•  This  can  cause  worker  processing  (mes  to  vary  widely  an  lead  to  (meouts  and  restarted  tasks  due  to  the  specula(ve  execu(on  in  Hadoop  

•  No  built-­‐in  dynamic  and  itera(ve  applica(on  support  

10

Page 11: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

New  Genera;on  Data  

11

•  Dynamic Data •  Size and Content

•  Structured? •  Semi structured,

unstructured

•  Relational? • Not always

Page 12: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

NoSQL  

12

A broad class of data management systems where the data is partitioned across a set of servers, where no server plays a privileged role

• NoSQL has emerged as an alternative model for this new non-relational data model.

• Address the ``Big Data'' challenge by providing horizontal scalability.

• Lower maintenance costs and flexibility.

• There are various data models that are represented under NoSQL including key-value, column-oriented and document-oriented stores.

• Each of these models has its own interpretation of data storage and makes different tradeoffs within the Consistency, Availability and Performance

Page 13: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

What  is  MongoDB?  

13

• Open source document-oriented database

•  Data is not in tables with rows and columns

• Data is stored as “documents”, each of which is a associative array of scalar values, or nested associative arrays

•  Javascript Object Notation (JSON) format

•  Stored as BSON

• MongoDB uses sharding to split the data evenly across the cluster to parallelize access.

• This is done through front-end “routing servers” and back-end “data servers”

• Provides a built-in MapReduce • Drawbacks • The MapReduce scripts should be written in JavaScript •  Slow and poor analytics libraries • The JavaScript implementation used by the MongoDB is not thread safe

Page 14: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Why  MongoDB?  

14

Materials Project: •  A community accessible data store of calculated materials.

• Data store is complex with hundreds of attributes and constantly evolving.

• MongoDB provides an appropriate data model and query language.

• The project also needs to perform complex statistical data mining to discover patterns in materials and validate/verify correctness.

• These task are difficult with MongoDB but natural for MapReduce

ALS: • Advanced Light Source’s Tomogropy beamline uses MongoDB to store metadata from experiments (Summer’12, LBNL)

Schemaless database

manager.x manager.x manager.x

Brain  

www.materialsproject.org  Source:  Michael  Kocher,  Daniel  Gunter  (LBNL)  

Page 15: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Hadoop-­‐MongoDB  Connector  

15

• Input splits are retrieved from a MongoDB server(s)

• Each mapper can read its splits in parallel

• Results are written back to MongoDB by the Hadoop reducer(s)

• It works with single MongoDB server or with a sharding setup

• User determines the split size

Page 16: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

MongoDB:  Overhead  of  mul;ple  connec;ons  

16

• Test ability to handle large number of simultaneous connections •  768 tasks with different checkpoint intervals compared to when there is no checkpoint

•  overhead

•  Connections increased from 154 to 768 per second, write volume increased to 768 MBs/.

Page 17: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

MongoDB:  Overhead,  when  using  more  nodes,  tasks  

17

• 10 min per task, •  All tasks run in parallel •  10 sec checkpoint interval •  Overhead observed after 1000 parallel tasks

• Large number of connections is the bottleneck

•  More than the data volume

Page 18: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

MongoDB  MapReduce  vs.  Hadoop-­‐MongoDB    Read/Write  Performance  Comparison  

18

• Data is stored on a single MongoDB server

• Hadoop cluster consists of 2 worker nodes

• The mongo-hadoop plug-in provides roughly five times better performance.

Page 19: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Hadoop-­‐MongoDB:  Choosing  the  Split  Size    

19

• Processing 9.3 million input records with Hadoop

• Each mapper reads an input split from the MongoDB server, does processing and sends its intermediate output to the reducer

• Split size varies:16, 32, 64, 128, 254 MBs

•  sweet spot: 128 MB

•  With the default split size of 8MB, Hadoop schedules over 500 mappers; by increasing the split size, this number drops around to 40

Page 20: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Hadoop-­‐MongoDB:  Increasing  Data    

20

• For 4.6 million input records, HDFS Hadoop is two times better than MongoDB, and at 37.2 million records it is five times

• At 37.2 million input records mongo-hadoop is more than 3 times slower in reading and more than nine times in writing than Hadoop-HDFS.

• In a sharded setup, mongo-hadoop reading times improve considerably.

2-node Hadoop Cluster and 2 Mongo-DB servers.

Page 21: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Hadoop-­‐MongoDB:  Sharding  and  processing  on  local  nodes  vs  different  nodes    

21

• The performance slightly worsened compared to running the servers on different machines.

•  MongoDB uses mmap to aggressively cache data from disk into memory

•  With increasing input size growing memory and CPU usage is observed on the worker/server nodes

• This effects the performance of the MapReduce job

Performance bottleneck is due to memory Contention. Locality has minimal effect.

Page 22: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Hadoop-­‐MongoDB:  Increasing  #Workers    

22

•  The performance over increasing cluster sizes from 16 to 64 cores

•  Single to two MongoDB sharded servers

• The write time is bound by the reduce phase for this MapReduce job

• Number of mappers >> number of reducers

• The write performance of MongoDB still remains to be a bottleneck along with the overhead of routing data to be written between sharding servers.

Write performance of MongoDB is a bottleneck.

Page 23: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Hadoop-­‐MongoDB:  Different  Setups    (given  that  the  data  is  in  MongoDB)  

23

• Best performance achieved reading from MongoDB and writing the output to HDFS

• Downloading the data to HDFS before running the analysis is the slowest .

Hadoop-HDFS provides the best peformance.

Page 24: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Hadoop-­‐MongoDB:  Different  Setups    

24

•  Increasing cluster size (from 8 cores to 64) for 37.2 million input records

• With an increasing number of worker nodes the concurrency of the map phase increases

• The map times get considerably faster

Page 25: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Hadoop-­‐MongoDB:  Fault  Tolerance    

25

• 32 node Hadoop cluster processing ~37 million input records

• After eight faulted worker nodes Hadoop-HDFS loses too many data nodes and fails to complete the MapReduce job

•  Mongo-hadoop gets the input splits from the MongoDB server therefore losing worker nodes does not lead to loss of input data

Page 26: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

Conclusions  •  Sharding  helps  to  improve  MongoDB’s  performance  especially  for  reads.    

•  In  a  sharded  setup,  mongo-­‐hadoop  reading  (mes  improve  considerably,  as  there  are  mul(ple  servers  to  respond  to  parallel  worker  requests  

•  In  cases  where  data  is  stored  in  MongoDB  and  needs  to  be  analyzed,  the  mongo-­‐hadoop  connector  is  a  convenient  way  to  use  Hadoop.    

•  Performance  improves    when  output  is  wriben  to  HDFS  

•  MongoDB  performance  degrada(on  observed  with  the  increasing  number  of  connec(ons,  increasing  write  requests  per  second,  as  well  as  the  increase  in  total  write  volume  

•  The  mongo-­‐hadoop  plug-­‐in  provides  roughly  five  (mes  beber  performance  compared  to  using  MongoDB’s  na(ve  MapReduce  implementa(on.    

•  The  performance  gain  from  using  mongo-­‐hadoop  increases  linearly  with  input  size.  

26

Page 27: Performance Evaluation of a MongoDB and Hadoop Platform ...datasys.cs.iit.edu/events/ScienceCloud2013/s02.pdf1 Performance Evaluation of a MongoDB and Hadoop Platform for Scientific

27

Contact

Madhu Govindaraju [email protected] Binghamton University State University of New York (SUNY)

Dan Gunter [email protected] Lawrence Berkeley National Laboratory