Baodong Jia, Tomasz WiktorWlodarczyk, Chunming Rong Department of Electrical Engineering and...

Performance Considerations of Data Acquisition in Hadoop

SystemBaodong Jia, Tomasz WiktorWlodarczyk, Chunming Rong

Department of Electrical Engineering and Computer ScienceUniversity of Stavanger

Namrata Patil

Department of computer science & Engg 2

Contents…..IntroductionSub projects of HadoopTwo solutions for data acquisitionWorkflow of Chukwa systemPrimary componentsSetup for Performance AnalysisFactors Influencing Performance

ComparisonConclusion


Introduction• Oil and Gas Industry

• Drilling done from service companies

http://bub.blicio.us/social-media-for-oil-and-gas/


4

Continued… Companies collect drilling data by placing sensors on drilling bits and

platforms and make it available on their servers.

Advantages

Problems

• Drilling status• Operators can get useful information on the historical data

• Vast amounts of data are accumulated • Infeasible or very time consuming to perform reasoning over it

Investigate application of MapReduce system Hadoop

Department of computer science & Engg

Solution




Sub projects of hadoop 1.

Hadoop Common 2.

Chukwa 3. Hbase 4. HDFS

HDFS - Distributed File Systemstores application data in a replicated wayhigh throughput

Chukwa - An open source data collection systemdesigned for monitoring large distributed system.http://hadoop.apache.org/

http://hadoop.apache.org/


Two solutions for data acquisition..

Solution 1

Acquiring data from data sources, and then

copying the data file to HDFS

Solution 2

Chukwa based Solution


Solution 1 Hadoop runs MapReduce jobs on the clusterStores the results on HDFS

Steps

Prepare the required data set for the jobCopy it to HDFSSubmit the job to hadoopStore the result in a directory specified by user on

HDFS.Get the result out of HDFS


Pros & Cons…

Pros…Works efficiently for small number of files

with large file size

Cons…Takes a lot of extra time for large number of

files with small file sizeDoes not support appending file content


Solution 2Overcome the problem of extra time

generated by copying large file to HDFS

Exists on top of Hadoop

Chukwa feeds the organized data into cluster

Uses temporary file to store the data

collected from different agents.

http://incubator.apache.org/chukwa/



10

Chukwa Open source data collection

system built on top of

Hadoop.

Inherits scalability and

robustness

Provides flexible and

powerful toolkit to display,

monitor, and analyze resultshttp://incubator.apache.org/chukwa/

Department of computer science & Engg



Workflow of Chukwa system


Primary components…..

Agents - run on each machine and emit data.

Collectors - receive data from the agent and write it to

stable storage.

MapReduce jobs - parsing and archiving the data.

HICC - Hadoop Infrastructure Care Center a web-portal

style interface for displaying data.http://incubator.apache.org/chukwa/docs/r0.4.0/design.html

http://incubator.apache.org/chukwa/docs/r0.4.0/design.html


Continued…Agents

Collecting data through their adaptors.Adaptors - small dynamically-controllable

modules that run inside the agent processSeveral adaptorsAgents run on every node of hadoop clusterData from different hosts may generate different

data.http://incubator.apache.org/chukwa/docs/r0.4.0/design.html



Collectors Gather the data through HTTPReceives data from up to several hundred agents Writes all this data to a single hadoop sequence file

called sink fileclose their sink files, rename them to mark them

available for processing, and resume writing a new file.AdvantagesReduce the number of HDFS files generated by

ChukwaHide the details of the HDFS file system in use, such as

its Hadoop version, from the adaptors





MapReduce processing

Aimorganizing and processing incoming data

MapReduce jobsArchiving - take chunks from their input, and

output new sequence files of chunks, ordered and grouped

Demux - take chunks as input and parse them to produce ChukwaRecords ( key – value pair)http://incubator.apache.org/chukwa/docs/r0.4.0/design.html



HICC - Hadoop Infrastructure Care Center


• Web interface for displaying data

• Fetches the data from MySQL database

• Easier to monitor data




Setup for Performance Analysis

Hadoop cluster that consists of 15 unix hosts

that existed at the unix lab of UIS

One tagged with name node and the others

are used as data nodes.

Data stored at data nodes in replicated way


Factors Influencing Performance Comparison

1. Quality of the Data Acquired in Different

Ways

2. Time Used for Data Acquisition for Small

Data Size

3. Data Copying to HDFS for Big Data Size.


Quality of the Data Acquired in Different Ways

The Size of Data Acquired by Time

• Sink file size = 1Gb

• Chukwa agent check thefile content every 2 seconds


Time Used for Data Acquisition for Small Data Size

Actual Time Used for Acquisition in a Certain Time

• Time used to acquire datafrom servers

•Put acquired data into HDFS


Data Copying to HDFS for Big Data Size.

Time Used to Copy Data Set to HDFS With Diferent Replica Number

• Slope of line is bigger when replica number is bigger


Critical Value of Generating Time Differences

Size of Data set Time Used

20M 2s

30M 3s

40M 3s

50M 8s

Time used for copying according to the size of data set with replica number of 2

Critical value

Corresponding size of data file for generating time difference of data

acquisition


Continued…

Size of Data set Time Used

10M 2s

15M 2s

20M 8s

30M 10s

40M 21s

Time used for copying according to the size of data set with replica number of 3


Conclusion…..

Chukwa was demonstrated to work more efficiently for big data size, while for small data size there was no difference between the solutions


Thanks..

Baodong Jia, Tomasz WiktorWlodarczyk, Chunming Rong Department of Electrical Engineering and...

Documents

Transcript of Baodong Jia, Tomasz WiktorWlodarczyk, Chunming Rong Department of Electrical Engineering and...