Baodong Jia, Tomasz WiktorWlodarczyk, Chunming Rong Department of Electrical Engineering and...

25
Performance Considerations of Data Acquisition in Hadoop System Baodong Jia, Tomasz WiktorWlodarczyk, Chunming Rong Department of Electrical Engineering and Computer Science University of Stavanger Namrata Patil

Transcript of Baodong Jia, Tomasz WiktorWlodarczyk, Chunming Rong Department of Electrical Engineering and...

Performance Considerations of Data Acquisition in Hadoop

SystemBaodong Jia, Tomasz WiktorWlodarczyk, Chunming Rong

Department of Electrical Engineering and Computer ScienceUniversity of Stavanger

Namrata Patil

Department of computer science & Engg 2

Contents…..IntroductionSub projects of HadoopTwo solutions for data acquisitionWorkflow of Chukwa systemPrimary componentsSetup for Performance AnalysisFactors Influencing Performance

ComparisonConclusion

Department of computer science & Engg 3

Introduction• Oil and Gas Industry

• Drilling done from service companies

http://bub.blicio.us/social-media-for-oil-and-gas/

4

Continued… Companies collect drilling data by placing sensors on drilling bits and

platforms and make it available on their servers.

Advantages

Problems

• Drilling status• Operators can get useful information on the historical data

• Vast amounts of data are accumulated • Infeasible or very time consuming to perform reasoning over it

Investigate application of MapReduce system Hadoop

Department of computer science & Engg

Solution

http://bub.blicio.us/social-media-for-oil-and-gas/

Department of computer science & Engg 5

Sub projects of hadoop 1.

Hadoop Common 2.

Chukwa 3. Hbase 4. HDFS

HDFS - Distributed File Systemstores application data in a replicated wayhigh throughput

Chukwa - An open source data collection systemdesigned for monitoring large distributed system.http://hadoop.apache.org/

Department of computer science & Engg 6

Two solutions for data acquisition..

Solution 1

Acquiring data from data sources, and then

copying the data file to HDFS

Solution 2

Chukwa based Solution

Department of computer science & Engg 7

Solution 1 Hadoop runs MapReduce jobs on the clusterStores the results on HDFS

Steps

Prepare the required data set for the jobCopy it to HDFSSubmit the job to hadoopStore the result in a directory specified by user on

HDFS.Get the result out of HDFS

Department of computer science & Engg 8

Pros & Cons…

Pros…Works efficiently for small number of files

with large file size

Cons…Takes a lot of extra time for large number of

files with small file sizeDoes not support appending file content

Department of computer science & Engg 9

Solution 2Overcome the problem of extra time

generated by copying large file to HDFS

Exists on top of Hadoop

Chukwa feeds the organized data into cluster

Uses temporary file to store the data

collected from different agents.

http://incubator.apache.org/chukwa/

10

Chukwa Open source data collection

system built on top of

Hadoop.

Inherits scalability and

robustness

Provides flexible and

powerful toolkit to display,

monitor, and analyze resultshttp://incubator.apache.org/chukwa/

Department of computer science & Engg

Department of computer science & Engg 11

Workflow of Chukwa system

Department of computer science & Engg 12

Primary components…..

Agents - run on each machine and emit data.

Collectors - receive data from the agent and write it to

stable storage.

MapReduce jobs - parsing and archiving the data.

HICC - Hadoop Infrastructure Care Center a web-portal

style interface for displaying data.http://incubator.apache.org/chukwa/docs/r0.4.0/design.html

Department of computer science & Engg 13

Continued…Agents

Collecting data through their adaptors.Adaptors - small dynamically-controllable

modules that run inside the agent processSeveral adaptorsAgents run on every node of hadoop clusterData from different hosts may generate different

data.http://incubator.apache.org/chukwa/docs/r0.4.0/design.html

Department of computer science & Engg 14

Collectors Gather the data through HTTPReceives data from up to several hundred agents Writes all this data to a single hadoop sequence file

called sink fileclose their sink files, rename them to mark them

available for processing, and resume writing a new file.AdvantagesReduce the number of HDFS files generated by

ChukwaHide the details of the HDFS file system in use, such as

its Hadoop version, from the adaptors

http://incubator.apache.org/chukwa/docs/r0.4.0/design.html

Department of computer science & Engg 15

MapReduce processing

Aimorganizing and processing incoming data

MapReduce jobsArchiving  - take chunks from their input, and

output new sequence files of chunks, ordered and grouped

Demux - take chunks as input and parse them to produce ChukwaRecords ( key – value pair)http://incubator.apache.org/chukwa/docs/r0.4.0/design.html

Department of computer science & Engg 16

HICC - Hadoop Infrastructure Care Center

http://incubator.apache.org/chukwa/docs/r0.4.0/design.html

• Web interface for displaying data

• Fetches the data from MySQL database

• Easier to monitor data

Department of computer science & Engg 17

Setup for Performance Analysis

Hadoop cluster that consists of 15 unix hosts

that existed at the unix lab of UIS

One tagged with name node and the others

are used as data nodes.

Data stored at data nodes in replicated way

Department of computer science & Engg 18

Factors Influencing Performance Comparison

1. Quality of the Data Acquired in Different

Ways

2. Time Used for Data Acquisition for Small

Data Size

3. Data Copying to HDFS for Big Data Size.

Department of computer science & Engg 19

Quality of the Data Acquired in Different Ways

The Size of Data Acquired by Time

• Sink file size = 1Gb

• Chukwa agent check thefile content every 2 seconds

Department of computer science & Engg 20

Time Used for Data Acquisition for Small Data Size

Actual Time Used for Acquisition in a Certain Time

• Time used to acquire datafrom servers

•Put acquired data into HDFS

Department of computer science & Engg 21

Data Copying to HDFS for Big Data Size.

Time Used to Copy Data Set to HDFS With Diferent Replica Number

• Slope of line is bigger when replica number is bigger

Department of computer science & Engg 22

Critical Value of Generating Time Differences

Size of Data set Time Used

20M 2s

30M 3s

40M 3s

50M 8s

Time used for copying according to the size of data set with replica number of 2

Critical value

Corresponding size of data file for generating time difference of data

acquisition

Department of computer science & Engg 23

Continued…

Size of Data set Time Used

10M 2s

15M 2s

20M 8s

30M 10s

40M 21s

Time used for copying according to the size of data set with replica number of 3

Department of computer science & Engg 24

Conclusion…..

Chukwa was demonstrated to work more efficiently for big data size, while for small data size there was no difference between the solutions

Department of computer science & Engg 25

Thanks..