Baodong Jia, Tomasz WiktorWlodarczyk, Chunming Rong Department of Electrical Engineering and...
-
Upload
iliana-corder -
Category
Documents
-
view
217 -
download
0
Transcript of Baodong Jia, Tomasz WiktorWlodarczyk, Chunming Rong Department of Electrical Engineering and...
Performance Considerations of Data Acquisition in Hadoop
SystemBaodong Jia, Tomasz WiktorWlodarczyk, Chunming Rong
Department of Electrical Engineering and Computer ScienceUniversity of Stavanger
Namrata Patil
Department of computer science & Engg 2
Contents…..IntroductionSub projects of HadoopTwo solutions for data acquisitionWorkflow of Chukwa systemPrimary componentsSetup for Performance AnalysisFactors Influencing Performance
ComparisonConclusion
Department of computer science & Engg 3
Introduction• Oil and Gas Industry
• Drilling done from service companies
http://bub.blicio.us/social-media-for-oil-and-gas/
4
Continued… Companies collect drilling data by placing sensors on drilling bits and
platforms and make it available on their servers.
Advantages
Problems
• Drilling status• Operators can get useful information on the historical data
• Vast amounts of data are accumulated • Infeasible or very time consuming to perform reasoning over it
Investigate application of MapReduce system Hadoop
Department of computer science & Engg
Solution
http://bub.blicio.us/social-media-for-oil-and-gas/
Department of computer science & Engg 5
Sub projects of hadoop 1.
Hadoop Common 2.
Chukwa 3. Hbase 4. HDFS
HDFS - Distributed File Systemstores application data in a replicated wayhigh throughput
Chukwa - An open source data collection systemdesigned for monitoring large distributed system.http://hadoop.apache.org/
Department of computer science & Engg 6
Two solutions for data acquisition..
Solution 1
Acquiring data from data sources, and then
copying the data file to HDFS
Solution 2
Chukwa based Solution
Department of computer science & Engg 7
Solution 1 Hadoop runs MapReduce jobs on the clusterStores the results on HDFS
Steps
Prepare the required data set for the jobCopy it to HDFSSubmit the job to hadoopStore the result in a directory specified by user on
HDFS.Get the result out of HDFS
Department of computer science & Engg 8
Pros & Cons…
Pros…Works efficiently for small number of files
with large file size
Cons…Takes a lot of extra time for large number of
files with small file sizeDoes not support appending file content
Department of computer science & Engg 9
Solution 2Overcome the problem of extra time
generated by copying large file to HDFS
Exists on top of Hadoop
Chukwa feeds the organized data into cluster
Uses temporary file to store the data
collected from different agents.
http://incubator.apache.org/chukwa/
10
Chukwa Open source data collection
system built on top of
Hadoop.
Inherits scalability and
robustness
Provides flexible and
powerful toolkit to display,
monitor, and analyze resultshttp://incubator.apache.org/chukwa/
Department of computer science & Engg
Department of computer science & Engg 12
Primary components…..
Agents - run on each machine and emit data.
Collectors - receive data from the agent and write it to
stable storage.
MapReduce jobs - parsing and archiving the data.
HICC - Hadoop Infrastructure Care Center a web-portal
style interface for displaying data.http://incubator.apache.org/chukwa/docs/r0.4.0/design.html
Department of computer science & Engg 13
Continued…Agents
Collecting data through their adaptors.Adaptors - small dynamically-controllable
modules that run inside the agent processSeveral adaptorsAgents run on every node of hadoop clusterData from different hosts may generate different
data.http://incubator.apache.org/chukwa/docs/r0.4.0/design.html
Department of computer science & Engg 14
Collectors Gather the data through HTTPReceives data from up to several hundred agents Writes all this data to a single hadoop sequence file
called sink fileclose their sink files, rename them to mark them
available for processing, and resume writing a new file.AdvantagesReduce the number of HDFS files generated by
ChukwaHide the details of the HDFS file system in use, such as
its Hadoop version, from the adaptors
http://incubator.apache.org/chukwa/docs/r0.4.0/design.html
Department of computer science & Engg 15
MapReduce processing
Aimorganizing and processing incoming data
MapReduce jobsArchiving - take chunks from their input, and
output new sequence files of chunks, ordered and grouped
Demux - take chunks as input and parse them to produce ChukwaRecords ( key – value pair)http://incubator.apache.org/chukwa/docs/r0.4.0/design.html
Department of computer science & Engg 16
HICC - Hadoop Infrastructure Care Center
http://incubator.apache.org/chukwa/docs/r0.4.0/design.html
• Web interface for displaying data
• Fetches the data from MySQL database
• Easier to monitor data
Department of computer science & Engg 17
Setup for Performance Analysis
Hadoop cluster that consists of 15 unix hosts
that existed at the unix lab of UIS
One tagged with name node and the others
are used as data nodes.
Data stored at data nodes in replicated way
Department of computer science & Engg 18
Factors Influencing Performance Comparison
1. Quality of the Data Acquired in Different
Ways
2. Time Used for Data Acquisition for Small
Data Size
3. Data Copying to HDFS for Big Data Size.
Department of computer science & Engg 19
Quality of the Data Acquired in Different Ways
The Size of Data Acquired by Time
• Sink file size = 1Gb
• Chukwa agent check thefile content every 2 seconds
Department of computer science & Engg 20
Time Used for Data Acquisition for Small Data Size
Actual Time Used for Acquisition in a Certain Time
• Time used to acquire datafrom servers
•Put acquired data into HDFS
Department of computer science & Engg 21
Data Copying to HDFS for Big Data Size.
Time Used to Copy Data Set to HDFS With Diferent Replica Number
• Slope of line is bigger when replica number is bigger
Department of computer science & Engg 22
Critical Value of Generating Time Differences
Size of Data set Time Used
20M 2s
30M 3s
40M 3s
50M 8s
Time used for copying according to the size of data set with replica number of 2
Critical value
Corresponding size of data file for generating time difference of data
acquisition
Department of computer science & Engg 23
Continued…
Size of Data set Time Used
10M 2s
15M 2s
20M 8s
30M 10s
40M 21s
Time used for copying according to the size of data set with replica number of 3
Department of computer science & Engg 24
Conclusion…..
Chukwa was demonstrated to work more efficiently for big data size, while for small data size there was no difference between the solutions