Analysing Large Web Log Files in a Hadoop Distributed Cluster ...
Transcript of Analysing Large Web Log Files in a Hadoop Distributed Cluster ...
Analysing Large Web Log Files in a Hadoop Distributed Cluster
Environment
S Saravanan, B Uma Maheswari
Department of Computer Science and Engineering,
Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Bangalore, India.
[email protected], [email protected]
Abstract
Analysing web log files has become an important task
for E-Commerce companies to predict their customer
behaviour and to improve their business. Each click in
an E-commerce web page creates 100 bytes of data.
Large E-Commerce websites like flipkart.com,
amazon.in and ebay.in are visited millions of
customers simultaneously. As a result, these
customers generate petabytes of data in their web log
files. As the web log file size is huge we require
parallel processing and reliable data storage system
for processing the web log files. Both the
requirements are provided by Hadoop framework.
Hadoop provides Hadoop Distributed File System
(HDFS) and MapReduce programming model for
processing huge dataset efficiently and effectively. In
this paper, NASA web log file is analysed and the total
number of hits received by each web page in a
website, the total number of hits received by a web
site in each hour using Hadoop framework is
calculated and it is shown that Hadoop framework
takes less response time to produce accurate results.
Keywords - Hadoop, MapReduce, Log Files, Parallel
Processing, Hadoop Distributed File System, E-
Commerce
1. Introduction E-Commerce is a rapidly growing industry all over
the world. The biggest challenge for most E-
Commerce businesses is to collect, store, analyse and
organize data from multiple data sources. There’s
certainly a lot of data waiting to be analysed and it is a
daunting task for some E-Commerce businesses to
make sense of it all [1]. One kind of data that has to
be analysed in E-Commerce business is web log file.
Web log file contains the following details: The IP
address of the computer making the request (i.e. the
visitor), the date and time of the hit, the request
method, the location and name of the requested file,
the HTTP status code, the size of the requested file
and etc. Mining the web log file will be always
helpful to E-Commerce companies to increase their
profits. Because when E-Commerce companies mine
the web log file they can predict the behaviour of their
online customers. Mining the web log file is called
Web Usage Mining. By predicting, E-Commerce
companies can offer an online customer a
personalized experience, including content and
promotions. Also, they can provide product
recommendations to customers based on their
browsing behaviour. E-Commerce companies can do
a lot more by mining the web log file. As the number
of customers visiting E-Commerce web sites are
increasing the size of the web log file is also
increasing and nowadays the size of web log file is in
petabytes. There are already pattern discovery data
mining techniques available to analyse the web log
files. These data mining techniques store web log file
in traditional DBMS and analyse. But in the current
scenario, the number of online customers’ increases
day by day and each click from a web page creates on
the order of 100 bytes of data in a typical website log
file [2]. Consequently, large websites handling
millions of simultaneous visitors can generate
hundreds of petabytes of logs per day. For example,
eBay processes petabytes of data stored in web log
file to create a better shopping experience. So, to
analyse such a big web log file efficiently and
effectively, we need to develop faster, efficient and
effective parallel and scalable data mining algorithms.
Also, we need a cluster of storage devices to store a
petabyte of web log data and parallel computing
model for analysing. Hadoop framework provides
reliable cluster of storage facility to keep our large
web log file data in a distributed manner and parallel
processing feature to process a large web log file data
efficiently and effectively. The remainder of the
paper is organized as follows. Section 2 summarizes
S Saravanan et al, Int.J.Computer Technology & Applications,Vol 5 (5),1677-1681
IJCTA | Sept-Oct 2014 Available [email protected]
1677
ISSN:2229-6093
Master Node
192.168.2.1
Slave Node
192.168.2.2
Task Tracker
Job Tracker
Task Tracker
Name Node
Data Node Data Node
MapReduce Layer
HDFS Layer
2 NODE CLUSTER
the related work. In section 3, the system architecture
is discussed. Section 4 shows the proposed scheme.
Section 5 discusses the experimental results and in
section 6, paper is concluded.
2. Related work In [3], the SQL DBMS and Hadoop MapReduce
are compared and it is suggested that Hadoop
MapReduce performs better than the SQL DBMS. In
[4], it is mentioned that traditional DBMS cannot
handle a large dataset. So we need to have Big Data
technologies like Hadoop framework. Hadoop-
MapReduce [4][5][6] is used in many areas for big
data analysis. Hadoop is a good platform to analyse
the web log files as the size of the web log file is kept
increasing nowadays [7][8]. Apache Hadoop is an
open-source project created by Doug Cutting and
developed by the Apache Software Foundation.
Hadoop platform allows us to store large scale data
in thousands of nodes and analyse it. In [5], Generally
Hadoop cluster has thousands of nodes which store
multiple blocks of log files. Hadoop fragments log
files into blocks and these blocks are evenly
distributed over hundreds of nodes in a Hadoop
cluster. Also it replicates these blocks over the
multiple nodes to achieve reliability and fault
tolerance. MapReduce achieves parallel computation
by breaking analysing job into number of tasks.
3. System architecture
Figure 1. Two node hadoop cluster system architecture
Figure 1 shows the cluster configuration of Hadoop
system which is implemented in this paper. There are
2 nodes in the cluster. One node is called master node
and another one is called slave node. The architecture
is divided into two layers: HDFS Layer and
MapReduce Layer. Hadoop Distributed File System
(HDFS) is a Java-based file system that provides
scalable and reliable data storage that is designed to
span large clusters of commodity servers [9].
MapReduce Layer reads data from, writes data to
HDFS storage and processes the data in parallel.
Namenode keeps track of how weblog file is broken
down into file blocks, which nodes store those blocks.
Secondary name node periodically reads the HDFS
file system changes log and apply them into the
fsimage file. Data node stores the replication of web
log file. JobTracker determines the execution plan by
deciding which files to be processed, assigns nodes to
different tasks, and keeps track of all tasks as they are
running. TaskTracker is responsible for the execution
of individual tasks on each slave node.
4. Proposed scheme
4.1. Calculating the total number of hits
received by each URL
Figure 2. Calculating total number of hits received by each URL
Web Log File
Web Log
File Block1
Web Log
File Block2
Web Log File
BlockN
Split
(URL1,1),(U
RL2,1) ….
(URL1,1),(URL
2,1)...….
(URL1,1),
(URL2,1)...….
Map
(URL1,1),(U
RL1,1)….
(URL2,1),(U
RL2,1)….
(URLn,1),(URLn,1)….
Shuffle
(URL1,
Sum)
(URL2,
Sum)
(URLn, Sum)
Reduce
Total number of hits received by
each URL (URL1,Sum)
(URL2,Sum)….. (URLn,Sum)
Output
Input
S Saravanan et al, Int.J.Computer Technology & Applications,Vol 5 (5),1677-1681
IJCTA | Sept-Oct 2014 Available [email protected]
1678
ISSN:2229-6093
Figure 2 depicts the MapReduce function of
processing web log file and calculating the total
number of hits received by each URL. The input to
this function is a web log file. For each hit in the web
site, a line will be added into the web log file. The line
in the web log file contains the following fields: client
IP address, User name, Server Name, date, time,
request method, requested resource, HTTP version,
HTTP Status and Bytes sent. Example line from a
NASA web log file: “in24.inetnebr.com - -
[01/Aug/1995:00:00:01 -0400] GET
/shuttle/missions/sts-68/news/sts-68-mcc-05.txt
HTTP/1.0 200 1839”. The web log file is split into
blocks by Hadoop Framework and stored into 2 node
cluster. In the mapper function, Each block of the web
log file is given as an input to a map function which in
turn parses each line using regular expression and
emits the URL as a key along with the value 1
(URL1,1), (URL2,1), (URL3,1),….,(URLn,1). After
mapping, the shuffling collects all the (Key, Value)
pairs which are having the same URL from different
mapping function’s and forms a group. After this
process, Group1 entries will be (URL1,1), (URL1,1),
(URL1,1) and so on. Group2 entries will be
(URL2,1), (URL2,1) and so on. Then, the reduce
function calculates the sum for each URL group. The
result of the reduce function is (URL1,SUM),
(URL2,SUM),…(URLn,SUM).
4.2. Calculating the total number of hits
received by a website in each hour
Figure 3. Calculating total number of hits received in every hour
Figure 3 depicts the MapReduce function of
processing web log file and calculating the total
number of hits received in every hour. The input to
this function is a web log file. The web log file is split
into blocks. In the mapper function, Each block of the
web log file is given as an input to a map function
which in turn parses each line using regular
expression and emits the hour as a key along with the
value 1 (hour0,1), (hour1,1), (hour3,1),….,(hour23,1).
After mapping, the shuffling collects all the
(Key,Value) pairs which are having the same hour
from different mapping function’s and forms a group.
After this, Group1 will be (hour0,1), (hour0,1),
(hour0,1) and so on. Group2 will be (hour1,1),
(hour1,1) and so on. The reduce function calculates
the sum for each hour group. The result of the reduce
function will be (hour0, SUM), (hour1, SUM),…
(hour23,SUM).
5. Experimental results This section discusses the results obtained from
the experiment.
5.1. Experimental setup To calculate the total number of hits received by
each URL and by a web site in each hour, a 2 node
Hadoop cluster is set up with the configurations
shown in Table 1.
Table 1. System configuration
Operating System Ubuntu 14.04
Hadoop Version Hadoop 1.2.1
Number of nodes in
the cluster
2 (192.168.2.1,
192.168.2.2)
Dataset Nasa Access Log
(July 1 – July 31,
1995)
Dataset Size 195 MB
5.2. Results of calculating the total number of
hits received by each URL Before executing the MapReduce code in the 2
nodes cluster environment, the web log file is loaded
into the HDFS of Hadoop framework. Total number
of hits in the web log file is 1891715. The first log
was collected from 00:00:00 July 1, 1995 through
23:59:59 July 31, 1995, a total of 31 days [10]. Figure
4 shows the contents of the output directory named
no_of_hits_by_URL in HDFS. The output is stored in
a file called part_r_00000. Figure 5 shows a chunk of
the output file which is generated when the
Web Log File
Web Log
File Block1
Web Log
File Block2
Web Log
File BlockN
Split
(hour0,1),(h
our1,1) ….
(hour0,1),(h
our1,1)...….
(hour0,1),
(hour1,1)…..
(hour1,1)...
….
Map
(hour1,1)
(hour1,1)….
(hour2,1)
(hour2,1)….
(hour23,1)
(hour23,1)
….
Shuffle
(hour1,
Sum)
(hour2,
Sum)
(hour23,
Sum)
Reduce
Total number of hits received by
website in every one hour
Output
Input
S Saravanan et al, Int.J.Computer Technology & Applications,Vol 5 (5),1677-1681
IJCTA | Sept-Oct 2014 Available [email protected]
1679
ISSN:2229-6093
MapReduce code for calculating the number of hits
received by each URL is executed on the input web
log file.
Figure 4. no_hits_by_URL output directory in HDFS
When MapReduce function to calculate the total
number of hits received by each URL is executed,
CPU time spent is 42420 Milliseconds. The number of
map tasks launched is 3 and reduce tasks launched is
1. Time taken by map task is 32 Seconds and reduce
task is 44 Seconds.
Figure 5. A chunk of the number of hits received by
each URL output file in HDFS
5.2. Results of calculating the total number of
hits received by website in each hour
Figure 6. no_hits_by_Hour output directory in HDFS
When MapReduce function to calculate the total
number of hits received by a website in each hour is
executed, CPU time spent is 48390 Milliseconds. The
number of map tasks launched to process the dataset
is 3 and reduce tasks launched is 1. Time taken by
map task is 38 Seconds and reduce task is 23 Seconds.
Figure 7. Output: Number of hits received in each hour
S Saravanan et al, Int.J.Computer Technology & Applications,Vol 5 (5),1677-1681
IJCTA | Sept-Oct 2014 Available [email protected]
1680
ISSN:2229-6093
Figure 6 shows the contents of the output directory
named no_of_hits_by_Hour in HDFS. The output is
stored in a file called part_r_00000. Figure 7 shows
the number of hits received by a web site in each
hour. This output is generated in HDFS storage after
executing the MapReduce Code on the input web log
file.
Figure 8. Pictorial representation of number of hits received in each hour
Figure 8 shows the pictorial representation of number
of hits received by a web site in each hour. From the
graph, it can be seen that during 9th hour maximum
number of hits are received.
6. Conclusion A web log file is stored in a 2 node Hadoop
distributed cluster environment and analysed. The
response time taken to analyse the web log file is very
less as the web log file is broken into blocks and
stored on 2 nodes cluster and analysed in parallel.
MapReduce programming model of Hadoop
framework is used to analyse the weblog file in
parallel. In this paper, the total number of hits
received by each URL and the total number of hits
received by a website in each hour are calculated. In
the future, the number of nodes in the cluster can be
increased and data mining techniques such as
recommendation, clustering and classification can be
applied on the web log file which is stored in the
hadoop file system to extract useful patterns from the
web log file. So that, E-Commerce companies can
provide a better shopping experience to their online
customers and increase their profits.
7. References [1] “Why Big Data is a must in E-Commerce”, Guest post
by Jerry Jao, CEO of Retention Science. http://www.bigdatalandscape.com/news/why-big-data-
is-a-must-in-ecommerce
[2] “3 approaches to big data analysis with Apache
Hadoop” by
DaveJaffe.http://www.dell.com/learn/us/en/19/power/
ps1q14-20140158-jaffe
[3] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel
J. Abadi, David J. DeWitt, Samuel Madden, Michael
Stonebraker, (2009) “A Comparison of Approaches to
Large-Scale Data Analysis”, ACM SIGMOD’09.
[4] Yogesh Pingle, Vaibhav Kohli, Shruti Kamat, Nimesh
Poladia, (2012) “Big Data Processing using Apache
Hadoop in Cloud System”, National Conference on Emerging Trends in Engineering & Technology.
[5] Tom White, (2009) “Hadoop: The Definitive Guide.
O’Reilly”, Scbastopol, California.
[6] Apache-Hadoop, http://Hadoop.apache.org
[7] Jeffrey Dean and Sanjay Ghemawat., (2004) “MapReduce: Simplified Data Processing on Large
Clusters”, Google Research Publication.
[8] Sayalee Narkhede and Tripti Baraskar., (2013) “HMR Log Analyzer: Analyze Web Application Logs Over
Hadoop MapReduce”, International Journal of
UbiComp (IJU), Vol.4, No.3, July 2013.
[9] http://hortonworks.com/hadoop/hdfs/
[10] http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html
S Saravanan et al, Int.J.Computer Technology & Applications,Vol 5 (5),1677-1681
IJCTA | Sept-Oct 2014 Available [email protected]
1681
ISSN:2229-6093