DATA MINING OVER LARGE DATASETS USING HADOOP IN CLOUD ENVIRONMENT ·  · 2013-04-20DATA MINING...

6
DATA MINING OVER LARGE DATASETS USING HADOOP IN CLOUD ENVIRONMENT V.Nappinna lakshmi 1 , N. Revathi 2* 1 PG Scholar, 2 Assistant Professor Department of Information Technology, Sri Venkateswara College of Engineering, Sriperumbudur 602105, Chennai, INDIA. 1 [email protected] 2 [email protected] * * Corresponding author Abstract- There is a drastic growth of data’s in the web applications and social networking and such data’s are said be as Big Data. The Hive queries with the integration of Hadoop are used to generate the report analysis for thousands of datasets. It requires huge amount of time consumption to retrieve those datasets. It lacks in performance analysis. To overcome this problem the Market Basket Analysis a very popular Data Mining Algorithm is used in Amazon cloud environment by integrating it with Hadoop Ecosystem and Hbase. The objective is to store the data persistently along with the past history of the data set and performing the report analysis of those data set. The main aim of this system is to improve performance through parallelization of various operations such as loading the data, index building and evaluating the queries. Thus the performance analysis is done with the minimum of three nodes with in the Amazon cloud environment. Hbase is a open source, non-relational and distributed database model. It runs on the top of the Hadoop. It consists of a single key with multiple values. Looping is avoided in retrieving a particular data from huge datasets and it consumes less amount of time for executing the data. HDFS file system is used to store the data after performing the map reduce operations and the execution time is decreased when the number of nodes gets increased. The performance analysis is tuned with the parameters such as the HBase Heap Memory and Caching Parameter. Keywords- HBase, Cloud computing, Hadoop ecosystem, mining algorithm I INTRODUCTION Currently the data set sizes for applications are growing in a incredible manner. Thus the data sets growing beyond the few hundreds of terabytes, have no solutions to manage and analyse these data. Services like social networking approaches to achieve the goals like minimum amount of effort in terms of software, CPU and network. Cloud computing is associated with the new paradigm for provisioning the computing infrastructure. Thus the paradigm shifts the location of infrastructure to the network to reduce the cost associated with the management of hardware and software resources. The cloud computing is said to be as the model for enabling convenient, on- demand network access, to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction. Hadoop is a framework for running large number of applications which consists HDFS for storing large number of dataset. Hadoop DB tries to achieve fault tolerance and the ability to operate in heterogeneous environments by inheriting the scheduling and job tracking implementation from Hadoop. The main aim of these systems is to improve the performance through parallelization of various operations such as loading the datasets, index building and evaluating the queries. These systems usually designed to run on top of a shared nothing architecture where data may be stored in a distributed fashion and input/output speeds are improved by using multiple CPU’s disk in parallel and network links with high available bandwidth. Hadoop database tries to achieve the performance of parallel databases by doing most of query processing inside the database engine. Hadoop is an open source and framework that is used in cloud environment for efficient data analysis and storage of data. It supports data- intensive applications by realizing the implementation of the Map Reduce framework. Inspired by the Google’s architecture. Integration of Hadoop and Hive is used to store and retrieve the dataset in a efficient manner. For more efficiency of data storage and transactions of retail business the integration of Hadoop ecosystem with HBase along with the cloud environment is used to store and retrieve the data sets persistently. The performance analysis is done with the Map Reduce parameters like HBase heap memory and Caching parameter. V Nappinna Lakshmi et al, International Journal of Computer Science & Communication Networks,Vol 3(2), 73-78 73 ISSN:2249-5789

Transcript of DATA MINING OVER LARGE DATASETS USING HADOOP IN CLOUD ENVIRONMENT ·  · 2013-04-20DATA MINING...

Page 1: DATA MINING OVER LARGE DATASETS USING HADOOP IN CLOUD ENVIRONMENT ·  · 2013-04-20DATA MINING OVER LARGE DATASETS USING HADOOP IN CLOUD ENVIRONMENT . V.Nappinna lakshmi1, ... or

DATA MINING OVER LARGE DATASETS USING

HADOOP IN CLOUD ENVIRONMENT

V.Nappinna lakshmi1, N. Revathi2*

1PG Scholar, 2Assistant Professor

Department of Information Technology,

Sri Venkateswara College of Engineering, Sriperumbudur – 602105, Chennai, INDIA.

[email protected] 2 [email protected]*

*Corresponding author

Abstract- There is a drastic growth of data’s in the

web applications and social networking and such

data’s are said be as Big Data. The Hive queries with

the integration of Hadoop are used to generate the

report analysis for thousands of datasets. It requires

huge amount of time consumption to retrieve those

datasets. It lacks in performance analysis. To

overcome this problem the Market Basket Analysis a

very popular Data Mining Algorithm is used in

Amazon cloud environment by integrating it with

Hadoop Ecosystem and Hbase. The objective is to

store the data persistently along with the past history

of the data set and performing the report analysis of

those data set. The main aim of this system is to

improve performance through parallelization of

various operations such as loading the data, index

building and evaluating the queries. Thus the

performance analysis is done with the minimum of

three nodes with in the Amazon cloud environment.

Hbase is a open source, non-relational and

distributed database model. It runs on the top of the

Hadoop. It consists of a single key with multiple

values. Looping is avoided in retrieving a particular

data from huge datasets and it consumes less amount

of time for executing the data. HDFS file system is

used to store the data after performing the map

reduce operations and the execution time is

decreased when the number of nodes gets increased.

The performance analysis is tuned with the

parameters such as the HBase Heap Memory and

Caching Parameter.

Keywords- HBase, Cloud computing, Hadoop

ecosystem, mining algorithm

I INTRODUCTION

Currently the data set sizes for applications

are growing in a incredible manner. Thus the data

sets growing beyond the few hundreds of

terabytes, have no solutions to manage and analyse

these data. Services like social networking

approaches to achieve the goals like minimum

amount of effort in terms of software, CPU and

network. Cloud computing is associated with the

new paradigm for provisioning the computing

infrastructure. Thus the paradigm shifts the

location of infrastructure to the network to reduce

the cost associated with the management of

hardware and software resources. The cloud

computing is said to be as the model for enabling

convenient, on- demand network access, to a

shared pool of configurable computing resources

that can be rapidly provisioned and released with minimal management effort or service provider

interaction.

Hadoop is a framework for running large

number of applications which consists HDFS for

storing large number of dataset. Hadoop DB tries

to achieve fault tolerance and the ability to operate

in heterogeneous environments by inheriting the

scheduling and job tracking implementation from

Hadoop. The main aim of these systems is to

improve the performance through parallelization

of various operations such as loading the datasets,

index building and evaluating the queries. These

systems usually designed to run on top of a shared nothing architecture where data may be stored in a

distributed fashion and input/output speeds are

improved by using multiple CPU’s disk in parallel

and network links with high available bandwidth.

Hadoop database tries to achieve the

performance of parallel databases by doing most

of query processing inside the database engine.

Hadoop is an open source and framework that is

used in cloud environment for efficient data

analysis and storage of data. It supports data-

intensive applications by realizing the

implementation of the Map Reduce framework.

Inspired by the Google’s architecture. Integration of Hadoop and Hive is used to store and retrieve

the dataset in a efficient manner. For more

efficiency of data storage and transactions of retail

business the integration of Hadoop ecosystem with

HBase along with the cloud environment is used to

store and retrieve the data sets persistently. The

performance analysis is done with the Map

Reduce parameters like HBase heap memory and

Caching parameter.

V Nappinna Lakshmi et al, International Journal of Computer Science & Communication Networks,Vol 3(2), 73-78

73

ISSN:2249-5789

Page 2: DATA MINING OVER LARGE DATASETS USING HADOOP IN CLOUD ENVIRONMENT ·  · 2013-04-20DATA MINING OVER LARGE DATASETS USING HADOOP IN CLOUD ENVIRONMENT . V.Nappinna lakshmi1, ... or

A. Cloud computing

It is a model for enabling convenient and on-

demand network access to a shared pool of

configurable computing resources (e.g. networks,

servers, application storages and services) that can

be rapidly provisioned and released with minimal

management effort or service provider interaction.

Fig 1.1 Hadoop Ecosystem

B. Hadoop

HADOOP is designed to run on cheap

commodity hardware. It automatically handles the data replication and node failure, focuses on

processing the data. It is used to store the copies of

internal log and dimension data sources. Used for

reporting/analytics and machine learning.

C. Hive

Hive is system for managing and querying

structured data built on top of the Hadoop for data

warehousing and analytics of data.

D. Hive QL

Hive QL is the tuple subset of the hive

dataware housing. It makes use of the parser for execution of the map reduce program and to store

the large data set in HDFS of Hadoop.

E. Map Reduce

Map Reduce is a programming model for

processing large data sets, and the implementation

is done with the Google. Map Reduce is typically

used to do distribute computing on clusters of

computers. Map Reduce is a framework for

processing parallelizable problems across huge datasets using a large number of computers

(nodes), collectively referred to as a cluster (if all

nodes are on the same local network and use

similar hardware).Computational processing can

occur on data stored either in a file system

(unstructured) or in a database (structured). Map

Reduce can take advantage on data locality,

processing the datasets or near the storage assets to

decrease transmission of data.

"Map" step- The master node takes the input

file and divides those datasets into smaller sub-

problems, and distributes them to each of the

worker nodes. A worker node may do this again in

turn to a multi-level tree structure. The worker

node processes the smaller problem, and passes

the answer back to its master node.

"Reduce" step- The master node collects the

answers of all the sub-problems and combines

them to form the output – the answer to the

problem is that, it was originally trying to solve

the job process in easier manner.

F. HBase

HBase is a distributed, versioned, column-

oriented database built on top of Hadoop and

Zookeeper. HBase master is used to figure out the

placement of regions of the nodes, so that it can

configure and splits the data correctly. This is to

run the mappers job on the region servers

containing the regions to ensure data locality.

HBase Server reads incoming request using one

Listener thread, and pool of Deseriaisation Threads. Input/output process is done

asynchronously. The listener sleeps on a bunch of

socket descriptors.

G. Zookeeper

Zookeeper allows distributed processes to

coordinate with each other through a shared

hierarchical name space of data registers (they are

said to be as the registers znodes), used to like the

file system. Unlike normal file systems Zookeeper

provides its clients with high throughput, low

latency, availability and the strictly ordered access

to the znodes. The performance analysis of Zookeeper is being tested by allowing number of

distributed systems.

The main differences between Zookeeper

and standard file systems are that every znode can

have data associated with it and those znodes are

limited to the amount of datasets that they can

have. Zookeeper was designed to store the

coordination of datasets such as: the status

information, configuration, location of the

information, etc. This kind of Meta data-

information is usually being measured in

kilobytes, if not bytes. Zookeeper has in built

sanity check of 1M and it is used to prevent it

from being used as a huge data store, but in

PIG

HCATALOG

A

M

B

A

R

I

O

O

Z

I

E

SQ

O

O

P

F

L

U

M

E

V Nappinna Lakshmi et al, International Journal of Computer Science & Communication Networks,Vol 3(2), 73-78

74

ISSN:2249-5789

Page 3: DATA MINING OVER LARGE DATASETS USING HADOOP IN CLOUD ENVIRONMENT ·  · 2013-04-20DATA MINING OVER LARGE DATASETS USING HADOOP IN CLOUD ENVIRONMENT . V.Nappinna lakshmi1, ... or

general it is used to store much smaller pieces of

data. It is used to achieve the data sets easily by

splitting the large number of datasets.

H. HDFS

Hadoop Distributed File System cluster

consists of a single Name node, a master server

that manages the file system namespace and

regulates access to files by clients. There are a

number of Data Nodes usually one per node in a

cluster. The Data Nodes manage storage attached

to the nodes that they run on. HDFS contains a file

system namespace and allows user data to be

stored in files. A single file is being split into one

or more blocks and set of blocks are stored in Data

Nodes.

Data Nodes-serves read, write requests,

performs block creation, deletion, and replication

upon instruction from Name node.

Block ops

Read Datanodes Datanodes

Replica-

tions

Rack 1 Rack2

Fig.1.2 HDFS architecture

Name node maintains the file system. Any

Meta information changes to the file system are

recorded by the Name node.

An application can specify the number of

replicas of the file needed: replication factor of the

file. This information is stored in the Name node -

HDFS is designed to store very large files across

machines in a large cluster. Each file is a sequence

of blocks. All blocks in the file system except the last are of the same size. Blocks are replicated for

fault tolerance. Block size and replicas are

configurable per file. The Name node receives a

Heartbeat and a Block Report from each Data

Node in the cluster. Block Report contains all the

blocks on a Data node. The placement of the

replicas is critical to HDFS performance.

Optimizing replica placement distinguishes HDFS

from other distributed file systems.

I. Rack-aware replica placement

Goal- improves reliability, availability and

network bandwidth utilization.

Searching of data topic-Many racks,

communication between racks are through

switches. Network bandwidth between machines

on the same rack is greater than those in different

racks. Name node determines the rack id for each

Data Node.

Replicas are placed-Nodes are being placed

on various local racks. Replica Selection- Replica

selection for READ operation: HDFS tries to

minimize the bandwidth consumption and latency.

If there is a replica on the Reader node then that is

preferred. HDFS cluster may span multiple data

centers: replica in the local data center is preferred

over the remote one. File system Metadata- the

HDFS namespace is stored by Name node.

Name node uses a transaction log called the

Edit Log to record every change that occurs to the file system Meta data. Entire file system

namespace including mapping of blocks to files

and file system properties is stored in a file

FsImage. Stored in Name node’s local file system.

II RELATED WORKS

D. Abadi [1]In this paper the large scale data

analysis is done with the traditional DBMS. The

data management is scalable but there is

replication of data. Replication of data leads to the

fault tolerance.

Y. Xu, P. Kostamaa, and L. Gao [3] This

paper deploys the Teradata parallel DBMS for

large data warehouses. In recent years there is a

increase in the data volumes and some data like

web logs and sensor data are not managed by

Teradata EDW (Enterprise Data ware house).

Researchers agree that both the parallel DBMS

and Map reduce of Hadoop paradigms have advantages and dis-advantages for the various

business applications and the also exists for the

long time. The integration of optimizing

opportunities is not possible for DBMS running on

the single node.

Farah Habib Chan chary [6] large datasets

among the clusters of machines are efficiently

stored in the cloud storage systems. So that the

Name node

Client

Blocks

Metadata

(Name, replica...)

(Home, foo,

data6)

Client

V Nappinna Lakshmi et al, International Journal of Computer Science & Communication Networks,Vol 3(2), 73-78

75

ISSN:2249-5789

Page 4: DATA MINING OVER LARGE DATASETS USING HADOOP IN CLOUD ENVIRONMENT ·  · 2013-04-20DATA MINING OVER LARGE DATASETS USING HADOOP IN CLOUD ENVIRONMENT . V.Nappinna lakshmi1, ... or

same information on more than one system could

operate the datasets even if any one of the

system’s power fails.

J.ABABI, AVI SILBERCHATZ [7] Analysing massive datasets on very large clusters is done within the HadoopDB architecture for the

real world application like business data

warehousing. It approaches for parallel databases

in performance. Still there is no scalability. It

consumes huge amount of time for execution.

III HBASE-MINING ARCHITECTURE

HADOOP

CLUSTER

Fig: 3.1 HBase-Mining Architecture

Hive queries are used to store and retrieve the

movie lens datasets. Efficient database storage

with the market basket algorithm is used for the

maintenance of heavy transactions in the retail

business of super market products and the aim of

this systems is to improve the performance through parallelization of various operations such

as loading the datasets, index building and queries

evaluation in the Hadoop database is integrated

along with the HBase is used to store and retrieve

the huge datasets without any loss age of the

transactions. The Amazon cloud is used to hold

the HDFS file system to store the name node, data

node along with the region servers where the data

sets are stored when it is being splitted from the

Hbase table. Hive Query Processing (HQP) is also

considered as one of the data nodes.

A. Advantage of having the HADOOP database

In RDBMS it can store the limited amount of

data, SQL operations are used and it cannot

execute the data concurrently which means there is

no parallel processing and it is a single threaded process, and it can handle only one dataset.

Whereas in HADOOP along with the HBASE data

storage it can store huge amount of data up to

petabytes of data, it makes use of NOSQL

operations and it can execute the data concurrently

which means the parallelization is achieved and it

splits the work into many based of the number of

processors. So that it maps the function with <key,

value> pairs and the data is being processed in the

reduce function to execute the dataset.

Table 1 Difference between the NOSQL and SQ

RDBMS are used to store only the structured

data, but the HADOOP data storage systems are used to store both the structured and unstructured

data. E.g. for unstructured data are (mail, audio,

video, medical transactions etc...)

IV MARKET BASKET ALGORITHMS

Market basket is one of the most popular

data mining algorithms. It is a cross-selling

promotional program to generate the combination

of datasets. Association rules are also used to

identify the pairs of similar datasets. Advantage of

using this algorithm is that it is simple in computations and different forms of data can be

analysed. Selection of promotions in purchasing

and joint promotional opportunities is more.

Identifying the actionable information of

market basket analysis are profitability for each

purchase profiles and the use for marketing

purpose are layouts or catalogs, select product for

promotions, space allocation and product

placement. The purchases patterns are identified

by the items tend to be purchased together and the

items purchased sequentially.

DATA REPORT

ANALYSIS

HIVE

DATASET

QUEIRIES

HADOOP

HDFS-HBASE

MARKET

BASKET ALGORITHM

NN DN

JT TT

HM HRS

SN DN

HQP TT

HRS

AMAZON

CLOUD

NN DN

JT TT

HM HRS

SN DN

HQP TT

HRS

DN

TT

HRS

V Nappinna Lakshmi et al, International Journal of Computer Science & Communication Networks,Vol 3(2), 73-78

76

ISSN:2249-5789

Page 5: DATA MINING OVER LARGE DATASETS USING HADOOP IN CLOUD ENVIRONMENT ·  · 2013-04-20DATA MINING OVER LARGE DATASETS USING HADOOP IN CLOUD ENVIRONMENT . V.Nappinna lakshmi1, ... or

A. Market Basket Analysis for Mapper

The input file is being loaded into the

Hadoop distributed file system and the datasets are

stored with the block sizes of 64MB along with

the <key, value> pairs, then the mapping function is performed then each mapped datasets are being

stored in their respective Hbase tables.

B. Algorithm

1. Input file is loaded in to the HDFS.

2. File is being splitted in to the block size of

64MB.

3. The mapping function is performed on the basis

of <k1, v1> pairs.

4. Then it is stored on the HBase tables.

C. Market Basket Analysis for Reducer

The HBase market basket table is being

splitted as region server to store the datasets with

the help of zookeeper and the <k1, v1> pairs are

allotted for each datasets that are involved in the

transactions. The Reduce function is performed to

get the output file in the form of <K2, v2> pairs.

The grouping and sorting functions are performed

to obtain the final result <k3, v3> pair in the

Hbase table.

D. Algorithm

1. HBase market basket table is splitted into region

servers to store the datasets with <k1, v1> pairs.

2. <k2, v2> output file pair is obtained by the

reduce function on the basis of alphabetical

analysis manner.

3. Sorting and grouping functions are performed

by counting the number of value counts to obtain final result <k3, v3> pair in the HBase table.

V RESULTS

Fig:5.1 Performance analysis of data sets

HBase has around 1 million records for product

table.

And it has 5.1 million records doing algorithm

analysis.

For Performance Results

1 Node - 1 Million records - 4 min 37 sec 2 Nodes - 1 Million records - 3 min 31 sec

3 Nodes - 1 Million records - 2 min 56 sec

5 Nodes - 1 Million records - 2min

Performance Parameters tuned

Hbase Heap Memory and Caching Parameter

VI CONCLUSION

The integration of Hadoop Ecosystem

along with the HBase is used to store and retrieve

the huge datasets without any loss and the

parallelization is achieved in loading the datasets,

building the indexes and evaluation of the queries . The Hadoop ecosystem can store both the

structured and unstructured datasets. It can include

and delete the datasets parallel. The Map Reduce

function is used to split the datasets and to get

stored on the basis of number of processors and

the market basket analysis for mapper and reducer

function is performed to store and retrieve the

millions of datasets along with the <key, value>

pairs that are allotted for each and every datasets.

Thus the obtained datasets are being stored in

Hbase table and the performance analysis is

processed with five nodes.

VII FUTURE WORK

Thus the Hadoop Ecosystem with the

integration of Hbase database should be used in

various fields like telecommunications, banks,

insurance, medical fields etc. To maintain the

public details in an efficient manner and to avoid

the fraudulent.

VIII REFERENCES

[1] D. Abadi.” Data management in the cloud: Limitations

and opportunities”,IEEE Transactions on Data

Engineering , Vol.32,No.1,March 2009. [2] Daniel Warneke and Odej Kao,”Exploiting Dynamic

Resource Allocation for Efficient Parallel Data

Processing in the Cloud,” IEEE Transactions on

Distributed and Parallel systems , Vol.22,No.6,June 2011.

[3] Y. Xu, P. Kostamaa, and L. Gao. “Integrating Hadoop

and Parallel DBMS”, Proceedings of ACM SIGMOD,

International conference on Data Management, New

york,NY,USA 2010. [4] Huiqi Xu, Zhen Li et.al, “ CloudVista: Interactive and

Economical Visual Cluster Analysis for Big Data in the

Cloud,” IEEE Conference on Cloud computing, Vol.5,

No.12,August 2012.

V Nappinna Lakshmi et al, International Journal of Computer Science & Communication Networks,Vol 3(2), 73-78

77

ISSN:2249-5789

Page 6: DATA MINING OVER LARGE DATASETS USING HADOOP IN CLOUD ENVIRONMENT ·  · 2013-04-20DATA MINING OVER LARGE DATASETS USING HADOOP IN CLOUD ENVIRONMENT . V.Nappinna lakshmi1, ... or

[5] M. Losee and Lewis Church Jr.”Information Retrieval

with Distributed Databases: Analytic Models of

Performance Robert,” IEEE Transactions on parallel and

distributed systems, Vol.15, No.1, January 2004. [6] Farah Habib Chan chary,”Data Migration: Connecting

databases in the cloud” IEEE JOURNAL ON COMPUTER SCIENCE, vol no-40 , page no-450-455,

MARCH 2012. [7] Kamil Bajda at el.”Efficient processing of data

warehousing queries in a split execution environment”,

JOURNAL ON DATA WAREHOUSING, vol no-35, ACM, JUNE 2011.

[8] J.ABABI, AVI SILBERCHATZ,”HadoopDB in action:

Building real world applications”, SIGMOD

CONFERENCE ON DATABASE, vol no-44 ,USA, SEPTEMBER 2011.

[9] S. Chen. “Cheetah: A High Performance, Custom Data

Warehouse on Top of Map Reduce”, In Proceedings of

VLDB, vol no-23, pg no-922-933, SEPTEMBER 2010.

[10] R. Vernica, M. Carey, and C. Li.” Efficient ParallelSet-

Similarity Joins Using Map Reduce”, In Proceedings of

SIGMOD, vol no-56, pg no-165-178, MARCH 2010

[11] Aster Data, “SQL Map Reduce framework”,

http://www.asterdata.com/product/advanced-

analytics.php.

[12] Apache HBase, http://hbase.apache.org/.

[13] J. Lin and C. Dyer, “Data-Intensive Text Processing with

Map Reduce”, Morgan & Claypool Publishers, (2010).

[14] GNU Cord, http://www.coordguru.com/.

V Nappinna Lakshmi et al, International Journal of Computer Science & Communication Networks,Vol 3(2), 73-78

78

ISSN:2249-5789