The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear...

31
Lijiulie Chen 1207-2725 Melfa Rd. Vancouver BC V6T1N4 July 28, 2009 Jane Pavelich Kaiser 3036 The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4 Dear Ms Pavelich: Re: EECE 496 Formal Report Enclosed is a copy of my EECE 496 final report on “A Distributed Graph Analysis Tool to Study Implicit Relations in Tagging Systems”. This report presents the design and implementation of a software package. It also reveals work done for exploring parallel database systems used for data processing in this tool. The challenges and future development are discussed as well. Yours sincerely, Keven Chen Enclosure

Transcript of The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear...

Page 1: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

Lijiulie Chen

1207-2725 Melfa Rd.

Vancouver BC V6T1N4

July 28, 2009

Jane Pavelich

Kaiser 3036

The Fred Kaiser Building

2332 Main Mall Vancouver BC

V6T 1Z4

Dear Ms Pavelich:

Re: EECE 496 Formal Report

Enclosed is a copy of my EECE 496 final report on “A Distributed Graph Analysis Tool

to Study Implicit Relations in Tagging Systems”. This report presents the design and

implementation of a software package. It also reveals work done for exploring parallel

database systems used for data processing in this tool. The challenges and future

development are discussed as well.

Yours sincerely,

Keven Chen

Enclosure

Page 2: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

A Distributed Graph Analysis Tool to Study

Implicit Relations in Tagging Systems

Name: Lijiulie Chen – 57007064

Technical Supervisor: Matei Ripeanu

Page 3: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

ii

Abstract

Tagging systems have been widely used to help people search, browse and share online

content. For my EECE 496 project, I designed and implemented a software package that

helps to characterize the implicit relationship between users, items and tags. These

relationships are often modeled as a graph, where nodes are users, items or tags and

edges connect these elements if they are similar according to some metric such as the

Jaccard Index. The project focuses on the creation of an extensible software package that

enables the analysis of these graphs. Since the size of the dataset of the tagging systems is

commonly very large in a research setting, the software package is also implemented to

fulfill a time-efficient requirement by utilizing a computer cluster to speed up the graph

building and analyzing process. In this project, MySQL [1] (i.e., a traditional open source

relational database solution) and Hive [2] (i.e., a data store solution based on MapReduce

[3] that allow applications to express queries in SQL) are used and evaluated for their

performance. In summary, we observe that MySQL achieves superior performance

compared to Hive in graph creation workload. Other database systems should be explored

in the future for performance evaluation such as HadoopDB [4], which provides a hybrid

solution (i.e., traditional databases and MapReduce paradigms).

Page 4: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

Table of Contents

A Distributed Graph Analysis Tool to Study Implicit Relations in Tagging Systems ..................... i

Abstract .................................................................................................................................. ii

Glossary ..................................................................................................................................... iii

List of Tables .............................................................................................................................. v

List of Figures ............................................................................................................................ vi

List of Abbreviations ................................................................................................................. vii

1.0 Introduction ........................................................................................................................... 1

2.0 Tools and Methodology ......................................................................................................... 3

2.1 Tools ................................................................................................................................. 3

2.1.1 Software Tools ............................................................................................................ 3

2.1.2 Deployment Environment ........................................................................................... 4

2.2 Design Concepts ................................................................................................................ 5

2.2.1 Design Patterns ........................................................................................................... 5

3.0 DESIGNS ............................................................................................................................. 8

3.1 Data Preparation & Database Setup ................................................................................... 9

3.2 Data Access Layer ........................................................................................................... 10

3.4 Graph Analysis ................................................................................................................ 14

4.0 RESULTS ........................................................................................................................... 16

4.1 Project Summary ............................................................................................................. 16

4.2 Performance Evaluation ................................................................................................... 17

4.4 Project Difficulties........................................................................................................... 19

4.5 Future Development ........................................................................................................ 20

5.0 CONCLUSION ................................................................................................................... 21

7.0 REFERENCES .................................................................................................................... 23

Glossary

Page 5: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

iv

Computer Cluster a group of linked computers that act like a single system and enable high

availability, load balancing and parallel processing

Node (machine) a device attached to a computer network or other telecommunications

network, such as a computer or router, or a point in a network topology

at which lines intersect or branch

Tag (metadata) a tag is a non-hierarchical keyword or term assigned to a piece of

information. This kind of metadata helps describe an item and allows it

to be found again by browsing or searching

Hadoop a free Java software framework that supports data intensive distributed

applications

Hive a data warehouse infrastructure build on top of Hadoop that provides

tools to enable easy data summarization, adhoc querying and analysis of

large datasets data store in Hadoop files

MapReduce a software framework introduced by Google to support distributed

computing on large data sets on clusters of computers

MySQL a relational database management system (RDBMS)

SQL a database computer language designed foro managing data in relational

database management systems

HadoopDB A hybrid of DBMS and MapReduce technologies that targets analytical

workloads

Tag a non-hierarchcal keyword or term assigned to a piece of information

(such as an internet bookmark, digital image or computer file)

Page 6: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

v

List of Tables

Table 1: Development Tools Used ………...………………………………………..4

Table 2: Cluster Node Specification Chart .…………………………………………5

Page 7: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

vi

List of Figures

Figure 1: two common data-oriented parallel-processing schemes ………………………6

Figure 2: system diagram …………………………………………………………………9

Figure 3: UML diagram of the data access layer ………………………………………..10

Figure 4: UML diagram of the graph factory module …………………………………. 12

Page 8: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

vii

List of Abbreviations

JDBC Java Database Connector

UBC the University of British Columbia

NetSysLab Networked Systems Laboratory

API Application Programming Interface

Page 9: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

1.0 Introduction

This report presents the design and evaluation of a data analysis tool for tagging systems.

The objective of this project is to deliver a software package that would process activity

traces of a tagging system (i.e., del.icio.us) and prepare it for further graph analysis. The

main focus of my investigation is on how to accomplish large-scale data processing more

efficiently. This software package will play a significant role in future tagging-system

data analysis research in NetSysLab at UBC.

Considering the increasingly large amount of data generated daily by users in tagging

systems, the graph analysis tool should be designed to cope with large-scale activity

traces. Hence, the goal of this project is to evaluate the efficiency of leveraging a parallel

processing paradigm like Hadoop to a traditional MySQL standalone solution. To achieve

this, the software is designed to be agnostic of data storage solution, so that it is easy to

experiment with different database application.

I have successfully built and tested a Java software package that processes tagging

activity records and constructs graph representations. The software is able to utilize either

MySQL or Hive as its underlying data storage to operate in either single-node or parallel-

processing (i.e., using a computer cluster) mode. I have also written two graph analysis

modules as samples for this software package. The package is extensible enough to allow

its users to exploit different data warehouse infrastructure and develop their own graph

analysis modules based on their needs.

Page 10: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

2

The scope of this project includes all design and implementation of data processing and

deployment. Moreover, modification of existing parallel processing implementations to

improve performance is outside of the scope of the project because of time constraints.

This report is divided into three main sections. First, it provides an overview of the tools

and methodology used. Next, it explains the design and implementation of the software

package in detail. Finally, it discusses the results and future directions of this project.

Page 11: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

3

2.0 Tools and Methodology

The implementation of the graph analysis package makes use of the computer cluster

located in the NetSysLab. This section provides an explanation of the equipments I used

and the methodology behind this approach.

2.1 Tools

The tools used in this project are split up into two categories: software and hardware. I

will describe each category separately in this section.

2.1.1 Software Tools

Most of the data analysis is written in Java, since it is architecturally neutral and provides

good integration with SQL and Hadoop. The code I wrote on a 32-bit machine can be

seamlessly compiled and run on a 64-bit system. Another big advantage of using Java is

because it is widely supported by the open source community. Thus, a variety of libraries

are available that may help support the development of the data analysis package. I can

easily integrate my project with the existing project that others have already developed.

As alternative data storage layers, I used both Hive/Hadoop to leverage a cluster of

computers to run the data analysis in parallel, and MySQL that provides a traditional

relational database backend data store.

Apache Hadoop (Hadoop) is an open-source implementation of the MapReduce

framework that enables applications to run on a cluster. Hive is a data warehouse

framework built on top of Hadoop. It provides users a SQL like query language which

Page 12: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

4

allows them to query data. Hive also implemented a JDBC interface which enables

programmers to write Java class to access the file system.

Table 1 describes the tools used for this project and their corresponding version.

Tool Version Description

Eclipse 3.3.2 Development environment for Java

Java 1.6.0_04 Core programming language

MySQL Database 5.0.45 Database for storing dataset

Hadoop 0.19.0 Java software framework for creating parallel

processing in the cluster

Hive 0.19 Data warehouse infrastructure for saving data to

Hadoop File System

Table 1 – Development Tools Used

2.1.2 Deployment Environment

The hardware environment for the deployment of the project is a 22 nodes computer

cluster provided by the NetSysLab. All the nodes are interconnected by a

PowerConnectTM

6248 48-port Gigabit Ethernet Layer 3 switch, which should provide

high availability and improved performance compared to a single computer. The table

below describes the basic specification of each node in the cluster.

Intel® Xeon® CPU E5354 @ 2.33GHz

(quad-core machine)

Page 13: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

5

4 GB RAM

800 GB hard drive

Table 2 - Cluster Node Specification Chart

2.2 Design Concepts

In this project, I use some software design patterns in the project architecture. This is to

ensure that I am following a pragmatic software design approach. Also, I have attempted

to use parallel processing for this project to utilize more computing power. These design

concepts will be explained in the following sections.

2.2.1 Design Patterns

In the design of the project I apply two design patterns to help make the data analysis tool

easy to extend and its modules properly decoupled. These design patterns are Factory

Method and Singleton [5]. I will explain the intent and application of these patterns in the

sections immediately following. However, I will only explain my implementation of

these patterns in more details in the section 3.0 of the report.

2.2.1.1 Factory Method

This design pattern is intended to let a client obtain instances of an object type without

knowledge about its concrete implementation. In other words, an object that implements

an interface can be created without specifying the exact class of the object that will be

created. [5] It allows greater extensibility because new subclasses can easily be added

later, without making a lot of changes to the existing code. Also, it decouples the abstract

interface with concrete implementations.

Page 14: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

6

2.2.1.2 Singleton

This design pattern is intended to ensure that there is only one instance of a certain class

during the execution of the program. This instance should be able to be accessed globally

by outside classes. This pattern is able to restrict some class to have one and only one

instance. It allows strict control to one particular instance and is able to permit refinement

operations on the instance. [5]

2.2.2 Parallel Processing

The target data sets used in the graph analysis can easily reach over ten Gigabyte; I have

considered utilizing a computer cluster to do parallel processing. Research [6] shows that

there are two common data-oriented parallel-processing schemes. One uses centralized

data storage and another uses distributed data storage. In this section of the report, I will

explain the general concept of parallel processing and the difference between the two

parallel-processing approaches. Figure 1 illustrates the two schemes.

Figure 1 - two common data-oriented parallel-processing schemes

Page 15: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

7

In the centralized data storage scheme, data is stored on a master node, but is processed

on each individual node in a cluster. The master node has to first send the data over the

network to a child node, and then let the child nodes complete required computations. In

the end, all the child nodes have to send back their results to the master node. The

performance of this approach is contingent on how fast the network is rather than how

fast each individual node can compute. In contrast, the distributed data storage scheme

relies more on computing power of the individual nodes.

In the distributed scheme, data is stored and computed on individual nodes. The master

node has prior knowledge of which node is hosting a particular part of the data set. The

master node will issue a computation command to each individual node upon a data

processing request. Each node then accesses the data stored on the node itself, completes

requested computation and sends the results back to the master node for result assembly.

Unlike the previous approach, the performance of the distributed scheme is contingent on

the number of child nodes in the cluster. For this project, I am using the distributed data

storage approach because it fits the application performance requirement.

Page 16: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

8

3.0 DESIGNS

The overall design of the data analysis tool is divided into four main modules: database

storage layer, database access layer, graph creator and graph analyzer.

In general, the data set is imported into either the MySQL database or Hadoop File

System [7]. The database layer implements a factory method that creates a connection to

either of the data storage depending on the user’s configuration and returns a single

database interface to the client modules. The interface allows the client modules (i.e.,

Graph analyzer, Graph factory) to perform basic operations against the database (i.e.,

load data, execute queries). This architecture offers developers the freedom to implement

new java classes to access other types of data storage without changing the client

modules, as long as it keeps the same interface. Furthermore, the queries are saved in

separate files and are accessed by the classes from their file paths. In this way, users can

easily modify the queries without changing the code at all. Figure 2 illustrates the system

architecture.

I will talk about each module of the system in the next few sections, starting with initial

data preparation and database setup.

Page 17: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

9

Figure 2 - System Diagram

3.1 Data Preparation & Database Setup

In order to start the graph analysis phase, raw data sets must be first processed into

workable forms and stored into a database. In this section, I will explain how I have

prepared raw traces of activity from popular tagging systems for processing. Also, I will

present some optimization techniques used in setting up the database structure that I have

used. This is a crucial initial step for the project, since data processing at this scale

requires timely efficient data storage.

To start, I obtained the dataset from a research lab’s website [8] in CSV (comma

separated values) format. The dataset used strings of hash codes as indexes. I converted

to integers for faster processing.

Page 18: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

10

Two data storage layers are configured: MySQL and Hive. This enables different use

case scenarios and performance comparison.

To ensure each database is structured efficiently, in the MySQL database setup I have

chosen the B-TREE indexing method, which is an implementation of a tree data structure

that enables faster queries based on relational operations (i.e., =, >, etc). In the Hive

database setup, I have used a feature called replication, which makes multiple copies of

the same data set. Replication improves the availability of the dataset and reduces data-

request queuing, as independent queries that read the same data blocks can run in

different data nodes in parallel.

After the initial data preparation and database setup, the data access layer will come into

play for the next stage of data processing. In the next section, I will explain the role of

data access layer and its implementation.

3.2 Data Access Layer

The data access layer can create connection to various databases and returns one single

object to allow the client modules (i.e., GraphCreator and GraphAnalyzer) to invoke data

storage operations. This layer consists of three parts, which are the database connection

pool, the database factory and the concrete database implementations. The class diagram

of the data access layer is shown in Figure 3.

Page 19: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

11

+executeBatch(in queryfile : String, in _conn : Connection) : void

+executeQuery(in queryfile : String, in _conn : Connection) : ResultSet

+excuteQuery(in queryfile : String, in parameters : Map, in _conn : Connection) : ResultSet

+executeUpdate(in queryfile : String, in _conn : Connection) : void

+executeUpdateWithParameter(in queryfile : String, in parameters : Map, in _conn : Connection) : int

+getConnectionPool() : ConnectionPool

+shutdown() : void

<<interface>>

Database

#ConnectionPool() : ConnectionPool

#createDBConnection() : Connection

+getAvailableConnection() : Connection

-getConnectionIfAvailable() : Connection

-init() : void

+releaseConnection() : void

+shutdown() : void

#DEFAULT_NUMBER_OF_CONNECTIONS : int

-driver : String

-numberOfConnections : int

-password : String

-pool : Map

-url : String

-user : String

ConnectionPool

+DatabaseFactory() : DatabaseFactory

+getDatabaseInstance() : Database

-DELICIOUS : String

-HIVE : String

DatabaseFactory

#createDBConnection() : Connection

+getExpresssionFromFile(in filename : String) : String

+prepareaStatement(in _query : String, in _map : Map) : String

#dbAlias : String

#dbConnection : Connection

#dbConnectionPool : ConnectionPool

#dbMachine : String

#pooledConnection : PooledConnection

-serialVerionUID : long

#url : String

<<implementation class>>

HiveDatabaseLayer

#createDBConnection() : Connection

+getExpresssionFromFile(in filename : String) : String

+prepareaStatement(in _query : String, in _map : Map) : String

#dbAlias : String

#dbConnection : Connection

#dbConnectionPool : ConnectionPool

#dbMachine : String

#pooledConnection : PooledConnection

-serialVerionUID : long

#url : String

<<implementation class>>

MySQLDatabaseLayer

<<uses>>

Figure 3 - UML diagram of the data access layer

As I mentioned in the design concepts section, a factory method is implemented to create

an instance of the database object. The users are able to configure which database

storage to use from the config.properties file. If one instance of the database is already

created, that particular instance is returned. This singleton object makes sure that there is

only one database object instantiated. Thus the number of connections to the database

storage can be easily controlled within the database layer. The ConnectionPool class

keeps track of the number of the connections that are connecting to the database. The

Page 20: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

12

upper limit of the connections is set by users. Once the number of connections exceeds

the upper limit, the process that requires the new connection will be put into sleep mode

until another process releases one connection. A synchronized method is implemented in

this class to prevent the race condition that could potentially be created in multi-threaded

processes. The third module in the data access layer is the concrete database

implementation. There are currently two database classes implemented, MySQL and

Hive. MysqlDatabaseLayer is the class used to access the data saved in MySQL

databases. It is limited to access only one machine at a time. On the other hand, the

implementation of the HiveDatabaseLayer accesses Hive data storage which is highly

scalable in the number of machines. It is able to utilize the computational power of the

whole cluster. However, this setup does not really give the expected throughput when we

tested it. I will talk about the performance of the program in the latter part of the report.

3.3 Graph Factory

The graph factory module is responsible for computing the edges and edge weights of the

graph based on the activity trace. The graph factory module consists of three main parts,

which are the graph factory, the edge creator and the concrete graph implementation. The

class diagram of the graph factory is illustrated in the Figure 4 below.

Page 21: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

13

+buildEdges(in type : EdgeType) : void

+buildNodes(in type : NodeType) : void

+getEdges(in type : EdgeType) : Set

+getEdgeWeight(in startNode : Node, in endNode : Node) : double

+getinstance() : IGraph

+getNodeCount() : int

+getNodeCount(in type : NodeType) : int

+getNodes(in type : NodeType) : Set

+getRelatedNodes(in type : Node) : Set

+getRelatedNodesNum(in type : Node) : int

<<interface>>

IGraph

+Node(in nodeType : NodeType, in nodeId : int) : Node

+getNodeId() : int

+getNodeType() : NodeType

-id : int

-type : int

Node

+valueOf() : NodeType

+values() : NodeType

+ITEM

+TAG

+USER

<<enumeration>>

NodeType

+valueOf() : EdgeType

+values() : EdgeType

+TagItemEdge

+TagTagEdge

+UserTagEdge

<<enumeration>>

EdgeType

+valueOf() : GraphType

+values() : GraphType

+tagPathwayGraph

+

<<enumeration>>

GraphType

+EdgetCreator(in tempNode : Set, in edgeType : EdgeType) : EdgeCreator

-populateEdgeTable() : void

-connection : Connection

-connectionPool : ConnectionPool

-db : Database

-parameter : Map

-rs : ResultSet

-startNodes : Set

-tempEdgeType : EdgeType

-tempQuery : String

<<implementation class>>

EdgeCreator

+run() : void

<<interface>>

Runnable

+TagPathwayGraph()

+addNewNodes(in tempQuery : String, in colNum : int, in type : NodeType, in nodeSet : Set) : void

-connection : Connection

-connectionPool : ConnectionPool

-db : Database

-items : Set

-tags : Set

-tempQuery : String

-tiEdges : Set

-ttEdges : Set

-users : Set

-utEdges : Set

TagPathwayGraph+GraphFactory()

-createGraph(in type : GraphType) : IGraph

GraphFactory

<<uses>>

<<uses>>

<<uses>>

End1

End2

End3

End4

Figure 4 - UML diagram of the graph factory module

The graph factory uses the same design pattern as Database factory does. It creates and

returns an instance of the graph object which implements the IGraph interface.

TagpathwayGraph is the class that implements the IGraph interface. This graph type is

defined to illustrate the network formulation that connects users to items via a path

composed of tags. (See Figure 3). It is a graph where the nodes are represented by the set

of users U, set of tags T and set of items I, edges are directed and divided into three

Page 22: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

14

disjoint sets. More formally, G = (U ∪ T ∪ I, Eut ∪ Ett ∪ Eti). In particular, Eut contains

edges that connect users to tags, Ett contains edges that connect tags, and Eti is the set of

edges that connects tags to items. Additionally, the edge weights can be defined,

respectively, as the frequency a user uses a tag, the co-occurrence frequency between tags

and the frequency in which a tag is assigned to an item. [9]

Figure 5 - the network composed of users that are connected to items of items through tag paths

The dataset we analyze consists of the tagging assignments from del.icio.us during 2003-

2006.

3.4 Graph Analysis

The graph analysis module is responsible for computing characteristics of the graph built

by the Graph Factory module. The graph analyzer is well separated from the concrete

graph implementation. Therefore, it can perform the analysis on the graph regardless the

type of the graph. This is achieved by the definition of the IGraph interface and the

factory methods.

Page 23: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

15

I have currently implemented two graph analyzers - node centrality analysis and weight

distribution analysis. The node centrality analysis counts the degree centrality of each and

returns an average or the distribution of degree centrality of the graphs. This analysis is

an example of performing the analysis on the basis of the IGraph interface. On the other

hand, the weight distribution analysis is an example of doing analysis against the

database layer directly. It enables the analyzer to perform query against the database

through the database layer.

Page 24: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

16

4.0 RESULTS

The project was successful in achieving most of the objectives. I have completed the

development of a software package that will enable future researchers to conduct graph

analysis easily. This software package not only abstracted away a lot of the low-level

database operations involved in processing tagging activity traces, but it also provided a

simple abstract graph model for users without any database knowledge to conduct simple

graph analysis. Furthermore, efforts were made to ensure the software package is

extensible enough for users to add in their own graph model implementations easily and

run customized graph analysis with special SQL/HiveQL instructions. Most importantly,

I have explored ways to achieve all of the above in a time-efficient manner for large data

sets.

In this section, I will first summarize project deliverables as a whole. Then I will evaluate

the performance of this software package on different database infrastructures. Lastly, I

will briefly state the difficulties this project faced and discuss its future directions.

4.1 Project Summary

A database layer has been successfully implemented and is able to access two types of

database storage systems: MySQL and Hive. It is also extensible to other storage types by

implementing the Database interface without changing any existing codes. Furthermore,

a graph factory and an analyzer interface have been defined. One graph building tool and

two graph analyzing tool have been implemented and successfully run against MySQL

database. Due to a bug in the Hive API, the project cannot be fully conducted. However,

Page 25: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

17

by using a smaller dataset, we are able to measure the performance of the Hive platform.

I also investigated the performance of the Hive infrastructure on a cluster. In the

following paragraph, I will talk about the performance of the project.

4.2 Performance Evaluation

The performance of the system is divided into two parts; the graph creation part and the

graph analysis part.

For the test of the graph creation module, a tag trace dataset from del.icio.us was used,

which is 1.2GB in size. The task is to extract the three types of edges as I described

before from the trace and save it back to the database. The MySQL SQL command that is

used for building edges for one node is shown below. This particular statement is for

building a user-tag-edge. It finds distinct uid and tagid pairs for one uid, finds the

frequency of appearance of each pair and then inserts the results back to the database.

Insert into userToTag (uid, tagid, weight)

Select A.uid, A.tagid, A.Count1 * 1.0 / B.Count2 As Freq

From

(Select uid, tagid, Count(*) As Count1

From delicious2003_2006_main Where uid=?

Group By tagid

Order by Count1 DESC) As A

Inner Join

(Select uid, Count(*) As Count2

From delicious2003_2006_main

Where uid=?

Group By uid) As B On A.uid = B.uid;

Page 26: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

18

The SQL query used by Hive is similar to the one shown above in logic but with a

slightly different syntax.

This task is performed against both MySQL on one single machine and Hive on a cluster

of 17 machines. It takes MySQL less than 1 second to finish one query like the one

shown above and about 3 days to finish this task. On the other hand, it may take Hive 55

seconds to finish the same query and up to 6 months to complete the whole task, which is

estimated by the time to build one edge multiplied by the number of edges to build.

For the test of graph analysis module, the task is to find out the distribution of the weight

of the edges from a dataset in size of 24GB. The query used is shown below.

This query only involves a where clause which is simpler than the one used to build the

graph. To execute this query once, it takes MySQL about 180 seconds and Hive about

150 seconds. Hive performs slightly better than MySQL this time. It is important to

highlight, though, that Hive used 16 more nodes.

There are several reasons why Hive on a cluster of 17 machines is outperformed by

MySQL on a single machine. First, to execute a query, Hive’s compute an execution plan

that attempts to divide the work into smaller parallel sub-tasks. This process takes several

seconds to complete, which becomes a huge overhead when executing a relatively small

task. Second, the data transfer time between nodes is also a huge overhead. Hive splits

one query task into several jobs and assigns them to each machine. The result of each

task is then sent back to central node for assembly of the either intermediary results,

which leads to a new job submission, or the final result. This MapReduce process creates

SELECT COUNT(weight) FROM useruserweight WHERE weight <= ?

Page 27: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

19

more overhead as it incurs in synchronization points that cannot be parallelized. When

the time to perform each task is small, that overhead becomes significant. To minimize

this overhead, I configured the HDFS layer to use three replicas of the dataset to increase

the data availability and decrease the need for reducing stages. This method only

increases the performance by 5% in contrast to using 3 times of disk space.

I have hypothesized that Hive will have a greater advantage over MySQL for even larger

datasets. However, for the datasets available to our investigation (i.e. around 20-50 GB),

there are probably not that much of a difference in performance time-wise because the

overhead of data distribution and setup is simply too big. For smaller datasets (i.e. under

15 GB), it is clear that MySQL on a single node is much faster. This performance

evaluation will serve as a rough benchmark for future users of this software package to

determine which database structure to use.

4.4 Project Difficulties

The main difficulty I encountered was how to parallelize the computing process. It

involves utilizing the computing power of multiple CPUs and distributing the dataset

storage on multiple hard drives. Other than program design and database setup involved

in parallelization, database optimization was also a challenge for me. Although there

were many attempts made on Hive, including ones mentioned in the newly published

paper [10], no significant performance improvement was shown. These difficulties have

delayed the intended project timeline considerably; however, I think the solution that I

came up with meets or even exceeds the expectations set for the project.

Page 28: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

20

4.5 Future Development

Future development for this project can be done from two different aspects. One is to

enhance the current code base and database setup. Another is to extend the current

software to conduct complex graph analysis.

The first aspect involves enhancing the performance of existing database setups. I believe

Hive and MySQL single node are not the best available. Other database systems,

especially in the parallel processing domain, should be explored for their performances in

various tasks (e.g., loading and executing). One of such systems is HadoopDB[4]. These

systems should be compared to MySQL single node and Hive. These comparisons should

be documented and will help users pick the most appropriate database for their data sets.

The second aspect involves extending the current code base. Users can potentially use

the software package to perform more complex graph analysis. They can write their own

graph analysis implementations using the current interfaces. Also, users may extend this

software package to build their own graph models (i.e. write their own graph factories).

Overall, this software package is designed with further extensions in mind. Thus, future

development should be smooth with the provided code structure and database setup.

Page 29: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

21

5.0 CONCLUSION

This report investigated the details of developing a software package that enables future

users to conduct complex graph analysis. It explained the details of setting up different

database systems and the design. The end product of this project will provide a shortcut

for researchers for tagging system data analysis. It also includes a basic graph factory and

two simple graph analysis implementations, which can both serve as example extensions

of the package. Users of this package could follow these examples to build their own

graph factories and graph analysis implementations. Challenges of database performances

are partially solved by using a parallel processing solution, Hive. However, its

performance for research-sized data sets is not very different from a MySQL database,

which is non-parallel. Hence, further explorations in other parallel systems, such as

MySQL Cluster or a hybrid approach (i.e., HadoopDB), are needed to find a more

suitable system for research-sized dataset. I will be performing this task if time permits.

The objective of delivering an efficient software package that will assist graph analysis

research is successfully achieved.

Page 30: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

22

6.0 ACKNOWLEDGMENTS

I would like to thank Professor Matei Ripeanu for providing technical guidance for this

project. I would also like to thank Elizeu Santos-Neto for his continuous support and his

insightful comments and feedback for this report.

Page 31: The Fred Kaiser Building 2332 Main Mall Vancouver BC Dear ...matei/496/OldProjects/2009.08-MapReduce-KevenChen.pdf · The Fred Kaiser Building 2332 Main Mall Vancouver BC V6T 1Z4

23

7.0 REFERENCES

[1] MySQL. http://www.mysql.com

[2] Hive. http://hadoop.apache.org/Hive/

[3] Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on

Large Clusters, 2004

[4] Azza Abouzeid1, Kamil BajdaPawlikowski1, Daniel Abadi1, Avi Silberschatz1,

Alexander Rasin2, HadoopDB: An Architectural Hybrid of MapReduce and

DBMS Technologies for Analytical Workloads, 2009

[5] Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides, Design Patterns:

Elements of Reusable Object-Oriented Software, pp 107-127 Addison Wesley,

2005

[6] PC farm for ATLAS Tier 3 analysis. http://atlas-service-enews.web.cern.ch/atlas-

service-enews/2009/features_09/features_pcfarm.php

[7] Hadoop. http://hadoop.apache.org/

[8] http://www.uni-koblenz-landau.de/koblenz/fb4/institute/IFI/AGStaab/

Research/DataSets/PINTSExperimentsDataSets/index_html

[9] Santos-Neto et al. "Tracking User Attention in Collaborative Tagging

Communities <http://www.ece.ubc.ca/%7Eelizeus/CAMA07_elizeu.pdf>". In

CAMA'2007.

[10] Andrew Pavlo Erik Paulson Alexander Rasin

Daniel J. Abadi David J. DeWitt Samuel Madden Michael Stonebraker,

A Comparison of Approaches to Large-scale Data Analysis, 2009d