DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan...

47
DATA COLLECTION AND CLASSIFICATION TO FIND PRODUCTIVITY OF USER BY USING NAÏVE BAYES ALGORITHM MUHAMMAD MURSHID BIN RAMLAN BACHELOR OF COMPUTER SCIENCE (NETWORK SECURITY) UNIVERSITI SULTAN ZAINAL ABIDIN 2018

Transcript of DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan...

Page 1: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

DATA COLLECTION AND CLASSIFICATION TO

FIND PRODUCTIVITY OF USER

BY USING NAÏVE BAYES ALGORITHM

MUHAMMAD MURSHID BIN RAMLAN

BACHELOR OF COMPUTER SCIENCE

(NETWORK SECURITY)

UNIVERSITI SULTAN ZAINAL ABIDIN

2018

Page 2: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

DATA COLLECTION AND CLASSIFICATION TO FIND PRODUCTIVITY OF

USER

BY USING NAÏVE BAYES ALGORITHM

MUHAMMAD MURSHID BIN RAMLAN

Bachelor of Computer Science (Network Security)

Faculty of Informatics and Computing

Universiti Sultan Zainal Abidin, Terengganu, Malaysia

MAY 2018

Page 3: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

i

DECLARATION

I hereby declare that this report is based on my original work except for quotations

and citations, which have been duly acknowledged. I also declare that it has not been

previously or concurrently submitted for any other degree at Universiti Sultan Zainal

Abidin or other institutions.

________________________________

Name : ..................................................

Date : ..................................................

Page 4: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

ii

CONFIRMATION

This is to confirm that:

The research conducted and the writing of this report was under my supervison.

________________________________

Name : ..................................................

Date : ..................................................

Page 5: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

iii

DEDICATION

In the Name of Allah Most Gracious and Most Merciful. I am grateful because He has

given me strength to complete my report for final year project entitled “Data

Collection And Classification To Find Productivity Of User By Using Naïve Bayes

Algorithm “ . I would like to express my sincere thanks and appreciation to my

supervisor, Dr. Mohamad Afendee Bin Mohamed for his guidance and understanding

in imparting his knowledge and constructive comment during the course of this

project. I would like to express my gratitude to my beloved family and friends for

giving me moral support and encouragements. Last but not least, I would like to thank

any person that contributes to my project and guide me throughout the preparation of

the project.

Page 6: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

iv

ABSTRACT

Nowadays, the use of internet are increasingly used. This include for the people

who are working in the office environment since many of the company providing

computers for their employee, thus it is hard for us to determine whether what we

are browsing while on the internet are productive or not. The next problem arise is

there are no system that can track what the user are browsing. Furthermore, the

admin cannot see the activities of the user/client since the admin tend to have

limited time to supervise all his client at one time. The admin also don’t know how

long the user really doing their job thus its hard to know the productive and

unproductive underlings. Thus, this project are carried out to design a good and

user friendly system to collect the data and classify the data by using a classifier

algorithm. We will do the data collection based on the history log of the user’s

computer. Then we will classify the data collected and produce an output whether

the user’s browsing history is productive or not. Next objective is to develop the

Naive Bayes algorithm in the system and to test the algorithm to classify the data.

The expected output will be shown in the form of pie chart. In the conclusion, by

developing this system, we can measure the productiveness of an employee and

ourselves. Thus we can take further action or we can reflect to ourselves and

improve to become a better and productive person.

Page 7: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

v

ABSTRAK

Pada masa kini, penggunaan internet semakin banyak digunakan.Ini termasuk

bagi orang-orang yang bekerja dalam persekitaran pejabat kerana kebanyakan

syarikat menyediakan komputer untuk kakitangan mereka. Oleh itu ia adalah

sukar bagi kita untuk menentukan sama ada apa yang kita lakukan semasa di

internet adalah produktif atau tidak.Masalah seterusnya yang timbul adalah

masih tiada sistem yang boleh mengesan pengguna melayari internet. Selain itu,

admin tidak dapat melihat aktiviti-aktiviti pengguna/pelanggan kerana admin

cenderung untuk mempunyai masa yang terhad untuk mengawasi semua

pelanggan beliau pada satu masa. Admin juga tidak tahu berapa lama pengguna

benar-benar melakukan tugas mereka maka ia sukar untuk tahu pekerja bawahan

yang produktif dan tidak produktif.Oleh itu, projek ini akan dijalankan untuk

menginventasi yang lebih baik dan sistem mesra pengguna untuk mengumpul data

dan mengkelaskan data dengan menggunakan algoritma Pengelas.Kami akan

melakukan pengumpulan data berdasarkan log lawatan komputer

pengguna.Kemudian kami akan mengelaskan data yang dikumpulkan dan

menghasilkan output yang sama ada lawatan pelayaran pengguna adalah

produktif atau tidak.Objektif seterusnya ialah untuk membangunkan algoritma

Naive Bayes dalam sistem dan untuk menguji algoritma untuk mengelaskan data.

Keputusan yang dijangkakan akan ditunjukkan dalam bentuk Carta pai.

Kesimpulannya, dengan membangunkan sistem ini, kita boleh mengukur tahap

produktiviti pekerja dan diri kita sendiri. Oleh itu kita boleh mengambil tindakan

selanjutnya atau kita boleh muhasabah kepada diri kita dan memperbaiki untuk

menjadi orang yang lebih baik dan produktif.

Page 8: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

vi

CONTENTS

PAGE

DECLARATION I

CONFIRMATION Ii

DEDICATION Iii

ABSTRACT Iv

ABSTRAK V

CONTENTS Vi

LIST OF TABLES Vii

LIST OF FIGURES Viii

LIST OF ABBREVIATIONS X

CHAPTER I INTRODUCTION

1.1 Introduction 1

1.2 Problem statement 3

1.3 Objectives 3

1.4

1.5

Scopes

Limitation Of Work

4

4

1.6 Expected Result 5

1.7 Report Organization 6

CHAPTER II LITERATURE REVIEW

2.1 Introduction 8

2.2 Techniques 9

2.2.1 Performance Analysis of Naive Bayes and

J48 Classification Algorithm for Data

Classification

2.2.2 A Survey of Online Activity Recognition

Using Mobile Phones

2.2.3 A Goal-based Classification of Web

Information Task

10

11

12

Page 9: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

vii

2.2.4 A Hybrid Learning System for

Recognizing User Tasks from Desktop Activities

and Email Messages

13

2.2.5 Self-Adaptive Attribute Weighting for

Naive Bayes Classification

14

2.2.6 A Review Article On Naive Bayes

Classifier With Various Smoothing Techniques

15

2.2.7 Efficient Manageability and Intelligent

Classification of Web Browsing History Using

Machine Learning

16

2.3 Summary Of Literature review 17

2.4 Summary 20

CHAPTER 3

RESEARCH METHODOLOGY

3.1 Introduction 21

3.2 System Model 22

3.2.1 Requirement and Specification 23

3.2.2 Design 23

3.2.3 Development 23

3.2.4 Integration and Testing 24

3.2.5 Deployment of System 24

3.2.6 Maintenance 24

3.3 System Requirement and Specification 25

3.3.1 Hardware 25

3.3.2 Software 26

3.4 Framework 28

3.5 List Of Website 29

3.6 Summary 33

REFERENCES 34

Page 10: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

viii

LIST OF TABLES

TABLE TITLE PAGE

2.3 Summary Of Literature Review 17

3.3.1 Hardware 25

3.3.2 Software 26

Page 11: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

ix

LIST OF FIGURES

FIGURE TITLE PAGE

2.2.6 Naïve Bayes Classifier 15

3.2 Waterfall Model 22

3.4 Framework 28

3.4.3 Example of converted JSON format into readable xml

format

29

3.4.4 History Export 30

3.4.5 Example of output after converted into excel format 30

Page 12: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

x

LIST OF ABBREVIATIONS / TERMS / SYMBOLS

FYP Final year project

GA Genetic algorithm

HCI Human computer interface

Page 13: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

xi

LIST OF APPENDICES

APPENDIX TITLE PAGE

A Appendix 1 80

B Appendix 2 81

C Appendix 3 82

D Appendix 4 83

Page 14: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

1

CHAPTER I

INTRODUCTION

1.1 Background

Data collection is the process of gathering and measuring information on targeted

variables in an established systematic way, which then enables one to answer relevant

questions and evaluate the outcomes. Data collection is a component of research in all

fields of study including social sciences, humanities, and business. The goal for all

data collection is to capture quality evidence that allows analysis to lead to the

formulation of convincing and credible answers to the question. Classification of the

data is based on the categorization that we will set on the naïve Bayes algorithm. Thus

in this system, I will apply the data collection technique to classify the data based on

the history taken from the browser log. The data that I will collect is from the browser

history of the user computer. Thus it will later show us the activities while browsing is

productive or unproductive.

The Bayesian Classification represents a supervised learning method as well as a

statistical method for classification. Assumes an underlying probabilistic model and it

allows us to capture uncertainty about the model in a principled way by determining

probabilities of the outcomes. It can solve diagnostic and predictive problems. This

Classification is named after Thomas Bayes (1702-1761), who proposed the Bayes

Page 15: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

2

Theorem. Bayesian classification provides practical learning algorithms and prior

knowledge and observed data can be combined. Bayesian Classification provides a

useful perspective for understanding and evaluating many learning algorithms. It

calculates explicit probabilities for hypothesis and it is robust to noise in input data[6].

Page 16: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

3

1.2 Problem Statement

The problems that make this program develop are:-

i) The admin cannot see the activities of the user/client since the admin tend to have

limited time to supervise all his client at one time.

ii) The admin also don’t know how long the user really doing their job thus its hard to

know the productive and unproductive underlings.

1.3 OBJECTIVES

The objective of this program are :-

i) To study about the activities carried out by the target user and to categorize them

whether it productive or not.

ii) To develop the Naive Bayes algorithm in the system.

iii) To test the algorithm to classify the data.

Page 17: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

4

1.4 SCOPE

This system will involve the user and the admin.

Admin:-

i)Monitor and analyze the information from the client.

ii)Request data from User/client

User:-

i) Target user that will give their data to the admin for further purpose.

ii) Send the requested information to the admin.

1.5 Limitation Of Work

As we know every system has its limitation which are :-

i) This application has very limited functionalities

-It can only classify the gained data from the user/client.

ii) This system will not detect when the user use incognito windows.

iii) Still cannot figure out if the history log is deleted from the browser.

Page 18: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

5

1.5 Expected Result

i) To design a system that help user to know the productivity while browsing on the

internet.

ii) The system will be able to help employer to know the his underling performance.

iii) The system that is easy to use

Page 19: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

6

1.7 Report Organization

This report consists of 5 chapters that contain information, description and each

section has served a different purpose that has been discussing the project

Chapter 1: Introduction

This chapter shows description and definition about data collection application.

Beside, objective and project scope will showed in this chapter.

Chapter 2: Literature Review

In this chapter, related research paper will state. The difference technology will

showed in table by comparing the advantage and disadvantage. It describes the

research about the existing system. Basically, the difficulties and other problems are

analyzed for improvements. Methods, techniques, equipment, and appropriate

technologies are studied to develop the application.

Chapter 3: Methodology

Type of methodology implemented in this supplication will show. Besides that,

technique or algorithm proposed and implemented will be stated. Framework of data

collection will be shown. Its also discussing about the methodology to be used in the

project. The methodology will be act as a guide for the development process and also

Page 20: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

7

helps to make sure the project will runs smoothly as planned. In this chapter also

include system requirement and specification that will be used to assist the

development of the project.

Chapter 4: Implementation and Result

Involves implementation and testing whereby the application being developed

and implemented the method or algorithm and the process testing the

application.

Chapter 5: Conclusion

the result will be discussed, and the conclusion was made. This section also

describes the achievement of the expected results, expectations and suggestion

for improvement and enhancement to the results of the proposed project

Page 21: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

8

CHAPTER 2

LITERATURE REVIEW

2.1 Introduction

This section discusses and portrays about literature review for Naïve Bayes

classifier that being used for previous research. A literature review is about past

research or recent research or what need to search or seek the truth for the

purpose portraying or illustrate the research problem, solutions and the

importance of seeking a solution. A literature review is not about information

gathering. In a given subject or chosen topic area, the literature review shows

in-depth grasp and summarize prior research that linked to the research subject.

Literature review involves the process of reading journal, articles, books and

research paper and later on analysing, summarize and evaluate the reading

based on connection to the project. It is a guideline for establishes the

credibility for the better project.

Page 22: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

9

2.2 Techniques

Nowadays, there are many ways technologies and technique can be used to classify

the data. The efficiency depend on the technology used and what environment to use

so that the effectiveness of an algorithm is perfect and suitable. Below are some of the

research and development that has been carried out by certain developer and also

different approach between them to complete their project.

Page 23: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

10

2.2.1 Performance Analysis of Naive Bayes and J48 Classification Algorithm for

Data Classification

In this article, it tell us that classification is an important data mining technique with

broad applications to classify the various kinds of data used in nearly every field of

our life. Next, classification is used to classify the item according to the features of the

item with respect to the predefined set of classes. This paper put a light on

performance evaluation based on the correct and incorrect instances of data

classification using Naïve Bayes and J48 classification algorithm. Naive Bayes

algorithm is based on probability and j48 algorithm is based on decision tree. They

make a research paper sets out to make comparative evaluation of classifiers NAIVE

BAYES AND J48 in the context of bank dataset to maximize true positive rate and

minimize false positive rate of defaulters rather than achieving only higher

classification accuracy using WEKA tool. The experiments results shown in this paper

are about classification accuracy, sensitivity and specificity. The results in the paper

on this dataset also show that the efficiency and accuracy of j48 is better than that of

Naïve Bayes.

Page 24: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

11

2.2.2 A Survey of Online Activity Recognition Using Mobile Phones

In this research paper, it tell us that physical activity recognition using embedded

sensors has enabled many context-aware applications in different areas, such as

healthcare. Initially, one or more dedicated wearable sensors were used for such

applications. However, recently, many researchers started using mobile phones for

this purpose, since these ubiquitous devices are equipped with various sensors,

ranging from accelerometers to magnetic field sensors. In most of the current studies,

sensor data collected for activity recognition are analyzed offline using machine

learning tools. However, there is now a trend towards implementing activity

recognition systems on these devices in an online manner, since modern mobile

phones have become more powerful in terms of available resources, such as CPU,

memory and battery. The research on offline activity recognition has been reviewed in

several earlier studies in detail. However, work done on online activity recognition is

still in its infancy and is yet to be reviewed. In this paper, they had review the studies

done so far that implement activity recognition systems on mobile phones and use

only their on-board sensors. They also discuss various aspects of these studies.

Moreover, they discuss their limitations and present various recommendations for

future research.

Page 25: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

12

2.2.3 A Goal-based Classification of Web Information Task

In this research paper, I found that they are conducting research using search engines

and online library portals, read the daily news and favourite comic online,

communicate with others increasingly through email and blogs, and have become

accomplished fact checkers thanks to Google. However, they found researchers still

lack a solid understanding of the types of activities and tasks in which users engage on

the Web. There are several reasons for this lack of understanding. First, the Web is a

moving target and is continually changing and evolving. They give an example where

the typical user has changed substantially since the early 1990s when the average web

user was a young, technically inclined male (Hawkey and Inkpen, 2005b). Also, the

Web now supports a much wider range of activities and uses. Examples include the

increase in web-based email; new sophisticated web-based travel and map

applications; and the popularity of online support and blog communities.

Page 26: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

13

2.2.4 A Hybrid Learning System for Recognizing User Tasks from Desktop

Activities and Email Messages

What I understand in this research paper is they develop an application named

TaskTracer. It is a system seeks to help multi-tasking users manage the resources that

they create and access while carrying out their work activities. It does this by

associating with each user-defined activity the set of files, folders, email messages,

contacts, and web pages that the user accesses when performing that activity. The

initial TaskTracer system relies on the user to notify the system each time the user

changes activities. However, this is burdensome, and users often forget to tell

TaskTracer what activity they are working on. This paper they introduces

TaskPredictor, a machine learning system that attempts to predict the user’s current

activity. TaskPredictor has two components: one for general desktop activity and

another specifically for email. TaskPredictor achieves high prediction precision by

combining three techniques:

(a) feature selection via mutual information,

(b) classification based on a confidence threshold, and

(c) a hybrid design in which a Naive Bayes classifier estimates the classification

confidence but where the actual classification decision is made by a support vector

machine

Page 27: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

14

2.2.5 Self-Adaptive Attribute Weighting for Naive Bayes Classification

In this paper, they propose a new Artificial Immune System (AIS) based self-adaptive

attribute weighting method for Naive Bayes classification. The proposed method that

they build namely AISWNB, uses immunity theory in artificial immune systems to

search optimal attribute weight values, where self-adjusted weight values will

alleviate the conditional independence assumption and help calculate the conditional

probability in an accurate way. One noticeable advantage of AISWNB is that the

unique immune system based evolutionary computation process, including

initialization, clone, section, and mutation, ensures that AISWNB can adjust itself to

the data without explicit specification of functional or distributional forms of the

underlying model. As a result, the AISWNB can obtain good attribute weight values

during the learning process. Experiments and comparisons on 36 machine learning

benchmark data sets and six image classification data sets demonstrate that AISWNB

significantly outperforms its peers in classification accuracy, class probability

estimation, and class ranking performance.

Page 28: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

15

2.2.6 A Review Article On Naive Bayes Classifier With Various Smoothing

Techniques

Figure 2.2.6 Naïve Bayes Classifier

In this research paper they tell me more about the Naive Bayes that is very popular in

commercial and open-source anti-spam e-mail filters. However, several forms of

Naive Bayes, something the anti-spam literature does not always acknowledge. They

discuss five different versions of Naive Bayes, and compare them on six new, non-

encoded datasets, that contain ham messages of particular Enron users and fresh spam

messages. The new datasets, which we make publicly available, are more realistic

than previous comparable benchmarks, because they maintain the temporal order of

the messages in the two categories, and they emulate the varying proportion of spam

and ham messages that users receive over time. In this paper they have discovered

various aspects of Naïve Bayes Classifier and smoothing techniques for extraction of

useful data along with their research criteria.

Page 29: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

16

2.2.7 Efficient Manageability and Intelligent Classification of Web Browsing

History Using Machine Learning

In this paper, he have a workable solution implemented by using machine learning and

natural language processing techniques for efficient manageability of User’s browsing

history. The significance of adding such a capability to a Web browser is that it

ensures efficient and quick information retrieval from browsing history, which

currently is very challenging. His purpose solution can guarantees that any important

websites visited in the past can be easily accessible because of the intelligent and

automatic classification.

In a nutshell, his solution-based paper provides an implementation as a browser

extension by intelligently classifying the browsing history into most relevant category

automatically without any user’s intervention. This guarantees no information is lost

and increases productivity by saving time spent revisiting websites that were of much

importance.

Page 30: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

17

2.3 Summary Of Literature review

No. Title Author Description

1. Performance Analysis of

Naive Bayes and J48

Classification Algorithm for

Data Classification

Tina R. Patil,

Mrs. S. S.

Sherekar

In this article it tell us more

about how to classify the data

by using the naive bayes and

they compare it with the J48

decision tree. So the expected

result is they want to

compare both of the

technique.

2. A Survey of Online Activity

Recognition Using Mobile

Phones

Muhammad

Shoaib ,

Stephan

Bosch ,

Ozlem

Durmaz Incel

, Hans

Scholten 1

and Paul J.M.

Havinga

In this research paper, they

are recognizing using

embedded sensors has

enabled many context-aware

applications in different

areas, such as healthcare.

Then they review the studies

done so far that implement

activity recognition systems

on mobile phones and use

only their on-board sensors

3. A Goal-based Classification

of Web Information Tasks

Melanie

Kellar,

Carolyn

Watters,

Michael

Shepherd

Based on their analysis, the

participants’ record tasks

during the field study, as well

as previous research, they

have developed a goalbased

classification of information

tasks which describes user

activities on the Web

4. A Hybrid Learning System

for Recognizing User Tasks

from Desktop Activities and

Email Messages

Jianqiang

Shen, Lida

Li, Thomas

G. Dietterich,

Jonathan L.

Herlocker

In this research paper, they

build a TaskTracer system

seeks to help multi-tasking

users manage the resources

that they create and access

while carrying out their work

activities. This TaskTracer is

based on two main function

which the behavior of the

user at the desktop is a

mixture of different activities

and each activity is

associated with a set of

Page 31: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

18

resources relevant to that

activity

5. Self-Adaptive Attribute

Weighting for Naive Bayes

Classification

Jia Wua,b,

Shirui Panb ,

Xingquan

Zhuc , Zhihua

Caia , Peng

Zhangb ,

Chengqi

Zhangb

In this research paper, they

are going to search optimal

attribute weight values,

where self-adjusted weight

values will alleviate the

conditional independence

assumption and help

calculate the conditional

probability in an accurate

way. They are also propose a

new Artificial Immune

System (AIS) based self-

adaptive attribute weighting

method for Naive Bayes

classification

6

A Review Article On Naive

Bayes Classifier With

Various Smoothing

Techniques

Gurneet

Kaur, Er.

Neelam

Oberai

In this paper they tell us that

there is various classification

methods developed, but the

choice of using these

techniques mainly depend

upon the type of data

collections. Some Classifiers

are discussed. Few methods

perform well on numerical

and text data like Naive

Bayes but neural networks

handle both discrete and

continuous data. KNN is a

time consuming method and

finding the optimal value is

always an issue. Decision

tree reduces the complexity

but fails to handle continuous

data. Naïve Bayes along with

its simplicity is

computationally cheap also.

In the second section of the

paper, Naïve Bayes classifier

is discussed in detail. One of

the major drawback of Naïve

Bayes is of unseen words,

Page 32: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

19

which can be eliminated by

applying smoothing

techniques. In the third

section, there are various

smoothing methods when

applied on Naïve Byes are

discussed and their

performances are compared.

7 Efficient Manageability and

Intelligent Classification of

Web Browsing History

Using Machine Learning

Suraj G. ,

Sumantha

Udupa U

This paper deals with ways

of improving browser

capability by efficiently

managing browsing history.

The endeavor is to provide

technology solutions that

enable, extend, and

differentiate the

transformation of a browser

in maintaining websites

visited by users. This

historical information is an

essential part of our everyday

operations, but its huge

quantity and very poor

organizing capability makes

it difficult and time

consuming to retrieve it

according to User’s

preferences. Web Page

Browsing is one of the most

important ways for people to

obtain information. Every

visit has some visiting

motivation, and contains a

certain interest of the Users.

Managing the browsing

history also helps in

developing a web

personalization.

Table 2.3 Summary Of Literature Review

Page 33: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

20

2.4 Summary

In this phase, it will deliver the information the study of the past research about type

of classifier used, current system or application and articles from newspaper and

website. The research on the different techniques to be use in the system is important

to ensure the best and most suitable technique is applied in the system. This study is

more focus to do the development and guidance to make a successful project, and also

come out with the new system that will give benefits to the user.

Page 34: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

21

Chapter 3

Research Methodology

3.1 Introduction

This chapter will explain the specific details on the methodology being used in order

to develop this project. In order to make sure the project is in the right path,

methodology plays an important role as a guide for the project to complete and

working well as plan. There is different type of methodology that is used for different

type of application. It is important to choose the right and suitable methodology for

the development of an application thus it is necessary to understand the application

functionality itself.

There are many advantage of using waterfall model. One of the advantage are, the

model is simple and easy to understand and use it. Secondly, it is easy to be manage

due the rigidity of the model, this is because each phase has specific deliverables and

a review process. Furthermore, at each of the phases they will processed and

complete one at a time so that the phases will not overlap. Lastly, the waterfall model

works well for smaller project where the requirement are very well to be understand.

Page 35: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

22

3.2 System Model

Figure 3.2 Waterfall Model

Page 36: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

23

3.2.1 Requirement and Specification

During this phase, all possible requirements of the system to be developed are

captured in this phase by gathered from client and documented in a requirement

specification document

3.2.2 Design

Design is where the technical part of the system development. For this phase, the

requirement specifications from first phase are studied in this phase and the system

design is prepared. This system design helps in specifying which the best hardware

and system that suit to develop this system and helps in defining the overall system

architecture.

3.2.3 Development

With inputs from the system design, the system is first developed in small programs

called units, which are integrated in the next phase. Each unit is developed and tested

for its functionality, which is referred to as Unit Testing. Information has been

gathered and design are created. The system is using JAVA language. This system

will be success if there has no error in coding and follow all specification of the

system.

Page 37: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

24

3.2.4 Integration and Testing

All the units developed in the implementation phase are integrated into a system after

testing of each unit. Post integration the entire system is tested for any faults, failures

and error. In this phase, it will be tested with real hardware and test the software to

verify that it is built follow as the specifications given by the client.

3.2.5 Deployment of System

When the testing process of functional and non-functional is done, product are ready

to be use by user. The product is deployed in the customer environment or released

into the market.

3.2.6 Maintenance

Once your system is ready to use, you may later require change the code depends on

customer request if that product has some issues which come up in the client

environment. Also to enhance the product some better versions are released.

Maintenance is done to deliver these changes in the customer environment.

All these phases are require to each other in which progress is seen as flowing steadily

downwards (like a waterfall) through the phases. The next phase is started only after

the defined set of goals are achieved for previous phase and it is signed off, so the

name "Waterfall Model". If previous phase are not settle, process to next phase can

not be executed. By using this model, phases do not overlap.

Page 38: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

25

3.3 System Requirement and Specification

System requirement is needed to accomplish this project and assist the

development of the project that involves system requirement in hardware and

software. Each of these requirement is related to each other to make sure that system

can be done smoothly.

3.3.1 HARDWARE

No. Hardware Description

1 Laptop(Hp Pavilion g4) Processor: Intel Core i5 7th Generation

RAM: 6 GB

OS version: Windows 32/64 bit

2 Printer (Hp Ink jet) Used for printing document

Table 3.3.1 Hardware

Page 39: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

26

3.3.2 SOFTWARE

No. Software Description

1 Web Historian Read the history/log from the computer.

2 History export Export history from the computer in the json

format.

3 Notepad++ Notepad++ is a source code editor and Notepad

replacement that supports several languages.

Running in the MS Windows environment, its use

is governed by GPL License.

4 Google Chrome Google Chrome used to run localhost server and

web based system.

5 Microsoft Word 2013 Microsoft Word used for word processing, such as

creating and editing report and documentation.

6 Microsoft Powerpoint

2013

To present the result and the findings of this

project.

7 Snipping Tool

Used to captured and screen shot the images

8 Netbeans IDE 8.0.2 Platform to coding writing for develop the system

9 CodeBeutify To convert from json into xml format.

Page 40: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

27

10 Microsoft Excel 2013 To represent data from the converted into readable

by the code.

11 Adobe Photoshop CS6

Adobe Photoshop is a software program for

photographers, graphic designers, web designers,

videographers, and 3D artists use to enhance and

manipulate photos to make it more clearly and

beautiful.

12 MATLAB Platform to coding writing for develop the system

Table 3.3.2 Software

Page 41: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

28

3.4 FRAMEWORK

Figure 3.4.1 Framework

Page 42: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

29

Figure 3.4.2 : Example Web Historian graph for todays activities.

Figure 3.4.3 : Example of converted Json format into readable xml format.

Page 43: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

30

Figure 3.4.4 : There are three option available in the history export which is last day,

last week and to view all the history.

Figure 3.4.5 Example of output after converted into excel format.

Page 44: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

31

3.5 LIST OF WEBSITES

Gaming:

• Kongregate

• MiniClip

• Addicting Games

• Armor Games

• Newgrounds

• Crazy Monkey Games

• PopCap

• Yahoo Games

• Bgames

Chatting:

• Second Life

• Paltalk

• IMVU

• Badoo.com

• Charmdate.com

• Enterchatroom

Academic:

• Khan academy

• Tutorials point

• Google scholar

• Wikipedia

• Codecademy

• HTML Dog's Beginning HTML Guide

• Ruby on Rails Tutorial

• Mozilla Developer Network

Page 45: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

32

Shopping:

• Lazada

• Zalora

• Next

• Uniqlo

• AliExpress

• 11 Street

Page 46: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

33

3.6 Summary

Methodology is very important in system and application development. There also a

lots of different software development methodology that available and can be used to

develop any kind of application. The right methodology can help the project to be

done according to the specified time. The activities in each phase in the methodology

are explained so that it can be understood easily.

Page 47: DATA COLLECTION AND CLASSIFICATION TO FIND … · menghasilkan output yang sama ada lawatan pelayaran pengguna adalah produktif atau tidak.Objektif seterusnya ialah untuk membangunkan

34

REFERENCES

[1] https://www.finder.com/my/online-shopping

[2] https://www.upwork.com/blog/2014/03/10-best-web-development-tutorials-

beginners/

[3] http://thegeekdesire.com/best-free-chat-rooms-to-make-new-friends.html

[4] https://www.quora.com/What-is-the-best-online-games-site

[5] Suraj G. And Sumantha Udupa U. , Efficient Manageability And Intelligent

Classification Of Web Browsing History Using Machine Learning

[6] Tina R. Patil, Mrs. S. S. Sherekar , Performance Analysis Of Naive Bayes And J48

Classification Algorithm For Data Classification

[7] Muhammad Shoaib Et Al. , A Survey Of Online Activity Recognition Using

Mobile Phones

[8] Melanie Kellar And Carolyn Watters , A Goal-Based Classification Of Web

Information Tasks

[9] Jianqiang Shen Et Al. , A Hybrid Learning System For Recognizing User Tasks

From Desktop Activities And Email Messages

[10] Jia Wua,B Et Al. , Self-Adaptive Attribute Weighting For Naive Bayes

Classification

[11] Gurneet Kaur And Er. Neelam Oberai , A Review Article On Naive Bayes

Classifier With Various Smoothing Techniques