A Proposed Methodology for E-Business Intelligence Measurement Using Data Mining Techniques Stavros...

28
A Proposed Methodology for E- Business Intelligence Measurement Using Data Mining Techniques Stavros Valsamidis, Ioannis Kazanidis, Sotirios Kontogiannis Alexandros Karakos { [email protected], [email protected], [email protected], [email protected] } PCI 2014

Transcript of A Proposed Methodology for E-Business Intelligence Measurement Using Data Mining Techniques Stavros...

A Proposed Methodology for E-Business Intelligence Measurement Using Data

Mining Techniques

Stavros Valsamidis, Ioannis Kazanidis,

Sotirios Kontogiannis Alexandros Karakos

{[email protected],[email protected],

[email protected],[email protected] }

PCI 2014

Outline

Introduction

Method

Results

Discussion

Limitations

Conclusions

PCI 2014

Introduction (1/7)

E-business Business Intelligence Knowledge Data Discovery Data Mining

PCI 2014

Introduction (2/7) E-business E-business refers to any business that uses

the Internet and related technologies. E-business is the conducting of business on the Internet, not only buying and selling but also servicing customers and collaborating with business partners

Intelligence Luhn defined intelligence as: "the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal“.

PCI 2014

Introduction (3/7) Business Intelligence

Business Intelligence (BI) is the emerging

discipline that aims at combining corporate

data with textual user-generated content

(UGC) to let decision-makers analyze their

business based on the trends perceived

from the environment

PCI 2014

Introduction (4/7)

Knowledge Data Discovery

The term Knowledge Data Discovery

(KDD) was coined in 1989 to refer to the

broad process of finding knowledge in

data, and to emphasize the “high-level”

application of particular data mining (DM)

methods

PCI 2014

Introduction (5/7)

Data Mining

Data mining main goal is the search for relationships and distinct patterns that exist in datasets but they are “hidden" among the vast amount of data.

PCI 2014

Introduction (6/7) Indexes and metrics proposed by authors

for the usage of web applications. There are not metrics specifically for

measuring e-business usage in terms of BI This study contributes to the area of web

usage analysis for e-business intelligence by ‘marrying’ e-business with data mining

Four metrics, applied innovatively for the first time in the field of e-business

PCI 2014

Introduction (7/7)This paper proposes an iterative method for designing

and maintaining BI applications that reorganizes the activities and tasks normally carried out by practitioners

is completed by a case study to the consumer goods area, aimed at proving that the adoption of a structured methodology positively impacts on the project success

PCI 2014

Purchase records

Data pre-processing

Measures calculation

Data mining techniques

E-business usage

assesment

Logging data

MethodPCI 2014

Method – Steps (1/5) logging data: logging of specific data from e-business

systems

Specifically eleven (11) fields (request_time_event,

remote_host, request_uri, remote_logname,

remote_user, request_method, request_time,

request_protocol, status, bytes_sent, referer, agent) and

user requests from different products

Pre-processing: The data contain noise such as URLs,

emoticons, symbols, like asterisks, hashes, etc.

PCI 2014

Method – Steps (2/5) Indexes, metrics and rates:

Attribute name Description of the attributeSessions The number of sessions per product viewed by users

Pages The number of pages per product viewed by users

Unique pages The number of unique pages per product viewed by users

Unique Pages per ProductID per Session (UPPS)

The number of unique pages per product viewed by users per session

Homogeneity The homogeneity of products

Enrichment The enrichment of products

Disappointment The disappointment of users when they view pages of the products

Interest It is the one 's complement to the disappointment

Mean rate It represents the mean rate of the usage combining Enrichment, Homogeneity and Interest

Score It is the score of the product usage

PCI 2014

Method – Steps (3/5)

Indexes, metrics and rates:

Enrichment = 1- (Unique Pages/Total Pages)

Disappointment= Sessions/Total Pages

Interest=1-Disappointment

Homogeneity =Unique pages/Total Sessions

Mean rate = (Enrichment + Homogeneity + Interest) /3

Score = Mean rate * UPPS

PCI 2014

Method – Steps (4/5)Data mining techniques:

data mining techniques are applied so that relevant data can be analyzed. Classification, clustering and association rule mining are used, based on the metrics of the third step.

During this step the classification the algorithm 1R may be applied

Product clustering is included in the clustering step, this is established by the Purchases attribute

Clustering of user visits is performed with the use of k-means algorithm

PCI 2014

Method – Steps (5/5)Data mining techniques:

Association rule mining enables relationships to be found amongst attributes in databases, revealing if-then statements regarding attribute-values

An association rule X Y shows a close correlation among items in a database. This occurs when transactions in the database in which X occurs, there is also a high probability of having Y. In an association rule X and Y are respectively named the antecedent and consequent of the rule.

PCI 2014

Results (1/6)Study population and context

The data of 40 products are ranked in descending order according to the column Score

Product ID

Sessions

Pages

Unique pages

UPPS

Enrichment

Homogeneity

Disappointment

Interest

Mean rate Score Purchas

es

PID105 94 299 12 218 0,960 0,128 0,314 0,6860,591 128,84

9 58

PID35 89 339 9 182 0,973 0,101 0,263 0,7370,604 109,93

0 54

PID132 158 235 8 198 0,966 0,051 0,672 0,328 0,448 88,720 53

PID36 76 219 8 134 0,963 0,105 0,347 0,653 0,574 76,903 61

PID129 78 211 7 132 0,967 0,090 0,370 0,630 0,562 74,224 48

PID125 96 166 9 136 0,946 0,094 0,578 0,422 0,487 66,242 49

PID41 101 188 9 132 0,952 0,089 0,537 0,463 0,501 66,176 57

PID66 59 148 9 109 0,939 0,153 0,399 0,601 0,564 61,515 34

PID17 55 221 12 92 0,946 0,218 0,249 0,751 0,638 58,727 38

PID111 35 144 9 81 0,938 0,257 0,243 0,757 0,651 52,693 24

PCI 2014

Results (2/6) Data pre-processing and calculation of the

metrics and rates

The data are in ASCII form and are obtained from the Apache server log file.

Application of data mining techniques the column Score The attributes of the table were inserted in .cvs format into Weka

The attributes Product ID and Disappointment were removed

Product_ID is different for each instance and Disappointment is the complement to the Interest attribute. All the remaining attributes were disretized.

PCI 2014

Results (3/6) Classification

In the classification step, the algorithm 1R is applied.

The attribute Purchases is used as class.

The best attribute which describes the classification is Score

PCI 2014

Results (4/6) ClusteringThe clustering step contains products clustering, based on the Purchases attribute with the use of the SimpleKmeans algorithm

PCI 2014

Results (5/6) Association rule mining

The Apriori algorithm was used to find association rules over the discretized data

Because of the obvious dependencies of the attributes Sessions, Pages and Unique Pages with the attributes Enrichment, Interest and Homogeneity, the latter group of attributes was removed from the data table

Weka shows a list of 6 rules with the support of the antecedent and the consequent (total number of items) at 0.1 minimum, and the confidence of the rule at 0.9

PCI 2014

Results (6/6) Association rule mining

There is an uninteresting rule, like rule 1.

There are some similar rules, rules with the same element in antecedent and consequent but interchanged, such as the couples of rules 3, 4 and 5, 6

It is proven that purchases of the products are dependent on the scores

PCI 2014

Discussion (1/2) The indication that many pages within useful paths

contribute to increased usage is fairly obvious.

The more and better content on a site, the more a

user might visit it. So the administrators should add

some useful and helpful pages to a site.

If there is an essentially blank site but it is required for

the customers to visit it every day and contribute a

comment, then the usage will be necessarily high. On

the other hand, if there is a very elaborate web site

with rich content but is not required reading, limited

usage of the site would be expected

PCI 2014

Discussion (2/2)

Rule 2 offers to the administrators a lot of action ability,

since they can pay more attention to the products with low

values of Score and Sessions.

An increase in sessions results in more users (customers)

using the e-business system

Of course, it cannot be denied that a certain number of

customers only attempt to read the product information just

before doing their purchases

PCI 2014

Limitations

The fact that only 40 products in one e-business system

were investigated is a limitation to the study.

Especially for the data mining techniques which demand

large datasets.

However, this was ineluctable since the e-business system

of the case study had this number of active online products.

PCI 2014

Conclusions (1/3)

The proposed iterative method uses existing

tools and techniques in a novel way to perform

e-business systems usage analysis.

The metrics enrichment, homogeneity,

disappointment and interest are used.

It incorporates clustering, classification and

association rule mining.

PCI 2014

Conclusions (2/3)

Advantages

I. It is independent of a specific e-business system, since

it is based on the Apache log files and not the e-

business system itself. Thus, it can be easily

implemented for every e-business system.

II. It uses indexes and metrics in order to facilitate the

evaluation of each product.

III. It offers useful information for a company to have to

determine which parts of its web site to improve.

PCI 2014

Conclusions (3/3)

I. This approach may be applied after a long time

period of data tracking

II. The proposed approach may also be applied to

other web applications such as e-government,

e-learning, e-banking, blogs, social networks

etc.

PCI 2014

Thank You!

Stavros Valsamidis, Ioannis Kazanidis,

Sotirios Kontogiannis Alexandros Karakos

[email protected],

[email protected],[email protected],[email protected]

TEI of KavalaKavala, Greece

PCI 2014