Personalized Web recommendation based on path clustering

Personalized Web recommendation based on path clustering

Yijun Yu, Huaizhong Lin, Chun Chen Computer Institute Zhejiang University

Yu Gu Road 38,Hangzhou,Zhejiang,310027 China

Abstract: - Each user accesses a Web site with certain interests. The interest can be manifested by the sequence of each Web user access. The access paths of all Web users can be clustered. The effectiveness and efficiency are two problems in clustering algorithms. This paper provides a clustering algorithm for personalized Web recommendation. It is path clustering based on competitive agglomeration (PCCA). The path similarity and the center of a cluster are defined for the proposed algorithm. The algorithm relies on competitive agglomeration to get best cluster numbers automatically. Recommending based on the algorithm doesn't disturb users and needn't any registration information. Experiments are performed to compare the proposed algorithm with two other algorithms and the results show that the improvement of recommending performance is significant. Key-Words: - Competitive agglomeration; Path clustering, User interest, Personalized, Web recommendation 1 Introduction

Currently World Wide Web is developing rapidly. The managers of Web sites need good auto-subsidiary tool, which can adjust the configuration of Web page dynamically according to the user's interest. Thus they can improve service and develop peculiar electronic business to satisfy the demands of visitors much better. For the visitors, what they want is a characteristic Web page with much better service that can satisfy various demands and they also expect to receive an illumination from other users who have a similar access interest. From some points of view, visitors are also not very clear about these demands. Moreover, an important method to solve these two demands is to use Web Usage mining to recommend Web personalized page.

The term 'Web Usage Mining'[1] was introduced by Cooley et al., in 1997, in which they define Web usage mining as the 'automatic discovery of user access patterns from Web Servers'. Web Usage mining has gained much attention in the literature as a potential approach to fulfill the requirement of Web personalization. Personalized Web recommendation is one form of Web personalization[2] that could find important applications in e-business (such as google.com and Amazon.com) and e-learning sectors. Web pages are recommended to a user according to user interest that anticipates the user's needs. In this paper, we focus on the personalized Web recommendation of Web pages that are adapted according to user interest. User access pattern is important in personalized Web recommendation system. The drawbacks of previous methods[3,4] are that not it ignores the facts that the

available element will be close to zero with the increase of the user’s access time and the number of clustering is given by people, which can not dynamically adjust according to personal access. Most of the research in personalized Web recommendation recently is collaborative filtering [5,6], which needs extra information to distinguish personal activities. However, this input may have some errors and may be outdated. Authors in Paper [7,8] use K-means clustering to recommend Web personalized page, which does not consider the relationship between the user’s personality and the access path. The algorithm presented in this paper can overcome these drawbacks. 2 Path clustering based on competitive agglomeration

Path clustering based on competitive agglomeration (PCCA) is a partition algorithm, not a hierarchical clustering algorithm. The algorithm clusters according to path similarity, improves K-paths algorithm [9] , and can get best cluster numbers automatically by competitive agglomeration method.

2.1 Definitions

We should consider user access sequence of Web page in clustering. Web users access Web page according to their interest, of course, their interest is different. So, the sequence can present user access interest.

Proceedings of the 10th WSEAS International Conference on COMPUTERS, Vouliagmeni, Athens, Greece, July 13-15, 2006 (pp148-154)

Definition 1 The user access transaction t is defined as:

t=< 1 .tl url ,…, .trl url > , (1)

Here, L is the user’s access log set, l∈L, .til url is

hyperlink address of the Web page. For urls in current accessing Web page, people

always have been more interested in url accessed earlier than url accessed later. It is obviously that there is relation between user interest and user access path. So, we have user interest definition as follow. Definition 2 The user’s interest is defined as:

I( .til url )=m-i+1，1 i m≤ ≤ (2)

Here, .til url is hyperlink address of the Web page,

m is the number of Web page in Web site. It is obviously that there is relation as below.

I( 1 .tl url )>I( 2.tl url )>…> I( .til url ), 1 i m≤ ≤

Definition 3 Given two user access transaction t and s, the similarity between them sim(t,s) is defined as:

sim(t,s)=( . ) ( . )

( . ) ( . )

t si i

t si i

I l url I l url

I l url I l url

∩

∪ , 1 i m≤ ≤ (3)

Here, I(·)is the user’s interest of accessing Web page. Definition 4 Considering user access interest, the user access transaction t can be defined as:

t=< 1tu , 2

tu ,…, tiu > , 1 i m≤ ≤ (4)

Where, ( . ) ( . )

0

t tt j ji

I l url if l url tu

otherwise ∈

=

Here, L is the user’s access log set, l∈L, .til url is

hyperlink address of the Web page. We cluster the matrix of t. Definition 5 Cluster center center ( )C c is the longest public similarity path, which is the most typical path started from the starting point. The result set of the cluster is:

C={ 1c , 2c ,…, kc }， 1 k n≤ ≤ (5) For every cluster c∈C, the number of transaction t

belonging to cluster c is n, therefore the definition of cluster centers is:

center ( )C c =< '

1.l url , '

2 .l url ,…, ' .kl url >, (6) Where

'. max( ( . ))), ,. arg ( kkk l url count l url t cl url ∈=

' '

1 1. , . { . , ..., . }.k k kl url t l url l url l url−

∈ ∉ Here, count( .kl url )is the number of

different .kl url .

Given a user access transaction set T, every row is composed of user transaction, there are n transactions. The algorithm divides T into k clusters to make the total sum of the center similarity of the transaction the smallest. We can get the cluster center of the every cluster by formula (5). This process can be described as such mathematic problem:

Minimize:

center1 1

( , , ,) sim( ( ))k n

ij jj i

P W T C w t C c= =

= ∑∑ ；(7)

Satisfy:

11 0 1,..., , 1,...,, , .

k

ij ijj

w w i n j k=

= == ≥∑ (8)

Here, W is a division matrix of n k× , C={ 1c ，

2c …， nc }is the result set of the clustering, and sim(·,·) is the similarity of two paths defined in definition 3.

2.2 Competitive agglomeration

The principle of the competitive agglomeration [10] is as follows. It divides the data set into many small clusters, as the algorithm is processing, the neighboring clusters will compete with each other. The cluster who loses the competition will become smaller until it vanishes, which makes the number of the clusters to be an optimized number. Because formula (7) is combined with competitive agglomeration, the objective function is defined as:

center1 1 1 1

( , , ) sim( , ( ))k n k n

ij j ijj i j i

P W T C w t C c wη= = = =

= −∑∑ ∑∑Where:

1

1 0 {1,..., }, {1,..., }, , .k

ij ijj

w w i N j k=

∈= ≥ ∈∑ (9)

The number of the cluster K in competitive agglomeration is refreshed dynamically in formula (9). The right side of the equation is used to control the scale of the cluster to get a compact cluster. When the number of cluster equals to the number of the transactions, which means every transaction becomes a cluster, the first item becomes minimal. The second item of the equals is the total sum of the number of the cluster base, which is used to control the number of the cluster. When all the transactions are put into one cluster and other clusters are empty, the second item becomes minimal. The combination of the two items and the selection of an appropriate η will minimize the similarity of the path, and put the data set into the cluster which has the smallest number. Lagrange coefficient is used in formula (9) to get the membership grade min

stw [10] , which can minimize the


formula (9). minstw = KP

stw + Biasstw ， (10)

Here, minstw is the minimization of membership

grade; KPstw is the membership grade of the K-paths;

Biasstw is the deviation of the membership grade.

KPstw =

1center center

1 1/( )

sim ( , ( )) sim ( , ( )),

k

js jt C c t C c=∑ (11)

Biasstw =

center

( )sim( , ( ))

,s tjC

N Nt c

η− (12)

sN is the base of the cluster sc in formula (12).

1

N

s sjm

N w=

=∑ ， (13)

tN is defined as：

1 1center center

1 1( ) /( ).

sim( , ( )) sim ( , ( ))

k k

t jj jj j

N Nt C c t C c= =

= ∑ ∑ (14)

Formula (14) is a simple weighted average. The weight of the every cluster means how close all the transactions of that cluster are to the cluster center.

In formula (14), the first item of the right hand of KPstw has the same membership grade with the

k-paths algorithm. They only consider the relative distance of the path similarity. The second item of the right hand is a deviation with a symbol. The cluster is positive when the base of the cluster is larger than the average base, therefore the membership grade will increase. On the contrary, the cluster is negative when the base of the cluster is smaller than the average base, therefore the membership grade will decrease. When the base of the cluster is below the threshold, the cluster will be discarded and the number of the clusters will be refreshed. As a result, the fake cluster will vanish gradually. Because the number of cluster is larger than the number really needed, every real cluster is divided into many small clusters. By the process of the algorithm, the second item expands every cluster, including data object as much as possible. In the mean time, the restriction in formula (11) makes the neighboring clusters compete with one another and only a small number of the clusters win in the competition while the others become smaller and smaller until they finally vanish.

In the process of competitive agglomeration, it is very important to choose appropriate η in formula (9), which shows that how important the second item on the right side is to the first item. In order to make the competitive agglomeration independent of the center similarity of the cluster, the value of η should have some relation with these two items:

center

1 1

1 1

sim( , ( ))k n

ij ii j

k n

iji j

w t C c

wη = =

= =

∝∑∑

∑∑, (15)

Using the value of η in paper[10]:

center1 1

1 1

sim( , ( ))

( ) ( )

k n

ij ii j

k n

iji j

w t C c

k kw

η ω = =

= =

=∑∑

∑∑, (16)

Here, η is a function with k iteration, ω is defined as follows [10] :

/

0( ) e kk τω ω −= . (17) Here, 0ω is an initialization value, τ is a timed constant.

The new cluster algorithm can know the number of the clusters automatically. When we do search in personalized Web recommendation, the threshold we set at first is lager than the numbers of clusters really needed, using the competitive agglomeration to decrease the numbers of the cluster until getting an optimal number. Furthermore, we can put the noisy data into a noisy cluster to reduce the bad impact on the performance of the cluster and thus enhance the result of the cluster.

2.3 Problem solving

Problem P can be solved by solving the following two sub-questions based on competitive agglomeration method:

In first sub-question, the fixed variable C is used and marked as C .The question has been changed to P(W,T, C ), where the cluster center is fixed and the partition matrix is changed. In sub-question 1, the value of W is according to the similarity between the user’s access transaction and the cluster center.

In second sub-question, the fixed variable W is used and marked as W .Solve the changed problem P(W ,T,C), where partition matrix is fixed and cluster center is changed. To solve sub-question 2, we can use method used to get the cluster center in the formula (5) in definition 5.

So the PCCA algorithm to solve the problem P is as follows:

(1)Initialization: set n=0, given a initial value of 0C , get the value of 0W through solving

0( , , )P W T C . (2)Change the cluster center:


set W = iW , get the value of iC through solving ( , , )P W T C .

If ( , , )iP W T C )= 1( , , )iP W T C + ), then output C = 1iC + , terminate the protocol; Else go to the third step.

(3) Designating a cluster: set C = 1iC + , get the value of 1iW + through solving ( , , )P W T C .

If ( , , )iP W T C = 1( , , )iP W T C+ , then output iW and C , terminate the protocol; Else i=i+1, go to the second step.

Both PCCA algorithm and K-paths algorithm are partition clusters. They usually terminate in a local optimal cluster. Because the function P(·,·) is not prominent and it descends strictly, so after finite steps, the algorithm will terminate at a local minimal point. The time complexity of the algorithm that is the same as the K-paths algorithm’s, is O(Iknm), I is the number of the iteration. PCCA algorithm pays much attention to the user’s access interest, using competitive agglomeration method, so the result of clustering is much better than K-paths algorithm. The latter experiment proves that the result of clustering can be improved significantly for personalized Web recommendation. 3 Recommendation Set

The recommendation set is composed of useful links based on the user navigation activities on the Web site. Before the Web page is sent to the client, the recommendation link is added to the last demanding page of the sessions which the users will access. We use association rule method [11] to generate recommendation set in a cluster transaction. The confidence[12] of Web page is larger than the value of association Web page set. The Web page set is this cluster’s recommendation set. Definition 6 the support of accessing Web page s is defined as:

support (s)= count( )count( )

sC

, (18)

Here, C is a cluster, count(s) is the number of

transaction containing Web page s in C, count(C) is the number of transactions in C. Definition 7 the confidence of accessing Web page s is defined as:

confidence (s,s’)= support( ')

support( )s s

s∩

, (19)

Here, s’ is a Web page. If confidence (s,s’) is larger than threshold θ which is a constant, then 's s→ is an association rule. The recommendation set of s in C is composed by all s’. 4 Experiments

The data set used in this experiment is from the Web site—“NASA”. The NASA data set contains two months worth of all HTTP requests to the NASA Kennedy Space Center WWW server in Florida. The log was collected from 00:00:00 August 1,1995 through 23:59:59 August 31,1995. In that period there were 1,569,898 requests. There were a total of 18,688 unique IPs visiting pages, having a total of 171,529 effective sessions. A total of 15,429 unique pages were visited.

Doing experiment with PCCA algorithm, we have selected randomly some transactions (30,51,70, 95,108) as starting cluster center in the experiment from the user’s transaction data set. Let confidence threshold θ to 5, the comparison of run time between PCCA algorithm and K-path algorithm proposed by Wang[9] is shown as figure 1.

In this paper, the value of k is determined by experiment. The experiment result shows that the correct rate is good when k equal to 70, as figure 2 shows. When the value of k is too large or too small, the correct rate of recommendation will decrease. η is a function with k iteration. It is also an important variable of competitive agglomeration method.

Fig.1 The comparison of run time between PCCA

algorithm and K-path algorithm We should have an appropriate selection of the

value of n. The value of n is very important to the cluster, so in order to have a good efficient algorithm, we choose a minor n that is less than 8 in

2 0

4 0

6 0

8 0

1 0 0

0 2 0 4 0 6 0 8 0 1 0 0 1 2 0

K

Run

tim

e(s)

P C CA

K - pa t h


Personalized Web recommendation based on path clustering

Documents

Transcript of Personalized Web recommendation based on path clustering