Eï¬€ective Prediction of Web-user Accesses: A Data Mining Approach

Effective Prediction of Web-user Accesses: A Data Mining

Approach

Alexandros Nanopoulos Dimitris Katsaros Yannis Manolopoulos∗

Data Engineering Lab, Department of InformaticsAristotle University, Thessaloniki 54006, Greece{alex,dimitris,manolopo}@delab.csd.auth.gr

Abstract

The problem of predicting web-user accesses has re-cently attracted significant attention. Several algo-rithms have been proposed, which find important ap-plications, like user profiling, recommender systems,web prefetching, design of adaptive web sites, etc.In all these applications the core issue is the devel-opement of an effective prediction algorithm. In thispaper, we focus on web-prefetching, because of its im-portance in reducing user perceived latency present inevery Web-based application. The proposed methodcan be easily extended to the other aforementionedapplications.

Prefetching refers to the mechanism of deducingforthcoming page accesses of a client, based on ac-cess log information. We examine a method that isbased on a new type of association patterns, whichdifferently from existing approaches, considers all thespecific characteristics of the Web-user navigation.Experimental results indicate its superiority over ex-isting methods.Index Terms — Prediction, Web Log Mining,Prefetching, Association Rules, Data Mining.

1 Introduction

The problem of modeling and predicting a user’saccesses on a web-site has attracted a lot of re-search interest. It has been used [18] to improvethe web performance through caching [5, 12, 36] andprefetching [31, 20, 32, 25, 34, 35], recommend re-lated pages [17, 33], improve search engines [9] andpersonalize the browsing in a web site [34].

The core issue in prediction is the development ofan effective algorithm that deduces the future user

∗Contact author. email: [email protected], tel: +3031996363, fax: +3031998419

requests. The most successful approach towards thisgoal has been the exploitation of the user’s accesshistory to derive predictions. A thoroughly studiedfield in this area is Web prefetching. It shares all thecharacteristics that identify the Web prediction appli-cations like the previously mentioned. Nevertheless,web prefetching presents some particularities involv-ing the underlying communication medium, i.e., theInternet. However, developments in predictive Webprefetching algorithms can be easily adapted to therelated topics, e.g., user profiling, recommender sys-tems, design of adaptive web sites, etc.

The objective of prefetching is the reduction of theuser perceived latency. Since the Web’s popularityresulted in heavy traffic in the Internet, the net effectof this growth was a significant increase in the userperceived latency. Potential sources of latency arethe web servers’ heavy load, network congestion, lowbandwidth, bandwidth underutilization and propaga-tion delay. The obvious solution, that is, to increasethe bandwidth, does not seem a viable solution, sincethe Web’s infrastructure (Internet) cannot be easilychanged, without significant economic cost. More-over, propagation delay cannot be reduced beyond acertain point, since it depends on the physical dis-tance between the communicating end points. Thecaching of web documents at various points in thenetwork (client, proxy, server [5, 12, 36]) has beendeveloped to reduce the latency. Nevertheless, thebenefits reaped due to caching can be limited [24],when Web resources tend to change very frequently,resources cannot be cached (dynamically generatedweb documents), they contain cookies and when re-quest streams do not exhibit high temporal locality.

On the other hand, prefetching refers to the pro-cess of deducing client’s future requests for Web ob-jects and getting that objects into the cache, in thebackground, before an explicit request is made forthem. Prefetching capitalizes on the spatial local-

ity present in request streams [1], that is, corre-lated references for different documents, and exploitsthe client’s idle time, i.e., the time between succes-sive requests. The main advantages of employingprefetching is that it prevents bandwidth underuti-lization and hides part of the latency. Nevertheless,an over-aggressive scheme may cause excessive net-work traffic. Additionally, without a carefully de-signed prefetching scheme, several transferred docu-ments may not be used by the client at all, thus wast-ing bandwidth. Nevertheless, an effective prefetch-ing scheme, combined with a transport rate controlmechanism, can shape the network traffic, reducingsignificantly its burstiness and thus improve networkperformance [11].

In general, there exist two prefetching approaches.Either the client will inform the system about itsfuture requirements [30] or, in a more automatedmanner and transparently to the client, the systemwill make predictions based on the sequence of theclient’s past references [14, 31]. The first approach ischaracterized as informed and the latter as predictiveprefetching. In the design of a prefetching scheme forthe Web, its specialties must be taken into account.Two characteristics seem to heavily affect such a de-sign: a) the client server paradigm of computing theWeb implements, b) its hypertextual nature. There-fore, informed prefetching seems inapplicable in theWeb, since a user does not know in advance its futurerequirements, due to the “navigation” from page topage by following the hypertext links.

1.1 Mechanism of PredictivePrefetching

Web servers are in better position in making pre-dictions about future references, since they log asignificant1 part of requests by all Internet clients forthe resources they own. The prediction engine canbe implemented by exchange of messages between theserver and clients, having the server piggybacking in-formation about the predicted resources onto regularresponse messages, avoiding establishment of any newTCP connections [13]. Such a mechanism has beenimplemented in [13, 19] and seems the most appropri-ate, since it requires relatively few enhancements tothe current request-response protocol and no changesto HTTP 1.1 protocol.

In what follows in this article, we assume that thereis a system implementing a server-based predictiveprefetcher, which piggybacks its predictions as hints

1They only miss the requests satisfied by browser or proxycaches.

to its clients. Figure 1 illustrates how such an en-hanced Web server could cooperate with a prefetchengine to disseminate hints every time a client re-quests a document of the server.

request

response

WebClient

document request

response document

+ prefetching hints

I N T

E R

N E

T

Requestservicingmodule

predictions Predictionengine

Web serverlog files

logging

read

Enhanced Web server system

Figure 1: Proposed architecture of a prediction-enabled Web server.

1.2 Paper contribution

In this paper we focus on predictive prefetching.First, we identify three factors: a) the order of depen-dencies between page accesses, b) the noise presentin user access sequences due to random accesses ofa user (i.e., access that are not part of a pattern),c) the ordering2 of accesses within access sequences,that characterize the performance of predictive webprefetching algorithms. We develop a new algorithmthat takes into account the above factors. An ex-tensive experimental comparison of all algorithms,for the first time (the performance of existing algo-rithms have been examined only independently), in-dicates that the proposed algorithm outperforms ex-isting ones, by combining their advantages withoutpresenting their deficiencies.

The rest of the paper is organized as follows. Sec-tion 2 reviews related work and outlines the motiva-tion of this work and Section 3 describes the proposedalgorithm. Section 4 provides the experimental re-sults and finally, Section 5 contains the conclusions.

2 Related Work

Research on predictive web prefetching has involvedthe important issue of log file processing and the de-termination of user transactions (sessions) from it3.Several approaches have been proposed towards thisdirection [15, 16]. Since it is a necessary step for ev-ery web prefetching method, more or less similar ap-proaches on transaction formation from log files have

2The term ordering refers to the arrangment of accesses ina sequence, whereas order to the dependencies among them.

3Notice that this issue is not required for prefetching in thecontext of file systems.

been proposed in [31, 32, 25]. However, the most im-portant factor for any web prefetching scheme is theprediction algorithm, which is used to determine theactual documents to be prefetched.

The prediction scheme described in [31] uses aprefetching algorithm proposed for prefetching in thecontext of file systems [21]. It constructs a data struc-ture, called Dependency Graph (DG), which main-tains the pattern of access to different documentsstored at the server. The left part of Figure 2 il-lustrates an example of a dependency graph. For acomplete description of the scheme see [21, 31]. Thework described in [7] uses essentially the dependencygraph, but makes predictions by computing the tran-sitive closure of this graph. This method was testedand did not showed significantly better results com-pared to the simple dependency graph.

The scheme described in [32, 20] also uses aprefetching algorithm from the context of file sys-tems [14]. It is based on the notion of an m-orderPrediction-by-Partial-Match (PPM) predictor. Anm-order PPM predictor maintains Markov predictorsof order j, for all 1 ≤ j ≤ m. This scheme is alsocalled All-mth-Order Markov model [18]. The rightpart of Figure 2 illustrates a 2nd order PPM pre-dictor, where paths emanate from the tree root withmaximum length equal to m+1 (=3). For a completedescription of the scheme see [14, 23, 32, 20].

Recently, several algorithms have been proposedfor mining patterns from web logs [16, 8, 10, 15, 29,27]. Although these patterns can be characterized asdescriptive, since they indicate regularities discoveredfrom user access information, algorithms for web logmining and for predictive web prefetching share thecommon objective of determining statistically signif-icant user access sequences, i.e., access patterns. Theweb prefetching strategy proposed in [25] develops aspecialized association rule mining algorithm to dis-cover the prefetched documents. It discovers depen-dencies between pairs of documents (association ruleswith one item in the head and one item in the body).

Other related work includes [26], which describesa prefetching algorithm that is also based on asso-ciation rule mining. However, the subject of thispaper is Web-server caching, and more particularlythe prefetching of documents from the Web server’sdisk to its main memory. This approach differs fromweb prefetching, which concerns the prefetching ofdocuments from the server into the client’s cache.Improvements on the efficiency of PPM is examinedin [18] (which uses the name All-mth-Order Markovmodel for PPM). Three pruning criteria are pro-posed: a) support-pruning, b)confidence-pruning, c)error-pruning. The subject of [18] is mainly the effi-

ciency (experimental results indicated only marginalimprovements in the effectiveness of PPM). Never-theless, support-pruning is also examined in [25, 26]and in this paper as well. The other two criteriaare used in a post-processing step, on the set of dis-covered rules, and can be applied to any prefetchingscheme, thus they are orthogonal issues to the sub-ject examined in this paper. Finally, two variationsof the PPM prefetcher are described in [34, 35]. Thefirst one is a subset of the PPM whereas in the sec-ond one the selection of prefetching rules to activateis determined by “weights” assigned on them.

2.1 Motivation

Most of the existing web prefetching schemes differfrom the corresponding ones proposed in the contextof file systems only because they use techniques forthe identification of user transactions. For the coreissue in prefetching, i.e., prediction of requests, exist-ing algorithms from the context of file-systems havebeen utilized. Consequently, existing web prefetchingalgorithms do not recognize the specialized charac-teristics of the web. More precisely, two importantfactors are:

• The order of dependencies among the documentsof the patterns.

• The interleaving of documents belonging to pat-terns with random visits within user transac-tions.

These factors arise from both the contents of the doc-uments and the site of the structure (the links amongdocuments), and are described as follows.

The choice of forthcoming pages can depend, ingeneral, on a number of previously visited pages.This is also described in [18]. The DG , the 1-orderPPM and the scheme in [25] consider only first or-der dependencies. Thus, if several visits have to beconsidered and there exist patterns corresponding tohigher order dependencies, these algorithms do nottake them into account in making their predictions.On the other hand, the higher-order PPM algorithmsuse a constant maximum value for the considered or-ders. However, no method for the determination ofthe maximum order is provided in [32, 20]. A choiceof a small maximum may have a similar disadvantageas in the former case, whereas a choice of a large max-imum may lead to unnecessary computational cost,due to maintenance of a large number of rules.

A web user may follow, within a session, links topages that belong to one of several patterns. How-ever, during the same session, the user may also navi-gate to other pages that do not belong to this pattern

A/4

C/6

B/4

D/1

3

3 41

1

33

2

R

A/4

C/1

B/1C/2

B/2

B/4

D/1C/3

A/2 B/1

D/1C/6

A/3

C/1 B/1 A/1

C/1

C/1 D/1

B/1

Figure 2: Left: Dependency graph (lookahead window 2) for two request streams ABCACBD andCCABCBCA. Right: PPM predictor of 2nd order for two request streams ABCACBD and CCABCBCA.

(or that may not belong to any pattern at all). Hence,a user transaction can both contain documents be-longing to patterns and others that do not, and thesedocuments are interleaved in the transaction. How-ever, PPM (of first or higher order) and [25] prefetch-ers consider only subsequences of consecutive docu-ments inside transactions. On the other hand, theDG algorithm does not require the documents, whichcomprise patterns, to be consecutive in the transac-tions. However, since the order of the DG algorithm isone, only subsequences with two pages (not necessar-ily consecutive in the transactions) are considered4.

Consequently, none of the existing algorithms con-siders all the previously stated factors. The objectiveof this paper is the development of a new algorithmthat considers all the factors and subsequently theevaluation of their significance.

3 Proposed Method

3.1 Type of prefetching rules

Associations consider rules of several orders [3] (notof one only). The maximum order is derived from thedata and it does not have to be specified as an arbi-trary constant value [3]. For the support counting, atransaction T , supports sequences that do not neces-sarily contain consecutive documents in T . However,although the ordering of document accesses inside atransaction is important for the purpose of prefetch-ing, it is ignored by the association rules mining al-gorithms [3]. The approach in [16] takes into accountthe ordering within access sequences, however, sim-

4It has to be noticed that the use of thresholds for statisti-cal significance, e.g., support values [3], does not address theintervening of random accesses within the patterns. The pat-terns cannot be retrieved with larger values of support, sincethey will still remain broken to pieces. The problem is ad-dressed with appropriate candidate generation procedure (c.f.,Section 3).

ilar to PPM algorithm [32], it considers only sub-sequences with consecutive accesses within transac-tions.

The required approach involves a new definitionof the candidate generation procedure and the con-tainment criterion (for the support counting proce-dure). At the k-th phase, the candidates are de-rived from the self-join Lk−1 � Lk−1 [3]. However,to take the ordering of documents into account, thejoining is done as follows. Let two access sequencesbe S1 = 〈p1, . . . , pk−1〉 and S2 = 〈q1, . . . , qk−1〉,both in Lk−1. If p1 = q1, . . . , pk−2 = qk−2, thenthey are combined to form two candidate sequences,which are: c1 = 〈p1, . . . , pk−2, pk−1, qk−1〉 and c2 =〈p1, . . . , pk−2, qk−1, pk−1〉 (i.e., c1 and c2 are not con-sidered as identical, as in [3]). For instance, se-quences 〈A,B,C〉 and 〈A,B,D〉 are combined to pro-duce 〈A,B,C,D〉 and 〈A,B,D,C〉. The same holdsfor the second phase (k = 2). For instance, from〈A〉 and 〈B〉, 〈A,B〉 and 〈B,A〉 are produced. Thecontainment criterion is defined as follows:

Definition 1 If T = 〈p1, . . . , pn〉 is a transaction,an access sequence S = 〈p′1, . . . , p′m〉 is contained byT iff:

• there exist integers 1 ≤ i1 < . . . < im ≤ n suchthat p′k = pik

, for all k, where 1 ≤ k ≤ m. �

A sequence, S, of documents contained in a transac-tion, T , with respect to Definition 1 is called a subse-quence of T and the containment is denoted as S � T .

Related to the above scheme are the ones presentedin [27, 28]. By considering user-sessions as customer-sequences, the problem can also be transformed tothe setting of [4] for mining sequential patterns (eachrequest corresponds to a customer transaction). Nev-ertheless, the special phases of [4] for the processingof customer-transactions (i.e., elements of customer-sequences) are not applicable, since the latter are of

length one when considering user-sessions. Therefore,the direct application of the approach in [4] is notefficient. The scheme proposed in [22] uses a simi-lar approach, called mining of path fragments. Themain objective of path fragments is to consider subse-quences with non-consecutive accesses within trans-actions (for handling noise), i.e., the issue that isaddressed by Definition 1. The method in [22] isbased on discovering patterns containing regular ex-pressions with the ∗ wild-card between accesses of asequence.

Although the use of wild-cards presents differencesin a semantic level (it may distinguishes the sequencesthat explicitly do not contain consecutive accesses),for the purpose of web-prefetching, the use of Defini-tion 1 assures the addressing of noise within transac-tions without the need for wild-cards. Path fragments[22] where only examined against the approach in [16](subsequences with consecutive accesses). However,[4, 27] had addressed the handling of subsequenceswith non-consecutive accesses5. Moreover, the useof wild-cards requires the modification of the fre-quency counting procedure. The candidate-trie [3]should store, additionally to ordinary candidates, theones containing wild-cards. Consequently, a signifi-cant space and time overhead (since the wild-cardsmay appear in a number of combinations that growsrapidly with the size of candidates). Also during theprobing of a trie by a transaction, at each trie-nodewith the wild-card ∗, every child of the node has tobe followed. This increases the time complexity ofsupport counting. However, [22] does not presentany method for the support counting phase to ad-dress the above issues, and no experimental resultsare provided to examine its performance.

In the sequel, the web prefetching scheme that isbased on the type of rules determined by Definition 1is denoted as WMo (o stands for ordered web mining).

3.2 Algorithm

The candidate generation procedure of WMo, due tothe preservation of ordering, impacts the number ofcandidates. For instance, for two “large” documents〈A〉 and 〈B〉, both candidates 〈A,B〉 and 〈B,A〉 willbe generated in the second phase (differently, Apriori[3] would produce only one candidate, i.e., {A,B}).The same holds for candidates with larger length, foreach of which the number of different permutationsis large (nevertheless, ordering has to be preserved toprovide correct prefetching).

5In [22] the approaches in [4] and [16] are treated as iden-tical, despite the difference that [4] handles subsequences withnon-consecutive accesses.

In order to reduce the number of candidates, prun-ing can be applied according to the site of the struc-ture [27]. This type of pruning is based on the as-sumption that navigation is performed following thehypertext links of the site, which form a directedgraph. Therefore, an access sequence, and thus acandidate, has to correspond to a path in this graph.The candidate generation procedure and the apriori-pruning criterion [3] have to modified appropriately,and are depicted bellow.

Algorithm genCandidates(Lk, G)// Lk the set of large k-paths and G the graphbeginforeach L = 〈�1, . . . , �k〉, L ∈ Lk {

N+(�k) = {v|∃ arc �k → v ∈ G}foreach v ∈ N+(�k) {

//apply modified apriori-pruningif v ∈ L and L′ = 〈�2, . . . , �k, v〉 ∈ Lk {

C = 〈�1, . . . , �k, v〉if (∀S � C,S = L′ ⇒ S ∈ Lk)

insert C in the candidate-trie}

}}end

For the example depicted in Figure 3a, candidate〈B,E,C〉 corresponds to a path in the graph. Onthe other hand, candidate 〈B,C,E〉 does not, thus itcan be pruned. The reason of pruning the later candi-date is that no user transaction will contain 〈B,C,E〉,since there are no links to produce such an access se-quence that will contain (according to Definition 1)this candidate and increase its support. For instance,assuming the example database of Figure 3b, candi-date 〈B,E,C〉 will have support equal to two (con-tained in first and fourth transaction), whereas can-didate 〈B,C,E〉 is not supported by any transaction.The same applies for the example presented earlier inthis section. Among candidates 〈A,B〉 and 〈B,A〉,the former can be pruned. Evidently, 〈B,A〉 is con-tained in three transactions, whereas 〈A,B〉 in none.Containment is tested with respect to Definition 1.For example, candidate 〈B,A〉 (which correspondsto a path) is contained in the third transaction, i.e.,〈B,D,A〉, although documents B and A are not con-secutive in the transaction.

As it is illustrated, instead of generating every pos-sible candidate and then pruning with respect tothe site structure, a much more efficient approachis followed. More particularly, candidate generation,candidate storage and representation, and supportcounting over the transactions database take into ac-count the structure of the site during all these steps

D

A C

B E

<B, E, C>

<D, B, A>

<B, D, A>

<D, B, E, C>

<B, A, C>

<D, A, C>

Database

a) b)

Figure 3: a: An example of site structure. b: Adatabase of access sequences.

and not in a post-processing one, thus improving theefficiency. As it is shown, candidate generation is per-formed by extending candidates according to theiroutgoing edges in the graph, thus with respect tothe site structure. Consequently, the ordering is pre-served and, moreover, only paths in the graph areconsidered. Additionally, the apriori-pruning crite-rion of [3] is modified, since, for a given candidate,only its subsequences have to be tested and not anyarbitrary subset of documents, as it is designated forbasket data [3]. Candidates are stored in a trie struc-ture. Each transaction that is read from the databaseis decomposed into the paths it contains and each ofthem is examined against the trie, thus updating thesupport of the candidates. Due to space restrictions,more details can be found in [27, 28].

The overall execution time required for the sup-port counting procedure is significantly affected bythe number of candidates [3], hence its efficiency isimproved by this pruning criterion. Although severalheuristics have been proposed for the reduction of thenumber of candidates for the Apriori algorithm, theyinvolve basket data. The pruning with respect tothe site structure is required for the particular prob-lem, due to the ordering preservation and the largeincrease in the number of candidates. The effective-ness of pruning is verified by experimental results inSection 4.

4 Performance Results

This section presents the experimental results onthe performance of predictive web prefetching algo-rithms. We focus on DG, PPM, LBOT6, WM (de-noting the plain association rules mining algorithm[3]) and WMo algorithms. Both synthetic and real

6In the experiments the algorithm proposed in [25] will bereferenced as LBOT

data were used. The performance measures used arethe following (their description can be found also in[32]): Usefulness (also called Recall or Coverage):the fraction of requests provided by the prefetcher.Accuracy (also called Precision): the fraction of theprefetched requests offered to the client, that wereactually used. Network traffic: the number of docu-ments that the clients get when prefetching is useddivided by the one when prefetching is not used.First, we briefly describe the synthetic data gener-ator. Then, we present the results, and finally weprovide a discussion.

4.1 Generation of Synthetic Work-loads

In order to evaluate the performance of the algo-rithms over a large range of data characteristics, wegenerated synthetic workloads. Each workload is aset of transactions. Our data generator implements amodel for the documents and the linkage of the Website, as well as a model for user transactions.

We choose so that all site documents have links toother documents, that is, they correspond to HTMLdocuments. The fanout of each node, that is, thenumber of its outgoing links to other nodes of thesame site, is a random variable uniformly distributedin the interval [1..NFanout], where NFanout is a pa-rameter for the model. The target nodes of these linksare uniformly selected from the site nodes. If somenodes have no incoming links after the terminationof the procedure, then they are linked to the nodewith the greatest fanout. With respect to documentsizes, following the model proposed in [6], we set themaximum size equal to 133KB and assign sizes drawnfrom a lognormal distribution7 with mean value equalto 9.357KB and variance equal to 1.318KB.

In simulating user transactions, we generated apool of P paths (“pattern paths”, in the sequel).Each path is a sequence of linked in the site andpairwise distinct Web server documents, and will beused as “seeds” for generating the transactions. Eachof these paths is comprised of 4 nodes (documents),simulating the minimum length of a transaction. Thepaths are created in groups. Each group comprisesa tree. The paths are actually the full length pathsfound in these trees. The fanout of the internal treenodes is controlled by the parameter bf . Varying thisparameter we are able to control the ‘interweaving’of the paths. The nodes of these trees are selectedusing either the 80-20 fractal law or from the nodes

7Without loss of generality, we assume that HTML files aresmall files. Thus, according to [6] their sizes follow a lognormaldistribution.

that were used in the trees created so far. The per-centage of these nodes is controlled by the parameterorder, which determines the percentage of node de-pendencies that are non-first order dependencies. Forexample, 60% order means that 60% of the dependen-cies are non-first order dependencies. Thus, varyingthis parameter, we can control the order of the de-pendencies between the nodes in the path. The useof the fractal law results in some nodes to be selectedmore frequently than others. This fact reflects thedifferent popularity of the site documents, creatingthe so-called “hot” documents.

In order to create the transactions, we first asso-ciate a weight with each path in the pool. This weightcorresponds to the probability that this path will bepicked as the “seed” for a transaction. This weightis picked from an exponential distribution with unitmean, and is then normalized so that the sum ofthe weights for all the paths equals 1. A transac-tion is created as follows. First, we pick a path, say〈A,B,C, x〉, tossing a P -sided weighted coin, wherethe weight for a side is the probability of picking theassociated path. Then, starting from node A we tryto find a path leading to node B or with probabil-ity corProb to node C, whose length is determinedby a random variable, following a lognormal distribu-tion, whose mean and variance are parameters of themodel. This procedure is repeated for every node ofthe initial path except from those that, with probabil-ity corProb, were excluded from the path. The meanand variance of the lognormal distribution determinethe “noise” inserted in each transaction. Low valuesfor mean and variance leave the transaction prac-tically unchanged with respect to its pattern path,whereas larger values increase its length with respectto the pattern path. Table 1 summarizes the param-eters of the generator.

N Number of site nodes

NFanout Max num of nodes’ links

T Number of transactions

P Number of pattern paths

bf Branching factor of the trees

order Order of the dependencies

noiseMean Mean value of the noise

noiseV ar Variance of the noise

corProb Prob. excluding a path pattern

Table 1: The parameters for the generator.

4.2 Comparison of all algorithms

In order to carry out the experiments we generateda number of workloads. Each workload consisted ofT=100,000 transactions. From these, 35,000 trans-actions were used to train the algorithms and therest to evaluate their performance. The number ofdocuments of the site for all workloads was fixed toN=1000 and the maximum fanout to NFanout=100,so as to simulate a dense site. The branching factorwas set to bf=4 to simulate relatively low correlationbetween the paths. The number of paths of the poolfor all workloads was fixed to P=1000. With severalexperiments, not shown in this report, it was foundthat varying the values of the parameters P and Ndoes not affect the relative performance of the con-sidered algorithms. For all the experiments presentedhere, the order of the PPM algorithm was set equal to5, so as to capture both low and higher order depen-dencies. For the experiments, the lookahead windowof the DG algorithm was set equal to the length ofthe processed transaction, in order to be fair with re-spect to the other three algorithms. Also, in order todecouple the performance of the algorithms from theinterference of the cache, we flushed it after the com-pletion of each transaction. Each measurement in thefigures that follow is the average of 5 different runs.At this point, we must note that the used perfor-mance measures are not independent. For example,there is a strong dependence between usefulness andnetwork traffic. An increase in the former cannot beachieved without an analogous increase in the latter.This characteristic is very important for the interpre-tation of the figures to be presented in the sequel. Inorder to make safe judgments about the relative per-formance of the algorithms, we must always examinethe two measures, while keeping fixed the value of thethird metric, usually the network traffic, for all theconsidered algorithms.

The first set of experiments assessed the impact ofnoise on the performance of the prediction schemes.For this set of experiments, we selected a confidencevalue such that each scheme incurred about 50% over-head in network traffic (i.e., all methods were normal-ized to the same network traffic, which is the measureof resource utilization – the same applies also to theforthcoming experiments). The confidence value forDG and LBOT was set equal to 0.10, for PPM equalto 0.25 and for the other two methods equal to 0.275(for WMo and WM the support threshold was set to0.1%, which used as the default value). The resultsof this set of experiments are reported in the left partof Figure 4. From these figures, it can be seen thatWMo clearly outperforms all algorithms in terms of

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1.6 1.8 2 2.2 2.4 2.6 2.8 3

accura

cy

mean noise

order=0.6, noise var=0.8,conf:DG=LBOT=0.1, PPM=0.25, WM=WMo=0.275

DGPPMWM

WMoLBOT

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1accura

cy

higher order percentage

mean noise=2.15, var=0.8,conf:DG=LBOT=0.1, PPM=0.25, WM=WMo=0.275

DGPPMWM

WMoLBOT

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

1.6 1.8 2 2.2 2.4 2.6 2.8 3

usefu

lness

mean noise


DGPPMWM

WMoLBOT

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

usefu

lness



DGPPMWM

WMoLBOT

1.25

1.3

1.35

1.4

1.45

1.5

1.55

1.6

1.65

1.7

1.6 1.8 2 2.2 2.4 2.6 2.8 3

netw

ork

tra

ffic

mean noise


DGPPMWM

WMoLBOT

1.35

1.4

1.45

1.5

1.55

1.6

1.65

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

netw

ork

tra

ffic



DGPPMWM

WMoLBOT

Figure 4: Performance as a function of noise (left) and order (right).

accuracy and usefulness for the same network traf-fic. It is better than PPM as much as 40% and thanDG as much as 100%, in terms of accuracy. PPMis better than WMo for small values of noise, but itachieves this in the expense of larger network traf-fic. It can also be seen that the accuracy of WMo

is not affected by the increasing noise at all, imply-ing that WMo makes correct predictions even at thepresence of high noise. The usefulness for all algo-rithms drops with noise. In particular, the usefulnessof PPM drops more steeply for noise mean valuesof up to 2.3. Beyond this value, usefulness for PPMseems to be stabilized, but this is achieved in expenseof higher network traffic.

The second set of experiments evaluated the im-pact of the varying order on the performance of themethods. For this set of experiments, the confidencevalue for each method was the same as in the last set,whereas the mean value and variance of noise was setto 2.15 and 0.8, respectively. The results of this setof experiments are reported in the right part of Fig-ure 4. The general result is that only DG and LBOTare affected from the varying order, since in order tokeep its usefulness and accuracy in the same values,they increase their network traffic. The rest of the al-gorithms seem insensitive to the varying order withWMo performing the best among them, in terms ofboth accuracy and usefulness.

Next, we evaluated the benefits of the prefetchingalgorithms for an LRU cache and compared it withthe performance of the same cache with no prefetch-ing at all. For this experiment the range of cachesize was selected to be in the range of a few hundredKBytes, to simulate the fact that not all, but onlya small part of the Web client cache is “dedicated”to the Web server documents. The confidence wasset so that all algorithms incur network traffic 150%.The results of this experiment are reported in Fig-ure 5. From this figure, it is clear that prefetchingis beneficial, helping a cache to improve its hit ratioas much as 50%. The figure shows that the hit ra-tio increases steadily. This is due to the fact that,when the cache size becomes large enough to hold allthe site documents, then any future reference will besatisfied by the cache and its hit ratio will approach100%. The same figure shows that interference due tocache does not “blur” the relative performance of theprefetching algorithms. Therefore, WMo outperformsall other algorithms. The performance gap would bewider in environments with higher noise and higherorder of dependencies between the accesses. For smallcache sizes, WMo, PPM and DG have similar perfor-mance because for these sizes, the cache is not largeenough to hold all prefetched documents, and thus

many of them are replaced before they can be used.On the other hand, for very large cache sizes, the per-formance of all algorithms converges, since almost allthe site documents are cached, as explained before.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 50 100 150 200 250 300

hit

ratio

cache size (in K)

order=0.5, noise mean=2.30, var=0.8, conf:DG=LBOT=0.1, PPM=0.25, WM=WMo=0.27

DGPPMWM

WMoLBOT

No prefetching

Figure 5: Cache hits as a function of the cache size.

4.3 Real Datasets

We conclude the evaluation of the examined prefetch-ing algorithms by presenting some experiments thatwere conducted using real web server traces. In thefollowing, due to space limitations, we present the re-sults obtained from one trace, namely the ClarkNet(available at http://ita.ee.lbl.gov/html/traces.html).We used the first week of requests and we cleansed thelog (e.g., by removing CGI scripts, staled requests,etc.). The user session time was set to 6 hours and75% of the resulted transactions were used for train-ing. The average transaction length was 7, with themajority of the transactions having length smallerthan 5, so we set the order of the PPM prefetcher to5 and the lookahead window of the DG prefetcher to5. For WMo we turned off the structure-based prun-ing criterion, since the site structure for this datasetis not provided (however, for a real case application,the site structure is easily obtainable).

Table 2 presents the results from this experiment.The measurements where made so that the networktraffic incurred was the same for all algorithms. Asit is illustrated, WMo achieves better performancethan all the other algorithms, in all cases, in terms ofaccuracy and usefulness. This also verifies the per-formance results obtained from synthetic data. Ithas to be noticed that publicly available log files, likethe ClarkNet (this log is from year 1995), are notcharacterized by high connectivity and many alterna-tive navigation possibilities, which impact the num-

DG (w=5) LBOT 5-order PPM WM WMo

T A U A U A U A U A U

1.9 0.13 0.12 0.08 0.075 0.19 0.17 0.07 0.06 0.20 0.18

1.8 0.16 0.13 0.09 0.072 0.20 0.17 0.05 0.04 0.22 0.17

1.7 0.19 0.13 0.10 0.070 0.22 0.14 0.06 0.04 0.24 0.17

1.6 0.16 0.11 0.05 0.035 0.23 0.13 0.07 0.04 0.27 0.17

1.5 0.11 0.06 0.06 0.032 0.25 0.12 0.08 0.04 0.29 0.15

Table 2: Comparison of prefetchers with real data

ber and the characteristics of the navigation patterns.Thus, synthetic data can better simulate the lattercase, for which the performance gain obtained dueto prefetching is more significant, because this kindof applications are high demanding (e.g., e-commercesites).

4.4 Efficiency of WMo

Finally, we examined the effectiveness of the proposedpruning criterion. We use a synthetic dataset withthe same characteristics as the ones used in the ex-periments of Section 4.2. This experiment comparesthe proposed WMo algorithm with a version that doesnot use pruning with respect to the site structure,and is denoted as WMo/wp (WMo without pruning).Moreover, we examined the WM algorithm (Apri-ori algorithm for basket data), in order to providea comparatively evaluation of the result. Figure 6illustrates the number of candidates for these meth-ods with respect to the support threshold (given as apercentage). As it is depicted, WMo significantly out-performs both WMo/wp and WM for lower supportthresholds, where the number of candidates is larger.For larger support thresholds, WMo still outperformsWMo/wp and presents a comparative number of can-didates with those produced by WM . Nevertheless,as shown by the previous experiments, the perfor-mance of WM for the overall prefetching purpose issignificantly lower than that of WMo.

The number of candidates significantly impacts theperformance of this type of algorithms. This is in ac-cordance with related work on association rule min-ing [3]. Therefore, the efficiency of WMo is improvedby the proposed pruning. Moreover, pruning is ef-ficiently applied by considering the site structure inevery step of the rule discovery algorithm (see Section3.2). Detailed experimental results on the executiontime of this procedure can been found in [28].

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1e+06

1.1e+06

0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26

num

ber

of ca

ndid

ate

s

support threshold (percentage)

N=1000, T=100,000, P=1000, avg trans size=12, noiseMean=2.0, noiseVar=0.8

WMWMo/wp

WMo

Figure 6: Number of candidates w.r.t. supportthreshold

5 Conclusions

We considered the problem of predictive Webprefetching, that is, of deriving users’ future requestsfor Web documents based on their previous requests.Predictive prefetching suits the Web’s hypertextualnature and reduces significantly the perceived la-tency. Web prefetching has common characteristicswith other Web applications that involve predictionof user accesses, like user profiling, recommender sys-tems and design of adaptive web sites. Hence, theproposed method can be easily extended to these kindof applications.

We presented the important factors that affect theperformance of Web prefetching algorithms. The firstfactor is the order of dependencies between Web doc-ument accesses. The second is the interleaving of re-quests belonging to patterns with random ones withinuser transactions and the third one is the ordering ofrequests. None of the existing approaches has con-sidered the aforementioned factors altogether.

We proposed a new algorithm called WMo, whosecharacteristics include all the above factors. It com-pares favorably with previously proposed algorithms,

like PPM, DG and existing approaches from the ap-plication of Web log mining to Web prefetching. Fur-ther, the algorithm addresses efficiently the increasednumber of candidates.

Using a synthetic data generator, the performanceresults showed that for a variety of order and noisedistributions, WMo clearly outperforms the exist-ing algorithms. In fact, for many workloads WMo

achieved large accuracy in prediction with quite lowoverhead in network traffic. Additionally, WMo

proved to be very efficient in terms of reducing thenumber of candidates. These results were validatedwith experiments using a real Web trace.

In summary, WMo is a effective and efficient pre-dictive Web prefetching algorithm. Future work in-cludes:

• Investigation of the interaction between cachingand prefetching. The goal of such an effort wouldbe the development of adaptive Web cachingpolicies that dynamically evaluate the benefits ofprefetching and incorporate the prefetching de-cisions with the replacement mechanism.

• The extension of the present work towards thedirection of using another data mining method,namely clustering, as proposed in [37], to dealwith Web users access paths.

References

[1] V. Almeida, A. Bestavros, M. Crovella and A. deOliveira: “Characterizing Reference Locality inthe WWW”, Proceedings Conference on Paralleland Distributed Information Systems (PDIS’96),pp. 92–103, Dec. 1996.

[2] P. Atzeni, G. Mecca and P. Merialdo: “ToWeave the Web”, Proceedings 23rd Conferenceon Very Large Data Bases (VLDB’97), pp. 206–215, Aug. 1997.

[3] R. Agrawal and R. Srikant: “Fast Algo-rithms for Mining Association Rules”, Proceed-ings 20th Conference on Very Large Data Bases(VLDB’94), pp. 487–499, Sep. 1994.

[4] R. Agrawal and R. Srikant: “Mining Sequen-tial Patterns”, Proceedings IEEE Conference onData Engineering (ICDE’95), pp. 3–14, 1995.

[5] C. Aggarwal, J. Wolf, P.S. Yu: “Caching onthe World Wide Web”, IEEE Transactions onKnowledge and Data Engineering, Vol. 11, No.1, pp. 95–107, Feb. 1999.

[6] P. Barford and M. Crovella: “Generating Rep-resentative Web Workloads for Network andServer Performance Evaluation”, Proceedings

ACM Conference on Measurement and Model-ing of Computer Systems, (SIGMETRICS’98),pp. 151–160, Jun. 1998.

[7] A. Bestavros: “Speculative Data Disseminationand Service to Reduce Server Load, NetworkTraffic and Service Time”, Proceedings IEEEConference on Data Engineering (ICDE’96), pp.180–189, Feb. 1996.

[8] J. Borges and M. Levene: “Mining Associa-tion Rules in Hypertext Databases”, ProceedingsConference on Knowledge Discovery and DataMining (KDD’98), pp. 149–153, Aug. 1998.

[9] S. Brin and L. Page: “The Anatomy ofLarge-scale Hypertextual Web Search Engine”,Proceedings Int. World Wide Web Conference(WWW’98), pp. 107–117, 1998.

[10] B. Berendt and M. Spiliopoulou: “Analysisof Navigation Behavior in Web Sites Integrat-ing Multiple Information Systems”, The VLDBJournal, Vol. 9, No. 1, pp. 56–75, May 2000.

[11] M. Crovella and P. Barford: “The Network Ef-fects of Prefetching”, Proceedings of the IEEEConf. on Computer and Communications, (IN-FOCOM’98), pp. 1232–1240, Mar. 1998.

[12] P. Cao and S. Irani: “Cost-Aware WWWProxy Caching Algorithms”, Proceedings 1997USENIX Symposium on Internet Technologyand Systems (USITS’97), pp. 193–206, Jan.1997.

[13] E. Cohen, B. Krishnamurthy and J. Rexford:“Improving End-to-End Performance of the WebUsing Server Volumes and Proxy Filters”, Pro-ceedings ACM SIGCOMM Conference, pp. 241–253, Aug. 1998.

[14] K.M. Curewitz, P. Krishnan and J.S. Vitter:“Practical Prefetching via Data Compression”,Proceedings ACM SIGMOD Conference, pp.257–266, Jun. 1993.

[15] R. Cooley, B. Mobasher and J. Srivastava:“Data Preparation for Mining World Wide WebBrowsing Patterns”, Knowledge and Informa-tion Systems, Vol. 1, No. 1, pp. 5–32, Feb. 1999.

[16] M.S. Chen, J.S. Park and P.S. Yu: “EfficientData Mining for Path Traversal Patterns”, IEEETransactions on Knowledge and Data Engineer-ing, Vol. 10, No. 2, pp. 209–221, Apr. 1998.

[17] J. Dean and M. Henzinger: “Finding RelatedPages in World Wide Web”, Proceedings Int.World Wide Web (WWW’99), pp. 1467–1479,1999.

[18] M. Deshpande and G. Karypis: “SelectiveMarkov Models for Predicting Web-Page Ac-cesses”, Proceedings SIAM Int. Conference onData Mining (SDM’2001), Apr. 2001.

[19] D. Duchamp: “Prefetching Hyperlinks”, Pro-ceedings USENIX Symposium on Internet Tech-nologies and Systems (USITS’99), Oct. 1999.

[20] L. Fan, P. Cao, W. Lin and Q. Jacobson: “WebPrefetching Between Low-Bandwidth Clientsand Proxies: Potential and Performance”, Pro-ceedings ACM Conference on Measurement andModeling of Computer Systems, (SIGMET-RICS’99), pp. 178–187, Jun. 1999.

[21] J. Griffioen and R. Appleton: “Reducing FileSystem Latency Using a Predictive Approach”,Proceedings 1994 USENIX Conference, pp. 197–207, Jan. 1995.

[22] W. Gaul and L. Schmidt-Thieme: “Mining WebNavigation Path Fragments”, Proceedings Work-shop on Web Usage Analysis and User Profiling(WEBKDD’00), 2000.

[23] T. Kroeger and D.D.E. Long: “Predicting Fu-ture File-System Actions from Prior Events”,Proceedings USENIX Annual Technical Confer-ence, pp. 319–328, Jan. 1996.

[24] T. Kroeger, D.E. Long and J.Mogul: “Explor-ing the Bounds of Web Latency Reduction fromCaching and Prefetching”, Proceedings USENIXSymposium on Internet Technologies and Sys-tems (USITS’97), pp. 13–22, Jan. 1997.

[25] B. Lan, S. Bressan, B. C. Ooi and Y. Tay: “Mak-ing Web Servers Pushier”, Proceedings Work-shop on Web Usage Analysis and User Profiling(WEBKDD’99), Aug. 1999.

[26] B. Lan, S. Bressan, B. C. Ooi and K. Tan: “Rule-Assisted Prefetching in Web-Server Caching”,Proceedings ACM Int. Conference on Informa-tion and Knowledge Management (CIKM’00),pp. 504–511, Nov. 2000.

[27] A. Nanopoulos and Y. Manolopoulos: “FindingGeneralized Path Patterns for Web Log DataMining”, Proceedings East-European Conferenceon Advances in Databases and Information Sys-tems (ADBIS’00), pp. 215–228, Sep. 2000.

[28] A. Nanopoulos and Y. Manolopoulos: “Min-ing patterns from graph traversals”, Data andKnowledge Engineering (DKE), vol. 37, no. 3,pp. 243–266, Jun. 2001.

[29] J. Pei, J. Han, B. Mortazavi-Asl and H.Zhu: “Mining Access Patterns Efficiently fromWeb Logs”, Proceedings Pacific-Asia Confer-ence on Knowledge Discovery and Data Mining(PAKDD’00), Apr. 2000.

[30] H. Patterson, G. Gibson, E. Ginting, D. Stodol-sky and J. Zelenka: “Informed Prefetching andCaching”, Proceedings ACM Symposium on Op-erating Systems Principles (SOSP’95), pp. 79–95, Dec. 1995.

[31] V. Padmanabhan and J. Mogul: “Using Predic-tive Prefetching to Improve World Wide WebLatency”, ACM SIGCOMM Computer Commu-nications Review, Vol. 26, No. 3, Jul. 1996.

[32] T. Palpanas and A. Mendelzon: “Web Prefetch-ing Using Partial Match Prediction”, Proceed-ings 4th Web Caching Workshop, Mar. 1999.

[33] P. Pirolli, H. Pitkow and R. Rao: “Silk from aSow’s Ear: Extracting Usable Structures fromthe Web”, Proceedings ACM Conference on Hu-man Factors and Computing Systems (CHI ’96),pp. 118–125, Apr. 1996.

[34] J. Pitkow and P. Pirolli: “Mining Longest Re-peating Subsequences to Predict World WideWeb Surfing”, Proceedings USENIX Sympo-sium on Internet Technologies and Systems(USITS’99), Oct. 1999.

[35] R. Sarukkai: “Link Prediction and Path Analy-sis Using Markov Chains”, Computer NetworksVol. 33, No. 1–6, pp. 377–386, Jun. 2000.

[36] J. Shim, P. Scheuermann and R. Vingralek:“Proxy Cache Algorithms: Design, Implementa-tion, and Performance”, IEEE Transactions onKnowledge and Data Engineering, Vol. 11, No.4, pp. 549–562, Aug. 1999.

[37] T.W. Yan, M. Jacobsen, H. Garcia-Molina andU. Dayal: “From User Access Patterns to Dy-namic Hypertext Linking”, Computer Networks,Vol. 28, No. 7–11, pp. 1007–1014, May 1996.

Eï¬€ective Prediction of Web-user Accesses: A Data Mining Approach

Documents

Transcript of Eï¬€ective Prediction of Web-user Accesses: A Data Mining Approach