Clustering Techniques Utilized in Web Usage...

Clustering Techniques Utilized in Web Usage Mining

RENATA IVANCSY andFERENCKOVACS

Department of Automation and Applied Informaticsand HAS-BUTE Control Research Group

Budapest University of Technology and EconomicsH-1111, Goldmann Gy. ter 3.

Budapest, Hungaryemail:{renata.ivancsy, ferenc.kovacs}@aut.bme.hu

Abstract: - Clustering is the process of grouping objects together in such a way that the objects belonging tothe same group are similar and those belonging to different groups are dissimilar. Clustering technique canbe used in many applications for example biological, financial applications and many more. One of theseapplication types is Web clustering where different types of objects can be clustered into different groups forvarious purposes. This paper deals with the different aspects of Web data mining and provides an overviewabout the various techniques used in this field.

Key-Words:- Data Mining, Clustering, Web usage mining, Web usage clustering

1 Introduction

The expansion of the World Wide Web (Web forshort) has resulted in a large amount of data that isnow in general freely available for user access. Thedifferent types of data have to be managed and or-ganized that they can be accessed efficiently. There-fore, the application of data mining techniques on theWeb is now the focus of an increasing number of re-searchers.

Several data mining methods are used to dis-cover the hidden information in the Web. However,Web mining does not only mean applying data min-ing techniques to the data stored in the Web. The al-gorithms have to be modified to better suit the de-mands of the Web. New approaches should be usedbetter fitting to the properties of Web data. Fur-thermore, not only data mining algorithms, but alsoartificial intelligence, information retrieval and nat-ural language processing techniques can be used ef-ficiently. Thus, Web mining has been developed intoan autonomous research area.

Web mining involves a wide range of applica-tions that aim at discovering and extracting hiddeninformation in data stored on the Web. Another im-portant purpose of Web mining is to provide a mech-anism to make the data access more efficiently andadequately. The third interesting approach is to dis-cover the information which can be derived from theactivities of users, which are stored in log files forexample for predictive Web caching [1]. Thus, Webmining can be categorized into three different classesbased on which part of the Web is to be mined [2],

[3], [4]. These three categories are (i) Web contentmining, (ii) Web structure mining and (iii) Web usagemining. For detailed surveys of Web mining pleaserefer to [5], [3], [2], [6].

Web content mining [7], [6] is the task of dis-covering useful information available on-line. Thereare different kinds of Web content which can pro-vide useful information to users, for example multi-media data, structured (i.e. XML documents), semi-structured (i.e. HTML documents) and unstructureddata (i.e. plain text). The aim of Web content miningis to provide an efficient mechanism to help the usersto find the information they seek. Web content min-ing includes the task of organizing and clustering thedocuments and providing search engines for access-ing the different documents by keywords, categories,contents.

Web structure mining [8], [9], [10], [11] is theprocess of discovering the structure of hyperlinkswithin the Web. Practically, while Web content min-ing focuses on the inner-document information, Webstructure mining discovers the link structures at theinter-document level. The aim is to identify the au-thoritative and the hub pages for a given subject.

Web usage mining is the task of discovering theactivities of the users while they are browsing andnavigating through the Web. The aim of understand-ing the navigation preferences of the visitors is to en-hance the quality of electronic commerce services (e-commerce), to personalize the Web portals [12] orto improve the Web structure and Web server per-formance [13]. For this reason a model of the users

Proceedings of the 5th WSEAS Int. Conf. on Artificial Intelligence, Knowledge Engineering and Data Bases, Madrid, Spain, February 15-17, 2006 (pp237-242)

(User Model - UM) have to be built based on the in-formation gained from the log data.

The organization of this paper is as follows.Section 3 introduces the basic concepts of cluster-ing and enumerates some important clustering algo-rithms. Section 2 describes the aspects of Web usagemining. In Section 4 the different aspects of clas-sifying the clustering methods utilized in web usagemining are described. Conclusion can be found inSection 5.

2 Web Usage Mining

The aim of Web usage mining is to discover pat-terns of user activities in order to better serve theneeds of the users for example by dynamic link han-dling, by page recommendation, etc. The aim of aWeb site or Web portal is to supply the user the in-formation which is useful for him. There is a greatcompetition between the different commercial portalsand Web sites because every user means eventuallymoney (through advertisements, etc.). Thus the goalof each owner of a portal is to make his site more at-tractive for the user. For this reason the response timeof each single site have to be kept below 2s. More-over some extras have to be provided such as supply-ing dynamic content or links or recommending pagesfor the user that are possible of interest of the givenuser. Clustering of the user activities stored in differ-ent types of log files is a key issue in the Web com-munity.

There are three types of log files that can be usedfor Web usage mining [14]. Log files are stored onthe server side, on the client side and on the proxyservers. By having more than one place for storingthe information of navigation patterns of the usersmakes the mining process more difficult. Really re-liable results could be obtained only if one has datafrom all three types of log file. The reason for thisis that the server side does not contain records ofthose Web page accesses that are cached on the proxyservers or on the client side. Besides the log file onthe server, that on the proxy server provides addi-tional information. However, the page requests storedin the client side are missing. Yet, it is problematic tocollect all the information from the client side. Thus,most of the algorithms work based only the serverside data.

Web usage mining consists of three main steps:(i) preprocessing, (ii) pattern discovery and (iii) pat-tern analysis [15]. Figure 1 shows the block diagram

of the process of Web usage mining.

Figure 1. Process of Web usage mining

In the preprocessing phase the data have to becollected from the different places it is stored (clientside, server side, proxy servers). After identifying theusers, the click-streams of each user has to be splitinto sessions. In general the timeout for determininga session is set to 30 minute [15].

The pattern discovery phase means applyingdata mining techniques on the preprocessed log data.It can be frequent pattern mining, association rulemining or clustering. In this paper we are dealingonly with the task of clustering web usage log. Inweb usage mining there are two types of clusters tobe discovered: usage clusters and page clusters.

The aim of clustering users is to establishgroups of users having similar browsing behavior.The users can be clustered based on several infor-mation. In the one hand, the user can be requestedfilling out a form regarding their interests, for exam-ple when registrating on the web portal. The clus-tering of the users can be accomplished based on theforms. On the other hand, the clustering can be madebased on the information gained from the log datacollected during the user was navigating through theportal. Different types of user data can be collectedusing these methods, for example (i) characteristicsof the user (age, gender, etc.), (ii) preferences and in-terests of the user, (iii) user’s behavior pattern.

The aim of clustering web pages is to havegroups of pages that have similar content. This in-formation can be useful for search engines or for ap-plications that create dynamic index pages.

The last step of the whole web usage miningprocess is to analyze the patterns found during thepattern discovery step. The irrelevant patterns haveto be filtered out, and the resulted patterns or clusters


have to be validated. Some visualization techniquescan help this process for the user.

3 Clustering Algorithms

Clustering is a process of discovering groups of ob-jects such that the objects belonging to the samegroup are similar in a certain manner, and the ob-jects belonging to different groups are dissimilar. Themain problems one faces when creating a clusteringalgorithm are the following:

• The objects can have hundreds of attributes thathave to be taken into consideration for cluster-ing. One of the key issues is how to reduce thisnumber to achieve an efficient algorithm.

• The type of the attributes can be diverse, andnot only numerical attributes has to be handled.

• Because of the first two problem defining asimilarity function between the objects is nota trivial task. Many features and many types ofattributes has to be handled efficiently.

• The main feature of clustering in a data min-ing application is the high number of objectsthat have to be clustered. Thus the processingtime or the memory requirement of the algo-rithm can be huge, that has to be reduced usingsome heuristics.

• Validating the resulting clusters is also a hardtask. In case of low dimensionality, when theclusters can be represented visually, the valida-tion can be made by a human, but in case hav-ing large number of objects with high dimen-sionality statistical methods have to be usedand indices have to be defined which can becomputationally expensive.

There are many algorithms in the literature thatdeal with the problem of clustering large number ofobjects. The different algorithms can be classified re-garding different aspects. One of the key issue, whichalso determines another features of the algorithm isthe basic approach of the clustering algorithm.

The aim of thepartition-based algorithms isto decompose the set of objects into a set of disjointclusters where the number of the resulting clustersis predefined by the user. The algorithm uses an it-erative method, and based on a distance measure it

updates the cluster of each object. It is done un-til any changes can be made. The most representa-tive partition-based clustering algorithms are the k-means and the k-mediod, and in the data mining fieldthe CLARANS [16]. The advantage of the partition-based algorithms that they use an iterative way to cre-ate the clusters, but the drawback is, that the numberof clusters have to be determined in advance and onlyspherical shapes can be determined as clusters.

Hierarchical algorithms provides a hierarchi-cal grouping of the objects. There exist two ap-proaches, the bottom-up and the top-down approach.In case of bottom-up approach, at the beginning ofthe algorithm each object represents a different clus-ter and at the end all objects belong to the same clus-ter. In case of top-down method at the start of the al-gorithm all objects belong to the same cluster whichis split, until each object constitute a different cluster.The steps of the algorithms can be represented usinga dendrogram. The resulting clusters are determinedby cutting the dendrogram by a certain level. A keyaspect in these kind of algorithms is the definition ofthe distance measurements between the objects andbetween the clusters. Many definitions can be usedto measure distance between the objects, for exampleEucledian, City-Block, Minkowski and so on. Be-tween the clusters one can determine the distance asthe distance of the two nearest objects in the two clus-ter, or as the two farest or as the distance between themediods of the clusters. The drawback of the hierar-chical algorithm is that after an object is assigned toa given cluster it cannot be modified later. Further-more, like in partition-based case, also only sphericalclusters can be obtained. The advantage of the hierar-chical algorithms is that the validation indices (corre-lation, inconsistency measure), which can be definedon the clusters, can be used for determining the num-ber of the clusters. The best known hierarchical clus-tering methods are CHAMELEON [17], BIRCH [18]and CURE [19].

Density-based algorithmsstart by searchingfor core objects, and they are growing the clustersbased on these cores and by searching for objects thatare in a neighborhood within a radiusε of a given ob-ject. The advantage of these type of algorithms is thatthey can detect arbitrary form of clusters and it canfilter out the noise. DBSCAN [20] and OPTICS [21]are density-based algorithms.

Grid-based algorithms The grid-based algo-rithms use a hierarchical grid structure to decom-pose the object space into finite number of cells. For


each cell statistical information is stored about theobjects and the clustering is achieved on these cells.The advantage of this approach is the fast processingtime that is in general independent of the number ofdata objects. Grid-based algorithms are STING [22],CLIQUE [23] and WaweCluster [24].

Model-based algorithmsuse different distribu-tion models for the clusters which should be verifiedduring the clustering algorithm. A model-based clus-tering method is MCLUST [25].

Fuzzy algorithms suppose that no hard clus-ters exist on the set of objects, but one object canbe assigned to more than one cluster. The bestknown fuzzy clustering algorithm is FCM (Fuzzy C-MEANS) [26].

4 Classifying the Different Web ClusteringAlgorithms

There exists a great variety of Web usage clusteringalgorithms that can be categorized regarding severalaspects. In this paper the aspects for classifying thedifferent algorithms are the following: (i) The type ofthe objects to be clustered, (ii) The purpose of cluster-ing, (iii) The clustering algorithm used, (iv) The typeof the clusters discovered (cluster overlap handling)and (v) Similarity measure.

In the rest of the paper the different aspects ofclassifying is described in detail, and the most impor-tant and the best known web usage clustering algo-rithms are categorized based on the proposed aspects.

4.1 Object type

One aspect of categorizing a web usage clustering al-gorithm is the type of the objects to be clustered. Asmentioned in the previous section the target of theclustering can be various.

• Web pages: the most common task is to clusterweb pages based on the navigation behavior ofthe users.

• Web page sequences: the basis of the cluster-ing algorithm can be not only the frequency ofvisiting a web page, but also the frequency ofvisiting a sequence of web pages. In this casethe order of the pages also plays an importantrole.

• User rating results: if the users have the pos-sibility rating the different documents and web

pages, then the pages can be clustered basedon this information. This type of clustering iseasier than clustering from log data because inthis case the questions can be set up in such away that the resulting answer vector suits theinformation searched for.

• Registration information : Another type ofuser feature collection is the registration infor-mation in which the user supply many informa-tion about himself and about his interests.

From the first three type of objects the task ofweb page clustering can be accomplished, while fromthe last type of objects the users can be clustered.

4.2 Clustering purpose

It plays a key role what the purpose of the cluster-ing algorithm is. There exist many of purposes fromwhich we just mention some frequently used. Themost frequently used goal is to make a dynamic por-tal with page recommendation. For this reason thepages, documents or even user rating results can beclustered. Another important aspect is to have a por-tal with personalized pages or with user profiles. Forthis reason a user model have to be created regard-ing their navigational behavior. Page categorizing orpage indexing makes the navigation the users easierbecause in this case pages that are similar in a givenmanner are enumerated near to each other.

4.3 Clustering algorithm

Because each clustering method performs differentlyfor different purposes, the web usage clustering algo-rithms uses different basic clustering algorithms re-garding the tasks they are accomplishing. The differ-ent basic algorithms that can be used are described inSection 3.

Moreover, beside the basic algorithms, somenew approaches are used in Web usage mining aswell. Markov models and Fuzzy clustering algo-rithms are frequently used in this field. Semantic La-tence Analysis (LSA) is another approach that can beused for Web usage mining.

4.4 Cluster overlap handling

It is an interesting question how the boundary of acluster is defined. In several cases finding hard clus-ters are the objective of the clustering algorithms. In


this cases one object belongs to only one cluster. Inother cases, however, it is enabled to have objects thatbelong to more than one clusters at the same time. Inthis case the algorithm discovers overlapping or fuzzyclusters.

4.5 Similarity measure

Because of the non-numerical feature of the web us-age clustering problem, it is an important aspect howthe similarity of the objects to be clustered is defined.In some cases the non numerical attributes are omit-ted, or transformed to numerical values, and Eucle-dian or Minkowski distance is used. In other casesnew metrics are introduced in order to get a betterclustering result.

5 Conclusion

This paper deals with the problem of discoveringhidden information from large amount of log data,namely, with Web usage mining. The focus ofthis paper is clustering among the different miningprocesses. After describing the task of clustering,the most common clustering methods were enumer-ated based on their fundamental approach. These al-gorithms serve as basis for the web usage clusteringthat was described in detail. The different aspects ofclassifying the web usage clustering algorithms wasdescribed and a classification based on these aspectswas provided as well. In Table 1 some Web usageclustering algorithms are enumerated, and classifiedbased on the various aspects described so far.

Acknowledgment

This work has been supported by the Mobile InnovationCenter, Hungary, by the fund of the Hungarian Academyof Sciences for control research and the Hungarian Na-tional Research Fund (grant number: T042741).

References

[1] Q. Yang and H. H. Zhang, “Web-log mining forpredictive web caching.,”IEEE Trans. Knowl. DataEng., vol. 15, no. 4, pp. 1050–1053, 2003.

[2] Kosala and Blockeel, “Web mining research: A sur-vey,” SIGKDD: SIGKDD Explorations: Newsletterof the Special Interest Group (SIG) on KnowledgeDiscovery and Data Mining, ACM, vol. 2, 2000.

[3] S. K. Madria, S. S. Bhowmick, W. K. Ng, and E.-P.Lim, “Research issues in web data mining,” inData

Warehousing and Knowledge Discovery, pp. 303–312, 1999.

[4] J. Borges and M. Levene, “Data mining of user nav-igation patterns,” inWEBKDD, pp. 92–111, 1999.

[5] M. N. Garofalakis, R. Rastogi, S. Seshadri, andK. Shim, “Data mining and the web: Past, presentand future,” inACM CIKM’99 2nd Workshop on WebInformation and Data Management (WIDM’99),Kansas City, Missouri, USA, November 5-6, 1999(C. Shahabi, ed.), pp. 43–47, ACM, 1999.

[6] S. Chakrabarti, “Data mining for hypertext: A tu-torial survey.,” SIGKDD: SIGKDD Explorations:Newsletter of the Special Interest Group (SIG)on Knowledge Discovery and Data Mining, ACM,vol. 1, no. 2, pp. 1–11, 2000.

[7] M. Balabanovic and Y. Shoham, “Learning informa-tion retrieval agents: Experiments with automatedweb browsing,” inProceedings of the AAAI SpringSymposium on Information Gathering from Het-erogenous, Distributed Resources, pp. 13–18, 1995.

[8] S. Chakrabarti, B. Dom, and P. Indyk, “Enhanced hy-pertext categorization using hyperlinks,” inSIGMOD’98: Proceedings of the 1998 ACM SIGMOD inter-national conference on Management of data, (NewYork, NY, USA), pp. 307–318, ACM Press, 1998.

[9] J. M. Kleinberg, R. Kumar, P. Raghavan, S. Ra-jagopalan, and A. S. Tomkins, “The Web as a graph:Measurements, models and methods,”Lecture Notesin Computer Science, vol. 1627, pp. 1–18, 1999.

[10] J. Hou and Y. Zhang, “Effectively finding relevantweb pages from linkage information.,”IEEE Trans.Knowl. Data Eng., vol. 15, no. 4, pp. 940–951, 2003.

[11] H. Han and R. Elmasri, “Learning rules for concep-tual structure on the web,”J. Intell. Inf. Syst., vol. 22,no. 3, pp. 237–256, 2004.

[12] M. Eirinaki and M. Vazirgiannis, “Web miningfor web personalization,”ACM Trans. Inter. Tech.,vol. 3, no. 1, pp. 1–27, 2003.

[13] J. Pei, J. Han, B. Mortazavi-Asl, and H. Zhu, “Min-ing access patterns efficiently from web logs,” inPADKK ’00: Proceedings of the 4th Pacific-AsiaConference on Knowledge Discovery and Data Min-ing, Current Issues and New Applications, (London,UK), pp. 396–407, Springer-Verlag, 2000.

[14] J. Srivastava, R. Cooley, M. Deshpande, and P.-N.Tan, “Web usage mining: Discovery and applicationsof usage patterns from web data,”SIGKDD Explo-rations, vol. 1, no. 2, pp. 12–23, 2000.

[15] P. Batista, M. ario, and J. Silva, “Mining web accesslogs of an on-line newspaper,” 2002.

[16] R. T. Ng and J. Han, “Clarans: A method for cluster-ing objects for spatial data mining,”IEEE Transac-tions on Knowledge and Data Engineering, vol. 14,no. 5, pp. 1003–1016, 2002.


Algorithm Object Purpose Basic Overlap Similaritytype algorithm handling measure

Fu et. al Page URL Page BIRCH hard Eucledian[27] categorization metricShahabi et al User session K-means hard Angle among[28] user visitsNasraoui et al User session Web perzonalization Relational Fuzzy fuzzy Novel metric[29]Manco et al Web Personalized Novel algorithm, hard content[30] transactions recommendation centroid-based similarityCadez et al URL Interactive exploration Model-based hard[31] and visualization (First order Markov-model)

of Web logs

Table 1. Classification of Web usage mining algorithms

[17] G. Karypis, E.-H. S. Han, and V. K. NEWS,“Chameleon: Hierarchical clustering using dynamicmodeling,” Computer, vol. 32, no. 8, pp. 68–75,1999.

[18] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH:an efficient data clustering method for very largedatabases,” pp. 103–114, 1996.

[19] S. Guha, R. Rastogi, and K. Shim, “Cure: An ef-ficient clustering algorithm for large databases,” inSIGMOD 1998, Proceedings ACM SIGMOD Inter-national Conference on Management of Data, June2-4, 1998, Seattle, Washington, USA(L. M. Haas andA. Tiwary, eds.), pp. 73–84, ACM Press, 1998.

[20] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu,“A density-based algorithm for discovering clustersin large spatial databases with noise.,” inKDD,pp. 226–231, 1996.

[21] M. Ankerst, M. M. Breunig, H.-P. Kriegel, andJ. Sander, “Optics: ordering points to identify theclustering structure,” inSIGMOD ’99: Proceedingsof the 1999 ACM SIGMOD international conferenceon Management of data, (New York, NY, USA),pp. 49–60, ACM Press, 1999.

[22] W. Wang, J. Yang, and R. Muntz, “Sting: A statisti-cal information grid approach to spatial data mining,”1997.

[23] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Ragha-van, “Automatic subspace clustering of high dimen-sional data for data mining applications,” pp. 94–105, 1998.

[24] G. Sheikholeslami, S. Chatterjee, and A. Zhang,“WaveCluster: A multi-resolution clustering ap-proach for very large spatial databases,” inProc. 24th

Int. Conf. Very Large Data Bases, VLDB, pp. 428–439, 24–27 1998.

[25] C. Fraley and A. Raftery, “Mclust: Softwarefor model-based cluster and discriminant analysis,”1999.

[26] J. C. Bezdeck, R. Ehrlich, and W. Full, “Fcm: Fuzzyc-means algorithm,”Computers and Geoscience,vol. 10, no. 2-3, pp. 191–203, 1984.

[27] Y. Fu, K. Sandhu, and M. Shih, “Clustering of webusers based on access patterns,” 1999.

[28] C. Shahabi, A. M. Zarkesh, J. Adibi, and V. Shah,“Knowledge discovery from users web-page naviga-tion,” in RIDE ’97: Proceedings of the 7th Inter-national Workshop on Research Issues in Data En-gineering (RIDE ’97) High Performance DatabaseManagement for Large-Scale Applications, (Wash-ington, DC, USA), pp. 1–20, IEEE Computer Soci-ety, 1997.

[29] O. Nasraoui, H. Frigui, A. Joshi, and R. Krishnapu-ram, “Mining web access logs using relational com-petitive fuzzy clustering.”

[30] G. Manco, R. Ortale, and D. Saccà,“Similarity-based clustering of web transac-tions,” in SAC ’03: Proceedings of the 2003 ACMsymposium on Applied computing, (New York, NY,USA), pp. 1212–1216, ACM Press, 2003.

[31] I. Cadez, D. Heckerman, C. Meek, P. Smyth, andS. White, “Model-based clustering and visualizationof navigation patterns on a web site,”Data Min.Knowl. Discov., vol. 7, no. 4, pp. 399–424, 2003.


Clustering Techniques Utilized in Web Usage...

Documents

Transcript of Clustering Techniques Utilized in Web Usage...