[IEEE 2007 2nd International Conference on Digital Information Management - Lyon, France...

5
Transaction Clustering of web log data files using genetic algorithm Daisy Jacobs University of Zululand Natal, South Africa [email protected] S.Sarasvady Amrita Vishwa Vidyapeetham Ettimadai Coimbatore 641105. TN India [email protected] Pit.Pichappan Imam Mohammad Bin Saud University Riyadh, Saudi Arabia [email protected] Abstract Increasingly web applications found to impact on numerous environments. The web log data offer more promises and particularly application of the genetic algorithms is significant as it represents the relations between different data components. We have used simple genetic algorithms to log files and we found that the preliminary results are more promising there by open more avenues for future research. 1. Introduction The growth of the web has dramatically changed the way information is accessed and managed, thereby opening the door to exciting new scenarios for the widespread consumption and exchange of information. Along with the excitement, there is also the recognition of an urgent need for effective and efficient tools for information users, who must be able to easily locate, manage and exchange disparate information, ranging from unstructured documents and pictures to structured, but often hidden, record- oriented data. The web is rapidly evolving into a ubiquitous computing platform for a new generation of information systems. Increasingly, web applications strive to manage data, documents or application services, spread worldwide and accessed through diverse devices from heterogeneous environments. The requirements of these applications as well as emerging technologies have created new challenges and opportunities for database technology. This new landscape has compelled the web community to revisit its approach to data models, query languages, storage support, query optimization, as well as, data and application services integration. It also demands that web research further interact with information retrieval, programming languages, artificial intelligence, distributed computing, workflows, and other areas of information science. The popularity of the Internet as a "first choice" source raises many interesting issues, two of which are: What types of information and advice are consumers getting from the Internet? What, if anything, should information professionals do in response to such findings? This study is part of a larger effort of us to understand patterns of information access using web logs in general and genetic algorithm in particular. A large number of companies, organizations, and users are exploiting the opportunities offered by Internet-based information solutions and many more are expected to follow. Companies have put their databases and product catalogues on the Web, search engines allow electronic market participants around the globe to locate potential trading partners, and a 1-4244-1476-8/07/$25.00 ©2007 IEEE.

Transcript of [IEEE 2007 2nd International Conference on Digital Information Management - Lyon, France...

Transaction Clustering of web log data files using genetic algorithm

Daisy Jacobs University of Zululand

Natal, South Africa [email protected]

S.Sarasvady

Amrita Vishwa Vidyapeetham Ettimadai

Coimbatore 641105. TN India

[email protected]

Pit.Pichappan Imam Mohammad Bin Saud University

Riyadh, Saudi Arabia [email protected]

Abstract

Increasingly web applications found to impact on numerous environments. The web log data offer more promises and particularly application of the genetic algorithms is significant as it represents the relations between different data components. We have used simple genetic algorithms to log files and we found that the preliminary results are more promising there by open more avenues for future research. 1. Introduction

The growth of the web has dramatically changed the way information is accessed and managed, thereby opening the door to exciting new scenarios for the widespread consumption and exchange of information. Along with the excitement, there is also the recognition of an urgent need for effective and efficient tools for information users, who must be able to easily locate, manage and exchange disparate information, ranging from unstructured documents and pictures to structured, but often hidden, record-oriented data.

The web is rapidly evolving into a ubiquitous computing platform for a new generation of information systems. Increasingly, web applications strive to manage data, documents or application services, spread worldwide and accessed through

diverse devices from heterogeneous environments. The requirements of these applications as well as emerging technologies have created new challenges and opportunities for database technology. This new landscape has compelled the web community to revisit its approach to data models, query languages, storage support, query optimization, as well as, data and application services integration. It also demands that web research further interact with information retrieval, programming languages, artificial intelligence, distributed computing, workflows, and other areas of information science. The popularity of the Internet as a "first choice" source raises many interesting issues, two of which are: What types of information and advice are consumers getting from the Internet? What, if anything, should information professionals do in response to such findings? This study is part of a larger effort of us to understand patterns of information access using web logs in general and genetic algorithm in particular.

A large number of companies, organizations, and users are exploiting the opportunities offered by Internet-based information solutions and many more are expected to follow. Companies have put their databases and product catalogues on the Web, search engines allow electronic market participants around the globe to locate potential trading partners, and a

1-4244-1476-8/07/$25.00 ©2007 IEEE.

set of different protocols are established to exchange goods and services by using different models and follow different strategies. There is a diagnostic need for effective tools for information users, who must be able to easily locate in the Web interested information from unstructured documents and pictures to structured, record oriented data. The effective and efficient access to the web information has become a critical research area. This has led to the development of new e-mining tools and technology, Web data models and query languages, web site management system, auction and negotiation systems, etc.

2. Web mining concept

Web mining refers to the discovery and analysis of data, documents, and multimedia from the World Wide Web. This includes the content, hyperlink structure, and access statistics. However, the explosion of information available on the web has increased the need for tools and technologies for efficient document and information extraction, and management. The term web mining has been used in two distinct ways. The first, which is referred to as Web content mining in this paper, describes the process of information or resource discovery from millions of sources across the World Wide Web. The second, Web usage mining, is the process of mining Web access logs or other user information user browsing and access patterns on one or more Web localities. [1]. The Web usage mining approach allows site anonymous and implicit user behavioral patterns without relying on subjective assessments.

The web itself and the search engine indices contain information about the documents. Documents have different types of relationships among themselves. Hyperlinks add depth to documents, providing the multi-dimensionality, which characterizes the web. Documents have an address, a URL, which represents a logical location on a server, which may provide information about the relationship of this document to other on the server. Also, there is a relationship to other documents on the “web unknown” or “web unidentified”; the search engine index may discover such relationships. Web mining is interdisciplinary in nature, spanning across such fields like information retrieval, natural language processing, information extraction, machine learning, database, data mining, data warehousing, knowledge management, user interface design, and visualization.

3. Web log analysis

The web log analysers are the tools that generate advanced web, ftp or mail server statistics, graphically. The log analyzers work as a CGI or from command line and show all possible information the log contains, in few graphical web pages. Most of them use a partial information file to be able to process large log files, often and quickly. They can analyze log files from IIS (W3C log format), Apache log files (NCSA combined/XLF/ELF log format or common/CLF log format), WebStar and most of all are web, proxy, wap, streaming servers, mail servers (and some ftp). 4. Background The data preparation tasks resulting in a user transaction file, and the specific usage mining tasks, where it involves the discovery of clusters from user transactions and the derivation of URL clusters from the transaction clusters. The transactions through search engines or by any mechanisms as a result of their current navigational activity leads to cluster formation and identification of such clusters present linkages. Clustering of user transactions allows for the discovery of effective aggregate usage profiles. Web log mining has been extensively applied in recent years for many applications such as topic identification [2], Web warehousing [3], target group identification, [4] etc. 5. Genetic Algorithm (GA) Genetic algorithms are probabilistic search methods based on the mechanism of natural selection and genetics. They are generally quite effective for rapid global search to find solutions in non-deterministic problems [5]. Genetic algorithms depend on the population where a large pool of individuals is measured usually in iterative process. In web logs processing, individuals are the usages that characterize the search or access pattern. On each iteration, all members of population are evaluated according to fitness function. A new population is then generated by probabilistically selecting the fit individuals from current population. Some of these selected individuals are carried forward into next generation population intact [6]. Genetic operators such as selection, crossover and mutation generate new offspring from the fittest individuals. These operators that recombine and mutate selected members of the current population determine the generation of successor in a GA. They correspond to idealized versions of genetic operations found in biological evolution. The most common operators are crossover and mutation. The

crossover operator produces two new offspring from two parents by copying selected genes from each parent. The mutation operator produces small random changes to the chromosome by choosing a single bit at random, and then changes its value. Mutation is often performed after crossover has been applied [6]. The genetic algorithm used in this work is shown in the figure 1. The following sections describe several aspects of the proposed algorithm, structure of web pages and fitness function. 6. Structure of web page usage pattern The structure of a web page is presented before describing the GA used for this work. Every page, given a page_id value, has been recorded to the database for processing task. The page ids are stored from log files. These page_id values are used for creating the data set of usage pattern. The length of a usage pattern has set as a number of the pages, which are accessed sequentially. If trio page groups are searched which are accessed orderly, the usage pattern should be as seen in figure 1.

Figure 1. Flow card of the genetic algorithm 7. Genetic operators The genetic algorithms introduce many operators to express the gene-species relationships. The genetic operators found value in web page processing recently. These operators express the relationships between parents and offspring of a given population. For the current study, we have employed three operators of genetic algorithms. They are - selection, crossover and mutation. We explain below the potential of their applications. 7.1 Selection The selection operator chooses the target web pages in the current population according to fitness function and transforms it without changes into the new population. We implement roulette wheel (as described in Emine Tug 7 et al) for selection phase, where the individuals are selected randomly from current population. The appropriateness of a web page is determined by the fitness function. 7.2 Crossover The crossover operator produces two web pages from two selected pages by swapping segments of pages and insert to population. 7.3 Mutation The mutation operator is employed for identifying content relationship. During the mutation phase, according to mutation probability value of each link in each selected page is changed with generating a random number between first page number and last page number. 7.4 Fitness function In this section, our fitness function is described. To formalize the function, three new values are introduced: session, duration and similarity rate. 7.5 Session In this experiment, a session is defined that all requests of each visitor made to the server in a

The target population is identified from log files

A threshold of usage is fixed for each usage in the population.

Usages are placed on the roulette wheel according to fitness value and N usages are selected. Mating pool is consist of these usages

Specific usages based on threshold are selected from mating pool and crossover performed. Then new transactions are acquired.

Mutation is operated to new transactions and candidates files are created.

Fitness value of each Transaction log in this new candidate files is evaluated and it's compared to threshold value.

specific time frame. We have used the original log data, which show sequences of requests by IP address and request time. Basically the boundary of a session needs to be embarked so as to mine the usage. In most of the mining sessions, a session is characterised by a unique IP address and a request time. When an IP address is marked, the entry time is noted. The entry time is normally the beginning of a session. Then a time space is defined as a session period. In the current work, this time is fixed by the mean value of duration. A session is considered as terminated when the IP numbers change or the visitor logs out and a new session is generated. When the sessions are identified, they are separated to subsets. 7.6 Support We determine the support as the probability of the specific visitor present in all sessions, which is the sum of the gene number compatible with each of the session subset of an Individual visitor, divided the total number of the session subsets. 7. 7 Similarity rate The gene rate is compatible with session subsets of an individual. We limit the fitness function as follows. The support = p(S{rs} The probability of records is defined with subsets and S is the session subsets. Similarity Rate: P m = Σ S= Where ‘m’ denotes the sum of session numbers and the fitness is defined as Fitness (specific log) = Support X Similarity Rate 8. The experimental results For the current work, web access log data from the designated URLs were used for experimentations. The file size used for transaction clustering is 694 Mb. The server data were used to analyze this data using genetic algorithm. We have used distinct fitness functions for log studies. The first function is fittest individual of each generation and the second one uses the support value. The experimental data from these two functions are compared with each other. The performance comparison of these

functions for finding of transactions pages by both specific logs numbers on each generation, and the fittest individual of generation, after certain number of generations is processed. The performance comparison of datasets is effected by examining the data presented (table 1) It explains the web page groups which is found from 100 times running results of the datasets according to 50% threshold value on data set which is shown in table 1. Each of running process represents a generation and also each of web page groups represents an individual of genetic algorithm. As seen from the table, datasets found to have varying number and page groups.

Table 1 – cluster results of web pages --------------------------------------------------------------- Mean Number of generated clusters Distance of Generation Pages in clusters web pages 1 2 --------------------------------------------------------------- 25 332 301 8.81 55 237 198 6.57 85 75 137 9.33 95 41 40 6.67 -------------------------------------------------------------- Average pages in clusters 11.97 --------------------------------------------------------------- Table 1 shows the fittest individual of each generation. As can be seen from the table, data fitness rapidly varies in the two generations. Further the web page groups, the average value varies considerably in the population. In the next stage, we have resorted for domain partition using web page classification. 8.1 Domain Partitions In the following part of the experiment, we show that genetic algorithms are useful in web pages partition using web logs. The web access logs are mined to classify the web pages. For disciplinary partitioning we use the ACM subject descriptors. The current partitioning of web pages is based on ACM headings by using the mutation operator to specify the collections for the sub-disciplinary repositories. Each web page has an average of 12.36 key data fields identified by the selection operator. Of these headings, an average of 4.6 key data fields is indicated by the mutation operator. We have selected a sample of 4,65,7900 web pages as those concepts (ACM), which best describe the domain. These web pages were used to determine into which domain to place the particular set of concepts. Since some web pages do not have main

concepts assigned to them, and a small fraction of web pages are dropped from the final partitions. We have employed both the inclusive and exclusive partition. They denote the degree to which the pages reflect the concepts in the main ACM descriptors. Table 2 shows two of the top 10 largest domains in the inclusive and exclusive partition sets. In the exclusive set, the largest domain is Artificial Intelligence with 856375 web pages while the largest domain in the inclusive set is Operating Systems with 634544 web pages. A similar exercise was performed in the biological sciences in [8].

Table 2 Top 10 inclusive and exclusive domains

Inclusive partition Exclusive Partition

Domain # of web pages Domain # of web

pages

1. Operating Systems 54435 7 634544

2. Network architecture and design

533536 6 163444

3. In put and output and data communication

633417 8 32423

4. Programming Languages 96574 4 66456

5. Artificial Intelligence 856375 1 60978

6. Software Engineering 23345 9 8654

7. Natural language processing 74334 3 88789

8. Pattern recognition 63543 10 64563

9. Memory structures 26446 2 45345

10. Fuzzy theory 46744 5 32440

In our experiment, the exclusive partition set was used to build the subject repositories. The selection and mutation operators have potential in web pages partitioning particularly when the subject descriptors are employed. 9. Conclusion

The simple genetic algorithm used here offers more promises. The transaction logs enable the web usage pattern detection and ensure future improvement in web log studies. The genetic algorithm, used as a comparison algorithm, uses usual support value known from data mining. The application implemented in this work can be significant especially for pages used in information transaction measurement. Semantic analyses based on web logs offer more promise and we believe that they are scalable. 10. References [1]. Bamshad Mobasher and Jaideep Srivastava. “Web Mining: Information and Pattern Discovery on the World Wide Web” Robert Cooley. http://www-users.cs.umn.edu/~mobasher/webminer [2] S. Kim and B. Zhang, “Genetic mining of HTML structures for effective web-document retrieval”, Applied Intelligence 18, 2003, pp.243–256. [3] T.M. Mitchell, “Machine Learning”, McGraw-Hill, Tokyo, 1997. [4] H. Cenk Ozmutlu and Fatih Cavdur, “Application of automatic topic identification on Excite Web search engine data logs”, Information Processing and Management 41 2005, pp. 1243–1262. [5] Xin Tan, David C. Yen and Xiang Fang. Web warehousing: Web technology meets data warehousing Technology in Society 25, 2003, pp.131–148. [6] Sandro Arayaa, Mariano Silvab and Richard Weberc. “A methodology for web usage mining and its application to target group identification”, Fuzzy Sets and Systems 148 2004, pp.139–152. [7]. Emine Tug, Merve Sakiroglu and Ahmet Arslan. “Automatic discovery of the sequential accesses from web log data files via a genetic algorithm”, Knowledge-Based Systems 19, 2006, pp.180–186. [8] Yi-Ming Chung, Qin He, Kevin Powell and Bruce Schatz “Semantic Indexing for a Complete Subject Discipline”, http://www.canis.uiuc.edu