Prioritizing Web Links Based on Web Usage and Content Data

6
Prioritizing Web Links Based on Web Usage and Content Data Kamika Chaudhary Department of Computer Science & Engineering Krishna Institute of Engineering & Technology Ghaziabad-201206, India [email protected] Absact- Web has grown enormously and is stiD growing rapidly day by day. With this huge amount of information in the web it has become difficult for the search engines to retrieve the required and relevant information efficiently. Web mining techniques, using different approaches, have contributed a lot in providing the relevant information to the user query. This paper introduces a new method for prioriting the web pages based on web usage and web content data. The proposed method uses Genetic Algorithm for providing good quality web pages as a result of user query. Prioritization of web pages faDs in the category of NP-complete problems. Genetic algorithm is used to deal with this. The method includes the parameters from both web usage and web content mining. Experimental results show that the proposed approach performed better than the existing approach. KeywordsGenetic algorithm; web usage mining; web content mining; common entry and exit points I. I NTRODUCTION World Wide Web has brought revolutionary changes in the popularity of inteet. It has grown into a huge and global information space. The volume of information present on the web is distributed in nature and growing at an exponential rate. To get the desired information without wandering through the pages of website has become an irksome job. Different types of methods are required to organize and manage the information so that it can be used efficiently for business puose. There exists a need of web mining technique in order to explore such a gigantic information base. Web mining is the process of uncovering user desired information om web documents by applying data mining techniques. Web mining aims to develop new methods for effective retrieval of potentially useful information. A large amount of information on the web is redundant in nature resulting in multiple pages crying similar contents. There is a present of heterogeneity among data present on the websites. Based on the type of data present in web documents, web mining is divided into three classes: web content mining, web structure mining and web usage mining. Web contt mining searches the information om structured, semi structured or unstructured content of the web. There are a number of links prest on the web pages which connects and organizes the 978-1-4799-2900-9/14/$3l.00 ©2014 IEEE Santosh Kumar Gupta Department of Computer Science & Engineering Krishna Institute of Engineering & Technology Ghaziabad-201206, India [email protected] information together. These hyperlink structures are utilized by web structure mining for retrieval of information. Web usage mining discovers the usage patte of visitor by mining the log files. It works by preprocessing the initial log data which removes the redundancy among data and then detecting the pattes and then performing an analysis on these pattes in order to find out user behavior. Several optimization techniques have been used for fmd the most usel pages of web site by using web usage and web content mining. The proposed approach uses natural optimization technique called genetic algorithm to explore the search space by using both content and usage mining. The inspiration behind genetic algorithm is the process of natural selection and genetic dynamics [5]. Getic algorithm has its roots in the Darwin's theory of survival of the fittest. So genetic algorithm is a search algorithm based upon the process of natural selection and population genetics [19]. The proposed approach aims to use genetic algorithm on the data collected by integrating web usage mining and web contt mining in order to find the pages of web site which are of utmost importance to user. Our approach is compared with the approach in [20] hereaſter named as EA and results are found to be better. In Section II paper presents the Literature review. Section III introduces the concept of Genetic Algorithm. In Section IV the proposed algorithm is presented. Implemtations details of the proposed approach e giv in Section V. Section VI & VII describe experimentation and conclusion respectively. II. R ELATED W ORK Web usage mining is the most crucial field of web mining. A lot of research has been done in this area which shows the importance of web usage mining to search gines. Speed and precision acts as most desirable chacteristics of search engines. Evolutionary algorithms more specifically genetic algorithm plays a vital role in achieving these characteristics. These algorithms also play an important role in the mining of web usage data. In [1] authors discuss about the use of genetic algorithm for mining the information om the web. They found that results of queries provided by search engines suffered om the problem of poor information and irrelevant 546

description

IEEE Paper

Transcript of Prioritizing Web Links Based on Web Usage and Content Data

Page 1: Prioritizing  Web  Links  Based  on  Web  Usage  and  Content  Data

Prioritizing Web Links Based on Web Usage and Content Data

Kamika Chaudhary Department of Computer Science & Engineering

Krishna Institute of Engineering & Technology Ghaziabad-201206, India

[email protected]

Abstract- Web has grown enormously and is stiD growing rapidly day by day. With this huge amount of information in the

web it has become difficult for the search engines to retrieve the required and relevant information efficiently. Web mining

techniques, using different approaches, have contributed a lot in providing the relevant information to the user query. This paper introduces a new method for prioritizing the web pages based on

web usage and web content data. The proposed method uses Genetic Algorithm for providing good quality web pages as a result of user query. Prioritization of web pages faDs in the

category of NP-complete problems. Genetic algorithm is used to deal with this. The method includes the parameters from both web usage and web content mining. Experimental results show

that the proposed approach performed better than the existing approach.

Keywords--Genetic algorithm; web usage mining; web content mining; common entry and exit points

I. INTRODUCTION

World Wide Web has brought revolutionary changes in the popularity of internet. It has grown into a huge and global information space. The volume of information present on the web is distributed in nature and growing at an exponential rate. To get the desired information without wandering through the pages of website has become an irksome job. Different types of methods are required to organize and manage the information so that it can be used efficiently for business purpose. There exists a need of web mining technique in order to explore such a gigantic information base. Web mining is the process of uncovering user desired information from web documents by applying data mining techniques. Web mining aims to develop new methods for effective retrieval of potentially useful information. A large amount of information on the web is redundant in nature resulting in multiple pages carrying similar contents. There is a present of heterogeneity among data present on the websites.

Based on the type of data present in web documents, web mining is divided into three classes: web content mining, web structure mining and web usage mining. Web content mining searches the information from structured, semi structured or unstructured content of the web. There are a number of links present on the web pages which connects and organizes the

978-1-4799-2900-9/14/$3l.00 ©2014 IEEE

Santosh Kumar Gupta Department of Computer Science & Engineering

Krishna Institute of Engineering & Technology Ghaziabad-201206, India [email protected]

information together. These hyper link structures are utilized by web structure mining for retrieval of information. Web usage mining discovers the usage pattern of visitor by mining the log files. It works by preprocessing the initial log data which removes the redundancy among data and then detecting the patterns and then performing an analysis on these patterns in order to find out user behavior. Several optimization techniques have been used for fmd the most useful pages of web site by using web usage and web content mining. The proposed approach uses natural optimization technique called genetic algorithm to explore the search space by using both content and usage mining. The inspiration behind genetic algorithm is the process of natural selection and genetic dynamics [5]. Genetic algorithm has its roots in the Darwin's theory of survival of the fittest. So genetic algorithm is a search algorithm based upon the process of natural selection and population genetics [19]. The proposed approach aims to use genetic algorithm on the data collected by integrating web usage mining and web content mining in order to find the pages of web site which are of utmost importance to user. Our approach is compared with the approach in [20] hereafter named as EA and results are found to be better.

In Section II paper presents the Literature review. Section III introduces the concept of Genetic Algorithm. In Section IV the proposed algorithm is presented. Implementations details of the proposed approach are given in Section V. Section VI & VII describe experimentation and conclusion respectively.

II. RELATED WORK

Web usage mining is the most crucial field of web mining. A lot of research has been done in this area which shows the importance of web usage mining to search engines. Speed and precision acts as most desirable characteristics of search engines. Evolutionary algorithms more specifically genetic algorithm plays a vital role in achieving these characteristics. These algorithms also play an important role in the mining of web usage data. In [1] authors discuss about the use of genetic algorithm for mining the information from the web. They found that results of queries provided by search engines suffered from the problem of poor information and irrelevant

546

Page 2: Prioritizing  Web  Links  Based  on  Web  Usage  and  Content  Data

pages. They provide a genetic strategy for search engines and considered web search as a standard optimization problem. The efficiency of search engine can be improved through web usage mining by using MASEL (matrix analysis on search engine log) algorithm proposed in [2]. The relationship among user, query and resource acts as central idea for this algorithm. MASEL considered a resource to be good if it is accessed by many good users. The purpose of improving search engine retrieval performance is dealt in [3]. Authors have proposed a genetic programming based framework for discovering ranking function which improves the retrieval performance by prioritizing the web pages in the decreasing order of relevance. The results are compared and found to be better than other existing ranking function for information retrieval. In [4] grammar based genetic programming used as data mining optimization technique in e-Iearning system. A group of useful education prediction (EP) rules are developed and provided to courseware authors to improve the adaptive systems for web based education (AS WE).

Genetic Algorithm is a natural selection theory based algorithm used for solving optimization problems. It is an adaptive heuristic search algorithm based on concept of survival of the fittest. Selection, crossover, mutation and acceptance are the main steps used for finding the solution to a problem. Fitness function is used for fmding the goodness of any solution and mutation escapes the population from problem of local optima [5]. A probabilistic web user model based on genetic algorithm for improving the web site structure is proposed in [6]. Adjacency matrixes have been used for representing the genetic population and ranking acts as a parameter for fitness scaling. Random binary vector is created by using scattered crossover. The result shows an improvement over another method.

Web usage mining works on the data collected from client server interaction. It utilizes secondary data present in web server logs, browser logs, proxy server logs, registration data, user profiles, cookies or any other source for mining the interesting patterns. It mainly consists of three phase data preprocessing, pattern discovery and pattern analysis [7]. Pattern discovery is performed in order to draw useful patterns from preprocessed data [S]. A system called Web Sift is designed to perform usage mining. It utilizes data from web server log in order to perform mining task. This data suffers from real world challenges. A framework dealing with all these challenges is discussed in [9]. A number of soft computing techniques had been used for retrieving the information such as in the field of web mining [10]. Soft computing technique called self organizing map (SOM) is applied to preprocessed data in web usage mining in order to find visitors navigation behavior [11]. This behavior of them is used for discovering the useful knowledge from secondary data [12, 21]. Authors proposed an optimization technique called ant colony clustering algorithm (ACLUSTER) for detecting useful trends and used linear genetic programming for analysis of user trends. ACLUSTER algorithm is applied on the preprocessed and cleaned data by using number of objects in the area and their similarity is used as independent threshold to form clusters of web usage patterns. It is important to improve the structure of web site from time to time as sites outgrow in their

design by compiling links and pages together. In [13] websites are reorganized by using 0-1 programming approach. This method is based on the co -occurrence frequencies between web pages which are obtained by user access pattern. In order to reduce the search depth and information overload for users two constraints are used number of outward links from each page and length of shortest path from home page to each page. Web personalization is the way of providing service to web visitor for retrieving the information of hislher interest. This is achieved by predicting the next page access by user. An accurate recommendation system for predicting next page access by using web usage mining has been discussed in [14]. Pair wise nearest neighbor clustering is used for identifying similar access pattern. The method provides good prediction accuracy and minimizes state space complexity. A two step strategy to improve retrieval effectiveness for personalizing the web has been presented in [15]. In the first step users query are categorized by system automatically based on his search history and then these categories are used for performing web search.

An intelligent miner (i-miner) framework has been used for analyzing the user trend [16]. A hybrid evolutionary approach called FCM has been used for forming the clusters for separating the user with similar interest and the Takagi-Sugeno fuzzy inference system has been used for analyzing the trends. Another approach for exploring the navigational pattern is by discovering the relationship existing among user and web object. A system based on probabilistic latent semantic analysis (PLSA) [17]. It has been developed for automatically characterizing the user preference and interests. Probabilistic inference has been used for performing analysis tasks. Authors in [IS] have proposed a workflow that shows how usage data can be extracted and processed for a real world tourism web site.

III. THE PROPOSED ApPROACH

The Genetic Algorithm (GA) is a natural optimization and adaptive heuristic search technique whose basic idea depends upon process of natural evolution. The mechanism of evolution is parallel in nature and has been used for solving several computational problems [19]. GA is used for solving general purpose optimization problems [5].

In computational problem genetic algorithm begins by selecting initial population in the form of chromosome and then applying fitness function which minimizes the cost on selected chromosome. Then two parent chromosomes having greater fitness are selected. Crossover and mutation are performed on selected parents. The process is repeated until best solution among current population is retrieved. After selection crossover is performed between two parent string and it results into offspring string. Mutation is another operator which is applied after crossover in order to change genetic material between parents and forms offspring. Then on the basis of Darwinism the offspring which survives most is chosen to be fittest [20].

A collection of webpage is used to represent chromosome in web usage mining problem. In order to find the web pages that is of utmost importance to user GA is used in this approach. Unique number has been assigned to web pages.

2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT) 547

Page 3: Prioritizing  Web  Links  Based  on  Web  Usage  and  Content  Data

These pages are indexed by assigning ID to them and thus chromosomal representation looks like below in fig I

Chromosomel= {set of web links} = {PI, P2, P3, P4, P5, P6, P7, P8, P9, PIO} Where PI, P2 . . . . . . . represents web links

II 43 I 48 I 10 I 13 I 37 I 38 I 44 I 14 36 I 49

Fig.1. Representation of web links in chromosomal form

A. Chromosome Representation

The chromosomes are used for representing initial population. Each chromosome shows a candidate solution. For representing the web page we will assign a unique number id to each unique URL taken from web server log. For further processing these unique no id is used instead of URL of pages visited by user.

TABLE 1. UNIQUE lD ASSIGNED TO WEB PAGE URL'S

Unique Id URL

1 192.168.30.95:51854

2 192.168.30.15:45682

3 192.168.1.5:32773

4 192.168.1.5:32773

5 192.168.30.128:60339

B. Fitness Function

Fitness function is an objective function used for selection of best individual among all individuals. It is used for quantifying the optimality of a solution. It measures the goodness of a solution by providing ranks to solution [21]. Various parameters are required for calculating the fitness of a solution as presented below.

i) Access frequency: Access frequency measures number

of times a particular page is visited by user irrespective of

user id In web usage mining, the usefulness of any particular

page can be measured by calculating the access frequency.

More the access frequency more could be its usefulness. Table

II shows the URL and their related access frequency.

TABLE II. URCS AND THEIR RELATED ACCESS COUNT

URL Access Count 192.168.30.95 33

192.168.30.15 12

192.168.1.5 1

192.168.1.5 24

192.168.30.127 19

2) Number of unique visitors: This factor shows the

importance of any web page on the basis of unique visitors

visited this page. This means that a URL can have more

popularity among users if it is visited by more number of

distinct visitors. Table III shows unique visitors.

T ABLE III. UNIQUE VISITORS AND THEIR CORRESPONDING USER lD

Unique Id URL Number of Unique Users 1 192.168.30.95 23

2 192.168.30.15 15

3 192.168.1.5 35

4 192.168.1.5 24

5 192.168.30.127 19

3) Time Duration: The amount of time spent on a page

shows the relevance of page for the user. If a user spent more

amount of time on a particular page then that page is

considered to be useful for the user. Table IV shows duration

of particular URL.

TABLE IV. AMOUNT OF TIME USER STAYED ON THE PAGE WITH RESPECT TO

URL

Unique Id URL Duration(seconds) 1 192.168.30.95 45

2 192.168.30.15 217

3 192.168.1.5 84

4 192.168.1.5 24

5 192.168.30.127 0

4) Number of bytes received: The quantity of data

downloaded by user from the web page shows that page has

content which is relevant for user. The entries for number of

bytes received by user are present in web log server entry.

From this entry we can deduce whether a page is important or

not.

TABLE V. NUMBER OF BYTES RECEIVED BY USER

Unique Id Amount of bytes received 1 270

2 2254

3 1059

4 124

5 1609

5) Common entry and exit points: A visitor begins his

search by clicking on a link which forwarded him towards a

page of website. This page is considered as the entry point of

the user. The exit point signifies the destination of the visitor.

It tells what visitors are looking for in the website.

6) Number of advertisements: The importance of any web

page can also be recognized by analyzing the number of

advertisement present on any particular page. If a page

548 2014 international Conference on issues and Challenges in intelligent Computing Techniques (iCICT)

Page 4: Prioritizing  Web  Links  Based  on  Web  Usage  and  Content  Data

consists of more number of advertisements then that page is

thought to be visited by more number of visitors.

Advertisements are placed on the pages which have higher

frequency of visits by user so they signifY the importance of

page.

I. Access frequency of each page 2. Number of unique user 3. The amount of time user stayed on the page 4. Number of bytes received 5. Common entry and exit points 6. Number of advertisement Cost Access frequency (AF) = If= 1(A. Fi) Where n=number of entries in the web log and AF is number of times a page is accessed by visitors. Costunique user (UNQ) = If= 1(UNQi) Where n =number of entries in the web log and UNQ is the number of different users visited a URL. CostDuration (OUR) = IF; 1(DURi) Where n=number of entries in the web log and DUR is the amount of time user stayed on a web page. CostBytes Received (BR) = I r�1(BRi) Where n=number of entries in the web log and BR is the amount of data user fetched from a web page. Costcommon entry exit point (EP) = I� 1{EPi) Where n=number of entries and EP shows the pages of beginning and finish of a user access session. CostNumber of advertisement (AD) = IF; 1(Alli) Where n=number of entries and AD signifies the number of advertisement present on any web page. Cost Function C(x) = Cl. IF;1(AH)+ C2. IF;lUNQi) + C3. If=1(OlJRi) +C4. If= 1(BRi) +C5. IP=1(EP i)+C6. If=1(AOi) Where CI, C2, C3, C4, C5 and C6 represent different constants and they are used for adjusting the values of different parameters.

Fig.2. Cost function and its parameters

An example for calculating cost of various parameters is shown below: F(x) = Cl.CostAccessFrequency + C2.CostDuration + C3.CostUniqueUser + C4.CostBytesreceived + C5.CostCommonPoints + C6.CostAdvertisements CI, C2, C3, C4, C5 and C6 are constants whose function is to normalize the value of parameters CI= 2.4 C2=0.05 C3=0.6 C4=0.003 C5=0.6 C6=1.5 In this example values are calculating the cost CostAccessFrequency= 33

CostStayDuration= 217 CostUniqueUser= 35

CostBytes Received= 2254

taken from above tables for

(Table 2) (Table 4) (Table 3) (Table 5)

CostCommonPoints= 20 CostAdvertisements= 30

C(x)=2.4*33+0.05*217+0.6*35+0.003 *2254+0.6*20 +1.5*30= 174.812

C. Selection

Selection is the process of choosing the fitter chromosomes from the population. The main objective of selection is to give importance to good solution and ignoring bad solution. In our approach we are using binary tournament selection which picks two individuals randomly from large set of population.

D. Crossover

Crossover is the method which exchanges the genetic material of both the parents to get new offspring. Main function of crossover is to recombine two strings to get a new better string. Various types of crossover exists, among all of them cyclic crossover is used in the proposed work.

Parent I 43 I 48

Parent 2

49 I 14

I 10 I 13 I 37

I 38 I 13 I 48

I 38 I 44 I 14

I 44 I 37 I 10

I 36 I 49 I

I 43 I 36 I

After Cyclic Crossover Offspring I 43 I 48 I 10 I 13 I 37 I 49 I 14 I 38 I 13 I 48 I

Offspring 2

49 I 14 I 38 I 13 I 48 I 43 I 48 I 10 I 13 I 37 I

Fig.3. Process of crossover

E. Mutation

Mutation is the third operator of GA that performs the function of maintaining diversity in the population by altering some bits present in the chromosome. It randomly distributes genetic information and avoids the probability of algorithm to suffer from the problem of local optima [20). There are many types of mutation operator: flip bit, boundary, uniform, non­uniform and Gaussian. It exploits the search space more thoroughly and results in providing better solution.

Flip Bit Mutation

49 I 48 I 10 I 13 I 37 I 38 I 44 I 14 I 43 I 36 I

After Mutation � I 49 I 48 I 10 I 13 I S4 I 38 I 44 I 14 I 43 I 36 I

Fig.4. Process of flip bit mutation

2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT) 549

Page 5: Prioritizing  Web  Links  Based  on  Web  Usage  and  Content  Data

IV. PROPOSED ALGORITHM

The proposed GA based algorithm (PGA) applies a fitness function on the randomly selected initial population to produce a set of web links which are of higher priority (TopLink-P) as compared to other existing links. The fitness function includes a number of parameters from both the content and usage pattern of web links. PGA initiates by randomly selecting a set of initial population and then applying the operators of crossover and mutation on the population for several generations until the population gets converged and result is produced. The whole process is represented in the form of steps in fig 5

Input: • Initial Population Size, PopSize • Number of generations, N • Crossover Rate, CR • Mutation Rate, MR

Output: Set of Top Priority Web links, TopLink-P Cost Function:

• Access frequency of each page (APi)

• The amount of time user stayed on the page(DU Ri)

• Number of unique user (lJ NQi)

• Number of bytes received(BRi)

• Common entry and exit points(E P' i) • Number of advertisement (ADi)

Cost(C)=

Cl. L·i,;,,(AF:i}tC2. bi,;,,(UNQi)+C3. b, _,(mJru)+C4. bi _ 1( B ru.)+C5 . b,';', (EPi.}+C6.b;::' ,(A i) Method:

I. Generate Initial Population Set of randomly selected web links, WebLinks [PopSize]

2. Evaluate each Top-P Web Link in the set of Top-P web links, WebLinks [PopSize], using cost function

3. While Generation::: N Do a) Perform Binary Tournament Selection,

CrossoverWebLinks[PopSize] b)Apply Cyclic Crossover among

CrossoverWebLinks[popSize] (WLinkParent 1, WLinkParent2)=Randoml yChoose( Crossover

WebLinks [popSize])

(WLinkOffspring 1, WLinkOffspring2 )=Cycl i cCrossover(WLink Parent 1, WLinkParent2)

c) Copy (WLinkOffspringl, WLinkOffspring2) to NewWebLinks

New W ebLinks[]=(WLinkOffspring 1, WLinkOffspring2) d) Perform mutation with mutation rate, MR e) Copy New Web Links to Initial set of Top-P WebLinks

WebLinks [ ] = NewWebLinks [] End While

4. TopLink-P WebLinks=LowCostWebLink(WebLinks [])

5. Return TopLink-P

Fig.5. Proposed GA based Algorithm (PGA)

V. AN EXAMPLE

An Example depicting procedure of proposed GA based algorithm (PGA) is shown in the Fig 6.For execution of PGA we have used java programming language and program is run for 50 generations with initial population consisting of lO chromosomes. Each chromosome includes a set of 5 pages and then their cost is calculated by applying genetic operators. The program runs till last generation which implies the convergence of cost. In our experiment the generation converges at cost 509. Fig 6 shows the chromosomes with their fitness cost at generation 1, 2, 3 and 50 with crossover rate of 75%.

Stept

First GenerJ.tioll

Chromosomes

CRI 18 15

CR2 18 2

CRJ 16 13

CR4 J3 9

CR5 10 16

CR6 10 2

CR7 2 15

CR8 16 7

CR9 10 16

CRI 10 7

Step2

19 14 I

9 19 12

9 II I

15 17 6

9 8 14

17 12 I

II 4 I

15 17 6

II 15 4

4 14 12 1 Selection

Crossover

Mutation

Second Generation

Chromosomes

CRI 16 7 15 17 6

CR2 2 7 15 17 6

CRJ 10 16 II 15 4

CR4 10 16 9 8 14

CR5 16 13 9 II I

CR6 16 13 9 II I

CR7 10 16 9 8 14

CR8 10 7 4 14 12

CR9 16 13 9 II I

CRI 10 7 4 14 12

Step3

Third Generation

Cost Chromsomes

215 CRI 16 7 15 17 6

310 CR2 16 7 15 17 6

215 CR3 10 7 4 14 12

509 CR4 2 7 10 17 6

335 CR5 16 7 15 17 6

215 CR6 16 7 15 17 6

215 CR7 16 7 15 17 6

509 CR8 16 7 15 17 6

351 CR9 16 7 15 17 6

310 CRI� 16 7 15 17 6

Step n

Fifteith Generation

Cost Chromosomes

509 CRI 16 12 II 17 6

509 CR2 16 7 15 17 6

351 CR3 14 7 15 3 6

335 CR4 16 7 15 17 6

215 CR5 16 7 15 17 6

215 CR6 16 7 15 17 6

335 CR7 16 7 15 17 6

310 CR8 16 7 15 17 6

215 CR9 14 7 15 3 6

310 CRI 16 7 15 17 6

VI. EXPERIMENT A TlON AND RESULTS

Cost

509

509

310

509

509

478

509

509

233

509

Cost

509

509

509

509

509

509

509

509

509

509

The results produced after implementing the PGA on a programming language is shown by making use of graph structure. In our program we have included the parameters from both the content of the web and from the usage pattern of the web pages. The comparison of results of both PGA and existing approach EA are shown in Fig 7.We have run the program for 50 generations with different crossover rate ranging from 50% to &75%. For a different crossover rate cost of the web links varies. We have also studied the quality of the web pages for different generations. We increased the generations till 400 and keep the constant crossover rate of 75% and compared the value of fitness score. The finding shows that on moving from one generation to next the cost varies. We have also tries to study the effect of crossover rate over the cost. The experimental results proves that in most of the cases the cost of TopLink-P web pages are better for proposed approach as compare to the existing approach.

550 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT)

Page 6: Prioritizing  Web  Links  Based  on  Web  Usage  and  Content  Data

Fig.7. PGA vs EA at Cross Over rate� 50%, 60%, 70%, 75%

VII. CONCLUSION

As size of information present on the internet has taken a shape of the giant it has become a necessity to increase the efficiency of the search engines. Web mining is aiming in this direction. It helps in mining the information on the basis of content, structure and usage of web pages. The proposed GA based approach combines the information from both content as well as usage of a web page in order to provide the required and relevant pages to user. We have calculated the cost of the web pages till the value gets converged in order to get the most optimized result. This cost is used as parameter in order to find the relevance of TopLink-P web pages. We have represented the experimental results in the form of graphical structure. These results show the superiority of proposed approach as compared to existing approach.

REFERENCES

[1) F. Picarougne, N. Monmarche, A. Oliver and G. Venturini, "GeniMiner: Web Mining with a Genetic-Based Algorithm," ICWI, pp. 263-270, 2002.

[2) D. Zhang, and Y. Dong, " A novel web usage mining approach for search engines," Computer Networks, vol 39(3) ,pp 303-310, 2002.

[3) W. Fan, M. Gordon and P. Pathak, "Genetic programming-based

discovery of ranking functions for effective web search," Journal of Management Information Systems, vol 21(4), pp 37-56, 2005.

[4) C. Romero, S. Ventura, C. Hervas and P. Gonzalez, "Rule Discovery in web-based educational systems using Grammar-Based Genetic

Programming," Data Mining Xl: Data Mining, Text Mining and Their

Business Applications, pp.205-214 ,2005.

[5) R.C. Chakraborty, "Fundamentals of Genetic Algorithms," Artificial

Intelligence ,2010.

[6) E. Andaur, S. Rios, P. Roman, and J. Velasquez, "Best Web Site Structure for Users Based on a Genetic Algorithm Approach," University of chile, 2010.

[7) 1. Srivastava, R.Cooley, M. Deshpande, and P. N. Tan, "Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data," ACM SIGKDD Explorations Newsletter 1.2 , pp.12-23, 2000.

[8) R. L. Haupt, "Practical Genetic Algorithms," John Wiley & Sons Inc. Chapter 1-7, pp. 1-251, 2004.

[9) 1. Srivastava, R. Cooley, M. Deshpande and P.N. Tan, "Web usage mining: Discovery and applications of usage patterns from web data", ACM SIGKDD Explorations Newsletter, 1(2), pp.12-23, 2000

[10) S. P. Nina, M. Rahman, K. l. Bhuiyan and K. Ahmed, " Pattern

discovery of web usage mining," In Computer Technology and Development, ICCTD 09 International Conference on vol. 1, pp. 499-503 lEEE 2009

[11) O.Nasraoui, M. Soliman, E. Saka, A. Badia, and R. Germain, " A web usage mining framework for mining evolving user profiles in dynamic web sites," Knowledge and Data Engineering, IEEE Transactions, voI20(2), pp.202-215, 2008

[12) S. K. Pal, V. Talwar, and P. Mitra, "Web mining in soft computing framework: Relevance, state of the art and future directions," Neural Networks, IEEE Transactions ,vol 13(5), pp.1163-1177, 2002.

[13) K. Etminani, A. R. Delui, N. R. Yanehsari, and M. Rouhani, " Web usage mining: Discovery of the users' navigational patterns using SOM," IEEE First International Conference in Networked Digital Technologies, NDT'09 , pp. 224-249, 2009.

[14) A. Abraham, and V. Ramos, "Web usage mining using artificial ant colony clustering and linear genetic programming," lEEE In Evolutionary Computation CEC'03 vol. 2, pp. 1384-1391, (2003)

[15) C. C. Lin, "Optimal Web site reorganization considering information overload and search depth," European Journal of Operational Research 173(3), pp.839-848, 2006.

[16) X. Jin, Y. Zhou, B. Mobasher, " Web usage mining based on probabilistic latent semantic analysis," In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining pp. 197-205, 2004.

[17) A. Pitman, M. Zanker, M. Fuchs, M. Lexhagen," Web usage mining in tourism-a query term analysis and clustering approach," Information and Communication Technologies in Tourism , pp 393-403, 2010.

[18) M. Mitchell, " An Introduction to Genetic Algorithms," MIT Press. Chapter 1-6. pp. 1-203, 1998

[19) T. V. Mathew, " Genetic Algorithm," Indian Institute of Technology Bombay, Mumbai pp. 1-15, 2012

[20) A. R. Simpson, G. C. Dandy, L. J. Murphy, "Genetic algorithms compared to other techniques for pipe optimization" Journal of Water Resources Planning and Management, 120(4), pp. 423-443, 1994

[21) A. K. Mishra, M. K. Mishra, V. Chaturvedi, S. K. Gupta and J. Singh, "Web usage mining using self organized maps" International Journal of Advanced Research in Computer Scence and Software Engineering, vol 3(6), pp. 532-539, 2013

2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT) 551