Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell,...

Finding related pages in the World Wide Web

A review by:

Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson

Content • Introduction

• Algorithms

– Companion

– Co-citation

– Netscape’s

• Evaluations

• Critique

• Conclusion

Introduction Searching on the World Wide Web

• Common search tools include Google, Yahoo

Traditional Approach

• Keyword Query based

• Need to specify your information needs by giving relevant keywords

• Prone to errors!

Question!

What do I do if I don’t know exactly what I am looking for?

Introduction • Another Way…

– Use URL as search input instead of a phrase of text

e.g. www.nytimes.com

• What are the requirements?– Fast

– High precision

– Little input data

Introduction How does it work? - Web graph structure

– Proposed two algorithms:

Companion• Derived from HITS (Hyperlink Induced Topic Search)

algorithm proposed by Kleinberg for ranking search queries.

• Makes use of weights, hub and authority scores.

Co-citation• Finds pages that are frequently co-cited with an input URL u.

Sites A,B,CSites

X,Y,Zu

Found X,Y,Z

Companion Algorithm • Takes in a starting URL u as input e.g.

www.awebsite.com• Made up of 4 steps:

– Building the vicinity graph of u– Contract duplicates and near-duplicates in the

graph– Compute edge weights based on host to host

connection– Compute a hub score and a authority score for

each node in the graph and return the top ranked authority nodes

Companion Algorithm • Uses 5 values* to help

determine relevant pages:• Go Back (B): How many parent

sites the website has i.e. going from u1 to p1

• Back-Forward (BF): How many child sites the parent has i.e. going from u1 to p2 then to u2 (or u1)

• Forward (F): How many children the site has (pages it links to) i.e. u1 to c1

• Forward-Back (FB): How many parent sites the children have

i.e. u1 to c1 to u3 • STOP list: websites considered

not to be relevant to the pages content

p

1

u

1

c

1

p

2

u

2

c

2

hyperlinks

u

3

STOP List:•http://validator.w3.org/check?uri=referer•www.microsoft.com/ie/dowload.html•www.yahoo.com

*These values are determined before the algorithm is executed

A Web-Graph

website

Companion Algorithm• Step 1 – Building the vicinity graph of u• If u is part of the STOP list then it is ignored, otherwise all other sites in the list

will be ignored

p1

c1

p2

u2

c2

u3

Vicinity graph after step 1

Companion Algorithm

Companion Algorithm• Step 2 – Eliminate any duplication

– If one of the nodes (website) in the graph has 10 or more links plus has 95% of it links common to another node*

• Combined the links from both nodes (union) to create one node

– This is to remove sites that are likely to be the same (e.g. mirror sites, or same site under different names)

• Step 3 – Assign Edge Weights– If two nodes are on the same host then the edge between them will be set

to zero– If there are k links going to one site (i.e. many-to-one), the node edges

authority weight are set to 1/k– If there are multiple links L from one site (i.e. one-to-many), the node edges

hub weight are set to 1/L

• The vicinity graph of u has now been constructed!

*This clearly has its problems!!!

Companion Algorithm• Step 4 – Compute Hub and Authority scores

• Nodes (websites) with a high authority score are expected to have relevant content

• Nodes with a high hub score are expected to contain links to relevant content

• The 10 highest authority scoring nodes are then returned as relevant pages to the starting URL u

Co-citation Algorithm • Two sites are co-cited if they

have a common parents e.g. u3 and u1 are co-cited by p1

• Degree of co-citation (DoC) is the number of common parents a site has e.g. u3 and u1 have a DoC of 2

• The algorithm finds the sibling of a site, computes their DoC and returns the top 10 sites with the highest DoC

• If number of siblings of u < 15 and DoC of u < 2 then algorithm restarts with a URL one level up from the original e.g. If u = a.com/X/Y/Z then

new u = a.com/X/Y

p

1

u

1

p

2

u

2

u

3

Siblings of u1

Netscape’s Approach• “What's Related” function• Not a lot of detail mentioned in the paper!• Gets similar pages from web crawling,

archiving, categorising and data mining (as opposed to just using the web graph like the previous algorithms)

• Also tries to learn from trends (i.e what user click on after they searched for a keyword)

Implementation

• Compaq’s Connectivity Server– Provides 180 million URL (nodes)

• Multi-threaded server to take in URLs– Uses either the Companion or Cocitation

algorithm to find related pages.

Evaluation

• Studies carried out to determine the performance of these algorithms.

• Benchmark against Netscape’s approach.

• Re-visit initial requirements.– Speed

– Precision

– Little Data Input – already achieved

Evaluation • Speed

– 109 milliseconds for Companion, and 195ms for Cocitation.

– Complexity of the Cocitation algorithm is in the order of

O(n log n).

• Precision

Critique • Faults within HITS not investigated. Nomura, Satoshi, and Hayamizu,

‘Analysis and Improvement of HITS Algorithm for Detecting Web Communities’, show some of the problems with the algorithm.

• Requires the user to have found something relevant to what they are looking for. i.e. I have found NYTimes, I want to have a look at what alternatives are available.

• Can it handle the scale of the web today? Tested with 180 million connectivity information. Indexable web size stands at over 11 billion

• Links to friend’s web pages that are non-relevant to the input URL will be taken into account, consider the size of the web today, this may lead to bad results.

• Small, specialised population used in test, lack of general approach.

• 'Two click away' idea not the case today.

Critique Looking at the positives

• The algorithms used indeed outperform Netscape’s algorithm for finding related pages, and can be extended to handle more than one input URL*

• Easy to implement

• Many papers were consulted and used during the process of writing and implementing the work.

*at the time (1999)

Applications and Future Work • Data Mining - Web Structure Mining

– Finding authoritative Web pages

• Classifying Web documents– Exploring Co-cited material, if they are linked, they could

have relevance, if one is pointed to, it could be important.

• Extend the algorithm to increase the heuristic and look beyond the 'two click away idea'.

• Lack of further work because the assumption was so unrealistic to today's standards

Conclusion • Suggested a solution to deal with the problem of

searching for a topic that can not be easily expressed in simple text query.

• Companion and Co-citation algorithms are fast ways of doing search that is different to traditional text queries.

• Obtained a solution that can be easily adapted and implemented into web servers.

Q & AAny Questions?

References• Hyperlink structure of the Web G.O. Arocena, A.O. Mendelzon and G.A. Mihaila, Applications

of a web query language, in: Proc. Of the Sixth International World Wide Web Conference.

• Chakrabarti et al., ‘Enhanced Hypertext Categorisation using Hyperlinks’, in which links and their orders are used to categorise Web pages.

• E. Spertus, ‘ParaSite: Mining Structural Information on the Web’, also suggested using cocitation and other forms of connectivity to identify related Web pages ‘Authoritative Sources in a Hyperlinked Environment’. The HITS algorithm is used as a starting point for the companion algorithm, which is extended and modified.

• Linkage Similarity Measures for the Classification of Web Documents, P'avel Calado, Marco Cristo, Marcos Andr'e Gon calves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani.

• Web Mining – A Bird's eye view, presentation by Sanjay Kumar Madria

Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell,...

Documents

Transcript of Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell,...