Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell,...
-
Upload
amelia-black -
Category
Documents
-
view
216 -
download
0
Transcript of Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell,...
Finding related pages in the World Wide Web
A review by:
Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson
Content • Introduction
• Algorithms
– Companion
– Co-citation
– Netscape’s
• Evaluations
• Critique
• Conclusion
Introduction Searching on the World Wide Web
• Common search tools include Google, Yahoo
Traditional Approach
• Keyword Query based
• Need to specify your information needs by giving relevant keywords
• Prone to errors!
Question!
What do I do if I don’t know exactly what I am looking for?
Introduction • Another Way…
– Use URL as search input instead of a phrase of text
e.g. www.nytimes.com
• What are the requirements?– Fast
– High precision
– Little input data
Introduction How does it work? - Web graph structure
– Proposed two algorithms:
Companion• Derived from HITS (Hyperlink Induced Topic Search)
algorithm proposed by Kleinberg for ranking search queries.
• Makes use of weights, hub and authority scores.
Co-citation• Finds pages that are frequently co-cited with an input URL u.
Sites A,B,CSites
X,Y,Zu
Found X,Y,Z
Companion Algorithm • Takes in a starting URL u as input e.g.
www.awebsite.com• Made up of 4 steps:
– Building the vicinity graph of u– Contract duplicates and near-duplicates in the
graph– Compute edge weights based on host to host
connection– Compute a hub score and a authority score for
each node in the graph and return the top ranked authority nodes
Companion Algorithm • Uses 5 values* to help
determine relevant pages:• Go Back (B): How many parent
sites the website has i.e. going from u1 to p1
• Back-Forward (BF): How many child sites the parent has i.e. going from u1 to p2 then to u2 (or u1)
• Forward (F): How many children the site has (pages it links to) i.e. u1 to c1
• Forward-Back (FB): How many parent sites the children have
i.e. u1 to c1 to u3 • STOP list: websites considered
not to be relevant to the pages content
p
1
u
1
c
1
p
2
u
2
c
2
hyperlinks
u
3
STOP List:•http://validator.w3.org/check?uri=referer•www.microsoft.com/ie/dowload.html•www.yahoo.com
*These values are determined before the algorithm is executed
A Web-Graph
website
Companion Algorithm• Step 1 – Building the vicinity graph of u• If u is part of the STOP list then it is ignored, otherwise all other sites in the list
will be ignored
p1
c1
p2
u2
c2
u3
Vicinity graph after step 1
Companion Algorithm
Companion Algorithm• Step 2 – Eliminate any duplication
– If one of the nodes (website) in the graph has 10 or more links plus has 95% of it links common to another node*
• Combined the links from both nodes (union) to create one node
– This is to remove sites that are likely to be the same (e.g. mirror sites, or same site under different names)
• Step 3 – Assign Edge Weights– If two nodes are on the same host then the edge between them will be set
to zero– If there are k links going to one site (i.e. many-to-one), the node edges
authority weight are set to 1/k– If there are multiple links L from one site (i.e. one-to-many), the node edges
hub weight are set to 1/L
• The vicinity graph of u has now been constructed!
*This clearly has its problems!!!
Companion Algorithm• Step 4 – Compute Hub and Authority scores
• Nodes (websites) with a high authority score are expected to have relevant content
• Nodes with a high hub score are expected to contain links to relevant content
• The 10 highest authority scoring nodes are then returned as relevant pages to the starting URL u
Co-citation Algorithm • Two sites are co-cited if they
have a common parents e.g. u3 and u1 are co-cited by p1
• Degree of co-citation (DoC) is the number of common parents a site has e.g. u3 and u1 have a DoC of 2
• The algorithm finds the sibling of a site, computes their DoC and returns the top 10 sites with the highest DoC
• If number of siblings of u < 15 and DoC of u < 2 then algorithm restarts with a URL one level up from the original e.g. If u = a.com/X/Y/Z then
new u = a.com/X/Y
p
1
u
1
p
2
u
2
u
3
Siblings of u1
Netscape’s Approach• “What's Related” function• Not a lot of detail mentioned in the paper!• Gets similar pages from web crawling,
archiving, categorising and data mining (as opposed to just using the web graph like the previous algorithms)
• Also tries to learn from trends (i.e what user click on after they searched for a keyword)
Implementation
• Compaq’s Connectivity Server– Provides 180 million URL (nodes)
• Multi-threaded server to take in URLs– Uses either the Companion or Cocitation
algorithm to find related pages.
Evaluation
• Studies carried out to determine the performance of these algorithms.
• Benchmark against Netscape’s approach.
• Re-visit initial requirements.– Speed
– Precision
– Little Data Input – already achieved
Evaluation • Speed
– 109 milliseconds for Companion, and 195ms for Cocitation.
– Complexity of the Cocitation algorithm is in the order of
O(n log n).
• Precision
Critique • Faults within HITS not investigated. Nomura, Satoshi, and Hayamizu,
‘Analysis and Improvement of HITS Algorithm for Detecting Web Communities’, show some of the problems with the algorithm.
• Requires the user to have found something relevant to what they are looking for. i.e. I have found NYTimes, I want to have a look at what alternatives are available.
• Can it handle the scale of the web today? Tested with 180 million connectivity information. Indexable web size stands at over 11 billion
• Links to friend’s web pages that are non-relevant to the input URL will be taken into account, consider the size of the web today, this may lead to bad results.
• Small, specialised population used in test, lack of general approach.
• 'Two click away' idea not the case today.
Critique Looking at the positives
• The algorithms used indeed outperform Netscape’s algorithm for finding related pages, and can be extended to handle more than one input URL*
• Easy to implement
• Many papers were consulted and used during the process of writing and implementing the work.
*at the time (1999)
Applications and Future Work • Data Mining - Web Structure Mining
– Finding authoritative Web pages
• Classifying Web documents– Exploring Co-cited material, if they are linked, they could
have relevance, if one is pointed to, it could be important.
• Extend the algorithm to increase the heuristic and look beyond the 'two click away idea'.
• Lack of further work because the assumption was so unrealistic to today's standards
Conclusion • Suggested a solution to deal with the problem of
searching for a topic that can not be easily expressed in simple text query.
• Companion and Co-citation algorithms are fast ways of doing search that is different to traditional text queries.
• Obtained a solution that can be easily adapted and implemented into web servers.
Q & AAny Questions?
References• Hyperlink structure of the Web G.O. Arocena, A.O. Mendelzon and G.A. Mihaila, Applications
of a web query language, in: Proc. Of the Sixth International World Wide Web Conference.
• Chakrabarti et al., ‘Enhanced Hypertext Categorisation using Hyperlinks’, in which links and their orders are used to categorise Web pages.
• E. Spertus, ‘ParaSite: Mining Structural Information on the Web’, also suggested using cocitation and other forms of connectivity to identify related Web pages ‘Authoritative Sources in a Hyperlinked Environment’. The HITS algorithm is used as a starting point for the companion algorithm, which is extended and modified.
• Linkage Similarity Measures for the Classification of Web Documents, P'avel Calado, Marco Cristo, Marcos Andr'e Gon calves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani.
• Web Mining – A Bird's eye view, presentation by Sanjay Kumar Madria