Finding related pages in the World Wide Web

13
Computer Networks 31 (1999) 1467–1479 Finding related pages in the World Wide Web Jeffrey Dean L,1 , Monika R. Henzinger 2 Compaq Systems Research Center, 130 Lytton Ave., PaloAlto, CA 94301, USA Abstract When using traditional search engines, users have to formulate queries to describe their information need. This paper discusses a different approach to Web searching where the input to the search process is not a set of query terms, but instead is the URL of a page, and the output is a set of related Web pages. A related Web page is one that addresses the same topic as the original page. For example, www.washingtonpost.com is a page related to www.nytimes.com, since both are online newspapers. We describe two algorithms to identify related Web pages. These algorithms use only the connectivity information in the Web (i.e., the links between pages) and not the content of pages or usage information. We have imple- mented both algorithms and measured their runtime performance. To evaluate the effectiveness of our algorithms, we performed a user study comparing our algorithms with Netscape’s ‘What’s Related’ service (http://home. netscape.com/escapes/related/). Our study showed that the precision at 10 for our two algorithms are 73% better and 51% better than that of Netscape, despite the fact that Netscape uses both content and usage pattern information in addition to connectivity information. 1999 Published by Elsevier Science B.V. All rights reserved. Keywords: Search engines; Related pages; Searching paradigms 1. Introduction Traditional Web search engines take a query as input and produce a set of (hopefully) relevant pages that match the query terms. While useful in many circumstances, search engines have the disadvantage that users have to formulate queries that specify their information need, which is prone to errors. This paper discusses how to find related Web pages, a different approach to Web searching. In our approach L Corresponding author. Present address: mySimon, Inc., Santa Clara, CA, USA. E-mail: [email protected] 1 This work was done while the author was at the Compaq Western Research Laboratory. 2 E-mail: [email protected] the input to the search process is not a set of query terms, but the URL of a page, and the output is a set of related Web pages. A related Web page is one that addresses the same topic as the original page, but is not necessarily semantically identical. For example, given www.nytimes.com, the tool should find other newspapers and news organizations on the Web. Of course, in contrast to search engines, our approach requires that the user has already found a page of interest. Recent work in information retrieval on the Web has recognized that the hyperlink structure can be very valuable for locating information [1,3,4,8,12,16,17,20,22,23]. This assumes that if there is a link from page v and w, then the author of v recommends page w, and links often connect 1389-1286/99/$ – see front matter 1999 Published by Elsevier Science B.V. All rights reserved. PII:S1389-1286(99)00022-5

Transcript of Finding related pages in the World Wide Web

Page 1: Finding related pages in the World Wide Web

Computer Networks 31 (1999) 1467–1479

Finding related pages in the World Wide Web

Jeffrey Dean Ł,1, Monika R. Henzinger 2

Compaq Systems Research Center, 130 Lytton Ave., Palo Alto, CA 94301, USA

Abstract

When using traditional search engines, users have to formulate queries to describe their information need. This paperdiscusses a different approach to Web searching where the input to the search process is not a set of query terms, butinstead is the URL of a page, and the output is a set of related Web pages. A related Web page is one that addresses thesame topic as the original page. For example, www.washingtonpost.com is a page related to www.nytimes.com,since both are online newspapers.

We describe two algorithms to identify related Web pages. These algorithms use only the connectivity informationin the Web (i.e., the links between pages) and not the content of pages or usage information. We have imple-mented both algorithms and measured their runtime performance. To evaluate the effectiveness of our algorithms,we performed a user study comparing our algorithms with Netscape’s ‘What’s Related’ service (http://home.netscape.com/escapes/related/). Our study showed that the precision at 10 for our two algorithms are 73%better and 51% better than that of Netscape, despite the fact that Netscape uses both content and usage pattern informationin addition to connectivity information. 1999 Published by Elsevier Science B.V. All rights reserved.

Keywords: Search engines; Related pages; Searching paradigms

1. Introduction

Traditional Web search engines take a query asinput and produce a set of (hopefully) relevant pagesthat match the query terms. While useful in manycircumstances, search engines have the disadvantagethat users have to formulate queries that specify theirinformation need, which is prone to errors. Thispaper discusses how to find related Web pages, adifferent approach to Web searching. In our approach

Ł Corresponding author. Present address: mySimon, Inc., SantaClara, CA, USA. E-mail: [email protected] This work was done while the author was at the CompaqWestern Research Laboratory.2 E-mail: [email protected]

the input to the search process is not a set of queryterms, but the URL of a page, and the output is a setof related Web pages. A related Web page is one thataddresses the same topic as the original page, but isnot necessarily semantically identical. For example,given www.nytimes.com, the tool should find othernewspapers and news organizations on the Web. Ofcourse, in contrast to search engines, our approachrequires that the user has already found a page ofinterest.

Recent work in information retrieval on theWeb has recognized that the hyperlink structurecan be very valuable for locating information[1,3,4,8,12,16,17,20,22,23]. This assumes that ifthere is a link from page v and w, then the authorof v recommends page w, and links often connect

1389-1286/99/$ – see front matter 1999 Published by Elsevier Science B.V. All rights reserved.PII: S 1 3 8 9 - 1 2 8 6 ( 9 9 ) 0 0 0 2 2 - 5

Page 2: Finding related pages in the World Wide Web

1468 J. Dean, M.R. Henzinger / Computer Networks 31 (1999) 1467–1479

Table 1Example results produced by the Companion algorithm

Input: www.nytimes.com

www.usatoday.com USA Today newspaperwww.washingtonpost.com Washington Post newspaperwww.cnn.com Cable News Networkwww.latimes.com Los Angeles Times newspaperwww.wsj.com Wall Street Journal newspaperwww.msnbc.com MSNBC cable news stationwww.sjmercury.com San Jose Mercury News

newspaperwww.chicago.tribune.com Chicago Tribune newspaperwww.nando.net Nando Times on-line news

servicewww.the-times.co.uk The Times newspaper

related pages. In this paper, we describe the Com-panion and Cocitation algorithms, two algorithmswhich use only the hyperlink structure of the Webto identify related Web pages. For example, Table 1shows the output of the Companion algorithm whengiven www.nytimes.com as input (in this case, theresults for the Cocitation algorithm are identical andthe results for Netscape are very similar, althoughthis is not always true).

One of our goals was to design algorithms withhigh precision that are very fast and that do notrequire a large number of different kinds of inputdata. Since we have a tool that gives us access tothe hyperlink structure of the Web (the ConnectivityServer [2]), we focused on algorithms that only useconnectivity information to identify related pages.Our algorithms use only the information about thelinks that appear on each page and the order in whichthe links appear. They neither examine the contentof pages, nor do they examine patterns of how userstend to navigate among pages.

Our Companion algorithm is derived from theHITS algorithm proposed by Kleinberg for rankingsearch engine queries [12]. Kleinberg suggested thatthe HITS algorithm could be used for finding relatedpages as well, and provided anecdotal evidence thatit might work well. In this paper, we extend thealgorithm to exploit not only links but also theirorder on a page (see Section 2.1.1) and present theresults of a user-study showing that the resultingalgorithm works very well.

The Cocitation algorithm finds pages that arefrequently cocited with the input URL u (that is, itfinds other pages that are pointed to by many otherpages that all also point to u).

Netscape Communicator Version 4.06 introduceda related pages service that is built into the browser[15] (see Section 2.3 for a more detailed discus-sion). On the browser screen, there is a ‘What’s Re-lated’ button, which presents a menu of related Webpages in some cases. The ‘What’s Related’ algo-rithm in Netscape is based on technology developedby Alexa, Inc., and computes its answers based onconnectivity information, content information, andusage information [14].

To compare the performance of our two algo-rithms and Netscape’s algorithm, we performed auser study on 59 URLs chosen by 18 volunteers.Our study results show that the precision at 10 com-puted over all 59 URLs of our two algorithms are73% better and 51% better than Netscape’s. Not allalgorithms gave answers for all URLs in our study.If we restrict the comparison to only the 37 URLsfor which all three algorithms returned answers, thenthe precision at 10 of our two algorithms are 40%and 22% better than Netscape’s algorithm. This issurprising since our algorithms are based only onconnectivity information.

Netscape’s algorithm gives answers for about 17million URLs [14], while our algorithms can giveanswers for a much larger set of URLs (we have con-nectivity information on 180 million URLs). This isimportant because it means that we can give relatedURL information for more URLs. Our algorithmsare also fast: in our environment, both average lessthan 200 msec of computation per input URL.

The example shown in Table 1 is for a URL with avery high level of connectivity (www.nytimes.comcontains 47,751 inlinks in our Connectivity Server),and all three algorithms generally perform quite wellfor well-connected URLs. Our algorithms can alsowork well when there is much less connectivity, asshown by the example in Table 2. This table showsthe answers for the Companion and Netscape algo-rithms for linas.org/linux/corba.html, oneof the input URLs chosen by a user as part of ouruser study. Alongside each answer is the user’s rat-ing for each answer, with a ‘1’ meaning that the userconsidered the page related, ‘0’ meaning that the

Page 3: Finding related pages in the World Wide Web

J. Dean, M.R. Henzinger / Computer Networks 31 (1999) 1467–1479 1469

Table 2Comparison of the results of the Companion and Netscape algorithms

Input: linas.org/linux/corba.html

Companion Netscape

1 www.cs.wustl.edu/¾schmidt/TAO.html 0 labserver.kuntrynet.com/¾linux1 dsg.harvard.edu/public/arachne 0 web.singnet.com.sg/¾siuyin1 jorba.castle.net.au/ – www.clark.net/pub/srokicki/linux1 www-b0.fnal.gov:8000/ROBIN – www.earth.demon.co.uk/linux/uk1 www.paragon-software.com/products/oak 0 www.emry.net/webwatcher1 www.tatanka.com/orb1.htm 0 www.interl.net/¾jasoneng/NLL/lwr1 www.oot.co.uk/dome-index.html 0 www.jnpcs.com/mkb/linux0 www.intellisoft.com/¾mark 1 www.linuxresources.com/1 www.dstc.edu.au/AU/staff/mart... 0 www.liszt.com/1 www.inf.fu-berlin.de/¾brose/jacorb 0 www.local.net/¾jgo/linuxhelp.html

A ‘1’ means that the page was valuable, a ‘0’ means that the page was not valuable, a ‘–’ means that the page could not be accessed.

user considered the page unrelated, and ‘–’ meaningthat the user was unable to access the page at all.The original page was about CORBA implementa-tions for Linux, and there were 123 pages pointingto this page in our Connectivity Server. Nine of theten answers given by the Companion algorithm weredeemed related by our user, while only one pagefrom Netscape’s set of answers was deemed related.Most of Netscape’s answers were about the muchbroader topic of Linux, rather than specifically aboutCORBA implementations on Linux.

Section 2 presents our algorithms in detail and de-scribes Netscape’s service, while Section 3 discussesthe implementation of our algorithms. Section 4 de-scribes the user study we performed and presents itsresults, and also provides a brief performance eval-uation of our algorithms. Finally, Section 6 presentsour conclusions.

2. Related page algorithms

In this section we describe our two algorithms (theCompanion algorithm and the Cocitation algorithm),as well as Netscape’s algorithm. Unlike Netscape’salgorithm, both of our algorithms exploit only thehyperlink-structure (i.e. graph connectivity) of theWeb and do not examine the information aboutthe content or usage of pages. Netscape’s algorithmuses all three kinds of information to arrive at itsresults.

In the sections below, we use the terms parent andchild. If there is a hyperlink from page w to page v,we say that w is a parent of v and that v is a child ofw.

2.1. Companion algorithm

The Companion algorithm takes as input a startingURL u and consists of four steps:(1) Build a vicinity graph for u.(2) Contract duplicates and near-duplicates in this

graph.(3) Compute edge weights based on host to host

connections.(4) Compute a hub score and an authority score for

each node in the graph and return the top rankedauthority nodes (our implementation returns thetop 10). This phase of the algorithm uses a mod-ified version of the HITS algorithm originallyproposed by Kleinberg [12].

These steps are described in more detail in thesubsections below. Only step 1 exploits the order oflinks on a page.

2.1.1. Step 1: building the vicinity graphGiven a query URL u we build a directed graph

of nodes that are nearby to u in the Web graph.Graph nodes correspond to Web pages and graphedges correspond to hyperlinks. The graph consistsof the following nodes (and the edges between thesenodes):

Page 4: Finding related pages in the World Wide Web

1470 J. Dean, M.R. Henzinger / Computer Networks 31 (1999) 1467–1479

(1) u,(2) up to B parents of u, and for each parent up to

BF of its children different from u, and(3) up to F children of u, and for each child up to

FB of its parents different from u.Here is how we choose these nodes in detail:

There is a stoplist STOP of URLs that are unrelatedto most queries and that have very high indegree.Our stoplist was developed by experimentation, andcurrently contains 21 URLs. Examples of nodesthat appear on our stoplist are www.yahoo.com andwww.microsoft.com/ie/download.html. If thestarting URL u is not one of the nodes on ourstoplist, then we ignore all the URLs on the stoplistwhen forming the vicinity graph. If u does appearon the stoplist, however, then we disable the stoplist(i.e. set STOP to the empty list) and freely includeany nodes in the vicinity graph. We disable thestoplist when the input URL appears on the stoplistbecause many nodes on the stoplist are popularsearch engines and portals, and we want to permitthese nodes to be considered when the input URL isanother such popular site.

Go back (B), and back-forward (BF): If u hasmore than B parents, add B random parents noton STOP to the graph; otherwise add all of u’sparents. If a parent x of u has more than BF C 1children, add up to BF=2 children pointed to by theBF=2 links on x immediately preceding the link tou and up to BF=2 children pointed to by the BF=2links on x immediately succeeding the link to u(ignoring duplicate links). If page x has fewer thanBF children, we add all of its children to the graph.Note that this exploits the order of links on page x .

Go forward (F), and forward-back (FB): If u hasmore than F children, add the children pointed to bythe first F links of u; otherwise, add all of u’s chil-dren. If a child of u has more than BF parents, addthe BF parents not on STOP with highest indegree;otherwise, add all of the child’s parents.

If there is a hyperlink from a page represented bynode v in the graph to a page represented by nodew, and v and w do not belong to the same host, thenthere is a directed edge from v to w in the graph (weomit edges between nodes on the same host).

In our experience, we have found that using alarge value for B (2000) and a small value for BF (8)works better in practice than using moderate values

for each (say, 50 and 50). We have observed thatlinks to pages on a similar topic tend to be clusteredtogether, while links that are farther apart on a pageare less likely to be on the same topic (for example,most hotlists are grouped into categories). This hasalso been observed by other researchers [6]. Using alarger value for B also means that the likelihood ofthe computation being dominated by a single parentpage is reduced.

2.1.2. Step 2: duplicate eliminationAfter building this graph we combine near-dupli-

cates. We say two nodes are near-duplicates if (a)they each have more than 10 links and (b) they haveat least 95% of their links in common. To combinetwo near-duplicates we replace their two nodes by anode whose links are the union of the links of the twonear-duplicates. This duplicate elimination phase isimportant because many pages are duplicated acrosshosts (e.g. mirror sites, different aliases for the samepage), and we have observed that allowing them toremain separate can greatly distort the results.

2.1.3. Step 3: assign edge weightsIn this stage, we assign a weight to each edge,

using the edge weighting scheme of Bharat and Hen-zinger [3] which we repeat here for completeness.An edge between two nodes on the same host 3 hasweight 0. If there are k edges from documents ona first host to a single document on a second hostwe give each edge an authority weight of 1=k. Thisweight is used when computing the authority scoreof the document on the second host. If there are ledges from a single document on a first host to aset of documents on a second host, we give eachedge a hub weight of 1=l. We perform this scaling toprevent a single host from having too much influenceon the computation.

We call the resulting weighted graph the vicinitygraph of u.

2.1.4. Step 4: compute hub and authority scoresIn this step, we run the imp algorithm [3] on the

graph to compute hub and authority scores. The imp

3 We assume throughout the paper that the host can be deter-mined from the URL-string.

Page 5: Finding related pages in the World Wide Web

J. Dean, M.R. Henzinger / Computer Networks 31 (1999) 1467–1479 1471

algorithm is a straightforward extension of the HITSalgorithm with edge weights.

The intuition behind the HITS algorithm is that adocument that points to many others is a good hub,and a document that many documents point to is agood authority. Transitively, a document that pointsto many good authorities is an even better hub, andsimilarly a document pointed to by many good hubsis an even better authority. The HITS computationrepeatedly updates hub and authority scores so thatdocuments with high authority scores are expected tohave relevant content, whereas documents with highhub scores are expected to contain links to relevantcontent. The computation of hub and authority scoresis done as follows:

Initialize all elements of the hub vector H to 1.0.Initialize all elements of the authority vector A

to 1.0.While the vectors H and A have not converged:

For all nodes n in the vicinity graph N ,A[n] :DP.n0;n/2edges.N/ H [n0]

ð authority weight.n0; n/For all n in N

H [n] :DP.n;n0/2edges.N/ A[n0]ð hub weight.n; n0/

Normalize the H and A vectors.

Note that the algorithm does not claim to find allrelevant pages, since there may be some that havegood content but have not been linked to by manyauthors.

The Companion algorithm then returns the nodeswith the ten highest authority scores (excluding uitself) as the pages that are most related to the startpage u.

2.2. Cocitation algorithm

An alternative approach for finding related pagesis to examine the siblings of a starting node u inthe Web graph. Two nodes are cocited if they havea common parent. The number of common parentsof two nodes is their degree of cocitation. As analternative to the Companion algorithm, we havedeveloped a very simple algorithm that determinesrelated pages by looking for sibling nodes with thehighest degree of cocitation. The Cocitation algo-

rithm first chooses up to B arbitrary parents of u. Foreach of these parents p, it adds to a set S up to BFchildren of p that surround the link from p to u. Theelements of S are siblings of u. For each node s inS, we determine the degree of cocitation of s with u.Finally, the algorithm returns the 10 most frequentlycocited nodes in S as the related pages.

In some cases there is an insufficient level of coc-itation with u to provide meaningful results. In ourimplementation, if there are not at least 15 nodesin S that are cocited with u at least twice, then werestart the algorithm using the node corresponding tou’s URL with one path element removed. For exam-ple, if u’s URL is a.com/X/Y/Z and an insufficientnumber of cocited nodes exist for this URL, then werestart the algorithm with the URL a.com/X/Y (ifthe resulting URL is invalid, we continue to chopelements until we are left with just a host name, orwe find a valid URL).

In our implementation, we chose B to be 2000 andBF to be 8 (the same parameter values we used forour implementation of the Companion algorithm).

One way of looking at the Cocitation algorithm isthat it finds ‘maximal’ n ð 2 bipartite subgraphs inthe vicinity graph.

2.3. Netscape’s approach

Netscape introduced a new ‘What’s Related?’ fea-ture in version 4.06 of the Netscape Communicatorbrowser. Details about the approach used to identifyrelated pages in their algorithm are sketchy. How-ever, the What’s Related FAQ page indicates thatthe algorithm uses connectivity information, usageinformation, and content analysis of the pages todetermine relationships. We quote from the ‘What’sRelated’ FAQ page:

The What’s Related data is created by Alexa Internet.Alexa uses crawling, archiving, categorizing and datamining techniques to build the related sites lists formillions of Web URLs. For example, Alexa uses linkson the crawled pages to find related sites. The day-to-day use of What’s Related also helps build and refinethe data. As the service is used, the requested URLsare logged. By looking at high-level trends, Netscapeand Alexa can deduce relationships between Web sites.For example, if thousands of users go directly from siteA to site B, the two sites are likely to be related.

Page 6: Finding related pages in the World Wide Web

1472 J. Dean, M.R. Henzinger / Computer Networks 31 (1999) 1467–1479

Next, Alexa checks all the URLs to make sure they arelive links. This process removes links that would tryto return pages that don’t exist (404 errors), as well asany links to servers that are not available to the generalInternet population, such as servers that are no longeractive or are behind firewalls. Finally, once all of therelationships are established and the links are checked,the top ten related sites for each URL are chosen bylooking at the strength of the relationship between thesites.

Each month, Alexa recrawls the Web and rebuilds thedata to pull in new sites and to refine the relationshipsbetween the existing sites. New sites with strong re-lationships to a site will automatically appear in theWhat’s Related list for that site by displacing any siteswith weaker relationships.

Note that since the relationships between sites arebased on strength, What’s Related lists are not nec-essarily balanced. Site A may appear in the list forSite B, but Site B may not be in the list for Site A.Generally, this happens when the number of sites withstrong relationships is greater than ten, or when sitesdo not have similar enough content.

3. Implementation

In experimenting with these algorithms, we werefortunate to have access to Compaq’s ConnectivityServer [2]. The Connectivity Server provides highspeed access to the graph structure of 180 millionURLs (nodes) and the links (edges) that connectthem. The entire graph structure is kept in memoryon a Compaq AlphaServer 4100 with 8 GB of mainmemory and dual 466 MHz Alpha processors. Therandom access patterns engendered by the connec-tivity-based algorithms described in this paper meanthat it is important for most or all of the graph tofit in main memory to prevent high levels of pagingactivity.

We implemented a multi-threaded server that ac-cepts a URL and uses either the Cocitation algorithmor the Companion algorithm to find pages related tothe given URL. Our server implementation consistsof approximately 5500 lines of commented C code,of which approximately 1000 lines implement theCompanion algorithm, 400 lines implement the Coc-itation algorithm, and the remainder are shared codeto perform tasks such as parsing HTTP query re-

quests, printing results, and logging status messages.We link our server code directly with the Con-nectivity Server library, and access the connectivityinformation by mmapping the graph information intothe address space of our server.

Our implementation of the Companion algorithmhas been subjected to a moderate amount of per-formance tuning, mostly in designing the neighbor-hood graph data structures to improve data-cacheperformance. The implementation of the Cocitationalgorithm has not been tuned extensively, although itdoes share a fair amount of code with the Compan-ion algorithm, and this shared code has been tunedsomewhat.

4. Evaluation

In this section we describe the evaluation weperformed for the algorithms. Section 4.1 describesour user study, while Section 4.2 discusses the resultsof the study. Section 4.3 evaluates the run timeperformance of our algorithms.

4.1. Experimental setup

To compare the different approaches, we per-formed a user study. We asked 18 volunteers tosupply us with at least two URLs for which theywanted to find related pages. Our volunteers included14 computer science professionals (mostly our col-leagues at Compaq’s research laboratories), as wellas 4 people with other professional careers. We re-ceived 69 URLs and used each of the algorithms todetermine the top 10 answers for each URL. We putthe answers in random order and returned them to thevolunteers for rating. The volunteers were instructedas follows:

Table 3Summary of all answers for the algorithms

Algorithm No. of URLs No. of No. ofwith answers answers dead links

Companion 50 498 42Cocitation 58 580 62Netscape 40 364 29

Page 7: Finding related pages in the World Wide Web

J. Dean, M.R. Henzinger / Computer Networks 31 (1999) 1467–1479 1473

We want to measure how well each algorithm per-forms. To measure performance we want to know thepercentage of valuable URLs returned by each algo-rithm. To be valuable the URL must be both relevant tothe topic you are interested in and a high quality page.For example, if your URL was www.audi.com and youget back a newsgroup posting where somebody talksabout his new Audi car, then the page was on topic, butprobably not high quality. On the other hand, if you getwww.jaguar.com as an answer, then it is up to you todecide whether this answer is on topic or not.

Scoring:0: Page was not valuable=useful1: Page was valuable=useful–: Page could not be accessed (i.e. did not exist, or

server was down)

Please ignore the order in which the pages are returned.So if a later page contains similar content to an earlierpage please rate the latter page as if you had notseen the earlier page. This will imply that we do notmeasure how ‘happy’ you are with a set of answersreturned by an algorithm. Instead we measure howmany valuable answers each algorithm gives.”

4.2. User study results

We received responses rating the answer URLsfor 59 of the input URLs. These 59 input URLsform the basis of our study. Table 3 shows howmany of these queries the algorithms answered andhow many answer URLs they returned. In manycases, the algorithms returned links that our usersrated as inaccessible. The column labeled No. ofdead links shows the number of inaccessible pagesamong all the answers for each algorithm. For thepurposes of our evaluation, we treat an inaccessiblelink as a score of ‘0’, since inaccessible pages arenot valuable=useful.

The Cocitation algorithm returned results for allbut one of the URLs. The reason why it returnedresults for almost all input URLs is that when insuf-ficient connectivity was found surrounding an inputURL (e.g. a.com/X/Y), the Cocitation algorithmused a chopped URL as input (e.g. a.com/X). Al-though we did not include this chopping feature inour implementation of the Companion algorithm, itis directly applicable and would enable the Compan-ion algorithm to return answers for more URLs. Wehave empirically observed that Netscape’s algorithm

also applies a similar chopping heuristic in somecases.

Table 4 contains a listing of the 59 URLs in ourstudy. For each URL, the three columns labeled Cp,Ct, and N show the URLs for which the Companion,Cocitation, and Netscape algorithms returned results,respectively. The table also shows the number ofhyperlinks pointing to the URL in the ConnectivityServer (Inlinks). For the Companion algorithm, itshows the number of nodes and edges in the vicinitygraph, as well as the wall clock time in millisecondstaken to compute the set of related pages (computedby surrounding the computation with gettimeof-day system calls). For the cocitation algorithm, itshows the number of siblings found, the number ofsiblings cocited at least twice (Cocited), and the wallclock time taken to compute the answers.

The three algorithms return answers for differ-ent subsets of our 59 URLs. To compare thesealgorithms, we can subdivide the URLs into sev-eral groups. The intersection group consists of thoseURLs where all algorithms returned at least oneanswer. There were 37 URLs in this group. Thenon-Netscape group consists of the URLs whereNetscape’s approach did not return any answers. Itconsists of 19 URLs.

To quantify the performance of the three algo-rithms, we now define two metrics. The precision atr for a given algorithm is the total number of answersreceiving a ‘1’ score within the first r answers, di-vided by r times the number of query URLs. Noticethat when an algorithm does not give any answers fora URL, this is as if it gave all non-relevant answersfor that URL.

For a given URL u, the average precision foru of an algorithm is the sum of the precision ateach rank where the answer of the algorithm for ureceived a ‘1’ score divided by the total number ofthe answers of the algorithm for u receiving a ‘1’score. If the algorithm does not return any answersfor u, its average precision for u is 0. The overallaverage precision for an algorithm is the sum of allthe average precisions for all the query URLs dividedby the total number of query URLs.

For each of the three groups of URLs (all, in-tersection, and non-Netscape), Table 5 shows theaverage precision and the precision at 10 for eachalgorithm. Fig. 1 shows the precision at r for each of

Page 8: Finding related pages in the World Wide Web

1474 J. Dean, M.R. Henzinger / Computer Networks 31 (1999) 1467–1479

Table 4User-provided URLs for evaluation (and statistics)

URL Cp Ct N In Companion algorithm Cocitation algorithmlinks

Nodes Edges Time Siblings Cocited Time(ms) (ms)

1. babelfish.altavista.digital.com/cgi...p p p

29284 6382 10305 269 6601 1079 5732. developer.intel.com/design/strong/t...

p p p19 333 479 12 85 39 17

3. english-server.hss.cmu.edu/p p p

3774 4197 8263 215 6630 2063 5844. hack.box.sk/

p p p1666 7963 14090 488 6099 1493 515

5. ieor.berkeley.edu/¾hochbaum/html/bo...p p

5 64 73 4 49 15 176. linas.org/linux/corba.html

p p p123 2028 4222 77 444 138 48

7. members.tripod.com/¾src-fall-regatt...p

0 0 0 0 6800 1523 7368. metroactive.com/music/

p p p13 201 302 67 168 58 31

9. travelwithkids.miningco.com/p p

99 744 1051 27 459 128 47110. www-db.stanford.edu/¾wiener

p p2 17 18 3 816 301 100

11. www.acf.dhhs.gov/p p p

1822 5522 14967 394 3659 1401 42412. www.adventurewest.com/pub/NASTC/

p p p13 65 92 4 52 16 14

13. www.amazon.com/exec/obidos/cache/br...p p

0 0 0 0 1149 411 26214. www.anglia.ac.uk/¾systimk/Humour/Hi...

p0 0 0 0 103 27 9

15. www.ayso26.org/p

5 142 169 8 0 0 016. www.babynames.com/

p p p528 2695 4781 110 1826 496 167

17. www.braintumor.org/p p p

201 779 1763 35 551 212 6818. www.bris.ac.uk/Depts/Union/BTS/Scri...

p p p131 505 1084 162 379 152 72

19. www.carrier.com/p p p

964 2755 5672 123 1837 580 16520. www.chesschampions.com/kasparov.htm...

p p p11 57 90 3 535 220 55

21. www.cl.cam.ac.uk/users/rja14/tamper...p p p

119 741 1256 27 452 155 3622. www.davecentral.com/

p p p6909 4703 11391 263 4349 1148 477

23. www.duofold.com/stepout/ski-wsa.htmp

0 0 0 0 185 66 2524. www.ebay.com/

p p p4658 4389 8660 199 5951 1314 493

25. www.etoys.com/p p p

2579 2294 6153 111 3351 1005 42726. www.fifa.com/

p p p12815 6105 12360 349 6452 1452 522

27. www.focus.de/p p p

7662 4881 14039 361 4208 1301 52528. www.geocities.com/Paris/Metro/1324/

p p11 92 138 7 39 15 30

29. www.geocities.com/TheTropics/2442/d...p p

64 450 685 19 184 65 2530. www.harappa.com/har/har0.html

p p p38 235 318 86 202 37 23

31. www.harmony-central.com/MIDI/Doc/tu...p p p

31 184 249 7 153 27 2132. www.hotelres.com/

p p p708 3072 6036 148 2380 889 196

33. www.innovation.ch/java/HTTPClient/p p p

73 341 639 13 215 68 2134. www.inquizit.com/

p p12 62 95 4 54 18 15

35. www.israelidance.com/p p p

40 210 323 9 173 40 1636. www.jewishmusic.com/

p p p399 1617 3111 66 1232 397 134

37. www.joh.cam.ac.uk/p p

162 557 1027 23 409 114 3738. www.levenger.com/

p p p259 1367 2083 51 1116 274 87

39. www.mdl.sandia.gov/micromachine/gal...p

0 0 0 0 143 61 1540. www.midiweb.com/

p p p1967 5304 15549 340 3370 1141 408

41. www.minimed.com/p p p

217 778 1663 33 573 223 6942. www.mit.edu/people/mkgray/net/

p p p258 1138 2124 44 896 287 72

43. www.mot-sps.com/p p p

391 883 1896 38 499 206 5444. www.movielink.com/

p p p12274 6205 11589 337 5622 1207 514

45. www.netusa1.net/¾spost/bench.htmlp p

1 9 9 3 944 229 8646. www.nsc.com/catalog/sg708.html

p p0 0 0 0 2774 1149 333

47. www.odci.gov/cia/publications/factb...p p p

3765 4352 8322 221 6708 1962 61348. www.paccc.com/

p p p18 409 625 15 64 21 7

49. www.perl.com/perl/index.htmlp p p

14 53 92 4 39 18 550. www.pianospot.com/1700305.htm

p0 0 0 0 39 23 85

Page 9: Finding related pages in the World Wide Web

J. Dean, M.R. Henzinger / Computer Networks 31 (1999) 1467–1479 1475

Table 4 (continued)

URL Cp Ct N In Companion algorithm Cocitation algorithmlinks

Nodes Edges Time Siblings Cocited Time(ms) (ms)

51. www.rei-outlet.com /p p

20 105 163 6 81 16 1652. www.sddt.com/files/library/94headli...

p p p9 31 66 3 3647 1260 386

53. www.sultry.arts.su.edu.au/links/lin...p p

54 258 461 11 210 74 2154. www.telemarque.com/articles/andrnch...

p p0 0 0 0 607 241 61

55. www.trane.com/p p p

900 2965 6201 140 2096 684 18156. www.traxxx.de/

p p1219 4607 10560 304 3355 1184 478

57. www.us-soccer.com/p p p

1175 3458 8515 201 2344 865 23058. www.wisdom.weizmann.ac.il/¾naor/puz...

p p8 49 65 3 96 16 10

59. www.wwa.com/¾android7/pilot/index.h...p

0 0 0 0 1846 709 214

these groups of URLs in graphs a, b and c. Fig. 1aand b illustrate that the Companion and Cocitationalgorithms substantially outperform Netscape’s al-gorithm at all ranks, and the Companion algorithmalmost always outperforms the Cocitation algorithm.

The intersection group is the most interestingcomparison, since it avoids penalizing an algorithmfor not returning at least one answer. For the inter-section group, Netscape’s algorithm achieves a preci-sion at 10 of 0.357, while the Companion algorithmachieves a precision at 10 of 0.501 (40% better), andthe Cocitation algorithm achieves a precision at 10of 0.435 (22% better). The average precision in theintersection group does not penalize an algorithm forreturning fewer than 10 answers. Under this met-ric, the Companion algorithm is 32% better thanNetscape’s algorithm, while the Cocitation algorithmis 20% better than Netscape’s algorithm.

In the group that includes all URLs, all three al-gorithms had drops in their precision at 10 values.There are two reasons for this. The first is that algo-rithms were given a precision of 0 for a given URL ifthey did not return any answers. This mostly affectedthe Netscape and Companion algorithms. The sec-

Table 5Precision metrics for each algorithm for three groups of URLs

Algorithm All Intersection Non-Netscape

Average precision Precision at 10 Average precision Precision at 10 Average precision Precision at 10

Companion 0.541 0.417 0.666 0.501 0.540 0.401Cocitation 0.518 0.363 0.605 0.435 0.434 0.325Netscape 0.343 0.241 0.502 0.357 n=a n=a

ond reason is that for the URLs in the non-Netscapeset, both the Companion and Cocitation algorithmsdid not perform as well as they did for URLs inthe Intersection set. Despite these drops in absoluteaverage precision, the average precision of the Com-panion algorithm is 57% better than that of Netscape,and the average precision of the Cocitation algorithmis 51% better than that of Netscape. Similar resultshold when examining average precision rather thanprecision at 10.

To evaluate the statistical significance of our re-sults, we computed the sign test and the Wilcoxsonsums of ranks test for each pair of algorithms [18].These results are shown in Table 6 and show thatthe difference between the Companion and Netscapealgorithms and between the Cocitation and Netscapealgorithms are statistically significant.

We also wanted to evaluate whether or not thealgorithms were generally returning the same resultsfor a given URL or whether they were returninglargely disjoint sets of URLs. Table 7 shows theamount of overlap in the answers returned by eachpair of algorithms. The percentage in parenthesesis the overlap divided by the total number of an-

Page 10: Finding related pages in the World Wide Web

1476 J. Dean, M.R. Henzinger / Computer Networks 31 (1999) 1467–1479

1 10

r

0.0

0.2

0.4

0.6

0.8

1.0P

reci

sion

(a) All

CompanionCocitationNetscape

1 10

r

0.0

0.2

0.4

0.6

0.8

1.0

(b) Intersection

1 10

r

0.0

0.2

0.4

0.6

0.8

1.0

(c) Non-NetscapeFig. 1. Precision at r for the three groups of URLs.

Table 6Sign test and Wilcoxon sum of ranks test for algorithm pairs

Algorithm All Intersection Non-Netscape

Sign Rank sum Sign Rank sum Sign Rank sum

Companion better than Netscape <0.0001 0.0026 0.0041 0.0340 n=a n=aCocitation better than Netscape 0.0136 0.0164 0.1685 0.2340 n=a n=aCompanion better than Cocitation 0.1922 0.3898 0.0793 0.2628 0.2643 0.4180

Table 7Overlap between answers returned by algorithms

Algorithm Companion Cocitation Netscape

Companion 253 (51%) 55 (11%)Cocitation 253 (44%) 56 (10%)Netscape 55 (15%) 56 (15%)

swers returned by the algorithm in that row. As thetable shows, there is a large overlap between theanswers returned by the Companion and Cocitationalgorithms. This is not surprising, since the twoalgorithms are both based on connectivity informa-tion surrounding the input URL and since both usesimilar parameters to choose the surrounding nodes.There is relatively little overlap between the answersreturned by Netscape and the other two algorithms.

4.3. Run-time performance

In this section, we present data about the run-time performance of the Companion and Cocitationalgorithms. Since we do not have direct access toNetscape’s algorithm and only access it through thepublic Web interface, we are unable to present per-

formance information for Netscape’s algorithm. Allmeasurements were performed on a Compaq Al-phaServer 4100 with 8 GB of main memory and dual466 MHz Alpha processors. The measured runningtimes are wall clock times from the time the inputURL is given to the server until the time the answersare returned. These times do not include the timetaken to format the results as an HTML page, sincethat was done by a server process running on anothermachine (and the time to do this was negligible).

The average running time for the Companionalgorithm on the 50 URLs for which it returnedanswers was 109 msec, while the average runningtime for the Cocitation algorithm on the 58 URLsfor which it provided answers was 195 msec. Theperformance of both these algorithms is sufficientlyfast that either one could handle a large amountof traffic (close to 800,000 requests per day forthe Companion algorithm). Furthermore, the averageperformance could probably be improved by cachinganswers for frequently requested URLs.

Although we did not explicitly include this factorin our user study, we have informally observed thatthe subjective quality of answers returned for boththe Companion and the Cocitation algorithms does

Page 11: Finding related pages in the World Wide Web

J. Dean, M.R. Henzinger / Computer Networks 31 (1999) 1467–1479 1477

0 5000 10000 15000

Graph edges

0

100

200

300

400

Com

pani

on ti

me

(mse

c)

(a)

0 2000 4000 6000

# of siblings

0

200

400

600

Coc

itatio

n tim

e (m

sec)

(b)

Fig. 2. Graph size versus running time of Companion and Cocitation algs.

not decrease when we somewhat decrease the param-eter B (the number of inlinks considered) during thebuilding of the vicinity graph. This is important foron-line services because it means that the graph sizecould be reduced during times of high load, therebyreducing the amount of time taken to service eachrequest. Under conditions of low load, the graph sizecould be increased.

The Companion algorithm generally converges onits answers within a few iterations (typically less than10 iterations), but the number of iterations increaseswith the graph size. Each iteration takes time that islinear in the number of edges in the vicinity graph.We plot the running time versus the number of graphedges in Fig. 2a.

The running time of the Cocitation algorithmis O.n log n/, where n is the number of siblingsexamined for cocitation, since it sorts the siblings bythe degree of cocitation. This effect is illustrated inFig. 2b. In our experience, the running times for thecocitation and companion algorithms are generallycorrelated, since URLs which have a large numberof siblings to consider in the cocitation algorithmalso generally produce a large neighborhood graphfor processing in the companion algorithm.

5. Related work

Many researchers have proposed schemes for us-ing the hyperlink structure of the Web [1,3,4,8,12,

16,17,20,22,23]. For the most part, this work doesnot discuss the finding of related pages, with fourexceptions discussed below.

We know of only one previous work that exploitsthe order of links: Chakrabarti et al. [6] use the linksand their order to categorize Web pages and theyshow that the links that are near a given link in pageorder frequently point to pages on the same topic.

Previous authors have suggested using cocitationand other forms of connectivity to identify relatedWeb pages. Spertus observed that cocitation can in-dicate that two pages are related [20]. That is, ifpage A points to both pages B and C, then B and Cmight be related. Various researchers in the field ofbibliometrics have also observed this [9–11,19], andthis observation forms the basis of our Cocitationalgorithm. The notion of collaborative filtering, al-though it is based on user’s recommendations ratherthan hyperlinks, also relies on this observation [21].Pitkow and Pirolli [16] cluster Web pages based oncocitation analysis. Terveen and Hill [22] use theconnectivity structure of the Web to find groups ofrelated Web sites.

Our companion algorithm descended from theHITS algorithm developed by Kleinberg [12]. TheHITS algorithm was originally proposed by Klein-berg as a way of using connectivity structure to iden-tify the most authoritative sources of information ona particular topic, where the topic was defined by thecombined link structure of a large number of startingnodes on the topic. Kleinberg also proposed that the

Page 12: Finding related pages in the World Wide Web

1478 J. Dean, M.R. Henzinger / Computer Networks 31 (1999) 1467–1479

HITS algorithm could be used to find related pageswhen the topic was defined by just a single node.The Companion algorithm used HITS algorithm asa starting point and extended and modified it in fourmain ways:(1) Kleinberg suggested using the following graph

to find related pages: Take a fixed number (say200) of parents of the given URL and call the setconsisting of the URL and these parents the startset. Now build the graph consisting of all nodespointing to a node in the start set or pointedto by a node in the start set. This means that‘grandparents’ of u are included in the graph,while nodes that share a child with u are notincluded in the graph. We believe that the latternodes are more likely to be related to u than arethe ‘grandparent’ nodes. Therefore our vicinitygraph is structured to exclude grandparent nodesbut to include nodes that share a child with u.

(2) We exploit the order of the links on a pageto determine which ‘siblings’ of u to include.When we added this feature, the precision of ouralgorithm improved noticably.

(3) The original HITS algorithm did not have edgeweights. We use edge weights to reduce theinfluence of pages that all reside on one host,since Bharat and Henzinger have shown thatedge weights improve the precision [3].

(4) We also merge nodes in our vicinity graph thathave a large number of duplicate links. Dupli-cate nodes are not such a serious problem whenusing the HITS or imp algorithms to rank queryresults, since the start set consists of a largenumber of URLs. However, when forming thevicinity graph starting with just a single URL,the influence of duplicate nodes is increased be-cause duplicate nodes with a large number of outlinks will quickly dominate the hub and authoritycomputation.

Kleinberg also showed that HITS computes theprincipal eigenvector of the matrix AAT, where A isthe adjacency matrix of the above graph, and sug-gested using non-principal eigenvectors for findingrelated pages. Finally, he gave anecdotal evidencethat HITS might work well.

Consecutively, a sequence of papers [5,7] pre-sented improvements on HITS and used it to popu-late a given hierarchy of categories.

6. Conclusion

We have presented two different algorithms forfinding related pages in the World Wide Web. Theysignificantly outperform Netscape’s algorithm forfinding related pages. The algorithms can be imple-mented efficiently and are suitable for use within alarge scale Web service providing a related pagesfeature.

Our two algorithms can be extended to handlemore than one input URL. In this case, the algo-rithms would compute pages that are related to allinput URLs. We are currently exploring these exten-sions.

Acknowledgements

This work has benefited greatly from discussionswe have had with Krishna Bharat, Andrei Broder,Puneet Kumar, and Hannes Marais. We are alsoindebted to Puneet for his work on the Connec-tivity Server. As some of the earliest users of theserver, Puneet answered our many questions andimplemented many improvements to the server inresponse to our suggestions. We are also grateful toHannes Marais for developing WebL, a Web script-ing language [13]. Using WebL, we were able toquickly develop a prototype user interface for ourrelated pages server. Krishna Bharat, Allan Heydon,and Hannes Marais provided useful feedback on ear-lier drafts of this paper. Finally, we would like tothank all the participants in our user study.

References

[1] G.O. Arocena, A.O. Mendelzon and G.A. Mihaila, Appli-cations of a web query language, in: Proc. of the SixthInternational World Wide Web Conference, pp. 587–595,Santa Clara, CA, April, 1997.

[2] K. Bharat, A.Z. Broder, M. Henzinger, P. Kumar and S.Venkatasubramanian, The connectivity server: fast accessto linkage information on the Web, in: Proc. of the 7thInternational World Wide Web Conference, pp. 469–477,Brisbane, Qld, April 1998.

[3] K. Bharat and M. Henzinger, Improved algorithms for topicdistillation in hyperlinked environments, in: Proc. of the21st International ACM SIGIR Conference on Research

Page 13: Finding related pages in the World Wide Web

J. Dean, M.R. Henzinger / Computer Networks 31 (1999) 1467–1479 1479

and Development in Information Retrieval (SIGIR’98), pp.104–111, 1998.

[4] S. Brin and L. Page, The anatomy of a large-scale hyper-textual Web search engine, in: Proc. of the 7th InternationalWorld Wide Web Conference, pp. 107–117, Brisbane, Qld.,April 1998.

[5] S. Chakrabarti, B. Dom, D. Gibson, S.R. Kumar, P. Ragha-van, S. Rajagopalan and A. Tomkins, Experiments in topicdistillation, in: ACM–SIGIR’98 Post-Conference Workshopon Hypertext Information Retrieval for the Web, 1998.

[6] S. Chakrabarti, B. Dom and P. Indyk, Enhanced hypertextcategorization using hyperlinks, in: Proc. of the ACM SIG-MOD International Conference on Management of Data,pp. 307–318, 1998.

[7] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D.Gibson and J. Kleinberg, Automatic resource compilationby analyzing hyperlink structure and associated text, in:Proc. of the Sixth International World Wide Web Confer-ence, pp. 65–74, Santa Clara, CA, April 1997.

[8] J. Carriere and R. Kazman, WebQuery: Searching and visu-alizing the web through connectivity, in: Proc. of the SixthInternational World Wide Web Conference, pp. 701–711,Santa Clara, CA, April 1998.

[9] E. Garfield, Citation analysis as a tool in journal evaluation,Science 178 (1972).

[10] E. Garfield, Citation Indexing, ISI Press, Philadelphia, PA,1979.

[11] M.M. Kessler, Bibliographic coupling between scientificpapers, American Documentation 14 (1963).

[12] J. Kleinberg, Authoritative sources in a hyperlinked envi-ronment, in: Proc. of the 9th Annual ACM–SIAM Sympo-sium on Discrete Algorithms, pp. 668–677, January 1998.

[13] T. Kistler and H. Marais, WebL — A programming lan-guage for the Web, in: Proc. of the 7th International WorldWide Web Conference, pp. 259–270, Brisbane, Qld., April1998.

[14] Netscape Communications Corporation, ‘What’s RelatedFAQ’ web page, http://home.netscape.com/escapes/related/faq.html

[15] Netscape Communications Corporation, ‘What’s Related’web page, http://home.netscape.com/escapes/related/

[16] J. Pitkow and P. Pirolli, Life, death, and lawfulness on theelectronic frontier, in: Proc. of the Conference on HumanFactors in Computing Systems (CHI 97), pp. 383–390,March 1997.

[17] P. Pirolli, J. Pitkow and R. Rao, Silk from a sow’s ear:

Extracting usable structures from the Web, in: Proc. of theConference on Human Factors in Computing Systems (CHI96), pp. 118–125, April 1996.

[18] S.M. Ross, Introductory Statistics, McGraw-Hill, NewYork, 1996.

[19] H. Small, Co-citation in the scientific literature: a new mea-sure of the relationship between two documents, JournalAmerican Society Information Science 24 (1973).

[20] E. Spertus, ParaSite: Mining structural information on theWeb, in: Proc. of the Sixth International World Wide WebConference, pp. 587–595, Santa Clara, CA, April 1997.

[21] U. Shardanand and P. Maes, Social information filtering:Algorithms for automating ‘Word of Mouth’, in: Proc. ofthe 1995 Conference on Human Factors in ComputingSystems (CHI’95), 1995.

[22] L. Terveen and W. Hill, Finding and visualizing inter-siteclan graphs, in: Proc. of the Conference on Human Factorsin Computing Systems (CHI-98): Making the ImpossiblePossible, pp. 448–455, ACM Press, New York, April 18–23, 1998.

[23] L. Terveen and W. Hill, Evaluating emergent collabora-tion on the Web, in: Proc. of ACM CSCW’98 Conferenceon Computer-Supported Cooperative Work, pp. 355–362,Social Filtering, Social Influences, 1998.

Jeffrey Dean received his Ph.D. from the University of Washing-ton in 1996, working on efficient implementation techniques forobject-oriented languages under Professor Craig Chambers. Hejoined Compaq’s Western Research Laboratory in 1996, wherehe worked on profiling techniques, performance monitoring hard-ware, compiler algorithms and information retrieval. In February,1999, he joined mySimon, Inc., where he is currently working onscalable comparison shopping systems for the World Wide Web.His current research interests include information retrieval andthe development of scalable systems for the World Wide Web.He is two continents shy of his goal of playing basketball onevery continent.

Monika R. Henzinger received her Ph.D. from Princeton Uni-versity in 1993 under the supervision of Robert E. Tarjan. Af-terwards, she was an assistant professor in Computer Scienceat Cornell University. She joined the Digital Systems ResearchCenter (now Compaq Computer Corporation’s Systems ResearchCenter) in 1996. Her current research interests are informationretrieval on the World Wide Web and algorithmic problemsarising in this context.