Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna...

11
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium,1998 Krishna Venkateswaran 1

Transcript of Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna...

Page 1: Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.

Authoritative Sources in a Hyperlinked Environment

Jon M. KleinbergACM-SIAM Symposium,1998

Krishna Venkateswaran

1

Page 2: Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.

Basic Idea R is grown to a set S so that it contains a rich amount of

authoritative pages.Include any page to S that is pointed to by a page in R.

R- Root set S contains t

results. R S- Base set

generated from algorithm.

‘S’ is used to determine the hubs and authorities.

2

Page 3: Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.

Get a set of results for a query string from a text based search query.

Take the top ‘t’ results out of it and put it in a set R.

For every page in set R,◦ Add all the pages that the page points to into

the set R.◦ Add a maximum of d pages that points to the

page, into the set R. The new result set is named S.

Result returned:Base set S out of which we compute the top

authorities and hubs.

3

Page 4: Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.

HeuristicsTo determine what pages to add to the set S.

Heuristic 1: Avoiding navigational links.◦ Transverse links: links that are between pages

with different domain names.◦ Intrinsic links (navigational links): links that are

between pages within a domain.◦ Delete all intrinsic links.

Heuristic 2: Avoiding Mass endorsements.◦ Mass endorsements: A large number of pages in

a domain pointing to a single page.◦ Example: “This site is designed by …” and a link.◦ Eliminate this by setting a parameter m and

allowing only m pages from a single domain to point to a page.

4

Page 5: Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.

Extracting authorities from the overall collection of pages, through an analysis of the link structure of G.

Good hub points to many good authorities and a good authority is pointed to by many good hubs.

Hubs Authorities unrelated page of large in-degree

5

Page 6: Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.

Basic Idea Each page p has a non negative authority weight

and non negative hub weight.

If p points to pages with large authority weight values then the page has a large hub weight value.

If p is pointed to by pages with large hub weight values then the page has a large authority weight value.

Pages with higher weights are better authorities and hubs.

6

Page 7: Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.

I operation:◦ Authority weight of a page= Sum of all hub

weights of pages pointing to the page.

O operation: ◦ Hub weight of a page= Sum of all authority

weights of pages, this page points to.

I and O reinforce each other.

Normalization: The values of the hub and authority weights are divided with a value so that the squares of the sum doesn’t exceed 1.

7

Page 8: Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.

Contd...q1 q1

q2 y[p]=sum of all x[q].

page p page p q2

x[p]=sum of all y[q] q3 q3

Operation I Operation O

Decision on when to stop the reinforcing process. 1)Apply I and O operations alternatively until a

fixed point is reached. 2)Choose a specific parameter ‘k’ and iterate the

process only to k number of times. 8

Page 9: Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.

Given the set of pages in the form of a graph, set an integer value for parameter k.

k is the number of time the iteration occurs. Repeat the following process k times.

◦ Apply the I operation to a page and update its new authority weight.

◦ Apply the O operation to a page and update its hub weight.

◦ Normalize both the authority weight and the hub weight. Return the graph with the new authority weight

and hub weight for each page.

9

Page 10: Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.

Observations The top authorities and hubs are determined by

finding the pages containing the top ‘c’ values for x and y from the graph resulted from the Iterate algorithm.

The Iterate procedure converges to fixed points x* and y* as k increases arbitrarily. ◦ Proved using principal eigenvectors.

Iterate algorithm results in densely linked collection of pages- rich in relevant pages. ◦ Most relevant collection of pages is the densest

graph.

10

Page 11: Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.

Results(java) Authorities

.328 http://www.gamelan.com/ Gamelan

.251 http://java.sun.com/ JavaSoft Home Page

.190 http://www.digitalfocus.com/digitalfocus/faq/howdoi.html The Java Developer: HowDoI

.190 http://lightyear.ncsa.uiuc.edu/srp/java/javabooks.html The Java Book

(\search engines") Authorities.346 http://www.yahoo.com/ Yahoo!.291 http://www.excite.com/ Excite.231 http://www.lycos.com/ Lycos Home Page.231 http://www.altavista.digital.com/ AltaVista: Main Page

(Gates) Authorities.643 http://www.roadahead.com/ Bill Gates: The Road Ahead.458 http://www.microsoft.com/ Welcome to Microsoft.440 http://www.microsoft.com/corpinfo/bill-g.htm

It was observed that the www.roadahead.com was the only site that was present in R initially.

This supports the algorithm because many of the pages don’t contain the search query in them. 11