Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based...
Transcript of Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based...
![Page 1: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/1.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Introduction to Information Retrievalhttp://informationretrieval.org
IIR 21: Link Analysis
Hinrich Schutze
Center for Information and Language Processing, University of Munich
2014-06-18
Schutze: Link analysis 1 / 80
![Page 2: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/2.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Overview
1 Recap
2 Anchor text
3 Citation analysis
4 PageRank
5 HITS: Hubs & Authorities
Schutze: Link analysis 2 / 80
![Page 3: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/3.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Outline
1 Recap
2 Anchor text
3 Citation analysis
4 PageRank
5 HITS: Hubs & Authorities
Schutze: Link analysis 3 / 80
![Page 4: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/4.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Applications of clustering in IR
Application What is Benefit Exampleclustered?
Search result clustering searchresults
more effective infor-mation presentationto user
Scatter-Gather (subsetsof) col-lection
alternative user inter-face: “search withouttyping”
Collection clustering collection effective informationpresentation for ex-ploratory browsing
McKeown et al. 2002,news.google.com
Cluster-based retrieval collection higher efficiency:faster search
Salton 1971
Schutze: Link analysis 4 / 80
![Page 5: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/5.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
K -means algorithm
K -means({~x1, . . . , ~xN},K )1 (~s1,~s2, . . . ,~sK )← SelectRandomSeeds({~x1, . . . , ~xN},K )2 for k ← 1 to K3 do ~µk ← ~sk4 while stopping criterion has not been met5 do for k ← 1 to K6 do ωk ← {}7 for n← 1 to N8 do j ← argminj ′ |~µj ′ − ~xn|9 ωj ← ωj ∪ {~xn} (reassignment of vectors)10 for k ← 1 to K11 do ~µk ←
1|ωk |
∑~x∈ωk
~x (recomputation of centroids)
12 return {~µ1, . . . , ~µK}
Schutze: Link analysis 5 / 80
![Page 6: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/6.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Initialization of K -means
Random seed selection is just one of many ways K -means canbe initialized.
Random seed selection is not very robust: It’s easy to get asuboptimal clustering.
Better heuristics:
Select seeds not randomly, but using some heuristic (e.g., filterout outliers or find a set of seeds that has “good coverage” ofthe document space)Use hierarchical clustering to find good seeds (next class)Select i (e.g., i = 10) different sets of seeds, do a K -meansclustering for each, select the clustering with lowest RSS
Schutze: Link analysis 6 / 80
![Page 7: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/7.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Take-away today
Schutze: Link analysis 7 / 80
![Page 8: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/8.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Take-away today
Anchor text: What exactly are links on the web and why arethey important for IR?
Schutze: Link analysis 7 / 80
![Page 9: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/9.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Take-away today
Anchor text: What exactly are links on the web and why arethey important for IR?
Citation analysis: the mathematical foundation of PageRankand link-based ranking
Schutze: Link analysis 7 / 80
![Page 10: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/10.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Take-away today
Anchor text: What exactly are links on the web and why arethey important for IR?
Citation analysis: the mathematical foundation of PageRankand link-based ranking
PageRank: the original algorithm that was used for link-basedranking on the web
Schutze: Link analysis 7 / 80
![Page 11: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/11.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Take-away today
Anchor text: What exactly are links on the web and why arethey important for IR?
Citation analysis: the mathematical foundation of PageRankand link-based ranking
PageRank: the original algorithm that was used for link-basedranking on the web
Hubs & Authorities: an alternative link-based rankingalgorithm
Schutze: Link analysis 7 / 80
![Page 12: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/12.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Outline
1 Recap
2 Anchor text
3 Citation analysis
4 PageRank
5 HITS: Hubs & Authorities
Schutze: Link analysis 8 / 80
![Page 13: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/13.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
The web as a directed graph
Schutze: Link analysis 9 / 80
![Page 14: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/14.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
The web as a directed graph
page d1 anchor text page d2hyperlink
Schutze: Link analysis 9 / 80
![Page 15: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/15.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
The web as a directed graph
page d1 anchor text page d2hyperlink
Assumption 1: A hyperlink is a quality signal.
Schutze: Link analysis 9 / 80
![Page 16: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/16.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
The web as a directed graph
page d1 anchor text page d2hyperlink
Assumption 1: A hyperlink is a quality signal.The hyperlink d1 → d2 indicates that d1’s author deems d2high-quality and relevant.
Schutze: Link analysis 9 / 80
![Page 17: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/17.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
The web as a directed graph
page d1 anchor text page d2hyperlink
Assumption 1: A hyperlink is a quality signal.The hyperlink d1 → d2 indicates that d1’s author deems d2high-quality and relevant.
Assumption 2: The anchor text describes the content of d2.
Schutze: Link analysis 9 / 80
![Page 18: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/18.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
The web as a directed graph
page d1 anchor text page d2hyperlink
Assumption 1: A hyperlink is a quality signal.The hyperlink d1 → d2 indicates that d1’s author deems d2high-quality and relevant.
Assumption 2: The anchor text describes the content of d2.We use anchor text somewhat loosely here for: the textsurrounding the hyperlink.
Schutze: Link analysis 9 / 80
![Page 19: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/19.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
The web as a directed graph
page d1 anchor text page d2hyperlink
Assumption 1: A hyperlink is a quality signal.The hyperlink d1 → d2 indicates that d1’s author deems d2high-quality and relevant.
Assumption 2: The anchor text describes the content of d2.We use anchor text somewhat loosely here for: the textsurrounding the hyperlink.Example: “You can find cheap cars <ahref=http://...>here</a>.”
Schutze: Link analysis 9 / 80
![Page 20: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/20.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
The web as a directed graph
page d1 anchor text page d2hyperlink
Assumption 1: A hyperlink is a quality signal.The hyperlink d1 → d2 indicates that d1’s author deems d2high-quality and relevant.
Assumption 2: The anchor text describes the content of d2.We use anchor text somewhat loosely here for: the textsurrounding the hyperlink.Example: “You can find cheap cars <ahref=http://...>here</a>.”Anchor text: “You can find cheap cars here”
Schutze: Link analysis 9 / 80
![Page 21: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/21.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
[text of d2] only vs. [text of d2] + [anchor text → d2]
Schutze: Link analysis 10 / 80
![Page 22: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/22.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
[text of d2] only vs. [text of d2] + [anchor text → d2]
Searching on [text of d2] + [anchor text → d2] is often moreeffective than searching on [text of d2] only.
Schutze: Link analysis 10 / 80
![Page 23: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/23.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
[text of d2] only vs. [text of d2] + [anchor text → d2]
Searching on [text of d2] + [anchor text → d2] is often moreeffective than searching on [text of d2] only.
Example: Query IBM
Schutze: Link analysis 10 / 80
![Page 24: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/24.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
[text of d2] only vs. [text of d2] + [anchor text → d2]
Searching on [text of d2] + [anchor text → d2] is often moreeffective than searching on [text of d2] only.
Example: Query IBM
Matches IBM’s copyright page
Schutze: Link analysis 10 / 80
![Page 25: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/25.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
[text of d2] only vs. [text of d2] + [anchor text → d2]
Searching on [text of d2] + [anchor text → d2] is often moreeffective than searching on [text of d2] only.
Example: Query IBM
Matches IBM’s copyright pageMatches many spam pages
Schutze: Link analysis 10 / 80
![Page 26: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/26.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
[text of d2] only vs. [text of d2] + [anchor text → d2]
Searching on [text of d2] + [anchor text → d2] is often moreeffective than searching on [text of d2] only.
Example: Query IBM
Matches IBM’s copyright pageMatches many spam pagesMatches IBM wikipedia article
Schutze: Link analysis 10 / 80
![Page 27: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/27.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
[text of d2] only vs. [text of d2] + [anchor text → d2]
Searching on [text of d2] + [anchor text → d2] is often moreeffective than searching on [text of d2] only.
Example: Query IBM
Matches IBM’s copyright pageMatches many spam pagesMatches IBM wikipedia articleMay not match IBM home page!
Schutze: Link analysis 10 / 80
![Page 28: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/28.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
[text of d2] only vs. [text of d2] + [anchor text → d2]
Searching on [text of d2] + [anchor text → d2] is often moreeffective than searching on [text of d2] only.
Example: Query IBM
Matches IBM’s copyright pageMatches many spam pagesMatches IBM wikipedia articleMay not match IBM home page!. . . if IBM home page is mostly graphics
Schutze: Link analysis 10 / 80
![Page 29: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/29.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
[text of d2] only vs. [text of d2] + [anchor text → d2]
Searching on [text of d2] + [anchor text → d2] is often moreeffective than searching on [text of d2] only.
Example: Query IBM
Matches IBM’s copyright pageMatches many spam pagesMatches IBM wikipedia articleMay not match IBM home page!. . . if IBM home page is mostly graphics
Searching on [anchor text → d2] is better for the query IBM.
Schutze: Link analysis 10 / 80
![Page 30: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/30.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
[text of d2] only vs. [text of d2] + [anchor text → d2]
Searching on [text of d2] + [anchor text → d2] is often moreeffective than searching on [text of d2] only.
Example: Query IBM
Matches IBM’s copyright pageMatches many spam pagesMatches IBM wikipedia articleMay not match IBM home page!. . . if IBM home page is mostly graphics
Searching on [anchor text → d2] is better for the query IBM.
In this representation, the page with the most occurrences ofIBM is www.ibm.com.
Schutze: Link analysis 10 / 80
![Page 31: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/31.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Anchor text containing IBM pointing to www.ibm.com
Schutze: Link analysis 11 / 80
![Page 32: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/32.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Anchor text containing IBM pointing to www.ibm.com
www.nytimes.com: “IBM acquires Webify”
www.slashdot.org: “New IBM optical chip”
www.stanford.edu: “IBM faculty award recipients”
wwww.ibm.com
Schutze: Link analysis 11 / 80
![Page 33: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/33.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Indexing anchor text
Schutze: Link analysis 12 / 80
![Page 34: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/34.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Indexing anchor text
Thus: Anchor text is often a better description of a page’scontent than the page itself.
Schutze: Link analysis 12 / 80
![Page 35: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/35.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Indexing anchor text
Thus: Anchor text is often a better description of a page’scontent than the page itself.
Anchor text can be weighted more highly than document text.(based on Assumptions 1&2)
Schutze: Link analysis 12 / 80
![Page 36: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/36.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Exercise: Assumptions underlying PageRank
Schutze: Link analysis 13 / 80
![Page 37: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/37.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Exercise: Assumptions underlying PageRank
Assumption 1: A link on the web is a quality signal – theauthor of the link thinks that the linked-to page is high-quality.
Schutze: Link analysis 13 / 80
![Page 38: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/38.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Exercise: Assumptions underlying PageRank
Assumption 1: A link on the web is a quality signal – theauthor of the link thinks that the linked-to page is high-quality.
Assumption 2: The anchor text describes the content of thelinked-to page.
Schutze: Link analysis 13 / 80
![Page 39: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/39.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Exercise: Assumptions underlying PageRank
Assumption 1: A link on the web is a quality signal – theauthor of the link thinks that the linked-to page is high-quality.
Assumption 2: The anchor text describes the content of thelinked-to page.
Is assumption 1 true in general?
Schutze: Link analysis 13 / 80
![Page 40: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/40.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Exercise: Assumptions underlying PageRank
Assumption 1: A link on the web is a quality signal – theauthor of the link thinks that the linked-to page is high-quality.
Assumption 2: The anchor text describes the content of thelinked-to page.
Is assumption 1 true in general?
Is assumption 2 true in general?
Schutze: Link analysis 13 / 80
![Page 41: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/41.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Google bombs
Schutze: Link analysis 14 / 80
![Page 42: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/42.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Google bombs
A Google bomb is a search with “bad” results due tomaliciously manipulated anchor text.
Schutze: Link analysis 14 / 80
![Page 43: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/43.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Google bombs
A Google bomb is a search with “bad” results due tomaliciously manipulated anchor text.
Google introduced a new weighting function in 2007 that fixedmany Google bombs.
Schutze: Link analysis 14 / 80
![Page 44: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/44.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Google bombs
A Google bomb is a search with “bad” results due tomaliciously manipulated anchor text.
Google introduced a new weighting function in 2007 that fixedmany Google bombs.
Still some remnants: [dangerous cult] on Google, Bing, Yahoo
Schutze: Link analysis 14 / 80
![Page 45: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/45.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Google bombs
A Google bomb is a search with “bad” results due tomaliciously manipulated anchor text.
Google introduced a new weighting function in 2007 that fixedmany Google bombs.
Still some remnants: [dangerous cult] on Google, Bing, Yahoo
Coordinated link creation by those who dislike the Church ofScientology
Schutze: Link analysis 14 / 80
![Page 46: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/46.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Google bombs
A Google bomb is a search with “bad” results due tomaliciously manipulated anchor text.
Google introduced a new weighting function in 2007 that fixedmany Google bombs.
Still some remnants: [dangerous cult] on Google, Bing, Yahoo
Coordinated link creation by those who dislike the Church ofScientology
Defused Google bombs: [dumb motherf....], [who is afailure?], [evil empire]
Schutze: Link analysis 14 / 80
![Page 47: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/47.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Outline
1 Recap
2 Anchor text
3 Citation analysis
4 PageRank
5 HITS: Hubs & Authorities
Schutze: Link analysis 15 / 80
![Page 48: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/48.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (1)
Schutze: Link analysis 16 / 80
![Page 49: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/49.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (1)
Citation analysis: analysis of citations in the scientificliterature
Schutze: Link analysis 16 / 80
![Page 50: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/50.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (1)
Citation analysis: analysis of citations in the scientificliterature
Example citation: “Miller (2001) has shown that physicalactivity alters the metabolism of estrogens.”
Schutze: Link analysis 16 / 80
![Page 51: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/51.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (1)
Citation analysis: analysis of citations in the scientificliterature
Example citation: “Miller (2001) has shown that physicalactivity alters the metabolism of estrogens.”
We can view “Miller (2001)” as a hyperlink linking twoscientific articles.
Schutze: Link analysis 16 / 80
![Page 52: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/52.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (1)
Citation analysis: analysis of citations in the scientificliterature
Example citation: “Miller (2001) has shown that physicalactivity alters the metabolism of estrogens.”
We can view “Miller (2001)” as a hyperlink linking twoscientific articles.
One application of these “hyperlinks” in the scientificliterature:
Schutze: Link analysis 16 / 80
![Page 53: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/53.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (1)
Citation analysis: analysis of citations in the scientificliterature
Example citation: “Miller (2001) has shown that physicalactivity alters the metabolism of estrogens.”
We can view “Miller (2001)” as a hyperlink linking twoscientific articles.
One application of these “hyperlinks” in the scientificliterature:
Measure the similarity of two articles by the overlap of otherarticles citing them.
Schutze: Link analysis 16 / 80
![Page 54: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/54.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (1)
Citation analysis: analysis of citations in the scientificliterature
Example citation: “Miller (2001) has shown that physicalactivity alters the metabolism of estrogens.”
We can view “Miller (2001)” as a hyperlink linking twoscientific articles.
One application of these “hyperlinks” in the scientificliterature:
Measure the similarity of two articles by the overlap of otherarticles citing them.This is called cocitation similarity.
Schutze: Link analysis 16 / 80
![Page 55: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/55.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (1)
Citation analysis: analysis of citations in the scientificliterature
Example citation: “Miller (2001) has shown that physicalactivity alters the metabolism of estrogens.”
We can view “Miller (2001)” as a hyperlink linking twoscientific articles.
One application of these “hyperlinks” in the scientificliterature:
Measure the similarity of two articles by the overlap of otherarticles citing them.This is called cocitation similarity.Cocitation similarity on the web: Google’s “related:” operator,e.g. [related:www.ford.com]
Schutze: Link analysis 16 / 80
![Page 56: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/56.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (2)
Schutze: Link analysis 17 / 80
![Page 57: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/57.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (2)
Another application: Citation frequency can be used tomeasure the impact of a scientific article.
Schutze: Link analysis 17 / 80
![Page 58: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/58.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (2)
Another application: Citation frequency can be used tomeasure the impact of a scientific article.
Simplest measure: Each citation gets one vote.
Schutze: Link analysis 17 / 80
![Page 59: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/59.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (2)
Another application: Citation frequency can be used tomeasure the impact of a scientific article.
Simplest measure: Each citation gets one vote.On the web: citation frequency = inlink count
Schutze: Link analysis 17 / 80
![Page 60: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/60.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (2)
Another application: Citation frequency can be used tomeasure the impact of a scientific article.
Simplest measure: Each citation gets one vote.On the web: citation frequency = inlink count
However: A high inlink count does not necessarily mean highquality . . .
Schutze: Link analysis 17 / 80
![Page 61: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/61.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (2)
Another application: Citation frequency can be used tomeasure the impact of a scientific article.
Simplest measure: Each citation gets one vote.On the web: citation frequency = inlink count
However: A high inlink count does not necessarily mean highquality . . .
. . . mainly because of link spam.
Schutze: Link analysis 17 / 80
![Page 62: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/62.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (2)
Another application: Citation frequency can be used tomeasure the impact of a scientific article.
Simplest measure: Each citation gets one vote.On the web: citation frequency = inlink count
However: A high inlink count does not necessarily mean highquality . . .
. . . mainly because of link spam.
Better measure: weighted citation frequency or citation rank
Schutze: Link analysis 17 / 80
![Page 63: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/63.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (2)
Another application: Citation frequency can be used tomeasure the impact of a scientific article.
Simplest measure: Each citation gets one vote.On the web: citation frequency = inlink count
However: A high inlink count does not necessarily mean highquality . . .
. . . mainly because of link spam.
Better measure: weighted citation frequency or citation rank
An citation’s vote is weighted according to its citation impact.
Schutze: Link analysis 17 / 80
![Page 64: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/64.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (2)
Another application: Citation frequency can be used tomeasure the impact of a scientific article.
Simplest measure: Each citation gets one vote.On the web: citation frequency = inlink count
However: A high inlink count does not necessarily mean highquality . . .
. . . mainly because of link spam.
Better measure: weighted citation frequency or citation rank
An citation’s vote is weighted according to its citation impact.Circular? No: can be formalized in a well-defined way.
Schutze: Link analysis 17 / 80
![Page 65: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/65.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (3)
Better measure: weighted citation frequency or citation rank
This is basically PageRank.
Schutze: Link analysis 18 / 80
![Page 66: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/66.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (3)
Better measure: weighted citation frequency or citation rank
This is basically PageRank.
PageRank was invented in the context of citation analysis byPinsker and Narin in the 1960s.
Schutze: Link analysis 18 / 80
![Page 67: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/67.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Citation analysis (3)
Better measure: weighted citation frequency or citation rank
This is basically PageRank.
PageRank was invented in the context of citation analysis byPinsker and Narin in the 1960s.
Citation analysis is a big deal: The budget and salary of thislecturer are / will be determined by the impact of hispublications!
Schutze: Link analysis 18 / 80
![Page 68: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/68.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Summary
Schutze: Link analysis 19 / 80
![Page 69: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/69.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Summary
We can use the same formal representation for
Schutze: Link analysis 19 / 80
![Page 70: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/70.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Summary
We can use the same formal representation for
citations in the scientific literature
Schutze: Link analysis 19 / 80
![Page 71: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/71.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Summary
We can use the same formal representation for
citations in the scientific literaturehyperlinks on the web
Schutze: Link analysis 19 / 80
![Page 72: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/72.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Summary
We can use the same formal representation for
citations in the scientific literaturehyperlinks on the web
Appropriately weighted citation frequency is an excellentmeasure of quality . . .
Schutze: Link analysis 19 / 80
![Page 73: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/73.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Summary
We can use the same formal representation for
citations in the scientific literaturehyperlinks on the web
Appropriately weighted citation frequency is an excellentmeasure of quality . . .
. . . both for web pages and for scientific publications.
Schutze: Link analysis 19 / 80
![Page 74: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/74.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Origins of PageRank: Summary
We can use the same formal representation for
citations in the scientific literaturehyperlinks on the web
Appropriately weighted citation frequency is an excellentmeasure of quality . . .
. . . both for web pages and for scientific publications.
Next: PageRank algorithm for computing weighted citationfrequency on the web
Schutze: Link analysis 19 / 80
![Page 75: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/75.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Outline
1 Recap
2 Anchor text
3 Citation analysis
4 PageRank
5 HITS: Hubs & Authorities
Schutze: Link analysis 20 / 80
![Page 76: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/76.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Model behind PageRank: Random walk
Schutze: Link analysis 21 / 80
![Page 77: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/77.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Model behind PageRank: Random walk
Imagine a web surfer doing a random walk on the web
Schutze: Link analysis 21 / 80
![Page 78: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/78.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Model behind PageRank: Random walk
Imagine a web surfer doing a random walk on the web
Start at a random page
Schutze: Link analysis 21 / 80
![Page 79: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/79.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Model behind PageRank: Random walk
Imagine a web surfer doing a random walk on the web
Start at a random pageAt each step, go out of the current page along one of the linkson that page, equiprobably
Schutze: Link analysis 21 / 80
![Page 80: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/80.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Model behind PageRank: Random walk
Imagine a web surfer doing a random walk on the web
Start at a random pageAt each step, go out of the current page along one of the linkson that page, equiprobably
In the steady state, each page has a long-term visit rate.
Schutze: Link analysis 21 / 80
![Page 81: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/81.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Model behind PageRank: Random walk
Imagine a web surfer doing a random walk on the web
Start at a random pageAt each step, go out of the current page along one of the linkson that page, equiprobably
In the steady state, each page has a long-term visit rate.
This long-term visit rate is the page’s PageRank.
Schutze: Link analysis 21 / 80
![Page 82: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/82.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Model behind PageRank: Random walk
Imagine a web surfer doing a random walk on the web
Start at a random pageAt each step, go out of the current page along one of the linkson that page, equiprobably
In the steady state, each page has a long-term visit rate.
This long-term visit rate is the page’s PageRank.
PageRank = long-term visit rate = steady state probability
Schutze: Link analysis 21 / 80
![Page 83: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/83.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Formalization of random walk: Markov chains
Schutze: Link analysis 22 / 80
![Page 84: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/84.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Formalization of random walk: Markov chains
A Markov chain consists of N states, plus an N ×N transitionprobability matrix P .
Schutze: Link analysis 22 / 80
![Page 85: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/85.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Formalization of random walk: Markov chains
A Markov chain consists of N states, plus an N ×N transitionprobability matrix P .
state = page
Schutze: Link analysis 22 / 80
![Page 86: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/86.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Formalization of random walk: Markov chains
A Markov chain consists of N states, plus an N ×N transitionprobability matrix P .
state = page
At each step, we are on exactly one of the pages.
Schutze: Link analysis 22 / 80
![Page 87: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/87.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Formalization of random walk: Markov chains
A Markov chain consists of N states, plus an N ×N transitionprobability matrix P .
state = page
At each step, we are on exactly one of the pages.
For 1 ≤ i , j ≤ N, the matrix entry Pij tells us the probabilityof j being the next page, given we are currently on page i .
Clearly, for all i,∑N
j=1 Pij = 1
di djPij
Schutze: Link analysis 22 / 80
![Page 88: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/88.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Example web graph
d0
d2 d1
d5
d3 d6
d4
car benz
ford
gm
honda
jaguar
jag
cat
leopard
tiger
jaguar
lion
cheetah
speed
Schutze: Link analysis 23 / 80
![Page 89: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/89.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Link matrix for example
d0 d1 d2 d3 d4 d5 d6d0 0 0 1 0 0 0 0d1 0 1 1 0 0 0 0d2 1 0 1 1 0 0 0d3 0 0 0 1 1 0 0d4 0 0 0 0 0 0 1d5 0 0 0 0 0 1 1d6 0 0 0 1 1 0 1
Schutze: Link analysis 24 / 80
![Page 90: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/90.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Transition probability matrix P for example
d0 d1 d2 d3 d4 d5 d6d0 0.00 0.00 1.00 0.00 0.00 0.00 0.00d1 0.00 0.50 0.50 0.00 0.00 0.00 0.00d2 0.33 0.00 0.33 0.33 0.00 0.00 0.00d3 0.00 0.00 0.00 0.50 0.50 0.00 0.00d4 0.00 0.00 0.00 0.00 0.00 0.00 1.00d5 0.00 0.00 0.00 0.00 0.00 0.50 0.50d6 0.00 0.00 0.00 0.33 0.33 0.00 0.33
Schutze: Link analysis 25 / 80
![Page 91: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/91.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Long-term visit rate
Schutze: Link analysis 26 / 80
![Page 92: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/92.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Long-term visit rate
Recall: PageRank = long-term visit rate
Schutze: Link analysis 26 / 80
![Page 93: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/93.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Long-term visit rate
Recall: PageRank = long-term visit rate
Long-term visit rate of page d is the probability that a websurfer is at page d at a given point in time.
Schutze: Link analysis 26 / 80
![Page 94: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/94.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Long-term visit rate
Recall: PageRank = long-term visit rate
Long-term visit rate of page d is the probability that a websurfer is at page d at a given point in time.
Next: what properties must hold of the web graph for thelong-term visit rate to be well defined?
Schutze: Link analysis 26 / 80
![Page 95: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/95.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Long-term visit rate
Recall: PageRank = long-term visit rate
Long-term visit rate of page d is the probability that a websurfer is at page d at a given point in time.
Next: what properties must hold of the web graph for thelong-term visit rate to be well defined?
The web graph must correspond to an ergodic Markov chain.
Schutze: Link analysis 26 / 80
![Page 96: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/96.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Long-term visit rate
Recall: PageRank = long-term visit rate
Long-term visit rate of page d is the probability that a websurfer is at page d at a given point in time.
Next: what properties must hold of the web graph for thelong-term visit rate to be well defined?
The web graph must correspond to an ergodic Markov chain.
First a special case: The web graph must not contain deadends.
Schutze: Link analysis 26 / 80
![Page 97: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/97.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Dead ends
??
Schutze: Link analysis 27 / 80
![Page 98: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/98.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Dead ends
??
The web is full of dead ends.
Schutze: Link analysis 27 / 80
![Page 99: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/99.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Dead ends
??
The web is full of dead ends.
Random walk can get stuck in dead ends.
Schutze: Link analysis 27 / 80
![Page 100: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/100.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Dead ends
??
The web is full of dead ends.
Random walk can get stuck in dead ends.
If there are dead ends, long-term visit rates are notwell-defined (or non-sensical).
Schutze: Link analysis 27 / 80
![Page 101: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/101.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Teleporting – to get us out of dead ends
Schutze: Link analysis 28 / 80
![Page 102: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/102.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Teleporting – to get us out of dead ends
At a dead end, jump to a random web page with prob. 1/N.
Schutze: Link analysis 28 / 80
![Page 103: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/103.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Teleporting – to get us out of dead ends
At a dead end, jump to a random web page with prob. 1/N.
At a non-dead end, with probability 10%, jump to a randomweb page (to each with a probability of 0.1/N).
Schutze: Link analysis 28 / 80
![Page 104: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/104.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Teleporting – to get us out of dead ends
At a dead end, jump to a random web page with prob. 1/N.
At a non-dead end, with probability 10%, jump to a randomweb page (to each with a probability of 0.1/N).
With remaining probability (90%), go out on a randomhyperlink.
Schutze: Link analysis 28 / 80
![Page 105: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/105.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Teleporting – to get us out of dead ends
At a dead end, jump to a random web page with prob. 1/N.
At a non-dead end, with probability 10%, jump to a randomweb page (to each with a probability of 0.1/N).
With remaining probability (90%), go out on a randomhyperlink.
For example, if the page has 4 outgoing links: randomlychoose one with probability (1-0.10)/4=0.225
Schutze: Link analysis 28 / 80
![Page 106: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/106.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Teleporting – to get us out of dead ends
At a dead end, jump to a random web page with prob. 1/N.
At a non-dead end, with probability 10%, jump to a randomweb page (to each with a probability of 0.1/N).
With remaining probability (90%), go out on a randomhyperlink.
For example, if the page has 4 outgoing links: randomlychoose one with probability (1-0.10)/4=0.225
10% is a parameter, the teleportation rate.
Schutze: Link analysis 28 / 80
![Page 107: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/107.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Teleporting – to get us out of dead ends
At a dead end, jump to a random web page with prob. 1/N.
At a non-dead end, with probability 10%, jump to a randomweb page (to each with a probability of 0.1/N).
With remaining probability (90%), go out on a randomhyperlink.
For example, if the page has 4 outgoing links: randomlychoose one with probability (1-0.10)/4=0.225
10% is a parameter, the teleportation rate.
Note: “jumping” from dead end is independent ofteleportation rate.
Schutze: Link analysis 28 / 80
![Page 108: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/108.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Result of teleporting
Schutze: Link analysis 29 / 80
![Page 109: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/109.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Result of teleporting
With teleporting, we cannot get stuck in a dead end.
Schutze: Link analysis 29 / 80
![Page 110: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/110.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Result of teleporting
With teleporting, we cannot get stuck in a dead end.
But even without dead ends, a graph may not havewell-defined long-term visit rates.
Schutze: Link analysis 29 / 80
![Page 111: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/111.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Result of teleporting
With teleporting, we cannot get stuck in a dead end.
But even without dead ends, a graph may not havewell-defined long-term visit rates.
More generally, we require that the Markov chain beergodic.
Schutze: Link analysis 29 / 80
![Page 112: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/112.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Ergodic Markov chains
A Markov chain is ergodic iff it is irreducible and aperiodic.
Schutze: Link analysis 30 / 80
![Page 113: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/113.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Ergodic Markov chains
A Markov chain is ergodic iff it is irreducible and aperiodic.
Irreducibility. Roughly: there is a path from any page to anyother page.
Schutze: Link analysis 30 / 80
![Page 114: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/114.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Ergodic Markov chains
A Markov chain is ergodic iff it is irreducible and aperiodic.
Irreducibility. Roughly: there is a path from any page to anyother page.
Aperiodicity. Roughly: The pages cannot be partitioned suchthat the random walker visits the partitions sequentially.
Schutze: Link analysis 30 / 80
![Page 115: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/115.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Ergodic Markov chains
A Markov chain is ergodic iff it is irreducible and aperiodic.
Irreducibility. Roughly: there is a path from any page to anyother page.
Aperiodicity. Roughly: The pages cannot be partitioned suchthat the random walker visits the partitions sequentially.
A non-ergodic Markov chain:
1.0
1.0
Schutze: Link analysis 30 / 80
![Page 116: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/116.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Ergodic Markov chains
Theorem: For any ergodic Markov chain, there is a uniquelong-term visit rate for each state.
Schutze: Link analysis 31 / 80
![Page 117: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/117.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Ergodic Markov chains
Theorem: For any ergodic Markov chain, there is a uniquelong-term visit rate for each state.
This is the steady-state probability distribution.
Schutze: Link analysis 31 / 80
![Page 118: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/118.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Ergodic Markov chains
Theorem: For any ergodic Markov chain, there is a uniquelong-term visit rate for each state.
This is the steady-state probability distribution.
Over a long time period, we visit each state in proportion tothis rate.
Schutze: Link analysis 31 / 80
![Page 119: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/119.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Ergodic Markov chains
Theorem: For any ergodic Markov chain, there is a uniquelong-term visit rate for each state.
This is the steady-state probability distribution.
Over a long time period, we visit each state in proportion tothis rate.
It doesn’t matter where we start.
Schutze: Link analysis 31 / 80
![Page 120: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/120.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Ergodic Markov chains
Theorem: For any ergodic Markov chain, there is a uniquelong-term visit rate for each state.
This is the steady-state probability distribution.
Over a long time period, we visit each state in proportion tothis rate.
It doesn’t matter where we start.
Teleporting makes the web graph ergodic.
Schutze: Link analysis 31 / 80
![Page 121: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/121.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Ergodic Markov chains
Theorem: For any ergodic Markov chain, there is a uniquelong-term visit rate for each state.
This is the steady-state probability distribution.
Over a long time period, we visit each state in proportion tothis rate.
It doesn’t matter where we start.
Teleporting makes the web graph ergodic.
⇒ Web-graph+teleporting has a steady-state probabilitydistribution.
Schutze: Link analysis 31 / 80
![Page 122: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/122.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Ergodic Markov chains
Theorem: For any ergodic Markov chain, there is a uniquelong-term visit rate for each state.
This is the steady-state probability distribution.
Over a long time period, we visit each state in proportion tothis rate.
It doesn’t matter where we start.
Teleporting makes the web graph ergodic.
⇒ Web-graph+teleporting has a steady-state probabilitydistribution.
⇒ Each page in the web-graph+teleporting has aPageRank.
Schutze: Link analysis 31 / 80
![Page 123: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/123.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Where we are
Schutze: Link analysis 32 / 80
![Page 124: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/124.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Where we are
We now know what to do to make sure we have a well-definedPageRank for each page.
Schutze: Link analysis 32 / 80
![Page 125: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/125.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Where we are
We now know what to do to make sure we have a well-definedPageRank for each page.
Next: how to compute PageRank
Schutze: Link analysis 32 / 80
![Page 126: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/126.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Formalization of “visit”: Probability vector
Schutze: Link analysis 33 / 80
![Page 127: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/127.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Formalization of “visit”: Probability vector
A probability (row) vector ~x = (x1, . . . , xN) tells us where therandom walk is at any point.
Schutze: Link analysis 33 / 80
![Page 128: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/128.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Formalization of “visit”: Probability vector
A probability (row) vector ~x = (x1, . . . , xN) tells us where therandom walk is at any point.
Example:( 0 0 0 . . . 1 . . . 0 0 0 )
1 2 3 . . . i . . . N-2 N-1 N
Schutze: Link analysis 33 / 80
![Page 129: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/129.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Formalization of “visit”: Probability vector
A probability (row) vector ~x = (x1, . . . , xN) tells us where therandom walk is at any point.
Example:( 0 0 0 . . . 1 . . . 0 0 0 )
1 2 3 . . . i . . . N-2 N-1 N
More generally: the random walk is on page i with probabilityxi .
Schutze: Link analysis 33 / 80
![Page 130: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/130.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Formalization of “visit”: Probability vector
A probability (row) vector ~x = (x1, . . . , xN) tells us where therandom walk is at any point.
Example:( 0 0 0 . . . 1 . . . 0 0 0 )
1 2 3 . . . i . . . N-2 N-1 N
More generally: the random walk is on page i with probabilityxi .
Example:( 0.05 0.01 0.0 . . . 0.2 . . . 0.01 0.05 0.03 )
1 2 3 . . . i . . . N-2 N-1 N
Schutze: Link analysis 33 / 80
![Page 131: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/131.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Formalization of “visit”: Probability vector
A probability (row) vector ~x = (x1, . . . , xN) tells us where therandom walk is at any point.
Example:( 0 0 0 . . . 1 . . . 0 0 0 )
1 2 3 . . . i . . . N-2 N-1 N
More generally: the random walk is on page i with probabilityxi .
Example:( 0.05 0.01 0.0 . . . 0.2 . . . 0.01 0.05 0.03 )
1 2 3 . . . i . . . N-2 N-1 N∑
xi = 1
Schutze: Link analysis 33 / 80
![Page 132: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/132.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Change in probability vector
If the probability vector is ~x = (x1, . . . , xN) at this step, whatis it at the next step?
Schutze: Link analysis 34 / 80
![Page 133: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/133.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Change in probability vector
If the probability vector is ~x = (x1, . . . , xN) at this step, whatis it at the next step?
Recall that row i of the transition probability matrix P tells uswhere we go next from state i .
Schutze: Link analysis 34 / 80
![Page 134: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/134.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Change in probability vector
If the probability vector is ~x = (x1, . . . , xN) at this step, whatis it at the next step?
Recall that row i of the transition probability matrix P tells uswhere we go next from state i .
So from ~x , our next state is distributed as ~xP .
Schutze: Link analysis 34 / 80
![Page 135: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/135.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Steady state in vector notation
Schutze: Link analysis 35 / 80
![Page 136: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/136.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Steady state in vector notation
The steady state in vector notation is simply a vector~π = (π1, π2, . . . , πN) of probabilities.
Schutze: Link analysis 35 / 80
![Page 137: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/137.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Steady state in vector notation
The steady state in vector notation is simply a vector~π = (π1, π2, . . . , πN) of probabilities.
(We use ~π to distinguish it from the notation for theprobability vector ~x .)
Schutze: Link analysis 35 / 80
![Page 138: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/138.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Steady state in vector notation
The steady state in vector notation is simply a vector~π = (π1, π2, . . . , πN) of probabilities.
(We use ~π to distinguish it from the notation for theprobability vector ~x .)
πi is the long-term visit rate (or PageRank) of page i .
Schutze: Link analysis 35 / 80
![Page 139: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/139.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Steady state in vector notation
The steady state in vector notation is simply a vector~π = (π1, π2, . . . , πN) of probabilities.
(We use ~π to distinguish it from the notation for theprobability vector ~x .)
πi is the long-term visit rate (or PageRank) of page i .
So we can think of PageRank as a very long vector – oneentry per page.
Schutze: Link analysis 35 / 80
![Page 140: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/140.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Steady-state distribution: Example
Schutze: Link analysis 36 / 80
![Page 141: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/141.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Steady-state distribution: Example
What is the PageRank / steady state in this example?
d1 d2
0.75
0.25
0.25
0.75
Schutze: Link analysis 36 / 80
![Page 142: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/142.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Steady-state distribution: Example
Schutze: Link analysis 37 / 80
![Page 143: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/143.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Steady-state distribution: Example
x1 x2Pt(d1) Pt(d2)
P11 = 0.25 P12 = 0.75P21 = 0.25 P22 = 0.75
t0 0.25 0.75t1
PageRank vector = ~π = (π1, π2) = (0.25, 0.75)
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 37 / 80
![Page 144: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/144.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Steady-state distribution: Example
x1 x2Pt(d1) Pt(d2)
P11 = 0.25 P12 = 0.75P21 = 0.25 P22 = 0.75
t0 0.25 0.75 0.25 0.75t1
PageRank vector = ~π = (π1, π2) = (0.25, 0.75)
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 37 / 80
![Page 145: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/145.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Steady-state distribution: Example
x1 x2Pt(d1) Pt(d2)
P11 = 0.25 P12 = 0.75P21 = 0.25 P22 = 0.75
t0 0.25 0.75 0.25 0.75t1 0.25 0.75
PageRank vector = ~π = (π1, π2) = (0.25, 0.75)
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 37 / 80
![Page 146: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/146.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Steady-state distribution: Example
x1 x2Pt(d1) Pt(d2)
P11 = 0.25 P12 = 0.75P21 = 0.25 P22 = 0.75
t0 0.25 0.75 0.25 0.75t1 0.25 0.75 (convergence)
PageRank vector = ~π = (π1, π2) = (0.25, 0.75)
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 37 / 80
![Page 147: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/147.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How do we compute the steady state vector?
Schutze: Link analysis 38 / 80
![Page 148: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/148.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How do we compute the steady state vector?
In other words: how do we compute PageRank?
Schutze: Link analysis 38 / 80
![Page 149: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/149.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How do we compute the steady state vector?
In other words: how do we compute PageRank?
Recall: ~π = (π1, π2, . . . , πN) is the PageRank vector, thevector of steady-state probabilities . . .
Schutze: Link analysis 38 / 80
![Page 150: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/150.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How do we compute the steady state vector?
In other words: how do we compute PageRank?
Recall: ~π = (π1, π2, . . . , πN) is the PageRank vector, thevector of steady-state probabilities . . .
. . . and if the distribution in this step is ~x , then thedistribution in the next step is ~xP .
Schutze: Link analysis 38 / 80
![Page 151: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/151.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How do we compute the steady state vector?
In other words: how do we compute PageRank?
Recall: ~π = (π1, π2, . . . , πN) is the PageRank vector, thevector of steady-state probabilities . . .
. . . and if the distribution in this step is ~x , then thedistribution in the next step is ~xP .
But ~π is the steady state!
Schutze: Link analysis 38 / 80
![Page 152: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/152.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How do we compute the steady state vector?
In other words: how do we compute PageRank?
Recall: ~π = (π1, π2, . . . , πN) is the PageRank vector, thevector of steady-state probabilities . . .
. . . and if the distribution in this step is ~x , then thedistribution in the next step is ~xP .
But ~π is the steady state!
So: ~π = ~πP
Schutze: Link analysis 38 / 80
![Page 153: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/153.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How do we compute the steady state vector?
In other words: how do we compute PageRank?
Recall: ~π = (π1, π2, . . . , πN) is the PageRank vector, thevector of steady-state probabilities . . .
. . . and if the distribution in this step is ~x , then thedistribution in the next step is ~xP .
But ~π is the steady state!
So: ~π = ~πP
Solving this matrix equation gives us ~π.
Schutze: Link analysis 38 / 80
![Page 154: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/154.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How do we compute the steady state vector?
In other words: how do we compute PageRank?
Recall: ~π = (π1, π2, . . . , πN) is the PageRank vector, thevector of steady-state probabilities . . .
. . . and if the distribution in this step is ~x , then thedistribution in the next step is ~xP .
But ~π is the steady state!
So: ~π = ~πP
Solving this matrix equation gives us ~π.
~π is the principal left eigenvector for P . . .
Schutze: Link analysis 38 / 80
![Page 155: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/155.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How do we compute the steady state vector?
In other words: how do we compute PageRank?
Recall: ~π = (π1, π2, . . . , πN) is the PageRank vector, thevector of steady-state probabilities . . .
. . . and if the distribution in this step is ~x , then thedistribution in the next step is ~xP .
But ~π is the steady state!
So: ~π = ~πP
Solving this matrix equation gives us ~π.
~π is the principal left eigenvector for P . . .
. . . that is, ~π is the left eigenvector with the largest eigenvalue.
Schutze: Link analysis 38 / 80
![Page 156: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/156.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How do we compute the steady state vector?
In other words: how do we compute PageRank?
Recall: ~π = (π1, π2, . . . , πN) is the PageRank vector, thevector of steady-state probabilities . . .
. . . and if the distribution in this step is ~x , then thedistribution in the next step is ~xP .
But ~π is the steady state!
So: ~π = ~πP
Solving this matrix equation gives us ~π.
~π is the principal left eigenvector for P . . .
. . . that is, ~π is the left eigenvector with the largest eigenvalue.
All transition probability matrices have largest eigenvalue 1.
Schutze: Link analysis 38 / 80
![Page 157: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/157.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
One way of computing the PageRank ~π
Schutze: Link analysis 39 / 80
![Page 158: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/158.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
One way of computing the PageRank ~π
Start with any distribution ~x , e.g., uniform distribution
Schutze: Link analysis 39 / 80
![Page 159: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/159.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
One way of computing the PageRank ~π
Start with any distribution ~x , e.g., uniform distribution
After one step, we’re at ~xP .
Schutze: Link analysis 39 / 80
![Page 160: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/160.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
One way of computing the PageRank ~π
Start with any distribution ~x , e.g., uniform distribution
After one step, we’re at ~xP .
After two steps, we’re at ~xP2.
Schutze: Link analysis 39 / 80
![Page 161: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/161.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
One way of computing the PageRank ~π
Start with any distribution ~x , e.g., uniform distribution
After one step, we’re at ~xP .
After two steps, we’re at ~xP2.
After k steps, we’re at ~xPk .
Schutze: Link analysis 39 / 80
![Page 162: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/162.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
One way of computing the PageRank ~π
Start with any distribution ~x , e.g., uniform distribution
After one step, we’re at ~xP .
After two steps, we’re at ~xP2.
After k steps, we’re at ~xPk .
Algorithm: multiply ~x by increasing powers of P untilconvergence.
Schutze: Link analysis 39 / 80
![Page 163: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/163.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
One way of computing the PageRank ~π
Start with any distribution ~x , e.g., uniform distribution
After one step, we’re at ~xP .
After two steps, we’re at ~xP2.
After k steps, we’re at ~xPk .
Algorithm: multiply ~x by increasing powers of P untilconvergence.
This is called the power method.
Schutze: Link analysis 39 / 80
![Page 164: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/164.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
One way of computing the PageRank ~π
Start with any distribution ~x , e.g., uniform distribution
After one step, we’re at ~xP .
After two steps, we’re at ~xP2.
After k steps, we’re at ~xPk .
Algorithm: multiply ~x by increasing powers of P untilconvergence.
This is called the power method.
Recall: regardless of where we start, we eventually reach thesteady state ~π.
Schutze: Link analysis 39 / 80
![Page 165: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/165.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
One way of computing the PageRank ~π
Start with any distribution ~x , e.g., uniform distribution
After one step, we’re at ~xP .
After two steps, we’re at ~xP2.
After k steps, we’re at ~xPk .
Algorithm: multiply ~x by increasing powers of P untilconvergence.
This is called the power method.
Recall: regardless of where we start, we eventually reach thesteady state ~π.
Thus: we will eventually (in asymptotia) reach the steadystate.
Schutze: Link analysis 39 / 80
![Page 166: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/166.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Power method: Example
Schutze: Link analysis 40 / 80
![Page 167: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/167.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Power method: Example
What is the PageRank / steady state in this example?
d1 d2
0.9
0.3
0.1
0.7
Schutze: Link analysis 40 / 80
![Page 168: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/168.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Computing PageRank: Power method
Schutze: Link analysis 41 / 80
![Page 169: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/169.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Computing PageRank: Power method
x1 x2Pt(d1) Pt(d2)
P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7
t0 0 1 = ~xPt1 = ~xP2
t2 = ~xP3
t3 = ~xP4
. . .t∞ = ~xP∞
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 41 / 80
![Page 170: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/170.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Computing PageRank: Power method
x1 x2Pt(d1) Pt(d2)
P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7
t0 0 1 0.3 0.7 = ~xPt1 = ~xP2
t2 = ~xP3
t3 = ~xP4
. . .t∞ = ~xP∞
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 41 / 80
![Page 171: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/171.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Computing PageRank: Power method
x1 x2Pt(d1) Pt(d2)
P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7
t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 = ~xP2
t2 = ~xP3
t3 = ~xP4
. . .t∞ = ~xP∞
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 41 / 80
![Page 172: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/172.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Computing PageRank: Power method
x1 x2Pt(d1) Pt(d2)
P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7
t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 0.24 0.76 = ~xP2
t2 = ~xP3
t3 = ~xP4
. . .t∞ = ~xP∞
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 41 / 80
![Page 173: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/173.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Computing PageRank: Power method
x1 x2Pt(d1) Pt(d2)
P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7
t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 0.24 0.76 = ~xP2
t2 0.24 0.76 = ~xP3
t3 = ~xP4
. . .t∞ = ~xP∞
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 41 / 80
![Page 174: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/174.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Computing PageRank: Power method
x1 x2Pt(d1) Pt(d2)
P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7
t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 0.24 0.76 = ~xP2
t2 0.24 0.76 0.252 0.748 = ~xP3
t3 = ~xP4
. . .t∞ = ~xP∞
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 41 / 80
![Page 175: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/175.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Computing PageRank: Power method
x1 x2Pt(d1) Pt(d2)
P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7
t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 0.24 0.76 = ~xP2
t2 0.24 0.76 0.252 0.748 = ~xP3
t3 0.252 0.748 = ~xP4
. . .t∞ = ~xP∞
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 41 / 80
![Page 176: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/176.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Computing PageRank: Power method
x1 x2Pt(d1) Pt(d2)
P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7
t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 0.24 0.76 = ~xP2
t2 0.24 0.76 0.252 0.748 = ~xP3
t3 0.252 0.748 0.2496 0.7504 = ~xP4
. . .t∞ = ~xP∞
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 41 / 80
![Page 177: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/177.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Computing PageRank: Power method
x1 x2Pt(d1) Pt(d2)
P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7
t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 0.24 0.76 = ~xP2
t2 0.24 0.76 0.252 0.748 = ~xP3
t3 0.252 0.748 0.2496 0.7504 = ~xP4
. . . . . .t∞ = ~xP∞
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 41 / 80
![Page 178: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/178.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Computing PageRank: Power method
x1 x2Pt(d1) Pt(d2)
P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7
t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 0.24 0.76 = ~xP2
t2 0.24 0.76 0.252 0.748 = ~xP3
t3 0.252 0.748 0.2496 0.7504 = ~xP4
. . . . . .t∞ 0.25 0.75 = ~xP∞
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 41 / 80
![Page 179: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/179.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Computing PageRank: Power method
x1 x2Pt(d1) Pt(d2)
P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7
t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 0.24 0.76 = ~xP2
t2 0.24 0.76 0.252 0.748 = ~xP3
t3 0.252 0.748 0.2496 0.7504 = ~xP4
. . . . . .t∞ 0.25 0.75 0.25 0.75 = ~xP∞
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 41 / 80
![Page 180: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/180.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Computing PageRank: Power method
x1 x2Pt(d1) Pt(d2)
P11 = 0.1 P12 = 0.9P21 = 0.3 P22 = 0.7
t0 0 1 0.3 0.7 = ~xPt1 0.3 0.7 0.24 0.76 = ~xP2
t2 0.24 0.76 0.252 0.748 = ~xP3
t3 0.252 0.748 0.2496 0.7504 = ~xP4
. . . . . .t∞ 0.25 0.75 0.25 0.75 = ~xP∞
PageRank vector = ~π = (π1, π2) = (0.25, 0.75)
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 41 / 80
![Page 181: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/181.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Power method: Example
What is the PageRank / steady state in this example?
d1 d2
0.9
0.3
0.1
0.7
The steady state distribution (= the PageRanks) in thisexample are 0.25 for d1 and 0.75 for d2.
Schutze: Link analysis 42 / 80
![Page 182: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/182.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Exercise: Compute PageRank using power method
Schutze: Link analysis 43 / 80
![Page 183: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/183.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Exercise: Compute PageRank using power method
d1 d2
0.3
0.2
0.7
0.8
Schutze: Link analysis 43 / 80
![Page 184: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/184.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Solution
Schutze: Link analysis 44 / 80
![Page 185: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/185.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Solution
x1 x2Pt(d1) Pt(d2)
P11 = 0.7 P12 = 0.3P21 = 0.2 P22 = 0.8
t0 0 1t1t2t3
t∞
PageRank vector = ~π = (π1, π2) = (0.4, 0.6)
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 44 / 80
![Page 186: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/186.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Solution
x1 x2Pt(d1) Pt(d2)
P11 = 0.7 P12 = 0.3P21 = 0.2 P22 = 0.8
t0 0 1 0.2 0.8t1t2t3
t∞
PageRank vector = ~π = (π1, π2) = (0.4, 0.6)
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 44 / 80
![Page 187: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/187.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Solution
x1 x2Pt(d1) Pt(d2)
P11 = 0.7 P12 = 0.3P21 = 0.2 P22 = 0.8
t0 0 1 0.2 0.8t1 0.2 0.8t2t3
t∞
PageRank vector = ~π = (π1, π2) = (0.4, 0.6)
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 44 / 80
![Page 188: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/188.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Solution
x1 x2Pt(d1) Pt(d2)
P11 = 0.7 P12 = 0.3P21 = 0.2 P22 = 0.8
t0 0 1 0.2 0.8t1 0.2 0.8 0.3 0.7t2t3
t∞
PageRank vector = ~π = (π1, π2) = (0.4, 0.6)
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 44 / 80
![Page 189: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/189.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Solution
x1 x2Pt(d1) Pt(d2)
P11 = 0.7 P12 = 0.3P21 = 0.2 P22 = 0.8
t0 0 1 0.2 0.8t1 0.2 0.8 0.3 0.7t2 0.3 0.7t3
t∞
PageRank vector = ~π = (π1, π2) = (0.4, 0.6)
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 44 / 80
![Page 190: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/190.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Solution
x1 x2Pt(d1) Pt(d2)
P11 = 0.7 P12 = 0.3P21 = 0.2 P22 = 0.8
t0 0 1 0.2 0.8t1 0.2 0.8 0.3 0.7t2 0.3 0.7 0.35 0.65t3
t∞
PageRank vector = ~π = (π1, π2) = (0.4, 0.6)
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 44 / 80
![Page 191: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/191.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Solution
x1 x2Pt(d1) Pt(d2)
P11 = 0.7 P12 = 0.3P21 = 0.2 P22 = 0.8
t0 0 1 0.2 0.8t1 0.2 0.8 0.3 0.7t2 0.3 0.7 0.35 0.65t3 0.35 0.65
t∞
PageRank vector = ~π = (π1, π2) = (0.4, 0.6)
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 44 / 80
![Page 192: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/192.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Solution
x1 x2Pt(d1) Pt(d2)
P11 = 0.7 P12 = 0.3P21 = 0.2 P22 = 0.8
t0 0 1 0.2 0.8t1 0.2 0.8 0.3 0.7t2 0.3 0.7 0.35 0.65t3 0.35 0.65 0.375 0.625
t∞
PageRank vector = ~π = (π1, π2) = (0.4, 0.6)
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 44 / 80
![Page 193: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/193.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Solution
x1 x2Pt(d1) Pt(d2)
P11 = 0.7 P12 = 0.3P21 = 0.2 P22 = 0.8
t0 0 1 0.2 0.8t1 0.2 0.8 0.3 0.7t2 0.3 0.7 0.35 0.65t3 0.35 0.65 0.375 0.625
. . .t∞
PageRank vector = ~π = (π1, π2) = (0.4, 0.6)
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 44 / 80
![Page 194: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/194.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Solution
x1 x2Pt(d1) Pt(d2)
P11 = 0.7 P12 = 0.3P21 = 0.2 P22 = 0.8
t0 0 1 0.2 0.8t1 0.2 0.8 0.3 0.7t2 0.3 0.7 0.35 0.65t3 0.35 0.65 0.375 0.625
. . .t∞ 0.4 0.6
PageRank vector = ~π = (π1, π2) = (0.4, 0.6)
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 44 / 80
![Page 195: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/195.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Solution
x1 x2Pt(d1) Pt(d2)
P11 = 0.7 P12 = 0.3P21 = 0.2 P22 = 0.8
t0 0 1 0.2 0.8t1 0.2 0.8 0.3 0.7t2 0.3 0.7 0.35 0.65t3 0.35 0.65 0.375 0.625
. . .t∞ 0.4 0.6 0.4 0.6
PageRank vector = ~π = (π1, π2) = (0.4, 0.6)
Pt(d1) = Pt−1(d1) ∗ P11 + Pt−1(d2) ∗ P21
Pt(d2) = Pt−1(d1) ∗ P12 + Pt−1(d2) ∗ P22
Schutze: Link analysis 44 / 80
![Page 196: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/196.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank summary
Schutze: Link analysis 45 / 80
![Page 197: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/197.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank summary
Preprocessing
Schutze: Link analysis 45 / 80
![Page 198: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/198.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank summary
Preprocessing
Given graph of links, build matrix P
Schutze: Link analysis 45 / 80
![Page 199: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/199.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank summary
Preprocessing
Given graph of links, build matrix PApply teleportation
Schutze: Link analysis 45 / 80
![Page 200: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/200.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank summary
Preprocessing
Given graph of links, build matrix PApply teleportationFrom modified matrix, compute ~π
Schutze: Link analysis 45 / 80
![Page 201: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/201.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank summary
Preprocessing
Given graph of links, build matrix PApply teleportationFrom modified matrix, compute ~π~πi is the PageRank of page i .
Schutze: Link analysis 45 / 80
![Page 202: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/202.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank summary
Preprocessing
Given graph of links, build matrix PApply teleportationFrom modified matrix, compute ~π~πi is the PageRank of page i .
Query processing
Schutze: Link analysis 45 / 80
![Page 203: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/203.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank summary
Preprocessing
Given graph of links, build matrix PApply teleportationFrom modified matrix, compute ~π~πi is the PageRank of page i .
Query processing
Retrieve pages satisfying the query
Schutze: Link analysis 45 / 80
![Page 204: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/204.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank summary
Preprocessing
Given graph of links, build matrix PApply teleportationFrom modified matrix, compute ~π~πi is the PageRank of page i .
Query processing
Retrieve pages satisfying the queryRank them by their PageRank
Schutze: Link analysis 45 / 80
![Page 205: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/205.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank summary
Preprocessing
Given graph of links, build matrix PApply teleportationFrom modified matrix, compute ~π~πi is the PageRank of page i .
Query processing
Retrieve pages satisfying the queryRank them by their PageRankReturn reranked list to the user
Schutze: Link analysis 45 / 80
![Page 206: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/206.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank issues
Schutze: Link analysis 46 / 80
![Page 207: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/207.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank issues
Real surfers are not random surfers.
Schutze: Link analysis 46 / 80
![Page 208: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/208.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank issues
Real surfers are not random surfers.Examples of nonrandom surfing: back button, short vs. longpaths, bookmarks, directories – and search!
Schutze: Link analysis 46 / 80
![Page 209: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/209.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank issues
Real surfers are not random surfers.Examples of nonrandom surfing: back button, short vs. longpaths, bookmarks, directories – and search!→ Markov model is not a good model of surfing.
Schutze: Link analysis 46 / 80
![Page 210: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/210.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank issues
Real surfers are not random surfers.Examples of nonrandom surfing: back button, short vs. longpaths, bookmarks, directories – and search!→ Markov model is not a good model of surfing.But it’s good enough as a model for our purposes.
Schutze: Link analysis 46 / 80
![Page 211: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/211.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank issues
Real surfers are not random surfers.Examples of nonrandom surfing: back button, short vs. longpaths, bookmarks, directories – and search!→ Markov model is not a good model of surfing.But it’s good enough as a model for our purposes.
Simple PageRank ranking (as described on previous slide)produces bad results for many pages.
Schutze: Link analysis 46 / 80
![Page 212: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/212.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank issues
Real surfers are not random surfers.Examples of nonrandom surfing: back button, short vs. longpaths, bookmarks, directories – and search!→ Markov model is not a good model of surfing.But it’s good enough as a model for our purposes.
Simple PageRank ranking (as described on previous slide)produces bad results for many pages.
Consider the query [video service]
Schutze: Link analysis 46 / 80
![Page 213: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/213.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank issues
Real surfers are not random surfers.Examples of nonrandom surfing: back button, short vs. longpaths, bookmarks, directories – and search!→ Markov model is not a good model of surfing.But it’s good enough as a model for our purposes.
Simple PageRank ranking (as described on previous slide)produces bad results for many pages.
Consider the query [video service]The Yahoo home page (i) has a very high PageRank and (ii)contains both video and service.
Schutze: Link analysis 46 / 80
![Page 214: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/214.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank issues
Real surfers are not random surfers.Examples of nonrandom surfing: back button, short vs. longpaths, bookmarks, directories – and search!→ Markov model is not a good model of surfing.But it’s good enough as a model for our purposes.
Simple PageRank ranking (as described on previous slide)produces bad results for many pages.
Consider the query [video service]The Yahoo home page (i) has a very high PageRank and (ii)contains both video and service.If we rank all Boolean hits according to PageRank, then theYahoo home page would be top-ranked.
Schutze: Link analysis 46 / 80
![Page 215: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/215.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank issues
Real surfers are not random surfers.Examples of nonrandom surfing: back button, short vs. longpaths, bookmarks, directories – and search!→ Markov model is not a good model of surfing.But it’s good enough as a model for our purposes.
Simple PageRank ranking (as described on previous slide)produces bad results for many pages.
Consider the query [video service]The Yahoo home page (i) has a very high PageRank and (ii)contains both video and service.If we rank all Boolean hits according to PageRank, then theYahoo home page would be top-ranked.Clearly not desirable
Schutze: Link analysis 46 / 80
![Page 216: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/216.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank issues
Real surfers are not random surfers.Examples of nonrandom surfing: back button, short vs. longpaths, bookmarks, directories – and search!→ Markov model is not a good model of surfing.But it’s good enough as a model for our purposes.
Simple PageRank ranking (as described on previous slide)produces bad results for many pages.
Consider the query [video service]The Yahoo home page (i) has a very high PageRank and (ii)contains both video and service.If we rank all Boolean hits according to PageRank, then theYahoo home page would be top-ranked.Clearly not desirable
In practice: rank according to weighted combination of rawtext match, anchor text match, PageRank & other factors
Schutze: Link analysis 46 / 80
![Page 217: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/217.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank issues
Real surfers are not random surfers.Examples of nonrandom surfing: back button, short vs. longpaths, bookmarks, directories – and search!→ Markov model is not a good model of surfing.But it’s good enough as a model for our purposes.
Simple PageRank ranking (as described on previous slide)produces bad results for many pages.
Consider the query [video service]The Yahoo home page (i) has a very high PageRank and (ii)contains both video and service.If we rank all Boolean hits according to PageRank, then theYahoo home page would be top-ranked.Clearly not desirable
In practice: rank according to weighted combination of rawtext match, anchor text match, PageRank & other factors
→ see lecture on Learning to Rank
Schutze: Link analysis 46 / 80
![Page 218: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/218.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Example web graph
d0
d2 d1
d5
d3 d6
d4
car benz
ford
gm
honda
jaguar
jag
cat
leopard
tiger
jaguar
lion
cheetah
speed
Schutze: Link analysis 47 / 80
![Page 219: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/219.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Transition (probability) matrix
Schutze: Link analysis 48 / 80
![Page 220: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/220.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Transition (probability) matrix
d0 d1 d2 d3 d4 d5 d6d0 0.00 0.00 1.00 0.00 0.00 0.00 0.00d1 0.00 0.50 0.50 0.00 0.00 0.00 0.00d2 0.33 0.00 0.33 0.33 0.00 0.00 0.00d3 0.00 0.00 0.00 0.50 0.50 0.00 0.00d4 0.00 0.00 0.00 0.00 0.00 0.00 1.00d5 0.00 0.00 0.00 0.00 0.00 0.50 0.50d6 0.00 0.00 0.00 0.33 0.33 0.00 0.33
Schutze: Link analysis 48 / 80
![Page 221: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/221.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Transition matrix with teleporting
Schutze: Link analysis 49 / 80
![Page 222: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/222.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Transition matrix with teleporting
d0 d1 d2 d3 d4 d5 d6d0 0.02 0.02 0.88 0.02 0.02 0.02 0.02d1 0.02 0.45 0.45 0.02 0.02 0.02 0.02d2 0.31 0.02 0.31 0.31 0.02 0.02 0.02d3 0.02 0.02 0.02 0.45 0.45 0.02 0.02d4 0.02 0.02 0.02 0.02 0.02 0.02 0.88d5 0.02 0.02 0.02 0.02 0.02 0.45 0.45d6 0.02 0.02 0.02 0.31 0.31 0.02 0.31
Schutze: Link analysis 49 / 80
![Page 223: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/223.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Power method vectors ~xPk
Schutze: Link analysis 50 / 80
![Page 224: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/224.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Power method vectors ~xPk
~x ~xP1 ~xP2 ~xP3 ~xP4 ~xP5 ~xP6 ~xP7 ~xP8 ~xP9 ~xP10 ~xP11 ~xP12 ~xP13
d0 0.14 0.06 0.09 0.07 0.07 0.06 0.06 0.06 0.06 0.05 0.05 0.05 0.05 0.05d1 0.14 0.08 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04d2 0.14 0.25 0.18 0.17 0.15 0.14 0.13 0.12 0.12 0.12 0.12 0.11 0.11 0.11d3 0.14 0.16 0.23 0.24 0.24 0.24 0.24 0.25 0.25 0.25 0.25 0.25 0.25 0.25d4 0.14 0.12 0.16 0.19 0.19 0.20 0.21 0.21 0.21 0.21 0.21 0.21 0.21 0.21d5 0.14 0.08 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04d6 0.14 0.25 0.23 0.25 0.27 0.28 0.29 0.29 0.30 0.30 0.30 0.30 0.31 0.31
Schutze: Link analysis 50 / 80
![Page 225: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/225.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Example web graph
d0
d2 d1
d5
d3 d6
d4
car benz
ford
gm
honda
jaguar
jag
cat
leopard
tiger
jaguar
lion
cheetah
speed
PageRank
d0 0.05d1 0.04d2 0.11d3 0.25d4 0.21d5 0.04d6 0.31
PageRank(d2)<PageRank(d6):why?
Schutze: Link analysis 51 / 80
![Page 226: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/226.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How important is PageRank?
Schutze: Link analysis 52 / 80
![Page 227: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/227.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How important is PageRank?
Frequent claim: PageRank is the most important componentof web ranking.
Schutze: Link analysis 52 / 80
![Page 228: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/228.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How important is PageRank?
Frequent claim: PageRank is the most important componentof web ranking.
The reality:
Schutze: Link analysis 52 / 80
![Page 229: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/229.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How important is PageRank?
Frequent claim: PageRank is the most important componentof web ranking.
The reality:
There are several components that are at least as important:e.g., anchor text, phrases, proximity, tiered indexes . . .
Schutze: Link analysis 52 / 80
![Page 230: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/230.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How important is PageRank?
Frequent claim: PageRank is the most important componentof web ranking.
The reality:
There are several components that are at least as important:e.g., anchor text, phrases, proximity, tiered indexes . . .Rumor has it that PageRank in its original form (as presentedhere) now has a negligible impact on ranking!
Schutze: Link analysis 52 / 80
![Page 231: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/231.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How important is PageRank?
Frequent claim: PageRank is the most important componentof web ranking.
The reality:
There are several components that are at least as important:e.g., anchor text, phrases, proximity, tiered indexes . . .Rumor has it that PageRank in its original form (as presentedhere) now has a negligible impact on ranking!However, variants of a page’s PageRank are still an essentialpart of ranking.
Schutze: Link analysis 52 / 80
![Page 232: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/232.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How important is PageRank?
Frequent claim: PageRank is the most important componentof web ranking.
The reality:
There are several components that are at least as important:e.g., anchor text, phrases, proximity, tiered indexes . . .Rumor has it that PageRank in its original form (as presentedhere) now has a negligible impact on ranking!However, variants of a page’s PageRank are still an essentialpart of ranking.Adressing link spam is difficult and crucial.
Schutze: Link analysis 52 / 80
![Page 233: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/233.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Outline
1 Recap
2 Anchor text
3 Citation analysis
4 PageRank
5 HITS: Hubs & Authorities
Schutze: Link analysis 53 / 80
![Page 234: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/234.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
HITS – Hyperlink-Induced Topic Search
Schutze: Link analysis 54 / 80
![Page 235: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/235.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
HITS – Hyperlink-Induced Topic Search
Premise: there are two different types of relevance on the web.
Schutze: Link analysis 54 / 80
![Page 236: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/236.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
HITS – Hyperlink-Induced Topic Search
Premise: there are two different types of relevance on the web.
Relevance type 1: Hubs. A hub page is a good list of [links topages answering the information need].
Schutze: Link analysis 54 / 80
![Page 237: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/237.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
HITS – Hyperlink-Induced Topic Search
Premise: there are two different types of relevance on the web.
Relevance type 1: Hubs. A hub page is a good list of [links topages answering the information need].
E.g., for query [chicago bulls]: Bob’s list of recommendedresources on the Chicago Bulls sports team
Schutze: Link analysis 54 / 80
![Page 238: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/238.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
HITS – Hyperlink-Induced Topic Search
Premise: there are two different types of relevance on the web.
Relevance type 1: Hubs. A hub page is a good list of [links topages answering the information need].
E.g., for query [chicago bulls]: Bob’s list of recommendedresources on the Chicago Bulls sports team
Relevance type 2: Authorities. An authority page is a directanswer to the information need.
Schutze: Link analysis 54 / 80
![Page 239: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/239.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
HITS – Hyperlink-Induced Topic Search
Premise: there are two different types of relevance on the web.
Relevance type 1: Hubs. A hub page is a good list of [links topages answering the information need].
E.g., for query [chicago bulls]: Bob’s list of recommendedresources on the Chicago Bulls sports team
Relevance type 2: Authorities. An authority page is a directanswer to the information need.
The home page of the Chicago Bulls sports team
Schutze: Link analysis 54 / 80
![Page 240: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/240.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
HITS – Hyperlink-Induced Topic Search
Premise: there are two different types of relevance on the web.
Relevance type 1: Hubs. A hub page is a good list of [links topages answering the information need].
E.g., for query [chicago bulls]: Bob’s list of recommendedresources on the Chicago Bulls sports team
Relevance type 2: Authorities. An authority page is a directanswer to the information need.
The home page of the Chicago Bulls sports teamBy definition: Links to authority pages occur repeatedly onhub pages.
Schutze: Link analysis 54 / 80
![Page 241: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/241.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
HITS – Hyperlink-Induced Topic Search
Premise: there are two different types of relevance on the web.
Relevance type 1: Hubs. A hub page is a good list of [links topages answering the information need].
E.g., for query [chicago bulls]: Bob’s list of recommendedresources on the Chicago Bulls sports team
Relevance type 2: Authorities. An authority page is a directanswer to the information need.
The home page of the Chicago Bulls sports teamBy definition: Links to authority pages occur repeatedly onhub pages.
Most approaches to search (including PageRank ranking)don’t make the distinction between these two very differenttypes of relevance.
Schutze: Link analysis 54 / 80
![Page 242: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/242.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hubs and authorities: Definition
Schutze: Link analysis 55 / 80
![Page 243: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/243.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hubs and authorities: Definition
A good hub page for a topic links to many authority pages forthat topic.
Schutze: Link analysis 55 / 80
![Page 244: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/244.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hubs and authorities: Definition
A good hub page for a topic links to many authority pages forthat topic.
A good authority page for a topic is linked to by many hubpages for that topic.
Schutze: Link analysis 55 / 80
![Page 245: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/245.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hubs and authorities: Definition
A good hub page for a topic links to many authority pages forthat topic.
A good authority page for a topic is linked to by many hubpages for that topic.
Circular definition – we will turn this into an iterativecomputation.
Schutze: Link analysis 55 / 80
![Page 246: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/246.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Example for hubs and authorities
Schutze: Link analysis 56 / 80
![Page 247: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/247.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Example for hubs and authorities
hubs authorities
www.bestfares.com
www.airlinesquality.com
blogs.usatoday.com/sky
aviationblog.dallasnews.com
www.aa.com
www.delta.com
www.united.com
Schutze: Link analysis 56 / 80
![Page 248: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/248.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How to compute hub and authority scores
Schutze: Link analysis 57 / 80
![Page 249: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/249.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How to compute hub and authority scores
Do a regular web search first
Schutze: Link analysis 57 / 80
![Page 250: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/250.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How to compute hub and authority scores
Do a regular web search first
Call the search result the root set
Schutze: Link analysis 57 / 80
![Page 251: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/251.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How to compute hub and authority scores
Do a regular web search first
Call the search result the root set
Find all pages that are linked to or link to pages in the root set
Schutze: Link analysis 57 / 80
![Page 252: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/252.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How to compute hub and authority scores
Do a regular web search first
Call the search result the root set
Find all pages that are linked to or link to pages in the root set
Call this larger set the base set
Schutze: Link analysis 57 / 80
![Page 253: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/253.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
How to compute hub and authority scores
Do a regular web search first
Call the search result the root set
Find all pages that are linked to or link to pages in the root set
Call this larger set the base set
Finally, compute hubs and authorities for the base set (whichwe’ll view as a small web graph)
Schutze: Link analysis 57 / 80
![Page 254: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/254.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Root set and base set (1)
root set
The root set
Schutze: Link analysis 58 / 80
![Page 255: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/255.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Root set and base set (1)
root set
Nodes that root set nodes link to
Schutze: Link analysis 58 / 80
![Page 256: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/256.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Root set and base set (1)
root set
Nodes that link to root set nodes
Schutze: Link analysis 58 / 80
![Page 257: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/257.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Root set and base set (1)
base set
root set
The base set
Schutze: Link analysis 58 / 80
![Page 258: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/258.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Root set and base set (2)
Schutze: Link analysis 59 / 80
![Page 259: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/259.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Root set and base set (2)
Root set typically has 200–1000 nodes.
Schutze: Link analysis 59 / 80
![Page 260: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/260.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Root set and base set (2)
Root set typically has 200–1000 nodes.
Base set may have up to 5000 nodes.
Schutze: Link analysis 59 / 80
![Page 261: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/261.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Root set and base set (2)
Root set typically has 200–1000 nodes.
Base set may have up to 5000 nodes.
Computation of base set, as shown on previous slide:
Schutze: Link analysis 59 / 80
![Page 262: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/262.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Root set and base set (2)
Root set typically has 200–1000 nodes.
Base set may have up to 5000 nodes.
Computation of base set, as shown on previous slide:
Follow outlinks by parsing the pages in the root set
Schutze: Link analysis 59 / 80
![Page 263: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/263.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Root set and base set (2)
Root set typically has 200–1000 nodes.
Base set may have up to 5000 nodes.
Computation of base set, as shown on previous slide:
Follow outlinks by parsing the pages in the root setFind d ’s inlinks by searching for all pages containing a link tod
Schutze: Link analysis 59 / 80
![Page 264: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/264.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hub and authority scores
Schutze: Link analysis 60 / 80
![Page 265: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/265.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hub and authority scores
Compute for each page d in the base set a hub score h(d) andan authority score a(d)
Schutze: Link analysis 60 / 80
![Page 266: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/266.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hub and authority scores
Compute for each page d in the base set a hub score h(d) andan authority score a(d)
Initialization: for all d : h(d) = 1, a(d) = 1
Schutze: Link analysis 60 / 80
![Page 267: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/267.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hub and authority scores
Compute for each page d in the base set a hub score h(d) andan authority score a(d)
Initialization: for all d : h(d) = 1, a(d) = 1
Iteratively update all h(d), a(d)
Schutze: Link analysis 60 / 80
![Page 268: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/268.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hub and authority scores
Compute for each page d in the base set a hub score h(d) andan authority score a(d)
Initialization: for all d : h(d) = 1, a(d) = 1
Iteratively update all h(d), a(d)
After convergence:
Schutze: Link analysis 60 / 80
![Page 269: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/269.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hub and authority scores
Compute for each page d in the base set a hub score h(d) andan authority score a(d)
Initialization: for all d : h(d) = 1, a(d) = 1
Iteratively update all h(d), a(d)
After convergence:
Output pages with highest h scores as top hubs
Schutze: Link analysis 60 / 80
![Page 270: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/270.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hub and authority scores
Compute for each page d in the base set a hub score h(d) andan authority score a(d)
Initialization: for all d : h(d) = 1, a(d) = 1
Iteratively update all h(d), a(d)
After convergence:
Output pages with highest h scores as top hubsOutput pages with highest a scores as top authorities
Schutze: Link analysis 60 / 80
![Page 271: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/271.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hub and authority scores
Compute for each page d in the base set a hub score h(d) andan authority score a(d)
Initialization: for all d : h(d) = 1, a(d) = 1
Iteratively update all h(d), a(d)
After convergence:
Output pages with highest h scores as top hubsOutput pages with highest a scores as top authoritiesSo we output two ranked lists
Schutze: Link analysis 60 / 80
![Page 272: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/272.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Iterative update
Schutze: Link analysis 61 / 80
![Page 273: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/273.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Iterative update
For all d : h(d) =∑
d 7→y a(y)
d
y1
y2
y3
Schutze: Link analysis 61 / 80
![Page 274: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/274.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Iterative update
For all d : h(d) =∑
d 7→y a(y)
d
y1
y2
y3
For all d : a(d) =∑
y 7→d h(y)
d
y1
y2
y3
Schutze: Link analysis 61 / 80
![Page 275: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/275.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Iterative update
For all d : h(d) =∑
d 7→y a(y)
d
y1
y2
y3
For all d : a(d) =∑
y 7→d h(y)
d
y1
y2
y3
Iterate these two steps until convergence
Schutze: Link analysis 61 / 80
![Page 276: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/276.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Details
Schutze: Link analysis 62 / 80
![Page 277: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/277.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Details
Scaling
Schutze: Link analysis 62 / 80
![Page 278: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/278.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Details
Scaling
To prevent the a() and h() values from getting too big, canscale down after each iteration
Schutze: Link analysis 62 / 80
![Page 279: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/279.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Details
Scaling
To prevent the a() and h() values from getting too big, canscale down after each iterationScaling factor doesn’t really matter.
Schutze: Link analysis 62 / 80
![Page 280: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/280.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Details
Scaling
To prevent the a() and h() values from getting too big, canscale down after each iterationScaling factor doesn’t really matter.We care about the relative (as opposed to absolute) values ofthe scores.
Schutze: Link analysis 62 / 80
![Page 281: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/281.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Details
Scaling
To prevent the a() and h() values from getting too big, canscale down after each iterationScaling factor doesn’t really matter.We care about the relative (as opposed to absolute) values ofthe scores.
In most cases, the algorithm converges after a fewiterations.
Schutze: Link analysis 62 / 80
![Page 282: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/282.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Authorities for query [Chicago Bulls]
Schutze: Link analysis 63 / 80
![Page 283: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/283.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Authorities for query [Chicago Bulls]
0.85 www.nba.com/bulls0.25 www.essex1.com/people/jmiller/bulls.htm
“da Bulls”0.20 www.nando.net/SportServer/basketball/nba/chi.html
“The Chicago Bulls”0.15 users.aol.com/rynocub/bulls.htm
“The Chicago Bulls Home Page”0.13 www.geocities.com/Colosseum/6095
“Chicago Bulls”
(Ben-Shaul et al, WWW8)
Schutze: Link analysis 63 / 80
![Page 284: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/284.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
The authority page for [Chicago Bulls]
Schutze: Link analysis 64 / 80
![Page 285: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/285.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
The authority page for [Chicago Bulls]
Schutze: Link analysis 64 / 80
![Page 286: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/286.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hubs for query [Chicago Bulls]
Schutze: Link analysis 65 / 80
![Page 287: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/287.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hubs for query [Chicago Bulls]
1.62 www.geocities.com/Colosseum/1778“Unbelieveabulls!!!!!”
1.24 www.webring.org/cgi-bin/webring?ring=chbulls“Erin’s Chicago Bulls Page”
0.74 www.geocities.com/Hollywood/Lot/3330/Bulls.html“Chicago Bulls”
0.52 www.nobull.net/web position/kw-search-15-M2.htm“Excite Search Results: bulls”
0.52 www.halcyon.com/wordsltd/bball/bulls.htm“Chicago Bulls Links”
(Ben-Shaul et al, WWW8)
Schutze: Link analysis 65 / 80
![Page 288: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/288.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
A hub page for [Chicago Bulls]
Schutze: Link analysis 66 / 80
![Page 289: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/289.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
A hub page for [Chicago Bulls]
Schutze: Link analysis 66 / 80
![Page 290: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/290.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hubs & Authorities: Comments
Schutze: Link analysis 67 / 80
![Page 291: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/291.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hubs & Authorities: Comments
HITS can pull together good pages regardless of page content.
Schutze: Link analysis 67 / 80
![Page 292: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/292.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hubs & Authorities: Comments
HITS can pull together good pages regardless of page content.
Once the base set is assembled, we only do link analysis, notext matching.
Schutze: Link analysis 67 / 80
![Page 293: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/293.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hubs & Authorities: Comments
HITS can pull together good pages regardless of page content.
Once the base set is assembled, we only do link analysis, notext matching.
Pages in the base set often do not contain any of the querywords.
Schutze: Link analysis 67 / 80
![Page 294: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/294.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hubs & Authorities: Comments
HITS can pull together good pages regardless of page content.
Once the base set is assembled, we only do link analysis, notext matching.
Pages in the base set often do not contain any of the querywords.
In theory, an English query can retrieve Japanese-languagepages!
Schutze: Link analysis 67 / 80
![Page 295: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/295.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hubs & Authorities: Comments
HITS can pull together good pages regardless of page content.
Once the base set is assembled, we only do link analysis, notext matching.
Pages in the base set often do not contain any of the querywords.
In theory, an English query can retrieve Japanese-languagepages!
If supported by the link structure between English andJapanese pages
Schutze: Link analysis 67 / 80
![Page 296: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/296.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hubs & Authorities: Comments
HITS can pull together good pages regardless of page content.
Once the base set is assembled, we only do link analysis, notext matching.
Pages in the base set often do not contain any of the querywords.
In theory, an English query can retrieve Japanese-languagepages!
If supported by the link structure between English andJapanese pages
Danger: topic drift – the pages found by following links maynot be related to the original query.
Schutze: Link analysis 67 / 80
![Page 297: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/297.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Proof of convergence
Schutze: Link analysis 68 / 80
![Page 298: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/298.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Proof of convergence
We define an N × N adjacency matrix A. (We called this thelink matrix earlier.
Schutze: Link analysis 68 / 80
![Page 299: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/299.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Proof of convergence
We define an N × N adjacency matrix A. (We called this thelink matrix earlier.
For 1 ≤ i , j ≤ N, the matrix entry Aij tells us whether there isa link from page i to page j (Aij = 1) or not (Aij = 0).
Schutze: Link analysis 68 / 80
![Page 300: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/300.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Proof of convergence
We define an N × N adjacency matrix A. (We called this thelink matrix earlier.
For 1 ≤ i , j ≤ N, the matrix entry Aij tells us whether there isa link from page i to page j (Aij = 1) or not (Aij = 0).
Example:
d3
d1 d2
Schutze: Link analysis 68 / 80
![Page 301: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/301.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Proof of convergence
We define an N × N adjacency matrix A. (We called this thelink matrix earlier.
For 1 ≤ i , j ≤ N, the matrix entry Aij tells us whether there isa link from page i to page j (Aij = 1) or not (Aij = 0).
Example:
d3
d1 d2
d1 d2 d3d1 0 1 0d2 1 1 1d3 1 0 0
Schutze: Link analysis 68 / 80
![Page 302: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/302.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Write update rules as matrix operations
Schutze: Link analysis 69 / 80
![Page 303: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/303.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Write update rules as matrix operations
Define the hub vector ~h = (h1, . . . , hN) as the vector of hubscores. hi is the hub score of page di .
Schutze: Link analysis 69 / 80
![Page 304: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/304.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Write update rules as matrix operations
Define the hub vector ~h = (h1, . . . , hN) as the vector of hubscores. hi is the hub score of page di .
Similarly for ~a, the vector of authority scores
Schutze: Link analysis 69 / 80
![Page 305: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/305.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Write update rules as matrix operations
Define the hub vector ~h = (h1, . . . , hN) as the vector of hubscores. hi is the hub score of page di .
Similarly for ~a, the vector of authority scores
Now we can write h(d) =∑
d 7→y a(y) as a matrix operation:~h = A~a . . .
Schutze: Link analysis 69 / 80
![Page 306: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/306.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Write update rules as matrix operations
Define the hub vector ~h = (h1, . . . , hN) as the vector of hubscores. hi is the hub score of page di .
Similarly for ~a, the vector of authority scores
Now we can write h(d) =∑
d 7→y a(y) as a matrix operation:~h = A~a . . .
. . . and we can write a(d) =∑
y 7→d h(y) as ~a = AT~h
Schutze: Link analysis 69 / 80
![Page 307: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/307.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Write update rules as matrix operations
Define the hub vector ~h = (h1, . . . , hN) as the vector of hubscores. hi is the hub score of page di .
Similarly for ~a, the vector of authority scores
Now we can write h(d) =∑
d 7→y a(y) as a matrix operation:~h = A~a . . .
. . . and we can write a(d) =∑
y 7→d h(y) as ~a = AT~h
HITS algorithm in matrix notation:
Schutze: Link analysis 69 / 80
![Page 308: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/308.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Write update rules as matrix operations
Define the hub vector ~h = (h1, . . . , hN) as the vector of hubscores. hi is the hub score of page di .
Similarly for ~a, the vector of authority scores
Now we can write h(d) =∑
d 7→y a(y) as a matrix operation:~h = A~a . . .
. . . and we can write a(d) =∑
y 7→d h(y) as ~a = AT~h
HITS algorithm in matrix notation:
Compute ~h = A~a
Schutze: Link analysis 69 / 80
![Page 309: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/309.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Write update rules as matrix operations
Define the hub vector ~h = (h1, . . . , hN) as the vector of hubscores. hi is the hub score of page di .
Similarly for ~a, the vector of authority scores
Now we can write h(d) =∑
d 7→y a(y) as a matrix operation:~h = A~a . . .
. . . and we can write a(d) =∑
y 7→d h(y) as ~a = AT~h
HITS algorithm in matrix notation:
Compute ~h = A~aCompute ~a = AT~h
Schutze: Link analysis 69 / 80
![Page 310: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/310.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Write update rules as matrix operations
Define the hub vector ~h = (h1, . . . , hN) as the vector of hubscores. hi is the hub score of page di .
Similarly for ~a, the vector of authority scores
Now we can write h(d) =∑
d 7→y a(y) as a matrix operation:~h = A~a . . .
. . . and we can write a(d) =∑
y 7→d h(y) as ~a = AT~h
HITS algorithm in matrix notation:
Compute ~h = A~aCompute ~a = AT~hIterate until convergence
Schutze: Link analysis 69 / 80
![Page 311: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/311.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
HITS as eigenvector problem
HITS algorithm in matrix notation. Iterate:
Compute ~h = A~aCompute ~a = AT~h
Schutze: Link analysis 70 / 80
![Page 312: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/312.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
HITS as eigenvector problem
HITS algorithm in matrix notation. Iterate:
Compute ~h = A~aCompute ~a = AT~h
By substitution we get: ~h = AAT~h and ~a = ATA~a
Schutze: Link analysis 70 / 80
![Page 313: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/313.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
HITS as eigenvector problem
HITS algorithm in matrix notation. Iterate:
Compute ~h = A~aCompute ~a = AT~h
By substitution we get: ~h = AAT~h and ~a = ATA~a
Thus, ~h is an eigenvector of AAT and ~a is an eigenvector ofATA.
Schutze: Link analysis 70 / 80
![Page 314: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/314.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
HITS as eigenvector problem
HITS algorithm in matrix notation. Iterate:
Compute ~h = A~aCompute ~a = AT~h
By substitution we get: ~h = AAT~h and ~a = ATA~a
Thus, ~h is an eigenvector of AAT and ~a is an eigenvector ofATA.
So the HITS algorithm is actually a special case of the powermethod and hub and authority scores are eigenvector values.
Schutze: Link analysis 70 / 80
![Page 315: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/315.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
HITS as eigenvector problem
HITS algorithm in matrix notation. Iterate:
Compute ~h = A~aCompute ~a = AT~h
By substitution we get: ~h = AAT~h and ~a = ATA~a
Thus, ~h is an eigenvector of AAT and ~a is an eigenvector ofATA.
So the HITS algorithm is actually a special case of the powermethod and hub and authority scores are eigenvector values.
HITS and PageRank both formalize link analysis aseigenvector problems.
Schutze: Link analysis 70 / 80
![Page 316: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/316.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Example web graph
d0
d2 d1
d5
d3 d6
d4
car benz
ford
gm
honda
jaguar
jag
cat
leopard
tiger
jaguar
lion
cheetah
speed
Schutze: Link analysis 71 / 80
![Page 317: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/317.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Raw matrix A for HITS
d0 d1 d2 d3 d4 d5 d6d0 0 0 1 0 0 0 0d1 0 1 1 0 0 0 0d2 1 0 1 2 0 0 0d3 0 0 0 1 1 0 0d4 0 0 0 0 0 0 1d5 0 0 0 0 0 1 1d6 0 0 0 2 1 0 1
Schutze: Link analysis 72 / 80
![Page 318: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/318.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Hub vectors h0,~hi =1diA ·~ai , i ≥ 1
~h0 ~h1 ~h2 ~h3 ~h4 ~h5d0 0.14 0.06 0.04 0.04 0.03 0.03d1 0.14 0.08 0.05 0.04 0.04 0.04d2 0.14 0.28 0.32 0.33 0.33 0.33d3 0.14 0.14 0.17 0.18 0.18 0.18d4 0.14 0.06 0.04 0.04 0.04 0.04d5 0.14 0.08 0.05 0.04 0.04 0.04d6 0.14 0.30 0.33 0.34 0.35 0.35
Schutze: Link analysis 73 / 80
![Page 319: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/319.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Authority vectors ~ai =1ciAT · ~hi−1, i ≥ 1
~a1 ~a2 ~a3 ~a4 ~a5 ~a6 ~a7d0 0.06 0.09 0.10 0.10 0.10 0.10 0.10d1 0.06 0.03 0.01 0.01 0.01 0.01 0.01d2 0.19 0.14 0.13 0.12 0.12 0.12 0.12d3 0.31 0.43 0.46 0.46 0.46 0.47 0.47d4 0.13 0.14 0.16 0.16 0.16 0.16 0.16d5 0.06 0.03 0.02 0.01 0.01 0.01 0.01d6 0.19 0.14 0.13 0.13 0.13 0.13 0.13
Schutze: Link analysis 74 / 80
![Page 320: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/320.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Example web graph
d0
d2 d1
d5
d3 d6
d4
car benz
ford
gm
honda
jaguar
jag
cat
leopard
tiger
jaguar
lion
cheetah
speed
a h
d0 0.10 0.03d1 0.01 0.04d2 0.12 0.33d3 0.47 0.18d4 0.16 0.04d5 0.01 0.04d6 0.13 0.35
Schutze: Link analysis 75 / 80
![Page 321: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/321.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Example web graph
d0
d2 d1
d5
d3 d6
d4
car benz
ford
gm
honda
jaguar
jag
cat
leopard
tiger
jaguar
lion
cheetah
speed
Pages with highest
in-degree: d2, d3, d6
Pages with highest
out-degree: d2, d6
Pages with highest
PageRank: d6
Pages with highest hub
score: d6 (close: d2)
Pages with highest
authority score: d3
Schutze: Link analysis 76 / 80
![Page 322: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/322.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank vs. HITS: Discussion
Schutze: Link analysis 77 / 80
![Page 323: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/323.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank vs. HITS: Discussion
PageRank can be precomputed, HITS has to be computed atquery time.
Schutze: Link analysis 77 / 80
![Page 324: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/324.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank vs. HITS: Discussion
PageRank can be precomputed, HITS has to be computed atquery time.
HITS is too expensive in most application scenarios.
Schutze: Link analysis 77 / 80
![Page 325: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/325.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank vs. HITS: Discussion
PageRank can be precomputed, HITS has to be computed atquery time.
HITS is too expensive in most application scenarios.
PageRank and HITS make two different design choicesconcerning (i) the eigenproblem formalization (ii) the set ofpages to apply the formalization to.
Schutze: Link analysis 77 / 80
![Page 326: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/326.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank vs. HITS: Discussion
PageRank can be precomputed, HITS has to be computed atquery time.
HITS is too expensive in most application scenarios.
PageRank and HITS make two different design choicesconcerning (i) the eigenproblem formalization (ii) the set ofpages to apply the formalization to.
These two are orthogonal.
Schutze: Link analysis 77 / 80
![Page 327: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/327.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank vs. HITS: Discussion
PageRank can be precomputed, HITS has to be computed atquery time.
HITS is too expensive in most application scenarios.
PageRank and HITS make two different design choicesconcerning (i) the eigenproblem formalization (ii) the set ofpages to apply the formalization to.
These two are orthogonal.
We could also apply HITS to the entire web and PageRank toa small base set.
Schutze: Link analysis 77 / 80
![Page 328: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/328.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank vs. HITS: Discussion
PageRank can be precomputed, HITS has to be computed atquery time.
HITS is too expensive in most application scenarios.
PageRank and HITS make two different design choicesconcerning (i) the eigenproblem formalization (ii) the set ofpages to apply the formalization to.
These two are orthogonal.
We could also apply HITS to the entire web and PageRank toa small base set.
Claim: On the web, a good hub almost always is also a goodauthority.
Schutze: Link analysis 77 / 80
![Page 329: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/329.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
PageRank vs. HITS: Discussion
PageRank can be precomputed, HITS has to be computed atquery time.
HITS is too expensive in most application scenarios.
PageRank and HITS make two different design choicesconcerning (i) the eigenproblem formalization (ii) the set ofpages to apply the formalization to.
These two are orthogonal.
We could also apply HITS to the entire web and PageRank toa small base set.
Claim: On the web, a good hub almost always is also a goodauthority.
The actual difference between PageRank ranking and HITSranking is therefore not as large as one might expect.
Schutze: Link analysis 77 / 80
![Page 330: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/330.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Exercise
Schutze: Link analysis 78 / 80
![Page 331: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/331.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Exercise
Why is a good hub almost always also a good authority?
Schutze: Link analysis 78 / 80
![Page 332: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/332.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Take-away today
Anchor text: What exactly are links on the web and why arethey important for IR?
Citation analysis: the mathematical foundation of PageRankand link-based ranking
PageRank: the original algorithm that was used for link-basedranking on the web
Hubs & Authorities: an alternative link-based rankingalgorithm
Schutze: Link analysis 79 / 80
![Page 333: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/21link.pdf · and link-based ranking PageRank: the original algorithm that was used for link-based ranking on the](https://reader030.fdocuments.us/reader030/viewer/2022041022/5ed3c344dc1b3f24ef09b3a3/html5/thumbnails/333.jpg)
Recap Anchor text Citation analysis PageRank HITS: Hubs & Authorities
Resources
Chapter 21 of IIR
Resources at http://cislmu.org
American Mathematical Society article on PageRank (popularscience style)Jon Kleinberg’s home page (main person behind HITS)A Google bomb and its defusingGoogle’s official description of PageRank: PageRank reflectsour view of the importance of web pages by considering morethan 500 million variables and 2 billion terms. Pages that webelieve are important pages receive a higher PageRank and aremore likely to appear at the top of the search results.
Schutze: Link analysis 80 / 80