The math behind PageRank
description
Transcript of The math behind PageRank
![Page 1: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/1.jpg)
The math behind PageRank
A detailed analysis of the mathematical aspects of PageRankComputational Mathematics class presentation
Ravi S SinhaLIT lab, UNT
![Page 2: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/2.jpg)
Partial citations of references
• The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page
• Inside PageRank Monica Bianchini, Marco Gori, and Franco Scarselli
• Deeper Inside PageRank Amy Langville and Carl Meyer
• Efficient Computation of PageRank Taher Haveliwala
• Topic Sensitive PageRank Taher Haveliwala
![Page 3: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/3.jpg)
Overview of the talk
• Why PageRank• What is PageRank• How PageRank is used• Math• More math• Remaining math
![Page 4: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/4.jpg)
Why PageRank• Need to build a better automatic search engine
Why?• Human maintained lists subjective and expensive to
build (non-automatic)• Automatic engines based on keyword matching do a
horrible job (just page content is not enough; cleverly placed words in a page can mislead search engines)
• Advertisers sometimes mislead search engines
• Solution: Google [modern day: much more than PageRank; getting smarter] Exact technology: not public domain Core technology: PageRank (utilizes link structure)
• Other uses Any problem that can be visualized as a graph
problem where the centrality of the vertices needs to be computed (NLP, etc.)
![Page 5: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/5.jpg)
What is PageRank
• A way to find the most ‘important’ vertices in a graph
• PR(A) = (1-d) + d [ PR(T1) / C(T1) + … + PR(Tn) / C(Tn) ]
• Forms a probability distribution over the vertices [sum = 1]
• How does this relate to Web search? Vertices = pages Incoming edges = hyperlinks from other pages Outgoing edges = hyperlinks to other pages
![Page 6: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/6.jpg)
Simple visualization: the simplest variant of PageRank in use [user behavior]
Random surfer
Damping factor
Only one incoming link, yet high PageRank
![Page 7: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/7.jpg)
Lexical Substitution: A crash course
There are different types of managed care systems
Trivial for humans, not for machines
Math, statistics, linguistics wrapped within computer programs and algorithms
Information retrieval, machine translation, question answering, information security [information hiding in text]
![Page 8: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/8.jpg)
PageRank in use: Lexical Substitution
Weights: word similarityDirected/ undirected: whole other realm
![Page 9: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/9.jpg)
And now, the cool stuff
![Page 10: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/10.jpg)
The math behind PageRank
• Intuitive correctness• Mathematical foundation• Stability• Complexity of computational scheme• Critical role of the parameters involved• The distribution of the page score• Role of dangling pages• How to promote certain vertices (Web pages)
![Page 11: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/11.jpg)
Intuitive correctness
• Concept of ‘voting’ Related to citation in scientific literature More citations indicate great/ important piece of
work
• Random surfer / random walk• A page with many links to it must be important• A very important page must point to something
equally important
![Page 12: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/12.jpg)
Mathematical foundation
• Most researchers: Markov chains Caveat: Only applicable in absence of dangling nodes
• Basic idea: authority of a Web page unrelated to its contents [comes from the link structure]
• Simple representation
• Vector representation
)1(][
dhxdx
ppq q
qp
NIdxWdx )1(
IN = [1, 1, 1 … 1]’
Transition matrix: ∑(each column) = 1 or 0
![Page 13: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/13.jpg)
Mathematical foundation (2)
NId)()(txWd(t)x
11Google’s iterative version: converges to a stationary solution
Jacobi algorithm
NItN
txWdtx )1(1)1()(
Alternative computation
)1()1()1( txWdtxt
||x(t)||1 = 1; normalized
![Page 14: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/14.jpg)
Web communities: Energy balance [measure of authority]
Ip
I
dpI
outI
inII
xE
EEEIE
*
![Page 15: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/15.jpg)
More on energy
)( )()(
*1
*)1(1
*1 IOuti Idpi
iii
IIni
iiI
dpI
outI
inII
xddxf
ddxf
ddIE
EEEIE
Migration of scores across
graphLessons
Maximize energy
References from others
Minimize E(out)
Minimize E(dp)
Dangling pages, external links Maximize E(in)
inII EIE
![Page 16: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/16.jpg)
Even more on energy [community promotion]
dpI
outI
inII EEEIE
1. Split same content into smaller vertices
2. Avoid dangling pages
3. Avoid many outgoing links
![Page 17: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/17.jpg)
Page promotion
• Treat certain pages as communities• Bias certain pages by using a non-uniform
distribution in the vector IN
• Tinker with the connectivity [PageRank is proved to be affected by the regularity of the connection pattern]
NId)()(txWd(t)x
11Original •IN
•[1, 1, 1, …, 1]T
Biased •IN•[1, 1.5, 1.25, …, 1]T
![Page 18: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/18.jpg)
Computation of PageRank
• PageRank can be computed on a graph changing over time Practical interest [Web is alive]
• An optimal algorithm exists for computing PageRank Practical applications: Search engines, PageRank on
billions of pages – efficiency! Ο(|Η| log 1/ε) NOT dependent on the connectivity or other
dimensions Ideal computation: stops when the ranking of vertices
between two computations does not change [converge]
![Page 19: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/19.jpg)
The Markov model from the Web
• The PageRank vector can only exist if the Markov chain is irreducible
• By nature, the Web is non-bipartite, sparse, and produces a reducible Markov chain
• The Web hyperlinked matrix is forced to be Stochastic [non-negatives, all columns sum up to 1]
• Remove dangling nodes/ replace relevant rows/ columns with a small value, usually [1/n].eT
• Introduce personalization vector Primitive
• Non-negative• One positive element on the main diagonal• Irredicible
![Page 20: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/20.jpg)
More on the Markov structure• A convex combination of the original stochastic
matrix and a stochastic perturbation matrix Produces a stochastic, irreducible matrix The PageRank vector is guaranteed to exist for this
matrix
• Every node directly connected to another node, all probabilities non zero Irreducible Markov chain, will converge
0 1/2
0 0
0 1/2
1/2 1/2
1/60 7/15
1/2 1/2
![Page 21: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/21.jpg)
There’s more to PageRank
• Computation Power method
• Notoriously slow• Method of choice• Requires no computation of intermediate matrices• Converges quickly
Linear systems method
• The damping factor [usually 0.85] Greater value: more iterations required ‘Truer’ PageRanks
• Dangling pages• Storage issues
![Page 22: The math behind PageRank](https://reader035.fdocuments.us/reader035/viewer/2022081507/568165a8550346895dd88f1d/html5/thumbnails/22.jpg)
The end [for today]
Thanks for listening!