Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to...
Transcript of Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to...
![Page 1: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/1.jpg)
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University
http://www.mmds.org
Note to other teachers and users of these slides: We would be delighted if you found this our
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
![Page 2: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/2.jpg)
High dim. data
High dim. data
Locality sensitive hashing
Clustering
Dimensionality
reduction
Graph data
Graph data
PageRank, SimRank
Community Detection
Spam Detection
Infinite data
Infinite data
Filtering data
streams
Web advertising
Queries on streams
Machine learningMachine learning
SVM
Decision Trees
Perceptron, kNN
AppsApps
Recommender systems
Association Rules
Duplicate document detection
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2
![Page 3: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/3.jpg)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3
Facebook social graph4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]
![Page 4: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/4.jpg)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4
Connections between political blogsPolarization of the network [Adamic-Glance, 2005]
![Page 5: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/5.jpg)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5
Citation networks and Maps of science[Börner et al., 2012]
![Page 6: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/6.jpg)
domain2
domain1
domain3
router
InternetJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
![Page 7: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/7.jpg)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7
Seven Bridges of Königsberg[Euler, 1735]
Return to the starting point by traveling each link of the graph once and only once.
![Page 8: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/8.jpg)
� Web as a directed graph:
� Nodes: Webpages
� Edges: Hyperlinks
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8
I teach a
class on
Networks. CS224W:
Classes are
in the
Gates
building Computer
Science
Department
at Stanford
Stanford
University
![Page 9: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/9.jpg)
� Web as a directed graph:
� Nodes: Webpages
� Edges: Hyperlinks
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9
I teach a
class on
Networks. CS224W:
Classes are
in the
Gates
building Computer
Science
Department
at Stanford
Stanford
University
![Page 10: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/10.jpg)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10
![Page 11: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/11.jpg)
� How to organize the Web?� First try: Human curated
Web directories
� Yahoo, DMOZ, LookSmart
� Second try: Web Search
� Information Retrieval investigates:
Find relevant docs in a small
and trusted set
� Newspaper articles, Patents, etc.
� But: Web is huge, full of untrusted documents,
random things, web spam, etc.J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11
![Page 12: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/12.jpg)
2 challenges of web search:
� (1) Web contains many sources of information
Who to “trust”?
� Trick: Trustworthy pages may point to each other!
� (2) What is the “best” answer to query
“newspaper”?
� No single right answer
� Trick: Pages that actually know about newspapers
might all be pointing to many newspapers
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12
![Page 13: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/13.jpg)
� All web pages are not equally “important”
www.joe-schmoe.com vs. www.stanford.edu
� There is large diversity
in the web-graph
node connectivity.
Let’s rank the pages by
the link structure!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13
![Page 14: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/14.jpg)
� We will cover the following Link Analysis
approaches for computing importances
of nodes in a graph:
� Page Rank
� Topic-Specific (Personalized) Page Rank
� Web Spam Detection Algorithms
14J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
![Page 15: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/15.jpg)
![Page 16: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/16.jpg)
� Idea: Links as votes
� Page is more important if it has more links
� In-coming links? Out-going links?
� Think of in-links as votes:� www.stanford.edu has 23,400 in-links
� www.joe-schmoe.com has 1 in-link
� Are all in-links are equal?
� Links from important pages count more
� Recursive question!
16J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
![Page 17: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/17.jpg)
B38.4
C34.3
E8.1
F3.9
D3.9
A3.3
1.61.6 1.6 1.6 1.6
17J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
![Page 18: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/18.jpg)
� Each link’s vote is proportional to the
importance of its source page
� If page j with importance rj has n out-links,
each link gets rj / n votes
� Page j’s own importance is the sum of the
votes on its in-links
18J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
j
kkii
rj/3
rj/3rj/3rj = ri/3+rk/4
ri/3 rk/4
![Page 19: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/19.jpg)
� A “vote” from an important
page is worth more
� A page is important if it is
pointed to by other important
pages
� Define a “rank” rj for page j
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19
∑→
=ji
ij
rr
id
yy
mmaaa/2
y/2a/2
m
y/2
The web in 1839
“Flow” equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2�� … out-degree of node �
![Page 20: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/20.jpg)
� 3 equations, 3 unknowns, no constants
� No unique solution
� All solutions equivalent modulo the scale factor
� Additional constraint forces uniqueness:
� �� � �� ��� � Solution: �� �
� , �� �� , ��
�� Gaussian elimination method works for
small examples, but we need a better method for large web-size graphs
� We need a new formulation!20J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
Flow equations:
![Page 21: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/21.jpg)
� Stochastic adjacency matrix �� Let page � has �� out-links
� If � → �, then ��� ���
else ��� 0� � is a column stochastic matrix
� Columns sum to 1
� Rank vector �: vector with an entry per page� �� is the importance score of page �� ∑ �� 1�
� The flow equations can be written
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21
∑→
=ji
ij
rr
id
![Page 22: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/22.jpg)
� Remember the flow equation:� Flow equation in the matrix form
� ⋅ � �� Suppose page i links to 3 pages, including j
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22
j
i
M r r
=rj
1/3
∑→
=ji
ij
rr
id
ri
.
. =
![Page 23: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/23.jpg)
� The flow equations can be written
� � · �� So the rank vector r is an eigenvector of the
stochastic web matrix M
� In fact, its first or principal eigenvector, with corresponding eigenvalue 1
� Largest eigenvalue of M is 1 since M iscolumn stochastic (with non-negative entries)
� We know r is unit length and each column of Msums to one, so �� �
� We can now efficiently solve for r!The method is called Power iteration
23J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
NOTE: x is an eigenvector with the corresponding eigenvalue λ if:
�� �
![Page 24: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/24.jpg)
r = M∙r
y ½ ½ 0 y
a = ½ 0 1 a
m 0 ½ 0 m
24J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
yy
aa mm
y a m
y ½ ½ 0
a ½ 0 1
m 0 ½ 0
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
![Page 25: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/25.jpg)
� Given a web graph with n nodes, where the
nodes are pages and edges are hyperlinks
� Power iteration: a simple iterative scheme
� Suppose there are N web pages
� Initialize: r(0) = [1/N,….,1/N]T
� Iterate: r(t+1) = M ∙ r(t)
� Stop when |r(t+1) – r(t)|1 < ε
25J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
∑→
+=
ji
t
it
j
rr
i
)()1(
d
di …. out-degree of node i
|x|1 = ∑1≤i≤N|xi| is the L1 norm Can use any other vector norm, e.g., Euclidean
![Page 26: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/26.jpg)
� Power Iteration:
� Set �! 1/N
� 1: �′! ∑ #$�$�→!
� 2:� �′� Goto 1
� Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
yy
aa mm
y a m
y ½ ½ 0
a ½ 0 1
m 0 ½ 0
26
Iteration 0, 1, 2, …
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
![Page 27: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/27.jpg)
� Power Iteration:
� Set �! 1/N
� 1: �′! ∑ #$�$�→!
� 2:� �′� Goto 1
� Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
yy
aa mm
y a m
y ½ ½ 0
a ½ 0 1
m 0 ½ 0
27
Iteration 0, 1, 2, …
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
![Page 28: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/28.jpg)
� Power iteration:
A method for finding dominant eigenvector (the
vector corresponding to the largest eigenvalue)
� �%& � ⋅ �%'&� �%�& � ⋅ � � �� �� ⋅ � '
� �%(& � ⋅ � � � ��� ' �( ⋅ � '
� Claim:
Sequence � ⋅ � ' ,�� ⋅ � ' , …�* ⋅ � ' , …approaches the dominant eigenvector of �
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28
Details!
![Page 29: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/29.jpg)
� Claim: Sequence � ⋅ � ' ,�� ⋅ � ' , …�* ⋅ � ' , …approaches the dominant eigenvector of �
� Proof:� Assume M has n linearly independent eigenvectors, +�, +,, … , +- with corresponding eigenvalues .�, .,, … , .-, where .� / ., / ⋯ / .-
� Vectors +�, +,, … , +- form a basis and thus we can write: �%1& 2�+� � 2,+, �⋯� 2-+-
� ��%'& � 3� � 3��� �⋯� 34�4 2�%�+�& � 2,%�+,& � ⋯� 2-%�+-& 2�%.�+�& � 2,%.,+,& � ⋯� 2-%.-+-&
� Repeated multiplication on both sides produces
�5�%1& 2�%.�5+�& � 2,%.,5+,& � ⋯� 2-%.-5+-&J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29
Details!
![Page 30: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/30.jpg)
� Claim: Sequence � ⋅ � ' ,�� ⋅ � ' , …�* ⋅ � ' , …approaches the dominant eigenvector of �
� Proof (continued):� Repeated multiplication on both sides produces
�5�%1& 2�%.�5+�& � 2,%.,5+,& � ⋯� 2-%.-5+-&
� �5�%1& .�5 2�+� � 2, 6768
5+, �⋯� 2- 67
68
5+-
� Since .� / ., then fractions 6768
, 6968 … : 1and so
6$68
5 0 as ; → ∞ (for all � 2…>).
� Thus: �*�%'& ? 3 *�� Note if 2� 0 then the method won’t converge
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30
Details!
![Page 31: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/31.jpg)
� Imagine a random web surfer:
� At any time @, surfer is on some page �� At time @ � , the surfer follows an
out-link from � uniformly at random
� Ends up on some page A linked from �� Process repeats indefinitely
� Let:� B%@& … vector whose �th coordinate is the
prob. that the surfer is at page � at time @� So, B%@& is a probability distribution over pages
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31
∑→
=ji
ij
rr
(i)dout
j
i1 i2 i3
![Page 32: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/32.jpg)
� Where is the surfer at time t+1?
� Follows a link uniformly at random
B @ � � ⋅ B%@&� Suppose the random walk reaches a state
B @ � � ⋅ B%@& B%@&then B%@& is stationary distribution of a random walk
� Our original rank vector � satisfies � � ⋅ �� So, � is a stationary distribution for
the random walk
)(M)1( tptp ⋅=+
j
i1 i2 i3
32J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
![Page 33: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/33.jpg)
� A central result from the theory of random
walks (a.k.a. Markov processes):
For graphs that satisfy certain conditions,
the stationary distribution is unique and
eventually will be reached no matter what the
initial probability distribution at time t = 0
33J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
![Page 34: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/34.jpg)
![Page 35: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/35.jpg)
� Does this converge?
� Does it converge to what we want?
� Are results reasonable?
∑→
+=
ji
t
it
j
rr
i
)()1(
d Mrr =or
equivalently
35J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
![Page 36: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/36.jpg)
� Example:
ra 1 0 1 0
rb 0 1 0 1
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36
=
bbaa
Iteration 0, 1, 2, …
∑→
+=
ji
t
it
j
rr
i
)()1(
d
![Page 37: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/37.jpg)
� Example:
ra 1 0 0 0
rb 0 1 0 0
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37
=
bbaa
Iteration 0, 1, 2, …
∑→
+=
ji
t
it
j
rr
i
)()1(
d
![Page 38: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/38.jpg)
2 problems:
� (1) Some pages are
dead ends (have no out-links)
� Random walk has “nowhere” to go to
� Such pages cause importance to “leak out”
� (2) Spider traps:
(all out-links are within the group)
� Random walked gets “stuck” in a trap
� And eventually spider traps absorb all importance
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38
Dead end
![Page 39: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/39.jpg)
� Power Iteration:
� Set �! 1� �! ∑ #$
�$�→!� And iterate
� Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 3/6 7/12 16/24 1
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39
Iteration 0, 1, 2, …
yy
aa mm
y a m
y ½ ½ 0
a ½ 0 0
m 0 ½ 1
ry = ry /2 + ra /2
ra = ry /2
rm = ra /2 + rm
m is a spider trap
All the PageRank score gets “trapped” in node m.
![Page 40: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/40.jpg)
� The Google solution for spider traps: At each
time step, the random surfer has two options
� With prob. ββββ, follow a link at random
� With prob. 1-ββββ, jump to some random page
� Common values for ββββ are in the range 0.8 to 0.9
� Surfer will teleport out of spider trap
within a few time steps
40J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
yy
aa mm
yy
aa mm
![Page 41: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/41.jpg)
� Power Iteration:
� Set �! 1� �! ∑ #$
�$�→!� And iterate
� Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 1/6 1/12 2/24 0
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41
Iteration 0, 1, 2, …
yy
aa mm
y a m
y ½ ½ 0
a ½ 0 0
m 0 ½ 0
ry = ry /2 + ra /2
ra = ry /2
rm = ra /2
Here the PageRank “leaks” out since the matrix is not stochastic.
![Page 42: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/42.jpg)
� Teleports: Follow random teleport links with
probability 1.0 from dead-ends
� Adjust matrix accordingly
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42
yy
aa mm
y a m
y ½ ½ ⅓
a ½ 0 ⅓
m 0 ½ ⅓
y a m
y ½ ½ 0
a ½ 0 0
m 0 ½ 0
yy
aa mm
![Page 43: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/43.jpg)
Why are dead-ends and spider traps a problem
and why do teleports solve the problem?
� Spider-traps are not a problem, but with traps
PageRank scores are not what we want
� Solution: Never get stuck in a spider trap by
teleporting out of it in a finite number of steps
� Dead-ends are a problem
� The matrix is not column stochastic so our initial
assumptions are not met
� Solution: Make matrix column stochastic by always
teleporting when there is nowhere else to goJ. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43
![Page 44: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/44.jpg)
� Google’s solution that does it all:
At each step, random surfer has two options:
� With probability ββββ, follow a link at random
� With probability 1-ββββ, jump to some random page
� PageRank equation [Brin-Page, 98]
!���→!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44
di … out-degree
of node i
This formulation assumes that � has no dead ends. We can either
preprocess matrix � to remove all dead ends or explicitly follow random
teleport links with probability 1.0 from dead-ends.
![Page 45: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/45.jpg)
� PageRank equation [Brin-Page, ‘98]
�! CD �����→!� %1 E D& 1F
� The Google Matrix A:
G D� � 1 E D 1F HIH
� We have a recursive problem: � � ⋅ �And the Power method still works!
� What is ββββ ?
� In practice β =0.8,0.9 (make 5 steps on avg., jump)J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45
[1/N]NxN…N by N matrix
where all entries are 1/N
![Page 46: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/46.jpg)
y
a =
m
1/3
1/3
1/3
0.33
0.20
0.46
0.24
0.20
0.52
0.26
0.18
0.56
7/33
5/33
21/33
. . .
46J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
yy
aamm
13/15
7/15
1/2 1/2 0
1/2 0 0
0 1/2 1
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3
y 7/15 7/15 1/15
a 7/15 1/15 1/15
m 1/15 7/15 13/15
0.8 + 0.2
M [1/N]NxN
A
![Page 47: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/47.jpg)
![Page 48: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/48.jpg)
� Key step is matrix-vector multiplication
� rnew = A ∙ rold
� Easy if we have enough main memory to hold A, rold, rnew
� Say N = 1 billion pages
� We need 4 bytes for each entry (say)
� 2 billion entries for vectors, approx 8GB
� Matrix A has N2 entries
� 1018 is a large number!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 48
½ ½ 0
½ 0 0
0 ½ 1
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3
7/15 7/15 1/15
7/15 1/15 1/15
1/15 7/15 13/15
0.8 +0.2
A = β∙M + (1-β) [1/N]NxN
=
A =
![Page 49: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/49.jpg)
� Suppose there are N pages� Consider page i, with di out-links� We have Mji = 1/|di| when i→ j
and Mji = 0 otherwise� The random teleport is equivalent to:
� Adding a teleport link from i to every other page and setting transition probability to (1-ββββ)/N
� Reducing the probability of following each out-link from 1/|di| to ββββ/|di|
� Equivalent: Tax each page a fraction (1-ββββ) of its score and redistribute evenly
49J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
![Page 50: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/50.jpg)
� � � ⋅ �, where �A� J�A� � KJL
� �! ∑ G!� ⋅ ��HMN�� �! ∑ D�!� � �KO
H ⋅ ��H�N� ∑ D�!� ⋅ �� � �KO
HHMN� ∑ ��HMN�
∑ D�!� ⋅ �� � �KOH
HMN� since ∑�� 1� So we get: � J� ⋅ � � KJ
L L
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 50
[x]N … a vector of length N with all entries xNote: Here we assumed M
has no dead-ends
![Page 51: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/51.jpg)
� We just rearranged the PageRank equation
� J� ⋅ � � E JL L
� where [(1-ββββ)/N]N is a vector with all N entries (1-ββββ)/N
� M is a sparse matrix! (with no dead-ends)
� 10 links per node, approx 10N entries
� So in each iteration, we need to:
� Compute rnew = β M ∙ rold
� Add a constant value (1-ββββ)/N to each entry in rnew
� Note if M contains dead-ends then ∑ �A4PQA : and
we also have to renormalize rnew so that it sums to 1
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 51
![Page 52: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/52.jpg)
� Input: Graph R and parameter J� Directed graph R (can have spider traps and dead ends)
� Parameter J� Output: PageRank vector �4PQ
� Set: �!ST� �H
� repeat until convergence: ∑ �!-UV E �!ST� / W!� ∀�:�′A4PQ ∑ J ��Z[����→A
�′A4PQ ' if in-degree of A is 0
� Now re-insert the leaked PageRank:
∀A:�A4PQ �\A4PQ � K]L
� �Z[� �4PQ
52
where: ^ ∑ �′!-UV!
If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
![Page 53: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/53.jpg)
� Encode sparse matrix using only nonzero
entries
� Space proportional roughly to number of links
� Say 10N, or 4*10*1 billion = 40GB
� Still won’t fit in memory, but will fit on disk
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 53
0 3 1, 5, 7
1 5 17, 64, 113, 117, 245
2 2 13, 23
sourcenode degree destination nodes
![Page 54: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/54.jpg)
� Assume enough RAM to fit rnew into memory
� Store rold and matrix M on disk
� 1 step of power-iteration is:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 54
0 3 1, 5, 6
1 4 17, 64, 113, 117
2 2 13, 23
source degree destination01
2
34
5
6
01
2
34
5
6
rnew rold
Initialize all entries of rnew = (1-ββββ) / NFor each page i (of out-degree di):
Read into memory: i, di, dest1, …, destdi, rold(i)
For j = 1…di
rnew(destj) += ββββ rold(i) / di
![Page 55: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/55.jpg)
� Assume enough RAM to fit rnew into memory
� Store rold and matrix M on disk
� In each iteration, we have to:
� Read rold and M
� Write rnew back to disk
� Cost per iteration of Power method:
= 2|r| + |M|
� Question:
� What if we could not even fit rnew in memory?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 55
![Page 56: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/56.jpg)
� Break rnew into k blocks that fit in memory
� Scan M and rold once for each block
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 56
0 4 0, 1, 3, 5
1 2 0, 5
2 2 3, 4
src degree destination
01
2
3
4
5
01
2
34
5
rnew rold
M
![Page 57: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/57.jpg)
� Similar to nested-loop join in databases
� Break rnew into k blocks that fit in memory
� Scan M and rold once for each block
� Total cost:
� k scans of M and rold
� Cost per iteration of Power method:
k(|M| + |r|) + |r| = k|M| + (k+1)|r|
� Can we do better?
� Hint: M is much bigger than r (approx 10-20x), so
we must avoid reading it k times per iteration
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 57
![Page 58: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/58.jpg)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 58
0 4 0, 1
1 3 0
2 2 1
src degree destination
01
2
3
4
5
01
2
34
5
rnew
rold
0 4 5
1 3 5
2 2 4
0 4 3
2 2 3
Break M into stripes! Each stripe contains only destination nodes in the corresponding block of rnew
![Page 59: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/59.jpg)
� Break M into stripes
� Each stripe contains only destination nodes
in the corresponding block of rnew
� Some additional overhead per stripe
� But it is usually worth it
� Cost per iteration of Power method:
=|M|(1+εεεε) + (k+1)|r|
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 59
![Page 60: Stanford University - Mining of Massive ...mmds.org/mmds/v2.1/ch05-linkanalysis1.pdf · How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second](https://reader035.fdocuments.us/reader035/viewer/2022070808/5f0779527e708231d41d28eb/html5/thumbnails/60.jpg)
� Measures generic popularity of a page
� Biased against topic-specific authorities
� Solution: Topic-Specific PageRank (next)
� Uses a single measure of importance
� Other models of importance
� Solution: Hubs-and-Authorities
� Susceptible to Link spam
� Artificial link topographies created in order to
boost page rank
� Solution: TrustRank
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 60