Connected(Components,(and(Pagerankusers.cecs.anu.edu.au/~xlx/teaching/comp4650-sma/files/sma3-connected.pdf ·...
Transcript of Connected(Components,(and(Pagerankusers.cecs.anu.edu.au/~xlx/teaching/comp4650-sma/files/sma3-connected.pdf ·...
Connected Components, and Pagerank
Lecture slides credit: Lada Adamic, U Michigan Jure Leskovec, Stanford, Andreas Haeberlen, UPenn
COMP4650
Lexing Xie Research School of Computer Science, ANU
Connected components • Strongly connected components
– Each node within the component can be reached from every other node in the component by following directed links
n Strongly connected components n B C D E n A n G H n F
n Weakly connected components: every node can be reached from every other node by following links in either direcRon
A
B
C
D E
F G
H
A
B
C
D E
F G
H
n Weakly connected components n A B C D E n G H F
n In undirected networks one talks simply about ‘connected components’
by Lada Adamic, U Michigan
Giant component • if the largest component encompasses a significant fracRon of the
graph, it is called the giant component
by Lada Adamic, U Michigan
Bow-‐Tie Structure of the Web
Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. 2000. Graph structure in the Web. Comput. Netw. 33, 1-‐6 (June 2000), 309-‐320.
Not Everyone Asks/Replies
• Core: A strongly connected component, in which everyone asks and answers • IN: Mostly askers. • OUT: Mostly Helpers
The Web is a bow Re The Java Forum network is an uneven bow Re
Jun Zhang, Mark S. Ackerman, and Lada Adamic. 2007. ExperRse networks in online communiRes: structure and algorithms. In Proceedings of the 16th internaRonal conference on World Wide Web (WWW '07)., 221-‐230.
Minkyoung Kim, Lexing Xie, Peter Christen, Event Diffusion Paherns in Social Media (2012) Intl. Conf. on Weblogs and Social Media (ICWSM), 8 pages, Dublin, Ireland, June 2012
Modern InformaRon Retrieval: The Concepts and Technology behind Search Ricardo Baeza-‐Yates, Berthier Ribeiro-‐Neto Addison-‐Wesley Professional; 2 ediRon (February 10, 2011)
Google's PageRank (Brin/Page 98)
• A technique for esRmaRng page quality – Based on web link graph
• Results are combined with IR score – Think of it as: TotalScore = IR score * PageRank – In pracRce, search engines use many other factors (for example, Google says it uses more than 200)
11 A. Haeberlen , Z. Ives, U Penn
PageRank: IntuiRon
• Imagine a contest for The Web's Best Page – IniRally, each page has one vote – Each page votes for all the pages it has a link to – To ensure fairness, pages voRng for more than one page must split their vote equally between them
– VoRng proceeds in rounds; in each round, each page has the number of votes it received in the previous round
– In pracRce, it's a lihle more complicated -‐ but not much! 12
A
BE
C
DF
G
H
I
J
Shouldn't E's vote be worth more than F's?
How many levels should we consider?
Source: A. Haeberlen , Z. Ives, U Penn
PageRank • Each page i is given a rank xi • Goal: Assign the xi such that the rank of each page is governed by the ranks of the pages linking to it:
13
jBj j
i xN
xi
∑∈
=1
Rank of page j Rank of page i
Every page j that links to i
Number of links out
from page j How do we compute the rank values?
Source: A. Haeberlen , Z. Ives, U Penn
14
IteraRve PageRank (simplified)
)()1( 1 kj
Bj j
ki x
Nx
i
∑∈
+ =
nxi
1)0( =Initialize all ranks to be equal, e.g.:
Iterate until convergence
Source: A. Haeberlen , Z. Ives, U Penn
15
Example: Step 0
0.33
0.33
0.33
Initialize all ranks to be equal n
xi1)0( =
Source: A. Haeberlen , Z. Ives, U Penn
16
Example: Step 1
0.17
0.33
0.33
)()1( 1 kj
Bj j
ki x
Nx
i
∑∈
+ =
0.17
Propagate weights across out-edges
Source: A. Haeberlen , Z. Ives, U Penn
17
Example: Step 2
0.17
0.50
0.33
Compute weights based on in-edges )0()1( 1
jBj j
i xN
xi
∑∈
=
Source: A. Haeberlen , Z. Ives, U Penn
18
Example: Convergence
0.2
0.40
0.4
)()1( 1 kj
Bj j
ki x
Nx
i
∑∈
+ =
Source: A. Haeberlen , Z. Ives, U Penn
19
Naïve PageRank Algorithm Restated
• Let – N(p) = number outgoing links from page p – B(p) = number of back-‐links to page p
– Each page b distributes its importance to all of the pages it points to (so
we scale by 1/N(b)) – Page p’s importance is increased by the importance of its back set
)()(
1)( bPageRankbN
pPageRankpBb∑∈
=
Source: A. Haeberlen , Z. Ives, U Penn
20
In Linear Algebra formulaRon • Create an m x m matrix M to capture links:
– M(i, j) = 1 / nj if page i is pointed to by page j and page j has nj outgoing links
= 0 otherwise
– IniRalize all PageRanks to 1, mulRply by M repeatedly unRl all values converge:
– Computes principal eigenvector via power iteraRon
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
)(...
)()(
)'(...
)'()'(
2
1
2
1
mm pPageRank
pPageRankpPageRank
M
pPageRank
pPageRankpPageRank
Source: A. Haeberlen , Z. Ives, U Penn
21
A Brief Example
Amazon Yahoo
0 0.5 0.5 0 0 0.5 1 0.5 0
g' y’ a’
g y a
= *
Total rank sums to number of pages
g y a
1 1 1
=
1 0.5 1.5
, 1
0.75 1.25
, 1
0.67 1.33
, …
Running for multiple iterations:
Source: A. Haeberlen , Z. Ives, U Penn
22
Oops #1 – PageRank Sinks
Amazon Yahoo
0 0 0.5 0.5 0 0.5 0.5 0 0
g' y’ a’
g y a
= *
g y a
1 1 1
=
0.5 1
0.5
, 0.25 0.5 0.25
, 0 0 0
, … ,
Running for multiple iterations:
'dead end' -‐ PageRank is lost auer each round
Source: A. Haeberlen , Z. Ives, U Penn
23
Oops #2 – PageRank hogs
Amazon Yahoo
0 0 0.5 0.5 1 0.5 0.5 0 0
g' y’ a’
g y a
= *
g y a
1 1 1
=
0.5 2
0.5
, 0.25 2.5 0.25
, 0 3 0
, … ,
Running for multiple iterations:
PageRank cannot flow out and accumulates
Source: A. Haeberlen , Z. Ives, U Penn
24
Improved PageRank
• Remove out-‐degree 0 nodes (or consider them to refer back to referrer)
• Add decay factor d to deal with sinks
• Typical value: d=0.85
)()(
1)1()( bPageRankbN
ddpPageRankpBb
∑∈
+−=
Source: A. Haeberlen , Z. Ives, U Penn
25
Stopping the Hog
0 0 0.5 0.5 1 0.5 0.5 0 0
g' y’ a’
g y a
= 0.85 *
g y a
=
0.26 2.48 0.26
,
0.15 0.15 0.15
+
Running for multiple iterations:
… though does this seem right?
Amazon Yahoo
0.57 1.85 0.57
0.39 2.21 0.39
0.32 2.36 0.32
, , , … ,
Source: A. Haeberlen , Z. Ives, U Penn
Random Surfer Model • PageRank has an intuiRve basis in random walks on graphs
• Imagine a random surfer, who starts on a random page and, in each step, – with probability d, klicks on a random link on the page – with probability 1-‐d, jumps to a random page (bored?)
• The PageRank of a page can be interpreted as the fracRon of steps the surfer spends on the corresponding page – TransiRon matrix can be interpreted as a Markov Chain
26 Source: A. Haeberlen , Z. Ives, U Penn
PageRank, random walk on graphs G W = (1-‐d)*G + d*U
Pagerank vector
COMP4650 L Xie, ANU Computer Science
+
– with probability d, klicks on a random link on the page – with probability 1-‐d, jumps to a random page (bored?) – TransiRon matrix W can be interpreted as a Markov Chain
Search Engine OpRmizaRon (SEO) • Has become a big business • White-‐hat techniques
– Google webmaster tools – Add meta tags to documents, etc.
• Black-‐hat techniques – Link farms – Keyword stuffing, hidden text, meta-‐tag stuffing, ... – Spamdexing
• IniRal soluRon: <a rel="nofollow" href="...">...</a> • Some people started to abuse this to improve their own rankings
– Doorway pages / cloaking • Special pages just for search engines • BMW Germany and Ricoh Germany banned in February 2006
– Link buying
29 Source: A. Haeberlen , Z. Ives, U Penn
30
Recap: PageRank • EsRmates absolute 'quality' or 'importance' of a given
page based on inbound links – Query-‐independent – Can be computed via fixpoint iteraRon – Can be interpreted as the fracRon of Rme a 'random surfer' would spend on the page
– Several refinements, e.g., to deal with sinks
• Considered relaRvely stable – But vulnerable to black-‐hat SEO
• An important factor, but not the only one – Overall ranking is based on many factors (Google: >200)
Source: A. Haeberlen , Z. Ives, U Penn
What could be the other 200 factors?
• Note: This is enRrely speculaRve! 31
Posi9ve Nega9ve
On-‐page
Off-‐page
Source: Web InformaRon Systems, Prof. Beat Signer, VU Brussels
Keyword in Rtle? URL? Keyword in domain name? Quality of HTML code Page freshness Rate of change ...
Links to 'bad neighborhood' Keyword stuffing Over-‐opRmizaRon Hidden content (text has same color as background) AutomaRc redirect/refresh ...
High PageRank Anchor text of inbound links Links from authority sites Links from well-‐known sites Domain expiraRon date ...
Fast increase in number of inbound links (link buying?) Link farming Different pages user/spider Content duplicaRon ...