L23: PageRankjeffp/teaching/cs5140/L23-notes.pdf · collection of websites. ... ↳ truthfulness of...

Post on 18-Jul-2020

2 views 0 download

Transcript of L23: PageRankjeffp/teaching/cs5140/L23-notes.pdf · collection of websites. ... ↳ truthfulness of...

L23: PageRank

Je↵ M. Phillips

April 13, 2020

Final Report

At most 4 pages/student. Don’t cram in too much!

I Succinct title (and names)

I Problem definition and motivation.

I Explain your Data.

I key idea

I What did you do (which techniques, an implementation, acomparison, an extension)

I What did you learn? Artifacts (charts, plots, examples, math)and Intuition (in words, did it work?)

I ← Same ten Posters

o

webpage-simia.it( Search ). InvertedIndex

year ;i¥¥ . .

bout→ page 't

, P ?,"

car"

→¥pQgirl)

-

Define most relevantwebpages-

D- d-'

ne'EE¥÷÷÷÷:÷:i:smarm :÷÷÷i÷"¥xE÷÷g÷÷:X

Crawlers : program ; that walksaround web : ① read page

update fentwiuecfor

② follow random

inverting hyperlinks

use hyperlink infoL a hired "

www.pie.com" )pie

- 3

Spammersbuild fleet pages i link to your

Page w/ hyperlink tag .

studies : Alternative to search Engine

Yahoo ! and Look Smart

Built an organized , curatedcollection of websites

pageR-ankfslpjiturml-fktytg.fi?ggignhspiddeieafe

1ttqitlinkedto

balanceboth pages .

page is important ifa

random MCMC' ' random surfer "

were

to fond it.

-

Web is a big graph G- L V,

E)

V = I set do all pages }

F- = l Ei ; = link pi → B- 3

Petone ML → g* ⇐ converged to vectordistribution

q*Lj ) says how important page ; is .

Compute q* of lwepgraph-

• Keep truck of crawlers : how frequentreturn .

• Buy bag computer : Compute eigcp )← probtra.ca)

• Precompute P#

=P - P - p .

. . ..

. p

← too big°

q*= go ←last night

a :÷÷÷o÷÷÷t÷÷¥power

method

Anatomy of Web

Strongly ConnectedComponent

IN

OUTTubes

tendrilsOUT

tendrilsIN

disconnected

ANATOMY of WEB

is this G ergodic

÷ . . Is•

i'④ yo

Anatomy of Web

0

O

canwemaheGergodi.ci#• Teleportation ,/ taxation

→ about once every I steps

→ jumpto randomnode

.

P prob trans (G)13--0-15 p

R= a - is ) Pepe Q¥fIf¥¥Z

↳ dense

Rq . - Ul - B) Ptp'

G)ginLt B) Pain + qt nxt vector- -

Spam Farms

÷ .

Google counter

^seafhs¥at

Trust ( noes ? )

only teleport to trustedpages .

r ← qx , pagerank

t ← get trusted teleport

rts ) - ft ; ) if l arse → spam

↳ truthfulness of webpage

Word CountConsider as input all of English Wikipedia stored in DFS. Goal is tocount how many times each word is used.

Inverted IndexConsider as input all of English Wikipedia stored in DFS. Goal is tobuild an index, so each word has a list of pages it is in.

PhrasesConsider as input all of English Wikipedia stored in DFS. Goal is tobuild an index, on 3-grams (sequence of 3 words) that appears onexactly one page, with link to page.

Label Propagation (Graph)Consider a large graph G = (V ,E ) (e.g., a social network), with asubset of notes V 0 ⇢ V with labels (e.g., {pos, neg}). Each nodestores its label (if any) and edges.Assign a vertex a label if (a) unlabled, (b) has � 5 labeledneighbors, (c) based on majority vote.

Label Propagation (Embedding)Consider a data set X ⇢ Rd , with a subset of points X 0 ⇢ X withlabels (e.g., {pos, neg}). Implicitly defines graph with V = X andE using k = 20 nearest neighbors.Assign a vertex a label if (a) unlabled, (b) has � 5 labeledneighbors, (c) based on majority vote.

Example PageRank

M =

2

664

0 1/2 0 01/3 0 1 1/21/3 0 0 1/21/3 1/2 0 0

3

775

Stripes:

M1 =

2

664

01/31/31/3

3

775 M2 =

2

664

1/2001/2

3

775 M3 =

2

664

0100

3

775 M4 =

2

664

01/21/20

3

775

These are stored as�1 : (1/3, 2), (1/3, 3), (1/3, 4)

�,�

2 : (1/2, 1)(1/2, 4)�,�3 : (1, 3)

�, and

�4 : (1/3, 1), (1/2, 2)

�.

Example PageRank

M =

2

664

0 1/2 0 01/3 0 1 1/21/3 0 0 1/21/3 1/2 0 0

3

775

Stripes:

M1 =

2

664

01/31/31/3

3

775 M2 =

2

664

1/2001/2

3

775 M3 =

2

664

0100

3

775 M4 =

2

664

01/21/20

3

775

These are stored as�1 : (1/3, 2), (1/3, 3), (1/3, 4)

�,�

2 : (1/2, 1)(1/2, 4)�,�3 : (1, 3)

�, and

�4 : (1/3, 1), (1/2, 2)

�.

Example PageRank

M =

2

664

0 1/2 0 01/3 0 1 1/21/3 0 0 1/21/3 1/2 0 0

3

775

Blocks:

M1,1 =

0 1/2

1/3 0

�M1,2 =

0 01 1/2

�M2,1 =

1/3 01/3 1/2

�M2,2 =

0 1/20 0

These are stored as�1 : (1/2, 2)

�,�2 : (1/3, 1)

�, as�

2 : (1, 3), (1/2, 4)�, as

�3 : (1/3, 1)

�,�4 : (1/3, 1), (1/2, 2)

�, and

as�3 : (1/2, 4)

�.