Diapo

77
1 Importance of Web Pages PageRank Topic-Specific PageRank Hubs and Authorities Combatting Link Spam

Transcript of Diapo

Page 1: Diapo

1

Importance of Web Pages

PageRank

Topic-Specific PageRank

Hubs and Authorities

Combatting Link Spam

Page 2: Diapo

2

PageRank

�Intuition: solve the recursive equation: “a page is important if important pages link to it.”

�In high-falutin’ terms: importance = the principal eigenvector of the transition matrix of the Web.

� A few fixups needed.

Page 3: Diapo

3

Transition Matrix of the Web

� Enumerate pages.

� Page i corresponds to row and column i.

� M [i, j ] = 1/n if page j links to n pages, including page i ; 0 if j does not link to i.

� M [i, j ] is the probability we’ll next be at page i if we are now at page j.

Page 4: Diapo

4

Example: Transition Matrix

i

j

Suppose page j links to 3 pages, including i but not x.

1/3x

0

Page 5: Diapo

5

Random Walks on the Web

�Suppose v is a vector whose i th

component is the probability that each random walker is at page i at a certain time.

�If each walker follows a link from i at random, the probability distribution for walkers is then given by the vector M v.

Page 6: Diapo

6

Random Walks – (2)

�Starting from any vector v, the limit M (M (…M (M v ) …)) is the long-term distribution of walkers.

�Intuition: pages are important in proportion to how likely a walker is to be there.

�The math: limiting distribution = principal eigenvector of M = PageRank.

Page 7: Diapo

7

Example: The Web in 1839

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

Page 8: Diapo

8

Solving The Equations

�Because there are no constant terms, the equations v = M v do not have a unique solution.

�In Web-sized examples, we cannot solve by Gaussian elimination anyway; we need to use relaxation (= iterative solution).

�Can work if you start with a fixed v.

Page 9: Diapo

9

Simulating a Random Walk

�Start with the vector v = [1, 1,…, 1] representing the idea that each Web page is given one unit of importance.

�Repeatedly apply the matrix M to v, allowing the importance to flow like a random walk.

�About 50 iterations is sufficient to estimate the limiting solution.

Page 10: Diapo

10

Example: Iterating Equations

�Equations v = M v :

y = y /2 + a /2

a = y /2 + m

m = a /2

ya =m

111

13/21/2

5/41

3/4

9/811/81/2

6/56/53/5

. . .

Note: “=” isreally “assignment.”

Page 11: Diapo

11

The Walkers

Yahoo

M’softAmazon

Page 12: Diapo

12

The Walkers

Yahoo

M’softAmazon

Page 13: Diapo

13

The Walkers

Yahoo

M’softAmazon

Page 14: Diapo

14

The Walkers

Yahoo

M’softAmazon

Page 15: Diapo

15

In the Limit …

Yahoo

M’softAmazon

Page 16: Diapo

16

Real-World Problems

�Some pages are “dead ends” (have no links out).

� Such a page causes importance to leak out.

�Other groups of pages are spider traps(all out-links are within the group).

� Eventually spider traps absorb all importance.

Page 17: Diapo

17

Microsoft Becomes Dead End

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 0m 0 1/2 0

y a m

Page 18: Diapo

18

Example: Effect of Dead Ends

�Equations v = M v :

y = y /2 + a /2

a = y /2

m = a /2

ya =m

111

11/21/2

3/41/21/4

5/83/81/4

000

. . .

Page 19: Diapo

19

Microsoft Becomes a Dead End

Yahoo

M’softAmazon

Page 20: Diapo

20

Microsoft Becomes a Dead End

Yahoo

M’softAmazon

Page 21: Diapo

21

Microsoft Becomes a Dead End

Yahoo

M’softAmazon

Page 22: Diapo

22

Microsoft Becomes a Dead End

Yahoo

M’softAmazon

Page 23: Diapo

23

In the Limit …

Yahoo

M’softAmazon

Page 24: Diapo

24

M’soft Becomes Spider Trap

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 0m 0 1/2 1

y a m

Page 25: Diapo

25

Example: Effect of Spider Trap

�Equations v = M v :

y = y /2 + a /2

a = y /2

m = a /2 + m

ya =m

111

11/23/2

3/41/27/4

5/83/82

003

. . .

Page 26: Diapo

26

Microsoft Becomes a Spider Trap

Yahoo

M’softAmazon

Page 27: Diapo

27

Microsoft Becomes a Spider Trap

Yahoo

M’softAmazon

Page 28: Diapo

28

Microsoft Becomes a Spider Trap

Yahoo

M’softAmazon

Page 29: Diapo

29

In the Limit …

Yahoo

M’softAmazon

Page 30: Diapo

30

PageRank Solution to Traps, Etc.

�“Tax” each page a fixed percentage at each interation.

�Add a fixed constant to all pages.

�Models a random walk with a fixed probability of leaving the system, and a fixed number of new walkers injected into the system at each step.

Page 31: Diapo

31

Example: Microsoft is a Spider Trap; 20% Tax

�Equations v = 0.8(M v ) + 0.2:

y = 0.8(y /2 + a/2) + 0.2

a = 0.8(y /2) + 0.2

m = 0.8(a /2 + m) + 0.2

ya =m

111

1.000.601.40

0.840.601.56

0.7760.5361.688

7/115/11

21/11

. . .

Page 32: Diapo

32

General Case

�In this example, because there are no dead-ends, the total importance remains at 3.

�In examples with dead-ends, some importance leaks out, but total remains finite.

Page 33: Diapo

33

Solving the Equations

�Because there are constant terms, we can expect to solve small examples by Gaussian elimination.

�Web-sized examples still need to be solved by relaxation.

Page 34: Diapo

34

Finding a Good Starting Vector

1. Newton-like prediction of where components of the principal eigenvector are heading.

2. Take advantage of locality in the Web.

� Each technique can reduce the number of iterations by 50%.

� Important – PageRank takes time!

Page 35: Diapo

35

Predicting Component Values

�Three consecutive values for the importance of a page suggests where the limit might be.

1.0

0.70.6 0.55

Guess for the next round

Page 36: Diapo

36

Exploiting Substructure

�Pages from particular domains, hosts, or directories, like stanford.edu or infolab.stanford.edu/~ullman

tend to have many internal links.

�Initialize PageRank using ranks within your local cluster, then ranking the clusters themselves.

Page 37: Diapo

37

Strategy

�Compute local PageRanks (in parallel?).

�Use local weights to establish weights on edges between clusters.

�Compute PageRank on graph of clusters.

�Initial rank of a page is the product of its local rank and the rank of its cluster.

�“Clusters” are appropriately sized regions with common domain or lower-level detail.

Page 38: Diapo

38

In Pictures

2.0

0.1

Local ranks

2.05

0.05Intercluster weights

Ranks of clusters

1.5

Initial eigenvector

3.0

0.15

Page 39: Diapo

39

Topic-Specific Page Rank

�Goal: Evaluate Web pages not just according to their popularity, but by how close they are to a particular topic, e.g. “sports” or “history.”

�Allows search queries to be answered based on interests of the user.

� Example: Query Maccabi wants different

pages depending on whether you are interested in sports or history.

Page 40: Diapo

40

Teleport Sets

� Assume each walker has a small probability of “teleporting” at any tick.

� Teleport can go to:

1. Any page with equal probability.

� As in the “taxation” scheme.

2. A topic-specific set of “relevant” pages (teleport set ).

� For topic-specific PageRank.

Page 41: Diapo

41

Example: Topic = Software

�Only Microsoft is in the teleport set.

�Assume 20% “tax.”

� I.e., probability of a teleport is 20%.

Page 42: Diapo

42

Only Microsoft in Teleport Set

Yahoo

M’softAmazon

Dr. Who’sphonebooth.

Page 43: Diapo

43

Only Microsoft in Teleport Set

Yahoo

M’softAmazon

Page 44: Diapo

44

Only Microsoft in Teleport Set

Yahoo

M’softAmazon

Page 45: Diapo

45

Only Microsoft in Teleport Set

Yahoo

M’softAmazon

Page 46: Diapo

46

Only Microsoft in Teleport Set

Yahoo

M’softAmazon

Page 47: Diapo

47

Only Microsoft in Teleport Set

Yahoo

M’softAmazon

Page 48: Diapo

48

Only Microsoft in Teleport Set

Yahoo

M’softAmazon

Page 49: Diapo

49

Picking the Teleport Set

1. Choose the pages belonging to the topic in Open Directory.

2. “Learn” from examples the typical words in pages belonging to the topic; use pages heavy in those words as the teleport set.

Page 50: Diapo

50

Application: Link Spam

�Spam farmers today create networks of millions of pages designed to focus PageRank on a few undeserving pages.

�To minimize their influence, use a teleport set consisting of trusted pages only.

� Example: home pages of universities.

Page 51: Diapo

51

Hubs and Authorities

�Mutually recursive definition:

� A hub links to many authorities;

� An authority is linked to by many hubs.

�Authorities turn out to be places where information can be found.

� Example: course home pages.

�Hubs tell where the authorities are.

� Example: CS Dept. course-listing page.

Page 52: Diapo

52

Transition Matrix A

�H&A uses a matrix A [i, j ] = 1 if page ilinks to page j, 0 if not.

�AT, the transpose of A, is similar to the PageRank matrix M, but AT has 1’s where M has fractions.

Page 53: Diapo

53

Example: H&A Transition Matrix

Yahoo

M’softAmazon

y 1 1 1a 1 0 1m 0 1 0

y a m

A =

Page 54: Diapo

54

Using Matrix A for H&A

�Powers of A and AT have elements of exponential size, so we need scale factors.

�Let h and a be vectors measuring the “hubbiness” and authority of each page.

�Equations: h = λAa; a = µAT h.

� Hubbiness = scaled sum of authorities of successor pages (out-links).

� Authority = scaled sum of hubbiness of predecessor pages (in-links).

Page 55: Diapo

55

Consequences of Basic Equations

�From h = λAa; a = µAT h we can derive:� h = λµAAT h

� a = λµATA a

�Compute h and a by iteration, assuming initially each page has one unit of hubbiness and one unit of authority.� Pick an appropriate value of λµ.

Page 56: Diapo

56

Example: Iterating H&A

1 1 1A = 1 0 1

0 1 0

1 1 0AT = 1 0 1

1 1 0

3 2 1AAT= 2 2 0

1 0 1

2 1 2ATA= 1 2 1

2 1 2

a(yahoo)a(amazon)a(m’soft)

===

111

545

241824

11484

114

. . .

. . .

. . .

1+√321+√3

h(yahoo) = 1h(amazon) = 1h(m’soft) = 1

642

1329636

. . .

. . .

. . .

1.0000.7350.268

2820

8

Page 57: Diapo

57

Solving H&A in Practice

�Iterate as for PageRank; don’t try to solve equations.

�But keep components within bounds.

� Example: scale to keep the largest component of the vector at 1.

�Trick: start with h = [1,1,…,1]; multiply by AT to get first a; scale, then multiply by A to get next h,…

Page 58: Diapo

58

Solving H&A – (2)

�You may be tempted to compute AAT

and ATA first, then iterate these matrices as for PageRank.

�Bad, because these matrices are not nearly as sparse as A and AT.

Page 59: Diapo

59

H&A Versus PageRank

�If you talk to someone from IBM, they may tell you “IBM invented PageRank.”

� What they mean is that H&A was invented by Jon Kleinberg when he was at IBM.

�But these are not the same.

�H&A does not appear to be a substitute for PageRank.

�But may be used by Ask.com.

Page 60: Diapo

60

Spam on the Web

�Search has become the default gateway to the web.

�Very high premium to appear on the first page of search results.

Page 61: Diapo

61

What is Web Spam?

� Spamming = any action whose purpose is to boost a web page’s position in search engine results, without providing additional value.

� Spam = Web pages used for spamming.

�Approximately 10-15% of Web pages are spam.

Page 62: Diapo

62

Web-Spam Taxonomy

�Boosting techniques :

� Techniques for increasing the probability a Web page will be a highly ranked answer to a search query.

�Hiding techniques :

� Techniques to hide the use of boosting from humans and Web crawlers.

Page 63: Diapo

63

Hiding techniques

�Content hiding.

� Use same color for text and page background.

�Cloaking.

� Return different page to crawlers and browsers.

�Redirection.

� Redirects are followed by browsers but not crawlers.

Page 64: Diapo

64

Boosting Techniques

�Term spamming :

� Manipulating the text of Web pages in order to appear relevant to queries.

• Why? You can run ads that are relevant to the query.

�Link spamming :

� Creating link structures that boost PageRank.

Page 65: Diapo

65

Term Spamming – (1)

�Repetition : of one or a few specific terms e.g., “free,” “cheap,” “Viagra.”

�Dumping of a large number of unrelated terms.

� E.g., copy entire dictionaries.

Page 66: Diapo

66

Term Spamming – (2)

�Weaving :

� Copy legitimate pages and insert spam terms at random positions.

� Phrase Stitching :

� Glue together sentences and phrases from different sources.

• E.g., use the top-ranked pages on the topic you want to look like.

Page 67: Diapo

67

The Google Solution to Term Spamming

�In addition to PageRank, the original Google engine had another innovation: it trusted what people said about you in preference to what you said about yourself.

�Give more weight to words that appear in or near anchor text than to words that appear in the page itself.

Page 68: Diapo

68

The Google Solution – (2)

�Today, the Google formula for matching terms to documents involves over 250 factors.

� E.g., does the word appear in a header?

�As closely guarded as the formula for Coke.

Page 69: Diapo

69

Link Spam

� Three kinds of Web pages from a spammer’s point of view:

1. Own pages.

• Completely controlled by spammer.

2. Accessible pages.

• E.g., Web-log comment pages: spammer can post links to his pages.

3. Inaccessible pages.

Page 70: Diapo

70

Spam Farms – (1)

� Spammer’s goal:

� Maximize the PageRank of target page t.

� Technique:

1. Get as many links from accessible pages as possible to target page t.

2. Construct “link farm” to get PageRankmultiplier effect.

Page 71: Diapo

71

Spam Farms – (2)

InaccessibleInaccessible

t

Accessible Own

1

2

M

Goal: boost PageRank of page t.One of the most common and effectiveorganizations for a spam farm.

Page 72: Diapo

72

Analysis – (1)

Suppose rank from accessible pages = x.

PageRank of target page = y.

Taxation rate = 1-β.

Rank of each “farm” page = βy/M + (1-β)/N.

InaccessibleInaccessiblet

Accessible Own

12

M

From t

Share of“tax”

Size ofWeb

Page 73: Diapo

73

Analysis – (2)

y = x + βM[βy/M + (1-β)/N] + (1-β)/N

y = x + β2y + β(1-β)M/N

y = x/(1-β2) + cM/N where c = β/(1+β)

InaccessibleInaccessiblet

Accessible Own

12

M

Tax sharefor t.Very small;ignore.

PageRank ofeach “farm” page

Page 74: Diapo

74

Analysis – (3)

�y = x/(1-β2) + cM/N where c = β/(1+β).

�For β = 0.85, 1/(1-β2)= 3.6.

� Multiplier effect for “acquired” page rank.

�By making M large, we can make y as large as we want.

InaccessibleInaccessiblet

Accessible Own

12

M

Page 75: Diapo

75

Detecting Link-Spam

�Topic-specific PageRank, with a set of “trusted” pages as the teleport set is called TrustRank.

�Spam Mass = (PageRank – TrustRank)/PageRank.

� High spam mass means most of your PageRank comes from untrusted sources –you may be link-spam.

Page 76: Diapo

76

Picking the Trusted Set

�Two conflicting considerations:

� Human has to inspect each seed page, so seed set must be as small as possible.

� Must ensure every “good page” gets adequate TrustRank, so all good pages should be reachable from the trusted set by short paths.

Page 77: Diapo

77

Approaches to Picking the Trusted Set

1. Pick the top k pages by PageRank.

� It is almost impossible to get a spam page to the very top of the PageRank order.

2. Pick the home pages of universities.

� Domains like .edu are controlled.