(with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and...

29
(with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena

Transcript of (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and...

Page 1: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

(with an application of Web Spam detection)

CS315-Web Search and Mining

Power Laws and Rich-Get-Richer Phenomena

Page 2: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

What do these have in common?

The grades of students in a class. The weights of apples.

The high temperatures in Boston on July 4th. The heights of Dutch men. The speed of cars on I-90.

These measurements are well-characterized by the average and the standard deviation.

Most instances are typical.Seeing an outlier is very surprising.

Page 3: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

City populations

1. New York 8,310,2122. Los Angeles 3,834,340 3. Chicago 2,836,6584. Houston 2,208,180 5. Phoenix 1,552,2596. Philadelphia 1,449,634 7. San Antonio 1,328,9848. San Diego 1,266,731 9. Dallas 1,266,372 10.San Jose 939,899

Page 4: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

City populations

1. New York 8,310,2122. Los Angeles 3,834,340 3. Chicago 2,836,658

21. Boston, MA 625,087

248. Cambridge, MA 106,038

25,375. Lost Springs, WY 1

A few cities with high population

Many cities with low population

Page 5: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

City populations

Cities ordered on population range

Page 6: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Word Frequencies

Page 7: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Power Law: The number of cities with population > k is proportional to k-c.

Page 8: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

“fraction of items”

“popularity = k”

Page 9: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Power Law: Fraction f(k) of items with popularity k is proportional to k-c.

f(k) k-c

log [f(k)] log [k-c]

log [f(k)] -c log [k]

y -c x

Page 10: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

A power law is a straight line on a log-log plot.

Page 11: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Number of Web page in-links (Broder+)

Page 12: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Examples (some better than others)

frequency of words protein-interaction degree distributionInternet (AS) degree distributionseverity of inter-state warsseverity of terrorist attacksfrequency of bird sightingssize of blackoutsbook salespopulation of US citiessize of religionsnumber of citationspapers authoredpopularity of surnamesnumber of web hitsnumber of web links, with cut-offnumber of phone callssize of email address booknumber of species per genus

Page 13: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

What is going on?

Nature seems to create bell curves(range around an average)

Human activity seems to create power laws(popularity skewing)

Page 14: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Network Science: Scale-Free Property 2012

“seems to”

Page 15: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

How can we use this to… fight spam?

The main idea behind “Spam, Damn Spam and Statistics”Spammers manufacture pages and links to fool search enginesIn this process, they will overdo itTheir actions would likely fall outside the normal human activity

Let’s look for outliers in the power laws!

Page 16: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Web page out-degreesThere are 158,290 pages with out-degree 1301, while according to the overall trend only 1,700 such pages are expected.

Page 17: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Web page in-degreesThere are 369,457 pages have the in-degree of 1001, while according to the trend only 2,000 such pages are expected

Page 18: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Length of the URL’s host

The 100 longest hostnames reveal that 80 of them belong to adult site and 11 refer to the financial and credit related sites

Page 19: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Number of host name resolutions to a single IP

There are 100,000’s host names mapped to a single IP, The record-breaking IP is referred by 8,967,154 host names

Page 20: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Clusters of similar pages (shingling)

The blue group is mainly spam. 15 of 20 largest clusters have 2,080,112 spam pages

The red group has duplicated content, not spam).

Page 21: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Spammers are studious!

Page 22: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Why does data exhibit power laws?

imitation Power law

Can imitation explain the size of the Web parts?

Page 23: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Constructing a model of the Web

1. Pages are created in order, named 1, 2, …, N2. When created, page j links to a page randomly:

1. With probability p, picking a page i uniformly at random from pages 1, …, j-1

2. With probability (1-p), pick page i uniformly at random and link to the page that i links too imitation

randomness

This is the well-studied “preferential attachment” model

of Web generation

Page 24: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

The rich get richer

2 b) With prob. (1-p), pick page i uniformly at random and link to the page that i links too

1/43/4

Page 25: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

The rich get richer

2 b) With prob. (1-p), pick page i uniformly at random and link to the page that i links too

Equivalently,2 b) With prob. (1-p), pick a page

proportional to its in-degree and link to it

Page 26: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Information cascades and the rich

Information cascade = some people get a little bit richer by chance

and then rich-get-richer dynamics = the random rich people

get a lot richer very fast

Page 27: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Is popularity predictable?

Why is Harry Potter popular?

If we could re-play history, would we still read Harry Potter en masse,

or would it be some other book?

(But then, why JK Rowling had troublespublishing it at first?)

Page 28: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Is popularity… random?

Why “hits” in cultural markets are much more successful than average (and yet so hard to predict)?Can we study it with an experiment?“Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market”14,000 participants randomly assigned to “social influence” and “independent” conditionschose between 48 songs by unknown bandsin 8+1 parallel worlds

Subject

See what othersdownloaded

No information

World 1

World 8

World 0

Page 29: (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Music download site – 8+1 worlds

1. “Let’s go driving,” Barzin

2. “Silence is sexy,” Einsturzende Neubauten

3. “Go it alone,” Noonday Underground

10.“Picadilly Lilly,” Tiger Lillies

1. “Let’s go driving,” Barzin

2. “Silence is sexy,” Einsturzende Neubauten

3. “Go it alone,” Noonday Underground

10.“Picadilly Lilly,” Tiger Lillies

18

3

47

2

The best songs never went to the bottom, the worse never became popular. But their order changed a lot.