Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos...
-
Upload
anissa-ward -
Category
Documents
-
view
215 -
download
0
Transcript of Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos...
![Page 1: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/1.jpg)
Know your Neighbors:Web Spam Detection Using the Web Topology
Presented By,
SOUMO GORAI
Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1),Vanessa Murdock(1), Fabrizio Silvestri(2).1. Yahoo! Research Barcelona – Catalunya, Spain2. ISTI-CNR –Pisa,ItalyACM SIGIR, 25 July 2007, Amsterdam
![Page 2: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/2.jpg)
Soumo’s Biography
•4th Year CS Major
•Graduating May 2008
•Interesting About Me: Lived in India, Australia, and the U.S.
•CS Interests: Databases, HCI, Web Programming, Networking,
Graphics, Gaming,
.
![Page 3: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/3.jpg)
Here’s all that you can find on the web….
![Page 4: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/4.jpg)
Here’s just some of what really is out there…
![Page 5: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/5.jpg)
And more….
![Page 6: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/6.jpg)
Why so many different things…?
There is a fierce competition for your attention!
Ease of publication for personal publication as well as commercial publication, advertisements, and economic activity.
…and there’s lots lots lots lots…lots of spam!
![Page 7: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/7.jpg)
What’s Spam?!
![Page 8: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/8.jpg)
Hidden Text
![Page 9: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/9.jpg)
Only hidden text? Here’s a whole fake search engine!!!
![Page 10: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/10.jpg)
Why is Spam bad?
Costs:
• Costs for users: lower precision for some queries
•Costs for search engines: wasted storage space, network resources, and processing cycles
• Costs for the publishers: resources invested in cheating and not in improving their contentsEvery undeserved gain in ranking for a spammer is a loss of search precision for the search engine.
![Page 11: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/11.jpg)
How Do We Detect Spam?
•Machine Learning/Training
•Link-based Detection
•Content-based Detection
•Using Links and Contents
•Using Web-based Topology
![Page 12: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/12.jpg)
Machine Learning/Training
![Page 13: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/13.jpg)
ML ChallengesMachine Learning Challenges:
•Instances are not really independent (graph)
•Training set is relatively small
Information Retrieval Challenges:
•It is hard to find out which features are relevant
•It is hard for search engines to provide labeled data
•Even if they do, it will not reflect a consensus on what is Web Spam
![Page 14: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/14.jpg)
Link-based Detection
Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 2005]
![Page 15: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/15.jpg)
Why use it?• Degree-related measures
• PageRank
• TrustRank [Gy¨ongyi et al., 2004]
• Truncated PageRank [Becchetti et al., 2006]:similar to PageRank, it limits a page to the PageRank score
of its close neighbors. Thus, the Truncated PageRank scoreis a useful feature for spam detection because spam pagesgenerally try to reinforce their PageRank scores by linkingto each other.”
![Page 16: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/16.jpg)
Degree-basedMeasures are related to in-degree and out-degree
Edge-reciprocity (the number of links that are reciprocal)
Assortativity (the ratio between the degree of a particular page and the average degree of its neighbors
![Page 17: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/17.jpg)
TrustRank / PageRank
TrustRank: an algorithm that picks trusted nodes derived from page-ranks but tests the degree of relationship one page has with other known trusted pages. This is given a TrustRank score.
Ratio between TrustRank and Page Rank
Number of home pages.
Cons: this alone is not sufficient as there are many false positives.
![Page 18: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/18.jpg)
Content-based Detection
Most of the features reported in [Ntoulas et al., 2006]Number of words in the page and titleAverage word lengthFraction of anchor textFraction of visible textCompression rateCorpus precision and corpus recallQuery precision and query recallIndependent trigram likelihoodEntropy of trigrams
![Page 19: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/19.jpg)
Corpus and Query
F: set of most frequent terms in the collectionQ: set of most frequent terms in a query logP: set of terms in a page
Computation Techniques:
corpus precision: the fraction of words(except stopwords) in a page that appear in the set of popular terms of a data collection.
corpus recall: the fraction of popular terms of the data collection that appear in the page.
query precision: the fraction of words in a page that appear in the set of q most popular terms appearing in a query log.
query recall: the fraction of q most popular terms of the query log that appear in the page.
![Page 20: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/20.jpg)
Visual Clues
Figure: Histogram of the average word length in non-spam vs. spam pages for k = 500.
Figure: Histogram of the corpus precision in non-spam vs. spam pages.
Figure: Histogram of the query precision in non-spam vs. spam pages for k = 500.
![Page 21: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/21.jpg)
Links AND Contents Detection
Why Both?:
![Page 22: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/22.jpg)
Web Topology Detection
• Pages topologically close to each other are more likely to have the same label (spam/nonspam) than random pairs of pages.
• Pages linked together are more likely to be on the same topic than random pairs of pages [Davison, 2000]
•Spam tends to be clustered on the Web (black on figure)
![Page 23: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/23.jpg)
Topological dependencies: in-links
Let SOUT(x) be the fraction of spam hosts linked by host x out of all labeled hosts linked by host x. This figure shows the histogram of SOUT for spam and non-spam hosts. We see that almost all non-spam hosts link mostly to non-spam hosts.
Let SIN(x) be the fraction of spam hosts that link to host x out of all labeled hosts that link to x. This figure shows the histograms of SINfor spam and non-spam hosts.In this case there is a clear separation between spam and non-spam hosts.
![Page 24: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/24.jpg)
Clustering: if the majority of a cluster is predicted to be spam then we change the prediction for all hosts in the cluster to spam. The inverse holds true too.
![Page 25: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/25.jpg)
Article CritiquePros:
•Has detailed descriptions of various detection mechanisms.
•Integrates link and content attributes for building a system to detect Web spam
Cons:
•Statistics and success rate for other content-based detection techniques.
•Some graphs had axis labels missing.
Extension:
combine the regularization (any method of preventing overfitting of data by a model) methods at hand in order to improve the overall accuracy
![Page 26: Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e875503460f94b8abdf/html5/thumbnails/26.jpg)
Summary
•Machine Learning/Training
•Link-based Detection
•Content-based Detection
•Using Links and Contents
•Using Web-based Topology
Costs:
•Costs for users: lower precision for some queries
•Costs for search engines: wasted storage space, network resources, and processing cycles
•Costs for the publishers: resources invested in cheating and not in improving their contentsEvery undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
How Do We Detect Spam?Why is Spam bad?