Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

24
Spam detection with a content-based random-walk algorithm F. Javier Ortega [email protected] José A. Troyano [email protected] Craig Macdonald [email protected] Fermín Cruz [email protected]

description

Presentation of PolaritySpam, a graph-based ranking algorithm intended to demote the spam web pages in the ranking provided by a web search engine. Cite as: F. Javier Ortega; Craig Macdonald; José A. Troyano; Fermín L. Cruz. “Spam Detection with a Content-based Random-Walk Algorithm”. Proceedings of the Second International Workshop on Search and Mining User-Generated Contents, International Conference on Information and Knowledge Management. 2010. Toronto, Canadá

Transcript of Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Page 2: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Index

♦ Introduction♦ Related work

♦ Content-based

♦ Link-based

♦ Our Approach♦ Random-walk algorithm

♦ Content-based metrics

♦ Selection of seeds

♦ Experiments♦ Future work♦ References

Page 3: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Introduction

♦ Web Spam: phenomenon where a number of web pages are created for the purpose of making a search engine deliver undesirable results for a given query.

Page 4: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Introduction

♦ Self-Promotion: gaining high relevance for a search engine, mainly based on the textual content.

i.e.: including a number of keywords in the web page.

Page 5: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Introduction

♦ Mutual-Promotion: gaining high score by focusing the attention on the out-links and in-links of a web page.i.e.: a web page with lots of in-links

can be considered relevant by a search

engine.

Page 6: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Introduction

♦ Web Spam characteristics:

♦ Textual content: large amount of invisible content, a set of words with high frequency, lots of hyperlinks with large anchor texts, very long words, etc.

♦ Link-farms: large number of pages pointing one to another, in order to improve their scores by increasing the number of in-links to them.

♦Good pages usually point to good pages.

♦Spam pages mainly point to other spam pages (link-farms). They rarely point to good pages.

Page 7: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Related work: Content-based

♦ Content-based techniques classify the web pages as spam or not-spam according to their textual content.

♦ Heuristics to determine the spam likehood of a web page.♦ Meta tag content, anchor texts, URL of the page, average lenght of

the words, compression rate, etc. [10, 12]

♦ Inclusion of link-based scores and metrics into a classifier [3]

♦ Link-based techniques exploit the relations between web pages to obtain a rank of pages, ordered according to their spam likelihood.

♦ Random-Walk algorithms that penalizes spam-like behaviors.♦ Don't take into account the nearest neighbours [1]

♦ Take only the scores received from a specific set of good or bad pages. [7,11]

Page 8: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Our Approach

♦ Our approach combines both techniques:♦ A set of content-based metrics, that

obtains information from each single web page.

♦ A link-based algorithm, that processes the relations between web pages.

♦ The goal is to obtain a ranking of web pages, in which spam web pages are demoted according to their spam likelihood.

Page 9: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Our Approach

Web pages

Content-based metrics

Selection of Seeds

Random-walk algorithm

Web graph

Page 10: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Our Approach: random-walk algorithm

♦ We propose a random-walk algorithm that computes two scores for each web page:

♦PR :⁺ relevance of a web page♦PR :⁻ spam likelihood of a web page

♦ PR (⁻ b), changes according to the relation of b with spam-like web pages. Analogous with PR .⁺

a bThe higher PR (a), the higher PR (b).⁺ ⁺The higher PR (a), the higher PR (b).⁻ ⁻

Page 11: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Our Approach: random-walk algorithm

♦ Formula:

♦ Intuition:High PR⁻High PR⁺

Higher PR !!⁺ Higher PR !!⁻

Page 12: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Our Approach: content-based metrics

♦ Content-based metrics are intended to extract some a-priori information from the textual content of the web pages.

♦ Content-based metrics must be:♦ Easy to obtain: save the performance!♦ Accurate: precision is preferred over recall.

Page 13: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Our Approach: content-based metrics

♦ Selected metrics:♦ Compressibility: fraction of the sizes of a web

page, before and after being compressed.♦ Fraction of globally popular words: a web

page with a high fraction of words within the most popular words in the entire corpus, is likely to be a spam.

♦ Average length of words: non-spam web pages have a bell-shaped distribution of average word lengths, while malicious pages have much higher values.

Page 14: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Our Approach: selection of seeds

♦ Seeds: set of relevant nodes, in terms of spam (negative seeds) or not-spam likelihood (positive seeds).

♦ The algorithm gives more relevance to the seeds.

♦ Spam-biased algorithm

Page 15: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Our Approach: selection of seeds

♦ Unsupervised method: content-based metrics as features to choose the seeds.

♦ Pros:♦Human intervention is not needed.♦Larger number of seeds can be considered.♦Inclusion of text content into a link-based

method.

♦ Due to the lack of human intervention...♦“False positives”.

Page 16: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Our Approach: selection of seeds

♦ Obtaining a-priori score for a node, a:

♦ Selecting seeds:♦ Pos/Neg Approach:

♦ Pos/Neg Metrics Approach:

♦ Metric-based Approach

Page 17: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Experiments

♦ Dataset: WEBSPAM-UK2006*♦ ~98 million pages

♦ 11,402 hand-labeled hosts

♦ 7,423 labeled as spam.

♦ ~10 million spam web pages

♦ Terrier IR Platform

♦ Random-walk algorithm parameters:♦ Damping factor = 0.85

♦ Threshold = 0.01* C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection for web spam. SIGIR Forum, 40(2):11–24, December 2006.

Page 18: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Experiments

♦ Evaluation: PR-buckets

}}}

}

PageRank

Relevance

Buckets Total Pages

1 14

2 54

3 144

4 437

5 1070

6 2130

7 2664

8 2778

... ...

17 16M

18 28M

19 28M

20 28M

PR-bucket 1

PR-bucket 2

PR-bucket 3

PR-bucket 4

. . .

Total PR =

Page 19: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Experiments

♦ Baseline: TrustRank♦ Link-based technique.

♦ Seeds chosen in a semi-supervised way:• Hand-picked set of good pages.

• Top pages according to an inverse PageRank.

♦ Random-walk algorithm, biased according to the seeds

Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. Technical Report 2004-17, Stanford InfoLab, March 2004

Page 20: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

ExperimentsTrustRank Pos/Neg Approach

Metric-based ApproachPos/Neg Metrics Approach

Page 21: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Experiments

1 2 3 4 5 6 7 8 9 10

1

10

100

1000

TrustRank Pos/Neg Pos/Neg Metrics MetricsBased

Page 22: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

Conclusions and future work

♦ Novel web spam detection technique, that combines concepts from link and content-based methods.♦ Content-based metrics as an unsupervised seed

selection method.

♦ Random-walk algorithm to compute two scores for each web page: spam and not-spam likelihood.

♦ Future work:♦ Including new content-based heuristics.

♦ Improving the spam-biased selection of the seeds, taking into account the links to/from each node.

♦ Content-based metrics to characterize also the edges of the web graph.

Page 23: Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

References[1] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of web

spam. In AIRWeb’06: Adversarial Information Retrieval on the Web, 2006.

[2] A. A. Benczur, K. Csalogany, T. Sarlos, M. Uher, and M. Uher. Spamrank - fully automatic link spam detection. In In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb, 2005.

[3] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423–430, New York, NY, USA, 2007. ACM.

[4] G. V. Cormack, M. D. Smucker, and C. L. A. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. Computing Research Repository, 2010.

[5] L. da F. Costa, F. A. Rodrigues, G. Travieso, and P. R. V. Boas. Characterization of complex networks: A survey of measurements. Advances in Physics, 56(1):167–242, January 2005.

[6] D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In WebDB ’04: Proceedings of the 7th International Workshop on the Web and Databases, pages 1–6, New York, NY, USA, 2004. ACM.

[7] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. Technical Report 2004-17, Stanford InfoLab, March 2004.

[8] T. H. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. Technical Report 2003-29, 2003.2.

[9] G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 538–543, New York, NY, USA, 2002. ACM.

[10] P. Kolari, T. Finin, and A. Joshi. Svms for the blogosphere: Blog identification and splog detection. In AAAI Spring Symposium on Computational Approaches to Analysing Weblogs. Computer Science and Electrical Engineering, University of Maryland, Baltimore County, March 2006.

[11] V. Krishnan. Web spam detection with anti-trustrank. In ACM SIGIR workshop on Adversarial Information Retrieval on the Web, Seattle, Washington, USA, 2006.

[12] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In WWW ’06: Proceedings of the 15th international conference on World Wide Web, pages 83–92, New York, NY, USA, 2006. ACM.

[13] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web, 1999.

[14] B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In Proceedings of Models of Trust for the Web (MTW), a workshop at the 15th International World Wide Web Conference, Edinburgh, Scotland, 2006.