Detecting Web Spam Created with Markov Chains Text Generators Anton S. Pavlov, Moscow State...
-
Upload
rosanna-cannon -
Category
Documents
-
view
223 -
download
0
Transcript of Detecting Web Spam Created with Markov Chains Text Generators Anton S. Pavlov, Moscow State...
Detecting Web Spam Created with Markov Chains Text Generators
19.09.2009 1Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
Overview
1. Web Spam– Markov chains text generators
2. Proposed Approach– Method overview– Constraints and features– Machine learning
3. Experiments4. Conclusion
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
2
Web Spam (I)
• Web Spam – deliberate actions aimed at unjustifiably raising relevance of some pages in a search engine:– Decreases search quality and performance– Aims at specific algorithms, used by search
engines (PageRank, BM25, etc.)– Efficient if mass created
• Some types of web spam get into search results page (doorways)
19.09.2009 3Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
→ automatically generated
Web Spam (II)
• Spammers have to create many web pages, that will be indistinguishable from normal pages:– Create texts manually Too Expensive– Copying texts from other sources Duplicates can
be detected– Generate texts automatically: keyword stuffing,
Markov chains, text modification
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
4
Markov Chains Text Generators (I)
• Markov Chain – a sequence of random variables, each variable depends only on previous one
• Can be used as a generative model for texts:– States – word n-gramms– Transitions – probability of seeing a word, after
seeing n words
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
5
).|(),...,|(
,...,...,,
1111
21
NNNNNN
N
xXxXPxXxXxXP
XXX
Markov Chains Text Generators (II)
• Markov chains training:– Select a collection of documents– Collect probabilities of seeing a word x in every
state
• Generation process:– Use collected statistics to generate a text of
predefined length
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
6
Markov Chains Text Generators (III)
• Resulting texts retain topical coherence and local coherence
Again, the prevalence of spam within blog
comments. Our work in this paper extend our
previous work in this paper approximate what will
eventually be perceived by users of search
engines.
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
7
Overview
1. Web Spam– Markov chains text generators
2. Proposed Approach– Method overview– Constraints and features– Machine learning
3. Experiments4. Conclusion
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
8
Proposed Method (I)
• Generated texts can simulate only some constraints of the natural texts:– Local coherence– Topical coherence
• Presumably generated texts violate other constraints:– Genre coherence– Readability– Diversity– Etc…
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
9
Proposed Method (II)
• Collect various features:– Style/Genre– Authorship– Readability– Diversity
• Use machine learning to create automatic spam classifier
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
10
Style/Genre identification
features
Authorship identification
features
Readability metrics
Text diversity features
Machine learning
Automatic spam
classifier
Genre and Style
• Part of speech (POS) usage statistics– Especially rare POS statistics
• Punctuation statistics:– Expressive punctuation («!», «?»)– Smileys («:)»)– References ([23])
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
11
Authorship
• Part of speech statistics• Standard deviations of part of speech ratios– Helps identify texts with mixed authorship
• Sentences with several verbs
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
12
Readability
• Readability metrics:– Average word length– Ratio of long words– Average, minimum, maximum sentence length
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
13
Diversity
• Zipf’s law:– For words– For stemmed nouns
• Compression rates:– gzip– bz2
• Same words occurring in neighboring sentences
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
14
;)(
i
iFreq
Machine Learning
• Extracted 61 features for each document• Compared two popular ML algorithms:– SVM– C4.5 + Bagging
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
15
Decision Trees• Algorithm based on
C4.5:– Each split minimizes
informational entropy– Use different subsets of
the training set to build a tree and to select it’s weight Avoids overfitting
– Use bagging to merge multiple weak classifiers
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
17
Gzip compr. ratio > 0.6
Max sentence
length > 32
Spam
Ham Spam
N Y
N Y
Overview
1. Web Spam– Markov chains text generators
2. Proposed Approach– Method overview– Constraints and features– Machine learning
3. Experiments4. Conclusion
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
18
Experiments• Experiments were conducted on ROMIP By.Web
collection• Web spam sources:– Rusadult doorway generator (rusadult)– Doorway.su doorway generator (doorway_su)– Order 2 Markov chains text generator (markov2)
• Generated 3 training and 3 test sets:– Each contained 10000 By.Web documents (not spam)– Each contained 10000 documents from generators
(spam)• Measured precision, recall, and F-measure
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
19
Detecting Generated Web Spam
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
20
Real World Spam Examples• Trained classifiers on texts generated by
markov2 generator
• Checked if the same classifier can be used to detect real spam– Real spam samples from the Yandex Blog Search
–Manually collected 600 automatically generated spam texts
–Measured recall on the real spam samples, and F-measure on previously used test set
• Measured how adding samples of real spam affected classification
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
21
Real World Spam Examples
• Продажа автомашин. Каталог цен! автомобильная
акустика Большой выбор, хорошие цены Модель
Daewoo Nexia представляет собой последнюю
модификацию модели Opel Kadett E. В 1986 году
лицензированное производство этого автомобиля
началось в автомобильный видеорегистратор Купля-
продажа авто в Тюмени Много частных объявлений о
продаже автомобилей в Тюмени на Новые и.
автомобили.19.09.2009
Anton S. Pavlov, Boris V. Dobrov, Detecting Web Spam Created With Markov Chains
Text Generator, RCDL'0922
Detecting Real World Spam
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
23
Conclusion
• Proposed a new approach to web spam detection
• Proved the possibility of detection web spam, generated using Markov chains
• Detected spam not in the original training datasets
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
24
The End
• Questions?
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
25
References1. Веб коллекция BY.Web, http://romip.ru/ru/collections/by.web-2007.html.2. Генератор дорвеев Doorway.Su, http://doorway.su/.3. Зеленков Ю.Г., Сегалович И.В., Сравнительный анализ методов определения нечетких
дубликатов для Web-документов // Труды 9ой Всероссийской научной конференции «Электронные библиотеки: перспективные методы и технологии, электронные коллекции» - RCDL’2007, Переславль, Россия, 2007. – Том 1, С. 166-174.
4. Парсер mystem http://company.yandex.ru/technology/mystem/.5. Серверный генератор дорвеев от RUSADULT.com, http://doorways.rusadult.com/ru/.6. Фоменко В.П., Фоменко Т.Г., Авторский инвариант русских литературных текстов, 1981. 7. Чжун Кай-лай, Однородные цепи Маркова. Перев. с англ. — М.: Мир, 1964. — 425 с.8. Яндекс.Поиск по блогам, http://blogs.yandex.ru/.9. Benczúr, A. A., Bíró, I., Csalogány, K. and Sárlós, T. Web Spam Detection via Commercial Intent
Analysis. In Proceedings of the 3rd international workshop on Adversarial Information Retrieval on the Web, Banff, Alberta, Canada, May 8th, 2007. Pages: 89–92.
10. Braslavski P. Document Style Recognition Using Shallow Statistical Analysis. In Proceedings of the ESSLLI 2004 Workshop on Combining Shallow and Deep Processing for NLP, Nancy, France, 2004, p. 1–9.
11. Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., Vigna, S. A Reference Collection for Web Spam. ACM SIGIR Forum Volume 40, Issue 2 (December 2006) Pages: 11–24.
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
26
References12. Dale, E. and J. S. Chall. 1949. “The concept of readability.” Elementary English 26: 23.13. Dubay, W.H.. 2004. The Principles of Readability. Costa Mesa, CA: Impact Information14. Fetterly, D., Manasse, M., Najork, M. Spam, damn spam, and statistics: using statistical analysis
to locate spam web pages. In Proceedings of WebDB’04, New York, USA, 2004.15. Fetterly, D., Manasse, M., Najork, M. Detecting phrase-level duplication on the World Wide Web.
In Proceedings of SIGIR’05, pages 170–177, New York, NY, USA, 2005. ACM.16. Gyöngyi, Z. and Garcia-Molina H., Web Spam Taxonomy. In Proceedings of AIRWeb 2005, May
2005.17. Joachims, T. Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support
Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999.18. Mishne, G., Carmel, D., and Lempel, R. Blocking blog spam with language model disagreement. In
Proceedings of AIRWeb 2005, May 2005.19. Piskorski, J., Sydow, M., Weiss, D., Exploring Linguistic Features for Web Spam Detection: A
Preliminary Study. In Proceedings of the 4th international workshop on Adversarial Information Retrieval on the Web, Beijing, China, Pages 25-28.
20. Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.21. Urvoy T., Chauveau E., Filoche, P. Tracking Web Spam with HTML Style Similarities. ACM
Transactions on the Web, Vol. 2, No. 1, Article 3.
19.09.2009Anton S. Pavlov, Boris V. Dobrov, Detecting
Web Spam Created With Markov Chains Text Generator, RCDL'09
27