Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006,...

35
Web spamming Detecting Spam Web Pages thro ugh Content Analysis Alexandros Ntoulas et al, 200 6, International World Wide W eb Conference

Transcript of Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006,...

Page 1: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Web spamming

Detecting Spam Web Pages through Content Analysis

Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Page 2: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

• link stuffing: for link-based ranking, black hat SEO techniques include the creation of extraneous pages which link to a target page

• keyword-stuffing:The content of other pages

• may be “engineered” so as to appear relevant to popular searches

Page 3: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Figure 1: An example spam page; although it contains popularkeywords, the overall content is useless to a human user

Page 4: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Web spam

• The practices of crafting web pages for the sole purpose of increasing the ranking of these or some affiliated pages, without improving the utility to the viewer, are called “web spam”.

Page 5: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

왜 web spamming 을 하는가 ?

• 첫째 , Search engine 이 스팸사이트를 상위에 rank 하게 하여 웹검색자들을 스팸사이트로 끌여들여 경제적 이득을 취함

• 둘째로 search engine 이 스팸사이트를 노출시켜 사용자가 search engine 의 성능을 믿지 못하도록 함 , 즉 search engine 에 대한 공격

• 마지막으로 a search engine 이 spam pages 들로 인하여 필요없는 공간과 시간 , 혹은 네트워크 resource 를 을 낭비하게 함 . – 1/7 of English-language pages

Page 6: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Importance of detecting web spam

• Creating an effective spam detection method is a challenging problem. – Given the size of the web, such a method has to be automated.

– However, while detecting spam, we have to ensure that we identify spam pages alone, and that we do not mistakenly consider legitimate pages to be spam.

– At the same time, it is most useful if we can detect that a page is spam as early as possible, and certainly prior to query processing. In this way, we can allocate our crawling, processing, and indexing efforts to non-spam pages, thus making more efficient use of our resources.

Page 7: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Web spamming techniques

Page 8: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

• Web Spam TaxonomyBy Zoltán Gyöngyi and Hector Garcia-Molina, Stanford University. First International Workshop on Adversarial Information Retrieval on the Web, May 2005

Page 9: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Term Spamming

• p: page, q: query words• TF(t)= 문서에 출현하는 term t 의 수• IDF(t)=term t 를 포함하는 문서의 수

• Term spamming 은 TFIDF score 에 기반한 랭킹알고리즘을 채택하고 있는 search engine 을 대상으로 공격

Page 10: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Term Spamming• Body/title/meta tag/Anchor text

<meta name=\keywords" content=\buy, cheap,cameras, lens, accessories, nikon, canon">

<a href=\target.html">free, great deals, cheap, in-expensive, cheap, free</a>

• URL spambuy-canon-rebel-20d-lens-case.camerasx.com,buy-nikon-d100-d70-lens-case.camerasx.com,

Page 11: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

How to Term Spamming• Repetition of one or a few specific terms• Dumping of a large number of unrelated terms• Weaving of spam terms into copied contents

• Phrase stitching is also used by spammers to create content quickly

Page 12: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Link Spamming

• PageRank 알고리즘의 특징을 파악하여 Outgoing links, Incoming links 를 조작하는 수법

Page 13: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Outgoing links

• A spammer might manually add a number of outgoing links to well-known pages, hoping to increase the page's hub score.

• At the same time, the most wide-spread method for creating a massive number of outgoing links is directory cloning: One can find on the World Wide Web a number of directory sites, some larger and better known (e.g., the DMOZ Open Directory, dmoz.org, or the Yahoo! directory, dir.yahoo.com)

Page 14: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Incoming links

• Create a honey pot, a set of pages that provide some useful resource (e.g., copies of some Unix documentation pages), but that also have (hidden) links to the target spam page(s).

• Post links on blogs, unmoderated message boards, guest books, or wikis. spammers may include URLs to their spam pages as part of the seemingly innocent comments/messages they post.

Page 15: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Hiding Techniques-Content Hiding

Page 16: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Hiding Techniques-Cloaking

If spammers can clearly identify web crawler clients,they can adopt the following strategy, called cloak-ing: given a URL, spam web servers return one specicHTML document to a regular web browser, while theyreturn a dierent document to a web crawler. This way,spammers can present the ultimately intended contentto the web users (without traces of spam on the page),and, at the same time, send a spammed document tothe search engine for indexing.

Page 17: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Hiding Techniques-Redirection

Page 18: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Spam occurrence per top-level domain

• 105, 484, 446 web pages, collected by the MSN Search crawler during August 2004.

Page 19: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Spam occurrence per language in our data set.

Page 20: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Prevalence of spam - number of words on page

Page 21: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Prevalence of spam - number of words in title

Page 22: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Prevalence of spam - average word-length of page

Page 23: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Prevalence of spam - visible content on page

Page 24: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Prevalence of spam - compressibility of page

Page 25: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Classification model to detect spam

Page 26: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

• given the training set DS we generate N training sets by sampling n random items with replacement

• For each of the N training sets, we now create a classifier, thus obtaining N classifiers.

• In order to classify a page, we have each of the N classifiers provide a class prediction, which is considered as a vote for that particular class.

• The eventual class of the page is the class with the majority of the votes

Page 27: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Bagging & Boosting

spam Non-spam

Spam A B

Non-spam

C D

예측

실제

Page 28: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Challenges in Web Information Retrieval

Mehran Sahami Vibhu Mittal Shumeet Baluja Henry Rowley

Google Inc.

Page 29: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Information Retrieval on the Web

• Goal: identify which pages are of high quality and relevance to a user’s query.– PageRank, HITS

• Two Challenges– Adversarial classification: detecting Web spammin

g– Evaluating Search results

Page 30: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

PageRank

• Assume four web pages: A, B,C and D. • The initial values of PageRank

– PR(A)= PR(B)= PR(C)= PR(D)= 0.25.• PageRank for any page u

• Bu ={v| v links to page u }• Nv = the number of links from page v.

Page 31: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

PR(A) = PR(C)/1

PR(B) = PR(A)/2

PR(C) = PR(A)/2 + PR(B)/1+PR(D)/1

PR(D) = 0

Page 32: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Determining the relatedness of fragments of text

• eg:– “Captain Kirk” & “Star Trek” is similar than– “Captain Kirk” & “Fried Chicken”.

• How to measure the closeness between two phases.

• K(x,y) =

Page 33: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.
Page 34: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Retrieval of UseNet Articles

• at least 800 million documents

Page 35: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Retrieval of Images and Sounds

• non-textual “documents”– from digital still and video cameras, camera phone

s, audio recording devices, and mp3 music.