Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel;...
-
Upload
daniella-riley -
Category
Documents
-
view
216 -
download
0
Transcript of Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel;...
![Page 1: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/1.jpg)
Web SpamWeb SpamYonatan Ariel
SDBI 2005
Based on the work of
Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University
The Hebrew University of JerusalemThe Hebrew University of Jerusalem
![Page 2: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/2.jpg)
ContentsContents
• What is web spamWhat is web spam
• Combating web spam – TrustRank
• Combating web spam – Mass Estimation
• Conclusion
![Page 3: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/3.jpg)
Web SpamWeb Spam
• Actions intended to mislead search engines into ranking some pages higher than they deserve.
• Search engines are the entryways to the web
Financial gainsFinancial gains
![Page 4: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/4.jpg)
![Page 5: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/5.jpg)
![Page 6: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/6.jpg)
![Page 7: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/7.jpg)
ConsequencesConsequences
• Decreased search results quality “Kaiser pharmacy” returns techdictionary.com
• Increased cost of each processed query Search engine indexes are inflated with
useless pages
The first step in combating spam is understanding it
![Page 8: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/8.jpg)
Search EnginesSearch Engines
• High quality results, i.e. pages that are Relevant for a specify query
• Textual similarity
Important
• Popularity
• Search engines combine relevance and importance, in order to compute Ranking
![Page 9: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/9.jpg)
Definition revisedDefinition revised
• any deliberate human action that is meant to trigger an unjustifiably favorable relevance or importance for some web page, considering the page’s true value
![Page 10: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/10.jpg)
SSearch earch EEngine ngine OOptimizersptimizers
• Engage in spamming (according to our definition)
• Ethical methods Finding relevant directories to which a site
can be submitted
Using a reasonably sized description meta tag
Using a short and relevant page title to name each page
![Page 11: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/11.jpg)
Spamming TechniquesSpamming Techniques
• Boosting techniques Achieving high relevance / importance
• Hiding techniques Hiding the boosting techniques
We’ll cover them both
![Page 12: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/12.jpg)
TechniquesTechniques
• Boosting Techniques
Term Spamming
Link Spamming
• Hiding Techniques
![Page 13: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/13.jpg)
TFTF
• TF (term frequency(
measure of the importance of the term (in a specific page)
number of occurrences of the considered term
number of occurrences of all
terms
IDFIDF
( ) tp
kk
nTF t
n
![Page 14: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/14.jpg)
• IDF - (inverse document frequency) a measure of the general importance of the term in a
collection of pages
total number of documents in the
corpus
Total number of documents where t
appears
TFTF IDFIDF
| |( )
| ( ) |Dj
DIDF t
d t
![Page 15: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/15.jpg)
TF-IDFTF-IDF
• A high weight in tf-idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents.
• Spammers: Make a page relevant for a large number of queries
Make a page very relevant for a specific query
and
( , ) ( ) ( )Dt p t q
TFIDF p q TF t IDF t
![Page 16: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/16.jpg)
Term Spamming TechniquesTerm Spamming Techniques
• Body Spam Simplest, oldest, most popular.
• Title Spam Higher weights.
• Meta tag spam Low priority <META NAME="keywords" CONTENT="jew,jews,jew
watch,jews and communism,jews and banking,jews and banks,jews in government..history,diversity,Red Revolution,USSR,jews in government , holocaust, atrocities, defamation, diversity, civil rights, plurali, bible, Bible, murder, crime, Trotsky, genocide, NKVD, Russia, New York, mafia, spy, spies,Rosenberg">
![Page 17: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/17.jpg)
Term Spamming Techniques (cont’d)Term Spamming Techniques (cont’d)
• Anchor text spam <a href=“target.html”> free, great deals, cheap,
cheap,free </a>
• URL Buy-canon-rebel-20d-lens-case.camerasx.com
![Page 18: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/18.jpg)
Grouping Term Spamming TechniquesGrouping Term Spamming Techniques
• Repetition Increased relevance for a few specific queries
• Dumping of a large number of unrelated terms Effective against Rare, obscure terms queries
• Weaving of spam terms into copies contents Rare (original) topic Dilution – conceal some spam terms within the text
• Phrase stitching Create content quickly
Remember not only airfare to say the right planetickets thing in the right place, but far cheap travelmore difficult still, to leave hotel rooms unsaid the
wrong thing at vacation the tempting moment.
![Page 19: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/19.jpg)
TechniquesTechniques
• Boosting Techniques
Term Spamming
Link Spamming
• Hiding Techniques
![Page 20: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/20.jpg)
Three Types Of Pages On The WebThree Types Of Pages On The Web
• Inaccessible Spammers cannot modify
• Accessible Can be modified in a limited way
• Own pages We call a group of own pages a spam farm
![Page 21: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/21.jpg)
First Algorithm - HITSFirst Algorithm - HITS• Assigns global hub and authority scores to each page
• Circular definition: Important hub pages are those that point to many
important authority pages Important authority pages are those pointed to by
many hubs
• Hub scores can be easily spammed Adding outgoing links to a large number of well knows,
reputable pages.
• Authority score is more complicated The more the better
![Page 22: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/22.jpg)
Second Algorithm - Page RankSecond Algorithm - Page Rank
• a family of algorithms for assigning numerical weightings to hyperlinked documents
• The PageRank value of a page reflects the frequency of hits on that page by a random surfer is the probability of being at that page after
lots of clicks We continue at random from a sink page
![Page 23: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/23.jpg)
Page rankPage rank
( ) ( ) ( ) ( ) ( )static in out lossPR M PR M PR M PR M PR M
All n own pages are part of the
farm
All m accessible pages point to the
spam farm
Links pointing outside the spam
farm are supressed
No vote gets lost (each page has an
outgoing link)
All accessible and own pages point
to t
All pages within the farm are reachable
Inaccessible accessible Own
t
![Page 24: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/24.jpg)
Techniques – Outgoing linksTechniques – Outgoing links
• Manually adding outgoing link to well-knows hosts; increased hub score Directories sites
• dmoz.org
• Yahoo! Directory
Creating massive outgoing link structure quickly
![Page 25: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/25.jpg)
Techniques – Incoming LinksTechniques – Incoming Links
• Honey-pot – useful resource
• Infiltrate a web directory
• Links on blogs, guest books, wikis Google’s tag – <a href="http://www.example.com/" rel="nofollow">discount</a>
• Link exchange
• Buy expired domains
• Create own spam farm
![Page 26: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/26.jpg)
TechniquesTechniques
• Boosting Techniques
Term Spamming
Link Spamming
• Hiding Techniques
![Page 27: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/27.jpg)
Content HidingContent Hiding
• Color scheme font’s color same as background’s color
• Tiny anchor images links (1x1 pixel)
• Using scripts Setting the visible HTML style attribute to
FALSE.
![Page 28: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/28.jpg)
CloakingCloaking
• Spam web servers can return a different document to a web crawler
• Identification of web crawlers: A list of IP addresses ‘user-agent’ field in the HTTP request
• Allow web masters block some contents
• Legitimate optimizations (remove ads)
• Delivering contents that search engine can’t read (such as flash)
![Page 29: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/29.jpg)
RedirectionRedirection
• Automatically redirecting the browser to another URL
• Refresh meta tag in the header of an HTML document <meta http-equiv=“refresh” content=“0;url=target.html>
• Simple to identify
• Scripts
<script language=“javascript> location.replace(“target.html”) </script>
![Page 30: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/30.jpg)
How can we fight it?How can we fight it?
• IDENTIFY instances of spam Stop crawling / indexing such pages
• PREVENT spamming Avoid cloaking – identifying as regular web
browsers
• COUNTERBALANCE the effect of spamming Use variation of the ranking methods
![Page 31: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/31.jpg)
Some StatisticsSome Statistics
The results of a single breadth first
search at the Yahoo! Home page
A complete set of pages crawled and
indexed by AltaVista
![Page 32: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/32.jpg)
Some More StatisticsSome More Statistics
Sophisticated spammers
Average spammers
![Page 33: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/33.jpg)
ContentsContents
• What is web spam
• Combating web spam – TrustRankCombating web spam – TrustRank
• Combating web spam – Mass Estimation
• Conclusion
![Page 34: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/34.jpg)
MotivationMotivation
• The spam detection process is very expensive and slow, but is critical to the success of search engines
• We’d like to assist the human experts who detect web spam
![Page 35: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/35.jpg)
Getting dirtyGetting dirty
• G = (V,E) V = set of N pages (vertices) E = set of directed links (edges) that connect
pages• We collapse multiple hyperlinks into a single link
• We remove self hyperlinks
• i(p) – number of in-links to a page p
• w(p) – number of out-links from a page p
![Page 36: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/36.jpg)
Our ExampleOur Example
V = { 1, 2, 3, 4}
E = { (1,2),(2,3),(3,2),(3,4)}
N = 4
i(2) = 2; w(2) = 1
1 432
![Page 37: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/37.jpg)
A Transition MatrixA Transition Matrix
0 if (q,p) E
( , ) 1 otherwise
w(q)
T p q
![Page 38: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/38.jpg)
In our exampleIn our example
1 432
0 0 0 0
11 0 0
20 1 0 0
10 0 0
2
T
The out edges of
‘3’
The in edges of
‘4’
![Page 39: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/39.jpg)
An Inverse transition matrixAn Inverse transition matrix
0 if (p,q) E
( , ) 1 otherwise
i(q)
U p q
![Page 40: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/40.jpg)
In Our ExampleIn Our Example
1 432
10 0 0
20 0 1 0
10 0 1
20 0 0 0
u
The in edges of
‘2’The out edges of
‘2’
![Page 41: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/41.jpg)
Page RankPage Rank
• mutual reinforcement between pages the importance of a certain page influences and is
being influenced by the importance of some other pages.
:( , )
( ) 1( ) (1 )
( )q q p E
r qr p
w q N
In-links votesdecay factor
start-off atuthority
![Page 42: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/42.jpg)
Equivalent Matrix EquationEquivalent Matrix Equation
1 (1 ) 1Nr T r
N
Scalar ScalarN vector N vectorN vector
Dynamic
Static
![Page 43: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/43.jpg)
A Biased PageRankA Biased PageRank
(1 ) Nr T r d
A static score distribution
(summing up to one)
Only pages that are reachable from some d[i]>0
will have a positive page rank
![Page 44: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/44.jpg)
Oracle FunctionOracle Function
• A binary oracle function O over all pages p in V:
0 if p is spam( )
1 otherwiseO p
1
4
2 3
65
7 good
bad
O(3 ) = 1
O(6 ) = 0
![Page 45: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/45.jpg)
Oracle FunctionsOracle Functions
• Oracle invocations are expensive and time consuming We CAN’T call the function for all pages
• Approximate isolation of the good set Good pages seldom point to bad onesGood pages seldom point to bad ones
• As we’ve seen, good pages *can* point to bad ones
bad pages often point to bad ones
![Page 46: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/46.jpg)
Trust FunctionTrust Function
• We need to evaluate pages without relying on O.
• We define, for any page p, a trust function
• Ideal Trust Property (for any page p)
T(p) = Pr[ O(p) = 1 ] Very hard to come up with such function
Useful in ordering search results
![Page 47: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/47.jpg)
Ordered Trust PropertyOrdered Trust Property
T(p) = T(p) Pr[O(p) = 1] = Pr[O(q) = 1]
T(p) < T(q) Pr[O(p) = 1] < Pr[O(q) = 1]
![Page 48: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/48.jpg)
First Evaluation Metric - First Evaluation Metric - Pairwise Pairwise OrderednessOrderedness
1 if T( ) T( ) and O( ) < O( )
( , , , ) 1 if T( ) T( ) and O( ) > O( )
0 Otherwise
p q p q
I T O p q p q p q
( , )| | ( , , , )
( , , )| |
p q PP I T O p q
pairord T OP
A violation of the ordered trust proerty
Trust function T, oracle function O, pages p,q
The fraction of the pairs for which T did not make a mistake
![Page 49: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/49.jpg)
Threshold Trust Property
T(p) > O(q) = 1
• Doesn’t necessarily provide an ordering of pages based on their likelihood of being good
• We’ll describe two evaluation metrics Precision
Recall
![Page 50: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/50.jpg)
Threshold Evaluation MetricsThreshold Evaluation Metrics
|{ | ( ) and ( )=1}|( , )
|{ | ( ) }|
p T p O pprec T O
p T p
Total number of good pages
in X
|{ | ( ) and ( )=1}|( , )
|{ | ( ) 1 }|
p T p O prec T O
p O p
Total number of ‘good’ estimations
Total number of correct ‘good’
estimations
Total number of correct ‘good’
estimations
![Page 51: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/51.jpg)
Computing TrustComputing Trust
• Limited budget L of O-invocations
• We select at random a seed set S of L pages and call the oracle on its elements
• Ignorant Trust Function:
0
( ) if p S( ) 1
otherwise2
O pT p
Not checked by human experts
![Page 52: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/52.jpg)
For exampleFor example
• L = 3; S={1,3,6}
[1,1,1,1,0,0,0]O 1
4
2 3
65
7
Oracle Actual Values
Ignorant function values
• We choose X = 7 Pairwise orderness = 34/42
• For ½ Precision =1; Recall =0.5
0
1 1 1 1t [1, ,1, , ,0, ]
2 2 2 2
![Page 53: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/53.jpg)
Trust PropagationTrust Propagation
• Remember approximate isolation ?
• We generalize the ignorant function
• M-Step Trust Function: The original set S, on which we called
O
There exists a path of a maximum length of M from page p to page q,
that doesn’t include bad seed pages
![Page 54: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/54.jpg)
ExampleExample
1
4
2 3
65
7
0
1 1 1 1t [1, ,1, , ,0, ]
2 2 2 2
![Page 55: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/55.jpg)
ExampleExample
1
4
2 3
65
7
1
1 1 1t [1,1,1, , ,0, ]
2 2 2
![Page 56: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/56.jpg)
ExampleExample
1
4
2 3
65
7
2
1 1t [1,1,1,1, ,0, ]
2 2
![Page 57: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/57.jpg)
ExampleExample
1
4
2 3
65
7
3
1t [1,1,1,1,1,0, ]
2
A mistake
![Page 58: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/58.jpg)
ResultsResults
A drop in performanceThe further away we are from
good seed pages, the less certain we are that a page is good!
![Page 59: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/59.jpg)
Trust AttenuationTrust Attenuation
• Trust Dampening
<1.
We could assign maximum(b,b*b) or
average(b,b*b)
![Page 60: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/60.jpg)
Trust AttenuationTrust Attenuation• Trust Splitting
The care with which people add links to their pages is often inversely proportional to the number of links on the page
![Page 61: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/61.jpg)
![Page 62: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/62.jpg)
Trust Rank AlgorithmTrust Rank Algorithm
1. (Partially) Evaluate seed-desirability of pages
2. Invoke the oracle function on the L most desirable seed pages, normalize the result (a vector d)
3. Evaluate TrustRank scores using a biased PageRank computation with d replacing the unfiorm distribution
![Page 63: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/63.jpg)
For ExampleFor Example
• Desirability vector
[0.08,0.13,0.08,0.10,0.09,0.06,0.02]
• Order the vertices accordingly:
[2, 4, 5, 1, 3, 6, 7]
1
4
2 3
65
7
![Page 64: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/64.jpg)
For ExampleFor Example (cont’d)(cont’d)
• Compute good seeds vector (other seeds are considered bad)
[0, 1, 0 , 1, 0, 0, 0]
• Normalize the result
d = [0, 1/2, 0 , 1/2, 0, 0, 0]
Will be used as the biased page rank
vector
1
4
2 3
65
7
![Page 65: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/65.jpg)
For ExampleFor Example (cont’d)(cont’d)
• Compute TrustRank Scores
[0, 0.18, 0.12, 0.15, 0.13, 0.05, 0.05]
Highest score
Highest score
Higher than p4,
due to p3
P1 is unreferenced
1
4
2 3
65
7
High due to a direct link from p4
t = d
For i = 1 to M do
t = T t +(1- ) d
return t
![Page 66: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/66.jpg)
Selecting SeedsSelecting Seeds
• We want to choose pages that are useful in identifying additional good pages
• We want to keep the seed set small
• Two strategies Inverse page rank
High Page Rank
![Page 67: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/67.jpg)
I. Inverse PageRankI. Inverse PageRank
• Preference to pages from which we can reach many other pages We can select seed pages based on the
number of outlinks
• We’ll choose the pages that point to many pages that point to many pages that point to many
pages that point to many pages that point to many pages that point to
many pages …
This is actually PageRank, where the importance of a page depends on its outlinks
• Perform PageRank on the graph G=(V,E’)
• Use inverse transition matrix U (instead of T)
![Page 68: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/68.jpg)
II. High PageRankII. High PageRank
• We’re interested in high PageRank pages
• Obtain accurate trust scores for high PageRank pages
• Preference to pages with high PageRank Likely to point to other high PageRank pags
May identify the goodness of fewer pages, but they may be more important pages
![Page 69: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/69.jpg)
StatisticsStatistics
• |Seed set S| = 1250 (given by inverse PageRank)
• Only 178 sites were selected to be used as good seeds (due to extremely rigorous selection criteria)
![Page 70: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/70.jpg)
Statistics (cont’d)Statistics (cont’d)
Bad sites in PageRank and TrustRank buckets
![Page 71: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/71.jpg)
Statistics (cont’d)Statistics (cont’d)
Bucket level demotion in Trust Rank
A site from a higher PageRank bucket appears in a lower TrustRank BucketSpam sites in
PageRank bucket 2 got demoted 7 buckets in average
![Page 72: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/72.jpg)
ContentsContents
• What is web spam
• Combating web spam – TrustRank
• Combating web spam – Mass EstimationCombating web spam – Mass Estimation Turn the spammers’ ingenuity against
themselves
• Conclusion
![Page 73: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/73.jpg)
Spam Mass – Naïve ApproachSpam Mass – Naïve Approach
• Given a page x, we’d like to know if it got most of its PageRank from spam pages or from reputable pages
• Suppose that we have a partition of the web into 2 sets V(S) = Spam pages
V(R) = Reputable pages
![Page 74: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/74.jpg)
First Labeling SchemeFirst Labeling Scheme
• Look at the number of direct inlinks If most of them comes from spam pages, then declare
that x is a spam page
G-0
G-1
S-k
S-2
S-1
good
bad
S-0x
2x
2
P (3 )(1 ) /
out of which ( )(1 ) / is due to spamming
for c=0.85, as long as k 2, this is the majority
k n
k n
![Page 75: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/75.jpg)
Second Labeling SchemeSecond Labeling Scheme
• If the largest part of x’s PageRank comes from spam nodes, we label x as spam
G-0
G-2
S-3
S-2
S-1
S-0
S-3
x
S-5
S-6
G-1
G-3
good
bad
20 2
20
g and g contribute (2 4 )(1 ) /
s contributes ( 4 )(1 ) /
n
n
1 m{x ,...,x }x
1 m
we can compute q
x's page rank due to x ...x
![Page 76: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/76.jpg)
Improved Labeling SchemeImproved Labeling Scheme
G-0
G-2
S-3
S-2
S-1
S-0
S-3
x
S-5
S-6
G-1
G-3
good
bad
0 3
0 3
{g ,...,g } 2x
{s ,...,s } 2x
q =(2 2 )(1- ) /
q =( 6 )(1- ) /
n
n
![Page 77: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/77.jpg)
Spam Mass DefinitionSpam Mass Definition
xThe absolute spam mass of x, denoted by M ,
is the PageRank contribution that x receives
from spam nodes
xThe relative spam mass of x, denoted by m ,
is the fraction of x's PageRank due to contributing
spam nodes
![Page 78: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/78.jpg)
EstimatingEstimating
• We assumed that we have a priori knowledge of whether nodes are good or bad – not realistic!
• What we’ll have is a subset of the good nodes, the good core Not hard to construct
Bad pages are ofren abandoned
![Page 79: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/79.jpg)
Estimating (cont’d)Estimating (cont’d)
• For computes 2 sets sets of PageRank scores:
• p=PR(v) – based on the uniform random jump distribution v (v[i] = 1/n, for i = 1..n)
• p`=PR(v`) – based on the random jump distribution v`
1 if i is in the good core
`[ ]0 otherwise
v i n
![Page 80: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/80.jpg)
Spam Mass Definition (cont’d)Spam Mass Definition (cont’d)
x x
X x x
x xX
x
Given PageRank scores p and p`
the estimated absolute spam mass of node x is
M p - p`
and the estimated relative spam mass of x is
(p - p` )m
p
![Page 81: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/81.jpg)
G-0
G-2
S-3
S-2
S-1
S-0
S-3
x
S-5
S-6
G-1
G-3
good
bad
![Page 82: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/82.jpg)
Spam Detection AlgorithmSpam Detection Algorithm
• Compute PageRank score p
• Compute (biased) PageRank p`
• Compute the relative spam mass vector
• For each node (with PageRank high enough), if its relative spam mass is bigger than a (given) threshold, declare that x is spam
![Page 83: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/83.jpg)
StatisticsStatistics
![Page 84: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/84.jpg)
ContentsContents
• What is web spam
• Combating web spam – TrustRank
• Combating web spam – Mass Estimation
• ConclusionConclusion
![Page 85: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/85.jpg)
ConclusionConclusion
• We introduced ‘web spam’
• We presented two ways to combat spammers TrustRank (spam demotion)
Spam mass estimation (spam detection)
![Page 86: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/86.jpg)
questions?
Thank youThank you
![Page 87: Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.](https://reader030.fdocuments.us/reader030/viewer/2022032605/56649e755503460f94b76203/html5/thumbnails/87.jpg)
BibliographyBibliography
• Web Spam TaxonomyWeb Spam Taxonomy (2004) - Gyongyi, Zoltan; Garcia-Molina, Hector, Stanford University
• Combating Web Spam with TrustRankCombating Web Spam with TrustRank (2005) - Gyongyi, Zoltan; Garcia-Molina, Hector; Pedersen, Jan
• Link Spam Detection Based on Mass EstimationLink Spam Detection Based on Mass Estimation (2005) - Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan
• http://www.firstmonday.org/issues/issue10_10/tatum/
• http://en.wikipedia.org/wiki/TFIDF
• http://en.wikipedia.org/wiki/Pagerank