Web Spam, Propaganda and Trust
-
Upload
phungduong -
Category
Documents
-
view
220 -
download
0
Transcript of Web Spam, Propaganda and Trust
![Page 1: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/1.jpg)
Web Spam, Propagandaand Trust
P. Takis MetaxasComputer Science Department
Wellesley College
Joint work with Joe DeStefano
![Page 2: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/2.jpg)
Outline of the Talk
The Web and its Spam •••••
A Short History of the Search Engines •••••••••
Web Spam as Propaganda •••
Propaganda Primer
Anti-propagandistic techniques on Spam ••••
Experimental Results
Conclusions and Next Steps ••
![Page 3: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/3.jpg)
The Web …
Has changed the way we get informedHas changed the way we make decisions
(financial, medical, political, …)Is huge 2-10 billion static pages publicly available,
doubling every year Three times this, if you count the “deep web” Infinite, if you count dynamically created pages
Will be omnipresent Computers, Cell phones, PDA’s, thermostats, toasters ...
Can be unreliable
![Page 4: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/4.jpg)
… and its Spam
![Page 5: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/5.jpg)
… and its Spam
![Page 6: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/6.jpg)
What is Web Spam?
The practice of manipulating web pagesin order to cause search engines rank them higherthan they would without manipulation“…than they deserve”“… unjustifiably favorable [ranking wrt] the page’strue value”“…unethical web page positioning”It is a problem, not only for search engines Primarily for users As well as for content providers
It is first a social problem, then a technical one
![Page 7: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/7.jpg)
Who is Spamming and Why?
Companies Big companies Small businesses
Advertisers and Promoters Search Engine Optimizers
Special interest groups Religious interests Financial interests Medical interests Political interests etc
Everybody could/would My doctor You (?), Me (!)
85% of searchersdo not go beyondtop-10
People (still) trustthe written word
People trust thesearch engines
![Page 8: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/8.jpg)
A Short History of Search Engines
1st Generation (ca 1994): AltaVista, Excite, Infoseek… Ranking based on Content
Pure Information Retrieval2nd Generation (ca 1996): Lycos Ranking based on Content + Structure
Site Popularity3rd Generation (ca 1998): Google, Teoma Ranking based on Content + Structure + Value
Page ReputationIn the Works Ranking based on “the need behind the query”
??
![Page 9: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/9.jpg)
1st Generation: Content Similarity
Boolean operations on query terms did not go very far
Content Similarity Ranking:The more rare words two documents share, the more similar they are
Similarity is measured by vector angles
Query Results are rankedby sorting the anglesbetween query and documents
How To Spam?t 1
d2
d 1
t 3
t 2
_
![Page 10: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/10.jpg)
1st Generation: How to Spam
Add keywords so as to confuse page relevanceHide them from human eyesSearching for Jennifer Aniston?SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORDJENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATEMOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDERVALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDIKLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFERLOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSONMADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUEKATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBERVALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTESKELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDYCRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLEMACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAMANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTABETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICALEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTONGILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRABANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOMHARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISYFUENTES KELLY BROOK
![Page 11: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/11.jpg)
2nd Generation: Site Popularity
A link from a page in site Ato some page in site Bis considered a popularityvote from A to B
Rank similar pagesaccording to popularity
Related implementationof Popularity:DirectHit’s Click-throughs
Rich get richer:users will always tryfirst few links returned
How To Spam?
www.aa.com1
www.bb.com2
www.cc.com1 www.dd.com
2
www.zz.com0
![Page 12: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/12.jpg)
2nd Generation: How to Spam
Heavily interconnected“link farms”spam popularity
Clicking robotsspam click-throughs
![Page 13: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/13.jpg)
3rd Generation: Page Reputation
A link from a page Px to page Py is considered aconfidence vote from Px to Py Confidence builds reputation
(as in academic co-citations)
The reputation “PageRank” of a page Pi =the sum
of a fraction of the reputationsof all pages Pj that point to Pi
Beautiful Math behind it PR = principal eigenvector
of the web’s link matrix PR equivalent to the chance
of randomly surfing to the pageHITS algorithm tries to recognize
“authorities” and “hubs”
How To Spam?
![Page 14: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/14.jpg)
3rd Generation: How to Spam
Organize “mutual admiration societies”of irrelevant reputable sites
![Page 15: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/15.jpg)
An Industry is Born
“SE Optimizer” CompaniesAdvertisement ConsultantsConferences
![Page 16: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/16.jpg)
Web Spam as a major forcebehind Search Engines Evolution
Search Engine’s Action
1st Generation: Pure IR Content
2nd Generation: Popularity Content + Structure
3rd Generation: Reputation Content + Structure + Value
In the Works Ranking based on
“the need behind the query”
Web Spammers Response
Add keywords so asto confuse page relevanceCreate “link farms” of heavilyinterconnected sitesOrganize “mutual admirationsocieties” of irrelevant sites??
Is there a pattern on how to spam?
Can you guesswhat they will
do?
They will try tomodify the Web Graph
for their benefit
![Page 17: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/17.jpg)
And Now For Something Completely Different(?)
Propaganda: Attempt to modify human behavior,
and thus influence their actionsin ways beneficial to propagandists
Theory of Propaganda Developed by the Institute for Propaganda Analysis 1938-1942
Propagandistic Techniques (and ways of detecting propaganda) Word games
Name Calling Glittering Generalities
Transfer Testimonial Bandwagon
![Page 18: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/18.jpg)
Societal Trust is a Network
A Simplified Description of Societal Trust:
Weighted Directed Graph of Nodes and Weighted Arcs Nodes = Societal Entities (People, Ideas, …) Arcs = Recommendation from an entity to another Arc weight = Degree of entrustment
Then what is Propaganda? Attempt to modify the Trust Social Network
in ways beneficial to propagandist
And what is Web Spam? Attempt to modify the Web Graph
in ways beneficial to spammer
![Page 19: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/19.jpg)
Web Spam as Propaganda
+ Testimonials+ mutualadmirationsocieties
+ Pagereputation
3rd Gen
+ Bandwagon+ link farms+ Sitepopularity
2nd Gen
Glitteringgeneralities
Keywordstuffing
Doc Similarity1st Gen
PropagandaSpammingRankingSE’s
Web Spam is a major force behind Search Engine evolution
So what?Can this understanding help us defend against web spam?
![Page 20: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/20.jpg)
Anti-Propagandistic Lessons for Web
How do you deal with propaganda in reallife?
Backward propagation of distrustThe recommender of an untrustworthymessage becomes untrustworthy
Can you transfer this technique to the web?
![Page 21: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/21.jpg)
An Anti-Propagandistic Algorithm
Start from untrustworthy site sS = {s}Using BFS for depth D do: Find the set U of sites
linking to sites in S(using the Google APIfor up to B b-links/site)
Ignore blogs, directories, edu’s S = S + U
Find the bi-connected componentBCC of U
that includes s
BCC shows multiple pathsto boost the reputation of s
![Page 22: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/22.jpg)
An Anti-Propagandistic Algorithm
Start from untrustworthy site sS = {s}Using BFS for depth D do: Find the set U of sites
linking to sites in S(using the Google APIfor up to B b-links/site)
Ignore blogs, directories, edu’s S = S + U
Find the bi-connected componentBCC of U
that includes s
BCC shows multiple pathsto boost the reputation of s
![Page 23: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/23.jpg)
Explored neighborhoods
![Page 24: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/24.jpg)
Evaluated Experimental Results
15% =2/13
14% = 1/34
70% = 28/40
100% = 32/32
60% = 28/47
64% = 14/22
69% = 9/13
80% = 16/20
78% = 42/54
74% = 34/46
Untrstwrth
7%4% = 2/542661380coral-calcium-benefits.com
0%0% = 0/323281genf20.com
13%9% = 4/47228312coral1.com
241
1429
1547
716
457
875
1307
|G|
advice-hgh.com
hgfound.org
1stHGH.com
maxsportsmag.com
hardcorebodybuilding.com
vespro.com
renuva.net
Target
8%77% = 10/1313
26%56% = 19/34164
10%5% = 2/40200
27%0% = 0/22105
15%0% = 0/1363
15%0% = 0/2097
13%2% = 1/46228
DirectoryTrustworth|BCC|
![Page 25: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/25.jpg)
Evaluated Experimental Results
![Page 26: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/26.jpg)
Conclusions and Next Steps
Web Spam / Cyberworld = Propaganda / SocietyParticular spamming techniques can be uncovered - then what?Spam becomes a necessity as web grows “I spent all my life searching for the meaning of life…” “If you cannot find it on eBay or Google, it does not exist”
Spam to you, treasure to meWho do you trust is the right question to ask
and provide tools for managing trusted and distrustedPersonalization of search a search engine (component) per browser Or: specialized search engines
Education, critical thinking What we believe, why we believe it
Cyber-social structures and networks I inherit the trusted/distrusted networks of the societies I join
![Page 27: Web Spam, Propaganda and Trust](https://reader033.fdocuments.us/reader033/viewer/2022051318/586765591a28ab36408b9715/html5/thumbnails/27.jpg)
How (not) To Solve The Problem