BlogVox: Separating Blog Wheat from Blog Chaff
description
Transcript of BlogVox: Separating Blog Wheat from Blog Chaff
BlogVox: Separating Blog Wheat from Blog Chaff
BlogVox: Separating Blog Wheat from Blog Chaff
Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC)
James Mayfield (JHU/APL)
Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC)
James Mayfield (JHU/APL)
Motivation: Cleaning the HarvestMotivation: Cleaning the Harvest• BlogVox – A Blog analytics engine developed for
the TREC 2006 Blog Track.• Presence of spam blogs or splogs and extraneous
content waters down the quality of the index.• Narrowing down on the content of the post is
essential in lack of clearly demarcated opinion sentences (like in eopinions, IMDB, Amazon etc)
• Noisy and unstructured text on the Blogosphere can skew blog analytics/ business intelligence tools (as observed in TREC 2006).
BlogVox Opinion Extraction System
BlogVox Opinion Extraction System
• TREC 06: Finding opinionated posts, either positive or negative, about a query
• 2006 TREC Blog corpus:• 80K blogs• 300K posts• 50 test queries
• BlogVox opinion extraction system• Document and sentence
level scorers• Combined scores using an
SVM meta-learner• Data cleaning: splogs and
post identification
BlogVox
BlogVox challenges• Data cleaning and splog removal • Slangs• Semantic orientation of words• Contradictions, sarcasms, ungrammatical text
Separating Blog Wheat from Blog ChaffSeparating Blog Wheat from Blog Chaff
Data cleaning for• Splog removal • Post content identification
Non English Blog removalNon English Blog removal
2
Collection ParsingCollection Parsing
1
Splog DetectionSplog Detection
3
Pre Indexing Steps
Title and Content Extraction
Title and Content Extraction
4
Non English Blog removalNon English Blog removal
2
Collection ParsingCollection Parsing
1
Splog DetectionSplog Detection
3
Pre Indexing Steps
Title and Content Extraction
Title and Content Extraction
4
Spam in the BlogosphereSpam in the Blogosphere• Types: comment spam, ping spam, splogs• Akismet: “87% of all comments are spam”• 75% of update pings are spam (ebiquity
2005)• 56% of blogs are spam (ebiquity 2005)• 20% of indexed blogs by popular blog search
engines is spam (Umbria 2006, ebiquity 2005)
• Spam blogs (splogs) are weblogs used to promoting affiliated websites or host ads
• “Spings, or ping spam, are pings that are sent from spam blogs”
Motivation: host adsMotivation: host ads
Motivation: index affiliates, promote pageRank
Motivation: index affiliates, promote pageRank
Data Cleaning: SplogsData Cleaning: Splogs
• Splog detection using SVM• 700 blogs, 700 splogs used for
training• Model based on blog homepage
and local blog features
Host Ads Index affiliates,Promote
pageRank
Plagiarized content
Splog Detection Performance
Nature of Splogs in TREC 2006Nature of Splogs in TREC 2006
1The TREC Blog06 Collection: Creating and Analyzing a Blog Test Collection – C. Macdonald, I. Ounis
• Around 83K identifiable blog home-pages in the collection, with 3.2M permalinks
• 81K blogs could be processed • We use splog detection models developed on
blog home-pages; 87% accuracy• We identified 13,542 splogs• Blacklisted 543K permalinks from these
splogs• ~16% of the entire collection• ~17% splog posts injected into TREC dataset1
1The TREC Blog06 Collection: Creating and Analyzing a Blog Test Collection – C. Macdonald, I. Ounis
Impact of Splogs in TREC Queries
Impact of Splogs in TREC Queries
Distribution of Splogs that appear TREC queries
0
20
40
60
80
100
120
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Top 100 Search Results ranked using TFIDF Scoring
Num
ber
of Splo
gs
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
American Idol
CholesterolHybrid Cars
Higher in Spam Prone ContextsHigher in Spam Prone ContextsSplog Distribution for 'Spam Terms'
0
20
40
60
80
100
120
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Search Result Rank
Num
ber
of Splo
gs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Spam query terms based on analysis by McDonald et al 2006 ..
Card
Interest
Mortgage
Separating Blog Wheat from Blog ChaffSeparating Blog Wheat from Blog Chaff
Data cleaning for• Splog removal• Post content identification
Non English Blog removalNon English Blog removal
2
Collection ParsingCollection Parsing
1
Splog DetectionSplog Detection
3
Pre Indexing Steps
Title and Content Extraction
Title and Content Extraction
4
Non English Blog removalNon English Blog removal
2
Collection ParsingCollection Parsing
1
Splog DetectionSplog Detection
3
Pre Indexing Steps
Title and Content Extraction
Title and Content Extraction
4
Data Cleaning: Content Identification
Data Cleaning: Content Identification
Navigation
Post content
Ads
Recent Posts
Data cleaning: Baseline heuristicData cleaning: Baseline heuristic
Eliminate link a if there exist a link b
• Within θ distance• No Title tags between the links• Avg length of text bearing nodes
less than a threshold• b is the nearest link to a An example DOM tree
Navigational Links
Ads
Post Content
Sidebar
Data cleaning: SVM cleanerData cleaning: SVM cleaner
• Random collection of 150 blog posts
• Human evaluation of 400 links tagged as content or extraneous links
• We trained SVM using linear kernel in this analysis
DOM Features
Evaluation
Tag Features
Position Features
Word Features
Data Cleaning: Effect of sidebar content
Data Cleaning: Effect of sidebar content
Related WorkRelated Work• Web Spam Detection
• Coverage: Blog Analytics Engines don’t look beyond Blogosphere
• Speed of detection is important, 150K posts/hour
• RSS feeds presents new opportunities, and challenges
• Email spam Detection• Nature of spamming: links,
RSS feeds, web graph, metadata
• Users targeted indirectly through search engines, e.g. “N1ST” not relevant for “NIST” query
• Template Detection• Repeated structural components
detected via sampling• Customization, use of javascripts
and AJAX is increasing• Simple heuristics using DOM
traversal work well in general cases
• Sentiment Analysis• Open domain opinion extraction
is complex• Opinions are part of a narrative• Subject for which the opinion is
being expressed is not easy to detect
ConclusionsConclusions
• Noisy content on the Blogosphere present a major challenge to the quality of blog analytics tools.
• Combination of heuristics and ML can be used to effectively clean the data.
Ongoing Work• DOM subtree elimination• Identifying the subject of the opinion• Slangs• More training examples!
http://ebiquity.umbc.edu/http://ebiquity.umbc.edu/
Thank you!
Backup Slides
Opinions in Social MediaOpinions in Social Media “I went to school early so I
would have time to grab some lunch. Which ended up consisting of a crappy sandwich from starbucks and a chai latte. Lacey came into Starbucks while I was there so we chatted for a little bit and she thought that I might be in her class. After I finished eating I headed to school and checked the board……..”1
[1] http://annamay13x.livejournal.com/7061.html
Expressed Opinions
NarrativeReader’s Perspective
“Starbucks Sandwiches are bad!”
Opinions can influence buying decisions of customers
Keyword Stuffed Blog• ‘coupon codes’, ‘casino’
Post Stitching• Excerpts scraped from other sources
Post Weaving• Spam Links contextually placed in post
Link-roll spam• With fully plagiarized text
DifficultyDifficulty• We have been experimenting
with multiple approaches starting mid 2005
• Data: http://ebiquity.umbc.edu/resource/html/id/212
DifficultyDifficulty
• Evolving spamming techniques and splog creation genres • Most basic technique spam techniques
• Generate content by stuffing key dictionary words• Generate link to affiliates, through link dumps on
blogrolls, linkrolls or after post content• Evolving spam techniques
• Scrape contextually similar content to generate posts• RSS hijacking• Aggregation software, e.g. Planet X• Intersperse links randomly• Make link placement meaningful• Add spam comments and then ping. Repeat.
TREC Submissions (Topic Relevance)TREC Submissions (Topic Relevance)
TREC Submissions (Opinion Extraction)
TREC Submissions (Opinion Extraction)