Post on 19-Jan-2016
© 2006 Nielsen BuzzMetrics, A VNU business affiliate
Natalie Glance
Senior Research Scientist
Nielsen BuzzMetrics
© 2006 Nielsen BuzzMetrics, A VNU business affiliate
Background
Nielsen BuzzMetrics aggregates consumer opinion
expressed in message boards, weblogs, Usenet and
other online discussions
Parent company behind BlogPulse, blog search and
analytics website
© 2006 Nielsen BuzzMetrics, A VNU business affiliate
What drives weblog spam?
Same goal as any other website spam: SEO
Weblog hosts provide:
Free hosting for link farms to promote affiliate sites
Free hosting for web pages with sponsored ads
Types of weblog spam
spam blogs – (pollute ping servers)
spam comments on legitimate blogs
spam trackback pings to legitimate blogs
© 2006 Nielsen BuzzMetrics, A VNU business affiliate
Collateral damage: blog search result contamination
Search results for ‘mortgage’ :
© 2006 Nielsen BuzzMetrics, A VNU business affiliate
Collateral damage: trend graphs
Explain the peaks: are they real or artifacts of spam?
© 2006 Nielsen BuzzMetrics, A VNU business affiliate
Collateral damage: real-time monitoring
Spikes in keyword clusters
2006/07/28 10:39 a.m. {deleted myspace account}
2006/07/27 10:55 a.m. {landis tested yesterday}
2006/08/07 3:22 a.m. {investing debt directory}
2006/08/07 6:54 a.m. {adsense cents makers}
2006/08/07 1:11 p.m. {wwdc keynote}
Breaking news or spam attack?
© 2006 Nielsen BuzzMetrics, A VNU business affiliate
Spam filtering challenges
Different analytics, different trade-offs
weblog search requirements: high coverage, clean results, minimize
false positives
trend search: high precision to eliminate spurious artifacts
real-time monitoring: high coverage w/human oversight
Different timeframes, different approaches
real-time search: highly efficient classification algorithms; automated
identification of spam attacks
historic search: offline spam identification can use combination of
approaches; sandbox for new weblogs