Understanding Botnet- driven Blog Spam: Motivations and ... · words designed to change the way a...

4
Understanding Botnet- driven Blog Spam: Motivations and Methods Brandon Bevans [email protected] California Polytechnic State University United States of America Bruce DeBruhl [email protected] California Polytechnic State University United States of America Foaad Khosmood [email protected] California Polytechnic State University United States of America Introduction Spam, or unsolicited commercial communication, has evolved from telemarketing schemes to a highly sophisticated and profitable black-market business. Although many users are aware that email spam is prominent, they are less aware of blog spam (Thom- ason, 2007). Blog spam, also known as forum spam, is spam that is posted to a public or outward facing web- site. Blog spam can be to accomplish many tasks that email spam is used for, such as posting links to a mali- cious executable. Blog spam can also serve some unique purposes. First, blog spam can influence purchasing decisions by featuring illegitimate advertisements or reviews. Se- cond, blog spam can include content with target key- words designed to change the way a search engine identifies pages (Geerthik, 2013). Lastly, blog spam can contain link spam, which spams a URL on a victim page to increase the inserted URLs search engine rank- ing. Overall, blog spam weakens search engines’ model of the Internet popularity distribution. Much academic and industrial effort has been spent to detect, filter, and deter spam (Dinh, 2013), (Spirin and Han, 2012). Less effort has been placed in understanding the underlying distribution mechanisms of spambots and botnets. One foundational study in characterizing blog spam (Niu et al., 2007) provided a quantitative analy- sis of blog spam in 2007. This study showed that blogs in 2007 included incredible amounts of spam but does not try to identify linked behavior that would imply botnet behavior. A later study on blog spam (Stringhini, 2015) explores using IPs and usernames to detect botnets but does not characterize the behav- ior of these botnets. In 2011, a research team (Stone- Gross et al., 2011) infiltrated a botnet, which allowed for observations of the logistics around botnet spam campaigns. Overall, our understanding of blog spam generated by botnets is still limited. Related Work Various projects have attempted to identify the me- chanics, characteristics, and behavior of botnets that control spam. In one important study (Shin et al., 2011), researchers fully evaluated how one of the most popular spam automation programs, XRumer, operates. Another study explored the behavior of bot- nets across multiple spam campaigns (Thonnard and Dacier, 2011). Others (Pitsillidis et al., 2012) examined the impact that spam datasets had on characterization results. (Lumezanu et al., 2012) explored the similari- ties between email spam and blog spam on Twitter. They show that over 50% of spam links from emails also appeared on Twitter. Figure 1: Browser rendering of the ggjx honeypot The underground ecosystem build around the bot- net community has been explored (Stone-Gross et al., 2011). In a surprising result, over 95% of pharmaceu- ticals advertised in spam were handled by a small group of banks (Levchenko et al., 2011). Our work is similar in that we are trying to characterize the botnet ecosystem, focusing on the distribution and classifica- tion of certain spam producing botnets. Experimental Design

Transcript of Understanding Botnet- driven Blog Spam: Motivations and ... · words designed to change the way a...

Page 1: Understanding Botnet- driven Blog Spam: Motivations and ... · words designed to change the way a search engine identifies pages (Geerthik, 2013). Lastly, blog spam can contain link

Understanding Botnet-driven Blog Spam: Motivations and Methods BrandonBevansbrandonbevans@gmail.comCaliforniaPolytechnicStateUniversityUnitedStatesofAmericaBruceDeBruhlbrandonbevans@gmail.comCaliforniaPolytechnicStateUniversityUnitedStatesofAmericaFoaadKhosmoodbrandonbevans@gmail.comCaliforniaPolytechnicStateUniversityUnitedStatesofAmerica

Introduction Spam, or unsolicited commercial communication,

has evolved from telemarketing schemes to a highlysophisticated and profitable black-market business.Although many users are aware that email spam isprominent, theyare lessawareofblogspam(Thom-ason,2007).Blogspam,alsoknownasforumspam,isspamthatispostedtoapublicoroutwardfacingweb-site.Blogspamcanbetoaccomplishmanytasksthatemailspamisusedfor,suchaspostinglinkstoamali-ciousexecutable.

Blog spam can also serve someunique purposes.First,blogspamcaninfluencepurchasingdecisionsbyfeaturing illegitimate advertisements or reviews. Se-cond,blogspamcanincludecontentwithtargetkey-words designed to change the way a search engineidentifies pages (Geerthik, 2013). Lastly, blog spamcancontainlinkspam,whichspamsaURLonavictimpagetoincreasetheinsertedURLssearchenginerank-ing.Overall,blogspamweakenssearchengines’modeloftheInternetpopularitydistribution.Muchacademicand industrial effort has been spent to detect, filter,anddeterspam(Dinh,2013),(SpirinandHan,2012).

Less effort has been placed in understanding theunderlyingdistributionmechanismsofspambotsandbotnets.Onefoundationalstudyincharacterizingblog

spam(Niuetal.,2007)providedaquantitativeanaly-sisofblogspamin2007.Thisstudyshowedthatblogsin2007includedincredibleamountsofspambutdoesnot try to identify linked behavior thatwould implybotnet behavior. A later study on blog spam(Stringhini,2015)exploresusing IPsandusernamestodetectbotnetsbutdoesnotcharacterizethebehav-iorofthesebotnets.In2011,aresearchteam(Stone-Grossetal.,2011)infiltratedabotnet,whichallowedforobservationsof the logisticsaroundbotnet spamcampaigns. Overall, our understanding of blog spamgeneratedbybotnetsisstilllimited.

Related Work Variousprojectshaveattemptedtoidentifytheme-

chanics, characteristics,andbehaviorofbotnets thatcontrol spam. In one important study (Shin et al.,2011), researchers fully evaluated how one of themost popular spam automation programs, XRumer,operates.Anotherstudyexploredthebehaviorofbot-netsacrossmultiplespamcampaigns(ThonnardandDacier,2011).Others(Pitsillidisetal.,2012)examinedtheimpactthatspamdatasetshadoncharacterizationresults.(Lumezanuetal.,2012)exploredthesimilari-ties between email spamand blog spamonTwitter.Theyshowthatover50%ofspamlinks fromemailsalsoappearedonTwitter.

Figure 1: Browser rendering of the ggjx honeypot

Theundergroundecosystembuildaroundthebot-netcommunityhasbeenexplored(Stone-Grossetal.,2011).Inasurprisingresult,over95%ofpharmaceu-ticals advertised in spam were handled by a smallgroupofbanks(Levchenkoetal.,2011).Ourworkissimilarinthatwearetryingtocharacterizethebotnetecosystem,focusingonthedistributionandclassifica-tionofcertainspamproducingbotnets.

Experimental Design

Page 2: Understanding Botnet- driven Blog Spam: Motivations and ... · words designed to change the way a search engine identifies pages (Geerthik, 2013). Lastly, blog spam can contain link

Inordertoclassifylinguisticsimilarityanddiffer-encesinbotnets,weimplement3honeypotstogathersamples of blog spam. We configure our honeypotsidenticallyusingtheDrupalcontentmanagementsys-tems(CMS)asshowninFigure1.Ourhoneypotsareidenticalexceptforthecontentoftheirfirstpostandtheir domain name. Ggjx.org is fashion themed,npcagent.com is sports themed, and gjams.com ispharmaceutical themed. We combine the data col-lected from Drupal with the Apache server logs(Apache, 2016) to allow for content analysis of datacollectedover42days.Toallowbotnets timetodis-cover the honeypots, we activate the honeypots atleast6-weeksbeforedatacollection.

Wegeneratethreetablesofcontentforeachhoney-pot(BevansandKhosmood,2016).Intheusertable,werecordthe informationthespambotenterswhileregisteringanduserloginstatisticsthatwesummarizeinTable1.Thisincludestheuserid,username,pass-word,dateofregistration,registrationIP,andnumberoflogins.Inthecontenttable,werecordthecontentofspampostsandcommentswhichwesummarizeinTa-ble 2. This includes the blog node id, the author’suniqueid,thedateposted,thenumberofhits,typeofpost,titleofthepost,textofthepost,linksinthepost,languageofthepost,andataxonomyofthepostfromIBM’sAlchemyAPI.

Table 1: User table characteristics for three honeypots

Table 2: Characteristics for the content tables

Table 3: Characteristics of entities

Lastly, in the access table, we include data andmeta-datafromtheApachelogs.Thisincludestheuserid,theaccessIP,theURL,theHTTPrequesttype,thenodeID,andanactionkeyworddescribingthetypeofaccess.

Our honeypots received a total of 1.1million re-questsforggjx,481thousandrequestsforgjams,and591thousandrequestsfornpcagent.

Entity Reduction It is widely accepted that spambot networks, or

botnets,areresponsibleformostspam.Therefore,wealgorithmicallyreducespaminstancesintouniqueen-titiesrepresentingbotnets.Foreachentity,wedefine4attributes:entityid,associatedIPs,usernames,andassociated user ids. To construct entities we scanthroughtheusersandassigneachonetoanentityasfollows.

1. Forauser,ifanentityexistswhichcontainsitsusernameorIP,theuserisaddedtotheentity.

2. Ifmore than one entitymatches the abovecriteria,allmatchingentitiesaremerged.

3. Ifnoentitymatchestheabovecriteria,anewentityiscreated.

WesummarizetheentitycharacteristicsinTable3.Themaximumnumberofusersinoneentityisalmost38 thousand for ggjx with over 100 unique IP ad-dresses.Theseresultsconfirmwhatisexpected-thevastmajorityofbots interactingwithourhoneypotsarepartof largebotnets. Thisalsoallowsustoper-formcontentanalysisexploringwhatlinguisticquali-tiesdifferentiatebotnets.

Table 4: NLP feature sets we consider for our content

analysis and their effectiveness at differentiating botnets

Content Analysis Tobetterunderstandbotnets,weusenaturallan-

guageprocessing(CollobertandWeston,2008)foran-alyzingthelinguisticcontentofentities.Forouranal-ysis,we consider various feature sets as proxies forlinguisticcharacteristicsassummarizedinTable4.WeuseaMaximumEntropyclassifier(MegaM,2016)totestwhich featuresdifferentiatebotnets. Inorder totestafeature,wetraintheclassifierwith70%oftheposts, randomly selected, from theN largest entities

Page 3: Understanding Botnet- driven Blog Spam: Motivations and ... · words designed to change the way a search engine identifies pages (Geerthik, 2013). Lastly, blog spam can contain link

andtest itwiththeremaining30%of theposts.Ourfinalresultsaretheaverageofthreeruns.

ThefirstfeaturesetwetestisBagOfWords(BoW)whichmodelsthelexicalcontentofposts.Putsimply,eachwordinadocumentisputintoa‘bag’andthesyn-tacticstructureisdiscarded.Forimplementationde-tails,seeourtechnicalreport(Bevans,2016).InFigure2,weshowouranalysisoftheBoWfeatureset.

Whenconsidering the top5 contributingentities,theclassificationaccuracyislessthan95%whichim-pliesthatthelexicalcontentofbotnetsvariesgreatly.Thesecondfeatureweconsideristhetaxonomypro-videdbyIBMWatson’sAlchemyAPI.Alchemy’soutputisalistoftaxonomylabelsandassociatedconfidences.Forthepurposeofouranalysis,wediscardanylowornon-confidentlabels.InFigure3,weshowouranalysisoftheAlchemyTaxonomyfeaturesetwhichhighlightstheaccuracyofAlchemy’staxonomy.WenotethattheAlchemyTaxonomyfeaturesetisdramaticallysmallerinsizethantheBoWfeaturesetwhilestillprovidinghighperformance.Thisindicatesafulllexicalanalysisis not necessary but a taxonomic approach is suffi-cient. Our third feature is based on the links in theposts.Tocreatethefeature,weparseeachpostforanyHTTPlinksandstripthelinktoitscoredomainname.

Theclassifierwith the link featuresethadvariedresults,asshowninTable5,whereitwasreliableindifferentiating ggjx entities but less reliable for theother twohoneypots.TheseresultscorrelatewithlinkscarcityfromTable2.

Figure 2

Figure 3

Wetestthenormalizedvocabularysizeofapostasa feature.Wederivethis fromthenumberofuniquewords divided by the total number of words in thepost.AsshowninTable5,thevocabularysizedoesnotdifferentiatebotnets.

We also form a feature set based on the part-of-speech(PoS)makeupofapostusingtheStanfordPoSTagger.TheStanfordPoStaggerreturnsapairforeachwordinthetext,theoriginalwordandcorrespondingPoS.WecreateaBoWfromthisresponsethatcreatesanabstractrepresentationof thedocument’ssyntax.As shown in Table 5, the PoS does not differentiatebotnets.

Table 5: Accuracies for various features when identifying 10

and 60 entities using the maximum entropy classifier

Conclusions Inthispaper,weexamineinterestingcharacteris-

tics of spam-generating botnets and release a novelcorpus to the community.We find that hundreds ofthousandsof fakeusersarecreatedbyasmallsetofbotnets and much fewer numbers of them actuallypostspam.Thespamthatispostedishighlycorrelatedbysubjectlanguagetothepointwherebotnetslabeled

Page 4: Understanding Botnet- driven Blog Spam: Motivations and ... · words designed to change the way a search engine identifies pages (Geerthik, 2013). Lastly, blog spam can contain link

bytheirnetworkbehavioraretoalargedegreere-dis-coverableusingcontentclassification(Figure3).

Whilelinkandvocabularyanalysiscanbegooddif-ferentiatorsofthesebotnets,itisthecontentlabeling(providedbyAlchemy)thatisthebestindicator.Ourexperimentonlyspans42days,thusit’spossiblethesubjectspecializationisafeatureofthecampaignra-therthanthebotnetitself.

Bibliography

Apache virtual host. (2016).http://httpd.apache.org/docs/current/vhosts Ac-cessed:2016-08-10.

Bevans,B.,andKhosmood,F.(2016).ForumSpamCorpus.

http://users.csc.calpoly.edu/~foaad/bfbevans Ac-cessed:2017-04-01.

Bevans, B. (2016). “Categorizing Forum Spam.” Master's

ThesesatCalPolyDigitalCommons.http://digitalcom-mons.calpoly.edu/theses/1623Accessed:2017-04-01.

Collobert,R., andWeston, J. (2008). “Aunified architec-

ture fornatural languageprocessing:Deepneuralnet-workswithmultitasklearning.”Proceedingsofthe25thInternational Conference on Machine Learning, ACM:160–67.

Dinh,S.etal.(2015).“Spamcampaigndetection,analysis,

andinvestigation.”DigitalInvestigation,(12)S12–S21.Geerthik, S. (2013). “Survey on internet spam: Classifica-

tion and analysis.” International Journal of ComputerTechnologyandApplications,4(3):384.

Levchenko,K.etal.(2011).“Clicktrajectories:End-to-end

analysisofthespamvaluechain.”SymposiumonSecu-rityandPrivacy,IEEE.431–446.

Lumezanu,C.andFeamster,N. (2012). “Observingcom-

monspamintwitterandemail.”Proceedingsofthe2012ACM conference on Internet measurement, ACM. 461–466.

Mega M. (2016). “Mega model optimization package.”

https://www.umiacs.umd.edu/~hal/megam/, Ac-cessed:2016-08-10.

Niu,Y.etal.(2007).“Aquantitativestudyofforumspam-

mingusingcontext-basedanalysis.”NDSS.Pitsillidis,A.etal.(2012).“Taster’schoice:Acomparative

analysisofspamfeeds.”Proceedingsofthe2012ACMconferenceonInternetmeasure-

ment,ACM.427–440.

Shin,Y.,Gupta,M.,andMyers,S.A.(2011).“Thenutsand

boltsofaforumspamautomator.”LEET.Spirin,N.,andHan,J.(2012).“Surveyonwebspamdetec-

tion:Principlesandalgorithms.”ACMSIGKDDExplorationsNewsletter,13(2):50-64.Stone-Gross,B.,etal.“Theundergroundeconomyofspam:

A botmaster’s perspective of coordinating large-scalespamcampaigns.”LEET,11:4.

Stringhini, G. (2015). “Evilcohort: Detecting communities

ofmaliciousaccountsononlineservices.”24thUSENIXSecuritySymposium(USENIXSecurity15),563–578.

Thomason, A. (2007). “Blog spam: A review.” CEAS,

Citeseer.ThonnardO.andDacier,M.(2011).“Astrategicanalysisof

spambotnetsoperations.”Proceedingsofthe8thAnnualCollaboration, Electronic messaging, Anti-Abuse andSpamConference,ACM,162–171.