Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin...

Design and Evaluation of a Design and Evaluation of a Real-Time URL Spam Real-Time URL Spam Filtering ServiceFiltering Service

Kurt Thomas, Chris Grier, Justin Ma,Vern Paxson, and Dawn Song

IEEE Symposium on Security and Privacy 2011

OUTLINEOUTLINEIntroduction - MonarchRelated WorkSystem DesignImplementationEvaluationDiscussion and Conclusion

Spam URLSpam URLAdvertisementHarmful content

◦ Phishing, malware, and scams

Use of compromised and fraudulent accounts◦ Email, web services

MonarchMonarchSpam URL Filtering as a Service

Tens of millions of features

Related WorkRelated Work“Detecting spammers on Twitter” (2010)

◦ Post frequency, URLs, friends…

“Behind phishing: an examination of phisher modi operandi” (2008)◦ Lexical characteristics of phishing URLs

“Cantina: a content-based approach to detecting phishing web sites” (2007)◦ Parse HTML content

System DesignSystem Design

Monarch’s cloud infrastructureUrl Aggregation

◦ Email providers and Twitter’s streaming APIFeature Collection

◦ Visits a URL with web browsers to collect page content

System Design(cont.)System Design(cont.)

Monarch’s cloud infrastructureFeature Extraction

◦ Transform the raw data into a sparse feature vectorClassification

◦ Training and testing by distributed logistic regression

Collect Raw Features – Collect Raw Features – Web Web BrowserBrowser“A taxonomy of JavaScript redirection

spam”(2007)Lightweight browser not enough

◦ Poor HTML parsing, lack of JavaScript and plugins

Instrumented version of Firefox◦ JavaScript enabled◦ Flash and Java installed◦ Visited a URL and monitor a number of details

Raw FeaturesRaw FeaturesWeb Browser

◦ Initial URL and Landing URL, Redirects, Sources and Frames

◦ HTML Content, Page Links◦ JavaScript Events, Pop-up Windows, Plugins◦ HTTP Headers

DNS Resolver◦ Initial, final, and redirect URLs

IP Address Analysis◦ City, country, ASN

Proxy and Whitelist (200 domains)

Features VectorFeatures VectorRaw Features => sparse feature vector

◦ Canonicalize URLs◦ Remove obfuscation

Tokenize the text corpus◦ Splitting on non-alphanumeric characters

http://adl.tw/~dada/dada2.php?a=1&b=3

=> domain feature [adl,tw]

path feature [dada,dada2,php]

query parameters feature [a,1,b,3]

=> (…,adl:true,adm:false,…,dada:true,…,tw:true,……..)

total 49,960,691 feature(dimension)…

=> (1,3,a,adl,b,dada,dada2,php,tw)

Distributed Classifier DesignDistributed Classifier DesignLinear classification

◦ : feature vector◦ Determine a weight vector

A parallel online learner◦ With regularization to yield a sparse weight vector

Labeled data ,Testing =>

-1 => non-spam site 1 => spam site

Training the weight vectorTraining the weight vectorLogistic Regression

◦ With subgradient L1-Regularization

yi(xi． wi) larger => f(w) smaller

(Classification margin, hyperplane)

iii wxye

1log)(

Distributed Classifier Distributed Classifier AlgorithmAlgorithm

Data Set and assumptionData Set and assumption1.25 million spam email URLs567,784 spam Twitter URLs9 million non-spam Twitter URLs

Checking all Twitter URLs against:◦ Google Safebrowsing, SURBL, URIBL, APWG,

Phishtank◦ Any of its source URLs become blacklisted

Data Set and Data Set and assumption(cont.)assumption(cont.)On Twitter:

◦ 36% scams, 60% phishing, 4% malware

After regularizationAfter regularization

ImplementationImplementationAmazon Web Services(AWS) infrastructure

URL Aggregation◦ A queue, keeps 300,000 URLs

Feature Collection◦ 20x6 Firefox(4.0b4) on Ubuntu 10.04

With a custom extension

◦ Firefox’s NPAPI, Linux’s “host” command, MaxMind GeoIP library and Route Views

Classifier◦ Hadoop Distributed File System◦ On the 50-node cluster

Evaluation – Overall Evaluation – Overall AccuracyAccuracy5-fold cross-validation500,000 spam and non-spam eachTraining set size to 400,000 example

◦ 1:1, 4:1, 10:1Testing set size to 200,000 example

◦ 1:1

Evaluation – Single Evaluation – Single FeatureFeature

Evaluation – Accuracy Over Evaluation – Accuracy Over TimeTimeTraining once only <-> Retraining every four

Evaluation – Comparing Email Evaluation – Comparing Email and Tweet Spamand Tweet SpamLog odds ratio:

ii pqqpqp 1|,/log| 1221

Evaluation – The CostEvaluation – The Cost

For Twitter, $22,751 per month

Discussion and Discussion and ConclusionConclusionEvasion

◦ Feature Evasion◦ Time-based Evasion◦ Crawler Evasion

Monarch◦ Real-time system◦ Spam URL Filtering as a Service◦ $22,751 a month

Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin...

Documents

Transcript of Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin...

Botfarm Development Dynamic Malware Containment Vern Paxson & Christian Kreibich UC Berkeley Team Nov. 20, 2009.

1 Congestion Control EE122 Fall 2011 Scott Shenker ee122/ Materials with thanks to Jennifer Rexford, Ion Stoica, Vern Paxson.

@ SPAM : T HE U NDERGROUND ON 140 C HARACTERS OR L ESS Chris Grier, Vern Paxson, Michael Zhang University of California, Berkeley Kurt Thomas University.

Suspended Accounts in Retrospect: An Analysis of Twitter Spam Kurt Thomas, Chris Grier, Vern Paxson, Dawn Song University of California, Berkeley International.

Network Asset Discovery & Tracking Vern Paxson University of California Berkeley, California USA vern@eecs.berkeley.edu August 23, 2010.

@spam:TheUndergroundon’ 140charactersorlessychen/classes/cs450-s12/lectures/... · @spam:TheUndergroundon’ 140charactersorless! ChrisGrier,!KurtThomas,!! Vern Paxson!and!Michael!Zhang!

Efficient & Robust TCP Stream Normalization Mythili Vutukuru Joint work with Hari Balakrishnan and Vern Paxson.

CS 161: Computer Security Prof. Vern Paxson - ICIR · 2017-04-21 · Malware: Viruses CS 161: Computer Security Prof. Vern Paxson TAs: Paul Bramsen, Apoorva Dornadula, David Fifield,

VAST: A Uniﬁed Platform for Interactive Network …VAST: A Uniﬁed Platform for Interactive Network Forensics Matthias Vallentin vallentin@icir.org UC Berkeley Vern Paxson vern@icir.org

CS 161 - Computer Security Profs. Vern Paxson & David Wagner

Detecting Backdoors and Stepping Stones Yin Zhang Cornell University yzhang@CS.Cornell.EDU Vern Paxson ACIRI/LBNL vern@aciri.org 9 th USENIX Security Symposium.

1 An Inquiry into the Nature and Causes of the Wealth of Internet Miscreants Jason Franklin CMU jfranklin@cmu.edu Vern Paxson ICSI vern@icsi.berkeley.edu.

Monetizing Attacks / The Underground Economy original slides by Prof. Vern Paxson University of California, Berkeley.

Vern Paxson & Christian Kreibich UC Berkeley Team Nov. 20, 2009

Malware, con’t original slides provided by Prof. Vern Paxson University of California, Berkeley.

End-to-End Routing Behavior in the Internetcseweb.ucsd.edu/classes/wi01/cse222/papers/paxson-e2e-routing...End-to-End Routing Behavior in the Internet Vern Paxson ... report on an

Fast Port Scan Using Sequential Hypothesis Testing Jaeyeon Jung, Vern Paxson, Arthur W. Berger, and Hari Balakrishnan.

Detecting Forged RSTs Weaver, Sommer, Paxson Detecting Forged TCP Reset Packets Nicholas Weaver Robin Sommer Vern Paxson.

Using Honeynets for Internet Situational Awareness Vinod Yegneswaran, Paul Barford Vern Paxson University of Wisconsin, Madison ICSI, LBNL Hotnets 2005.

CS 161 - Computer Security Profs. Vern Paxson & David Wagnercs161/sp10/slides/4.12.viruses.pdf · Malware: Viruses CS 161 - Computer Security Profs. Vern Paxson & David Wagner TAs: