Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin...

Post on 17-Dec-2015

253 views 3 download

Tags:

Transcript of Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin...

Design and Evaluation of a Design and Evaluation of a Real-Time URL Spam Real-Time URL Spam Filtering ServiceFiltering Service

Kurt Thomas, Chris Grier, Justin Ma,Vern Paxson, and Dawn Song

IEEE Symposium on Security and Privacy 2011

1

OUTLINEOUTLINEIntroduction - MonarchRelated WorkSystem DesignImplementationEvaluationDiscussion and Conclusion

2

Spam URLSpam URLAdvertisementHarmful content

◦ Phishing, malware, and scams

Use of compromised and fraudulent accounts◦ Email, web services

3

MonarchMonarchSpam URL Filtering as a Service

Tens of millions of features

4

Related WorkRelated Work“Detecting spammers on Twitter” (2010)

◦ Post frequency, URLs, friends…

“Behind phishing: an examination of phisher modi operandi” (2008)◦ Lexical characteristics of phishing URLs

“Cantina: a content-based approach to detecting phishing web sites” (2007)◦ Parse HTML content

5

System DesignSystem Design

Monarch’s cloud infrastructureUrl Aggregation

◦ Email providers and Twitter’s streaming APIFeature Collection

◦ Visits a URL with web browsers to collect page content

6

System Design(cont.)System Design(cont.)

Monarch’s cloud infrastructureFeature Extraction

◦ Transform the raw data into a sparse feature vectorClassification

◦ Training and testing by distributed logistic regression

7

Collect Raw Features – Collect Raw Features – Web Web BrowserBrowser“A taxonomy of JavaScript redirection

spam”(2007)Lightweight browser not enough

◦ Poor HTML parsing, lack of JavaScript and plugins

Instrumented version of Firefox◦ JavaScript enabled◦ Flash and Java installed◦ Visited a URL and monitor a number of details

8

Raw FeaturesRaw FeaturesWeb Browser

◦ Initial URL and Landing URL, Redirects, Sources and Frames

◦ HTML Content, Page Links◦ JavaScript Events, Pop-up Windows, Plugins◦ HTTP Headers

DNS Resolver◦ Initial, final, and redirect URLs

IP Address Analysis◦ City, country, ASN

Proxy and Whitelist (200 domains)

9

Features VectorFeatures VectorRaw Features => sparse feature vector

◦ Canonicalize URLs◦ Remove obfuscation

Tokenize the text corpus◦ Splitting on non-alphanumeric characters

http://adl.tw/~dada/dada2.php?a=1&b=3

=> domain feature [adl,tw]

path feature [dada,dada2,php]

query parameters feature [a,1,b,3]

=> (…,adl:true,adm:false,…,dada:true,…,tw:true,……..)

total 49,960,691 feature(dimension)…

=> (1,3,a,adl,b,dada,dada2,php,tw)

10

Distributed Classifier DesignDistributed Classifier DesignLinear classification

◦ : feature vector◦ Determine a weight vector

A parallel online learner◦ With regularization to yield a sparse weight vector

Labeled data ,Testing =>

-1 => non-spam site 1 => spam site

11

Training the weight vectorTraining the weight vectorLogistic Regression

◦ With subgradient L1-Regularization

yi(xi. wi) larger => f(w) smaller

(Classification margin, hyperplane)

12

iii wxye

wf1

1log)(

Distributed Classifier Distributed Classifier AlgorithmAlgorithm

13

m

10

1

100I

Data Set and assumptionData Set and assumption1.25 million spam email URLs567,784 spam Twitter URLs9 million non-spam Twitter URLs

Checking all Twitter URLs against:◦ Google Safebrowsing, SURBL, URIBL, APWG,

Phishtank◦ Any of its source URLs become blacklisted

14

Data Set and Data Set and assumption(cont.)assumption(cont.)On Twitter:

◦ 36% scams, 60% phishing, 4% malware

15

After regularizationAfter regularization

16

ImplementationImplementationAmazon Web Services(AWS) infrastructure

URL Aggregation◦ A queue, keeps 300,000 URLs

Feature Collection◦ 20x6 Firefox(4.0b4) on Ubuntu 10.04

With a custom extension

◦ Firefox’s NPAPI, Linux’s “host” command, MaxMind GeoIP library and Route Views

Classifier◦ Hadoop Distributed File System◦ On the 50-node cluster

17

Evaluation – Overall Evaluation – Overall AccuracyAccuracy5-fold cross-validation500,000 spam and non-spam eachTraining set size to 400,000 example

◦ 1:1, 4:1, 10:1Testing set size to 200,000 example

◦ 1:1

18

Evaluation – Single Evaluation – Single FeatureFeature

19

Evaluation – Accuracy Over Evaluation – Accuracy Over TimeTimeTraining once only <-> Retraining every four

days

20

Evaluation – Comparing Email Evaluation – Comparing Email and Tweet Spamand Tweet SpamLog odds ratio:

21

ii pqqpqp 1|,/log| 1221

Evaluation – The CostEvaluation – The Cost

For Twitter, $22,751 per month

22

Discussion and Discussion and ConclusionConclusionEvasion

◦ Feature Evasion◦ Time-based Evasion◦ Crawler Evasion

Monarch◦ Real-time system◦ Spam URL Filtering as a Service◦ $22,751 a month

23