Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin...
-
Upload
lewis-pitts -
Category
Documents
-
view
253 -
download
3
Transcript of Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin...
![Page 1: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/1.jpg)
Design and Evaluation of a Design and Evaluation of a Real-Time URL Spam Real-Time URL Spam Filtering ServiceFiltering Service
Kurt Thomas, Chris Grier, Justin Ma,Vern Paxson, and Dawn Song
IEEE Symposium on Security and Privacy 2011
1
![Page 2: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/2.jpg)
OUTLINEOUTLINEIntroduction - MonarchRelated WorkSystem DesignImplementationEvaluationDiscussion and Conclusion
2
![Page 3: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/3.jpg)
Spam URLSpam URLAdvertisementHarmful content
◦ Phishing, malware, and scams
Use of compromised and fraudulent accounts◦ Email, web services
3
![Page 4: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/4.jpg)
MonarchMonarchSpam URL Filtering as a Service
Tens of millions of features
4
![Page 5: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/5.jpg)
Related WorkRelated Work“Detecting spammers on Twitter” (2010)
◦ Post frequency, URLs, friends…
“Behind phishing: an examination of phisher modi operandi” (2008)◦ Lexical characteristics of phishing URLs
“Cantina: a content-based approach to detecting phishing web sites” (2007)◦ Parse HTML content
5
![Page 6: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/6.jpg)
System DesignSystem Design
Monarch’s cloud infrastructureUrl Aggregation
◦ Email providers and Twitter’s streaming APIFeature Collection
◦ Visits a URL with web browsers to collect page content
6
![Page 7: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/7.jpg)
System Design(cont.)System Design(cont.)
Monarch’s cloud infrastructureFeature Extraction
◦ Transform the raw data into a sparse feature vectorClassification
◦ Training and testing by distributed logistic regression
7
![Page 8: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/8.jpg)
Collect Raw Features – Collect Raw Features – Web Web BrowserBrowser“A taxonomy of JavaScript redirection
spam”(2007)Lightweight browser not enough
◦ Poor HTML parsing, lack of JavaScript and plugins
Instrumented version of Firefox◦ JavaScript enabled◦ Flash and Java installed◦ Visited a URL and monitor a number of details
8
![Page 9: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/9.jpg)
Raw FeaturesRaw FeaturesWeb Browser
◦ Initial URL and Landing URL, Redirects, Sources and Frames
◦ HTML Content, Page Links◦ JavaScript Events, Pop-up Windows, Plugins◦ HTTP Headers
DNS Resolver◦ Initial, final, and redirect URLs
IP Address Analysis◦ City, country, ASN
Proxy and Whitelist (200 domains)
9
![Page 10: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/10.jpg)
Features VectorFeatures VectorRaw Features => sparse feature vector
◦ Canonicalize URLs◦ Remove obfuscation
Tokenize the text corpus◦ Splitting on non-alphanumeric characters
http://adl.tw/~dada/dada2.php?a=1&b=3
=> domain feature [adl,tw]
path feature [dada,dada2,php]
query parameters feature [a,1,b,3]
=> (…,adl:true,adm:false,…,dada:true,…,tw:true,……..)
total 49,960,691 feature(dimension)…
=> (1,3,a,adl,b,dada,dada2,php,tw)
10
![Page 11: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/11.jpg)
Distributed Classifier DesignDistributed Classifier DesignLinear classification
◦ : feature vector◦ Determine a weight vector
A parallel online learner◦ With regularization to yield a sparse weight vector
Labeled data ,Testing =>
-1 => non-spam site 1 => spam site
11
![Page 12: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/12.jpg)
Training the weight vectorTraining the weight vectorLogistic Regression
◦ With subgradient L1-Regularization
yi(xi. wi) larger => f(w) smaller
(Classification margin, hyperplane)
12
iii wxye
wf1
1log)(
![Page 13: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/13.jpg)
Distributed Classifier Distributed Classifier AlgorithmAlgorithm
13
m
10
1
100I
![Page 14: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/14.jpg)
Data Set and assumptionData Set and assumption1.25 million spam email URLs567,784 spam Twitter URLs9 million non-spam Twitter URLs
Checking all Twitter URLs against:◦ Google Safebrowsing, SURBL, URIBL, APWG,
Phishtank◦ Any of its source URLs become blacklisted
14
![Page 15: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/15.jpg)
Data Set and Data Set and assumption(cont.)assumption(cont.)On Twitter:
◦ 36% scams, 60% phishing, 4% malware
15
![Page 16: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/16.jpg)
After regularizationAfter regularization
16
![Page 17: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/17.jpg)
ImplementationImplementationAmazon Web Services(AWS) infrastructure
URL Aggregation◦ A queue, keeps 300,000 URLs
Feature Collection◦ 20x6 Firefox(4.0b4) on Ubuntu 10.04
With a custom extension
◦ Firefox’s NPAPI, Linux’s “host” command, MaxMind GeoIP library and Route Views
Classifier◦ Hadoop Distributed File System◦ On the 50-node cluster
17
![Page 18: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/18.jpg)
Evaluation – Overall Evaluation – Overall AccuracyAccuracy5-fold cross-validation500,000 spam and non-spam eachTraining set size to 400,000 example
◦ 1:1, 4:1, 10:1Testing set size to 200,000 example
◦ 1:1
18
![Page 19: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/19.jpg)
Evaluation – Single Evaluation – Single FeatureFeature
19
![Page 20: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/20.jpg)
Evaluation – Accuracy Over Evaluation – Accuracy Over TimeTimeTraining once only <-> Retraining every four
days
20
![Page 21: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/21.jpg)
Evaluation – Comparing Email Evaluation – Comparing Email and Tweet Spamand Tweet SpamLog odds ratio:
21
ii pqqpqp 1|,/log| 1221
![Page 22: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/22.jpg)
Evaluation – The CostEvaluation – The Cost
For Twitter, $22,751 per month
22
![Page 23: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.](https://reader030.fdocuments.us/reader030/viewer/2022032722/56649ced5503460f949b9cb7/html5/thumbnails/23.jpg)
Discussion and Discussion and ConclusionConclusionEvasion
◦ Feature Evasion◦ Time-based Evasion◦ Crawler Evasion
Monarch◦ Real-time system◦ Spam URL Filtering as a Service◦ $22,751 a month
23