Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering...
Transcript of Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering...
![Page 1: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/1.jpg)
Countering Spam Using Classification Techniques
Steve [email protected] Mining Guest LectureFebruary 21, 2008
![Page 2: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/2.jpg)
Overview
IntroductionCountering Email Spam
Problem DescriptionClassification HistoryOngoing Research
Countering Web SpamProblem DescriptionClassification HistoryOngoing Research
Conclusions
![Page 3: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/3.jpg)
IntroductionThe Internet has spawned numerous information-rich environments
Email SystemsWorld Wide WebSocial Networking Communities
Openness facilities information sharing, but it also makes them vulnerable…
![Page 4: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/4.jpg)
Denial of Information (DoI) Attacks
Deliberate insertion of low quality information (or noise) into information-rich environments
Information analog to Denial of Service (DoS) attacks
Two goalsPromotion of ideals by means of deceptionDenial of access to high quality information
Spam is the currently the most prominent example of a DoI attack
![Page 5: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/5.jpg)
Overview
IntroductionIntroductionIntroductionCountering Email Spam
Problem DescriptionClassification HistoryOngoing Research
Countering Web SpamCountering Web SpamCountering Web SpamProblem DescriptionProblem DescriptionProblem DescriptionClassification HistoryClassification HistoryClassification HistoryOngoing ResearchOngoing ResearchOngoing Research
ConclusionsConclusionsConclusions
![Page 6: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/6.jpg)
Countering Email Spam
Close to 200 billion (yes, billion) emails are sent each day
Spam accounts for around 90% of that email traffic
~2 million spam messages every second
![Page 7: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/7.jpg)
Old Email Spam Examples
![Page 8: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/8.jpg)
Problem Description
Email spam detection can be modeled as a binary text classification problem
Two classes: spam and legitimate (non-spam)
Example of supervised learningBuild a model (classifier) based on training data to approximatethe target function
Construct a function φ: M {spam, legitimate} such that it overlaps Φ: M {spam, legitimate} as much as possible
![Page 9: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/9.jpg)
Problem Description (cont.)
How do we represent a message?
How do we generate features?
How do we process features?
How do we evaluate performance?
![Page 10: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/10.jpg)
How do we represent a message?
Classification algorithms require a consistent format
Salton’s vector space model (“bag of words”) is the most popular representation
Each message m is represented as a feature vector f of n features: <f1, f2, …, fn>
![Page 11: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/11.jpg)
How do we generate features?
Sources of informationSMTP connections
Network properties
Email headersSocial networks
Email bodyTextual partsURLsAttachments
![Page 12: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/12.jpg)
How do we process features?
Feature TokenizationAlphanumeric tokensN-gramsPhrases
Feature ScrubbingStemmingStop word removal
Feature SelectionSimple feature removalInformation-theoretic algorithms
![Page 13: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/13.jpg)
dccFN
babFP
dcdR
dbdP
+=
+=
+=
+=
How do we evaluate performance?
Traditional IR metricsPrecision vs. Recall
False positives vs. False negatives
Imbalanced error costs
ROC curves
![Page 14: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/14.jpg)
Classification History
Sahami et al. (1998)Used a Naïve Bayes classifierWere the first to apply text classification research to the spam problem
Pantel and Lin (1998)Also used a Naïve Bayes classifierFound that Naïve Bayes outperforms RIPPER
![Page 15: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/15.jpg)
Classification History (cont.)
Drucker et al. (1999)Evaluated Support Vector Machines as a solution to spamFound that SVM is more effective than RIPPER and Rocchio
Hidalgo and Lopez (2000)Found that decision trees (C4.5) outperform Naïve Bayes and k-NN
![Page 16: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/16.jpg)
Classification History (cont.)
Up to this point, private corpora were used exclusively in email spam research
Androutsopoulos et al. (2000a)Created the first publicly available email spam corpus (Ling-spam)Performed various feature set size, training set size, stemming, and stop-list experiments with a Naïve Bayes classifier
![Page 17: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/17.jpg)
Classification History (cont.)
Androutsopoulos et al. (2000b)Created another publicly available email spam corpus (PU1)Confirmed previous research than Naïve Bayesoutperforms a keyword-based filter
Carreras and Marquez (2001)Used PU1 to show that AdaBoost is more effective than decision trees and Naïve Bayes
![Page 18: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/18.jpg)
Classification History (cont.)
Androutsopoulos et al. (2004)Created 3 more publicly available corpora (PU2, PU3, and PUA)Compared Naïve Bayes, Flexible Bayes, Support Vector Machines, and LogitBoost: FB, SVM, and LB outperform NB
Zhang et al. (2004)Used Ling-spam, PU1, and the SpamAssassin corporaCompared Naïve Bayes, Support Vector Machines, and AdaBoost: SVM and AB outperform NB
![Page 19: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/19.jpg)
Classification History (cont.)CEAS (2004 – present)
Focuses solely on email and anti-spam researchGenerates a significant amount of academic and industry anti-spam research
Klimt and Yang (2004)Published the Enron Corpus – the first large-scale corpus of legitimate email messages
TREC Spam Track (2005 – present)Produces new corpora every yearProvides a standardized platform to evaluate classification algorithms
![Page 20: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/20.jpg)
Ongoing Research
Concept Drift
New Classification Approaches
Adversarial Classification
Image Spam
![Page 21: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/21.jpg)
Concept Drift
Spam content is extremely dynamic
Topic drift (e.g., specific scams)Technique drift (e.g., obfuscations)
How do we keep up with the Joneses?
Batch vs. Online Learning
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
01/0301/03 01/0401/04 01/0501/05 01/06Pe
rcen
tage
of
Spam
Mes
sage
sMonth
OBFUSCATING_COMMENTINTERRUPTUS
HTML_FONT_LOW_CONTRASTHTML_TINY_FONT
![Page 22: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/22.jpg)
New Classification Approaches
Filter Fusion
Compression-based Filtering
Network behavioral clustering
![Page 23: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/23.jpg)
Adversarial Classification
Classifiers assume a clear distinction between spam and legitimate features
Camouflaged messagesMask spam content with legitimate contentDisrupt decision boundaries for classifiers
![Page 24: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/24.jpg)
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
640 320 160 80 40 20 10
Wei
ghte
d A
ccur
acy,
λ =
9
Number of Retained Features
Naive BayesSVM
LogitBoost 0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
640 320 160 80 40 20 10
Wei
ghte
d A
ccur
acy,
λ =
9
Number of Retained Features
Naive BayesSVM
LogitBoost 0.4
0.5
0.6
0.7
0.8
0.9
1
640 320 160 80 40 20 10
Wei
ghte
d A
ccur
acy,
λ =
9
Number of Retained Features
Naive BayesSVM
LogitBoost
Camouflage Attacks
Baseline performanceAccuracies consistently higher than 98%
Classifiers under attackAccuracies degrade to between 50% and 70%
Retrained classifiersAccuracies climb back to between 91% and 99%
![Page 25: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/25.jpg)
Camouflage Attacks (cont.)
Retraining postpones the problem, but it doesn’t solve it
We can identify features that are less susceptible to attack, but that’s simply another stalling technique 0
0.2
0.4
0.6
0.8
1
4(A)43(A)32(A)21(A)10(A)0
Frac
tion
of F
alse
Neg
ativ
es
Round Number (A denotes Attack)
NaiveBayesSVM
LogitBoost
![Page 26: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/26.jpg)
Image Spam
What happens when an email does not contain textual features?
OCR is easily defeated
Classification using image properties
![Page 27: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/27.jpg)
Overview
IntroductionIntroductionIntroductionCountering Email SpamCountering Email SpamCountering Email Spam
Problem DescriptionProblem DescriptionProblem DescriptionClassification HistoryClassification HistoryClassification HistoryOngoing ResearchOngoing ResearchOngoing Research
Countering Web SpamProblem DescriptionClassification HistoryOngoing Research
ConclusionsConclusionsConclusions
![Page 28: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/28.jpg)
Countering Web Spam
What is web spam?Traditional definitionOur definition
Between 13.8% and 22.1% of all web pages
![Page 29: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/29.jpg)
Ad Farms
Only contain advertising links (usually ad listings)
Elaborate entry pages used to deceive visitors
![Page 30: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/30.jpg)
Ad Farms (cont.)
Clicking on an entry page link leads to an ad listing
Ad syndicators provide the content
Web spammers create the HTML structures
![Page 31: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/31.jpg)
Parked Domains
Domain parking servicesProvide place holders for newly registered domainsAllow ad listings to be used as place holders to monetize a domain
Inevitably, web spammers abused these services
![Page 32: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/32.jpg)
Parked Domains (cont.)
Functionally equivalent to Ad FarmsBoth rely on ad syndicators for contentBoth provide little to no value to their visitors
Unique CharacteristicsReliance on domain parking services (e.g., apps5.oingo.com, searchportal.information.com, etc.)Typically for sale by owner (“Offer To Buy This Domain”)
![Page 33: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/33.jpg)
Parked Domains (cont.)
![Page 34: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/34.jpg)
Advertisements
Pages advertising specific products or services
Examples of the kinds of pages being advertised in Ad Farms and Parked Domains
![Page 35: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/35.jpg)
Problem Description
Web spam detection can also be modeled as a binary text classification problem
Salton’s vector space model is quite common
Feature processing and performance evaluation are also quite similar
But what about feature generation…
![Page 36: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/36.jpg)
How do we generate features?
Sources of informationHTTP connections
Hosting IP addressesSession headers
HTML contentTextual propertiesStructural properties
URL linkage structurePageRank scoresNeighbor properties
![Page 37: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/37.jpg)
Classification History
Davison (2000)Was the first to investigate link-based web spamBuilt decision trees to successfully identify “nepotistic links”
Becchetti et al. (2005)Revisited the use of decision trees to identify link-based web spamUsed link-based features such as PageRank and TrustRank scores
![Page 38: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/38.jpg)
Classification History
Drost and Scheffer (2005)Used Support Vector Machines to classify web spam pagesRelied on content-based features as well as link-based features
Ntoulas et al. (2006)Built decision trees to classify web spamUsed content-based features (e.g., fraction of visible content, compressibility, etc.)
![Page 39: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/39.jpg)
Classification HistoryUp to this point, previous web spam research was limited to small (on the order of a few thousand), private data sets
Webb et al. (2006)Presented the Webb Spam Corpus – a first-of-its-kind large-scale, publicly available web spam corpus (almost 350K web spam pages)http://www.webbspamcorpus.org
Castillo et al. (2006)Presented the WEBSPAM-UK2006 corpus – a publicly available web spam corpus (only contains 1,924 web spam pages)
![Page 40: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/40.jpg)
Classification HistoryCastillo et al. (2007)
Created a cost-sensitive decision tree to identify web spam in the WEBSPAM-UK2006 data setUsed link-based features from [Becchetti et al. (2005)] and content-based features from [Ntoulas et al. (2006)]
Webb et al. (2008)Compared various classifiers (e.g., SVM, decision trees, etc.) using HTTP session information exclusivelyUsed the Webb Spam Corpus, WebBase data, and the WEBSPAM-UK2006 data setFound that these classifiers are comparable to (and in many cases, better than) existing approaches
![Page 41: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/41.jpg)
Ongoing Research
Redirection
Phishing
Social Spam
![Page 42: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/42.jpg)
Redirection
144,801 unique redirect chains (1.54 average HTTP redirects)
43.9% of web spam pages use some form of HTML or JavaScript redirection
49%
14%
11%
8%
7%
5%
3%
2%
1%
302 HTTP redirect
frame redirect
301 HTTP redirect
iframe redirect
meta refresh andlocation.replace()meta refresh
meta refresh and location
location*
Other
![Page 43: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/43.jpg)
Phishing
Interesting form of deception that affects email and web users
Another form of adversarial classification
![Page 44: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/44.jpg)
Social Spam
Comment spam
Bulletin spam
Message spam
![Page 45: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/45.jpg)
Conclusions
Email and web spam are currently two of the largest information security problems
Classification techniques offer an effective way to filter this low quality information
Spammers are extremely dynamic, generating various areas of important future research…
![Page 46: Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest](https://reader035.fdocuments.us/reader035/viewer/2022081402/6050d3b65077ae465d252dd4/html5/thumbnails/46.jpg)
Questions