A Selective Learning Model for Spam...
Transcript of A Selective Learning Model for Spam...
![Page 1: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/1.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
A Selective Learning Model for Spam Filtering
Didier Colin, Catherine Roucairol, Ider TseveendorjPrism Laboratory
University of Versailles Saint-Quentin en YvelinesFrance
March 26, 2009
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 2: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/2.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Motivations of this work
The selective learning model
Experiments and results
Online application
Conclusion
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 3: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/3.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
The spam filtering problem
I Two approaches for spam filtering :I Knowledge engineeringI Machine learning : text classification
I Spam filtering is not a typical text classification problem :I Adversarial classification : Classifying against an opponent
who will try to delude/break the filterI Need for autonomy : Maintaining accuracy over time with
minimal human interventionI False-positive issue : No acceptable false positive rate
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 4: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/4.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Idea
I Learning all messages is generally a bad idea
I Assumption : existence of a harmful knowledge
I Basic idea : identify these messages and do not learn them
I Formulate the learning process as an optimization problem,and introduce a decision variable
I Purposes:I Protect the filter against deluding strategiesI Provide better behaviour over time by preventing natural
degeneration of the filterI Give the filter better generalization capability
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 5: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/5.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Why a selective approach ?
I Human communications are inherently redundant
I Human languages often contain misleading informations
I Especially true in the case of spam (repetitive commercialstrategies, deceptive messages)
I These characteristics may be difficult to capture in a featureselection scheme
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 6: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/6.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Problem formulation
I Problem formulation: finding a training subcorpus such thattraining on it maximizes the resulting filter’s accuracy on theevaluation corpus
I A typical corpus : 103 to 106 learning messages
I A typical classifier learns in polynomial time
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 7: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/7.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Problem formulation
I Problem formulation: finding a training subcorpus such thattraining on it maximizes the resulting filter’s accuracy on theevaluation corpus
I A typical corpus : 103 to 106 learning messages
I A typical classifier learns in polynomial time
−→ we opt for a meta-heuristic implementation
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 8: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/8.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Implementation
I Genetic implementation
I Data : a set of messages C , a classifier fI Representations
I Solution : boolean vector X of dimension |C |, Xi = 1 ifmessage i is selected
I Fitness : A(fC(X ),C ), weighted accuracy of resulting filter onthe set C , C (X ) = {ci ∈ C |Xi = 1}
I OperationsI Selection : elitistI Cross-over : one pointI Mutation : random bit inversion
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 9: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/9.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Genetic operations
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 10: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/10.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Genetic operations
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 11: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/11.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Genetic operations
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 12: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/12.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Genetic operations
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 13: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/13.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Genetic operations
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 14: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/14.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Genetic operations
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 15: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/15.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Experiments protocol
I Data sets : lingspam corpus 1(481 spams, 2412 legitimatemessages), SpamAssassin( 1897 spams, 4150 legitimatemessages)
I Classifier : Bernoulli naive bayesian, 60 words vocabularyI Parameters :
I population size : 10 to 100I mutation rate : 5 to 75I initial solutions : random selection of 10% legitimate message
and 50% spam
I Metric : Total Cost Ratio =A(fC(X ),C)
A(f∅,C) )
1Ion Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras and C. D.Spyropoulos, An evaluation of Naive Bayesian anti-spam filtering”, ComputingResearch Repository”, ”2000”
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 16: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/16.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Results : TCR evolution for various population size
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 17: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/17.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Results : TCR evolution for a population of 25 individuals
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 18: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/18.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Results : Overview
Table: Comparison of spam precision and spam recall for exhaustive andselective learning algorithm
Exhaustivelearning
Selectivelearning(initial)
Selectivelearning(best)
Precision 96.82 % 96.85 % 98.72 %
Recall 88.33 % 89.60 % 96.47 %
I Better solutions found at the first iteration
I TCR improved by a factor 4
I Best solutions contain only 1/3 of the lingspam corpus
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 19: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/19.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Results on SpamAssassin
Bernoulli naive bayesianperfoms bad (TRC < 1)
Initial solutions must bealmost exhaustive
Selective learning do notbring much improvement
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 20: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/20.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Online selective learning
I Initial learning is only half of the job
I Is online selective learning possible ?
I Assuming no-user feedback
I Corpus → flow
I For each incoming message, a decision problem : shall welearn it ?
I Idea : for each incoming message, test if learning this messageimproves the filter’s precision over the N previous messages(learning window)
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 21: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/21.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Online selective learning algorithm
Input: Wi , the i-th message on the mail flow, f , a classifier, N,an integer
beginf ′ ← copy(f )if f(W) ≥ λthen learn(f ′,W , spam)else learn(f ′,W , ham)C ← {Wj , i − N ≤ j ≤ i}if A(f,C) ≥ A(f ′, C)then return falseelse return true
endAlgorithm 1: Online selective learning
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 22: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/22.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
TCR evolution, regular lingspam
I Little to noimprovements
I Slight loss forwindow = 50, 25
I Slight gain forwindow = 500
I But globalevolution is even
I Easy mail flow →conservativelearning strategies
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 23: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/23.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
TCR evolution, noisy lingspam (5%)
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 24: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/24.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Conclusions
I A learning model specifically designed to address the issues ofspam filtering
I Easy to implement...
I Good synergy with existing techniques
I Not tied to a specific classification model
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 25: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/25.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Perspectives and future works
I Efficient heuristics for initial solutions ?
I Make use of non learned data
I Dynamic variations of online selective window
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering
![Page 26: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier](https://reader033.fdocuments.us/reader033/viewer/2022052103/603d33e71e70f517436be1d9/html5/thumbnails/26.jpg)
OutlineMotivations of this work
The selective learning modelExperiments and results
Online applicationConclusion
Thank you !
Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering