Address Lewis D Ritchie. 2 Duke Lane, Fraserburgh Lewis D Ritchie.
Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D....
-
Upload
rafe-terry -
Category
Documents
-
view
240 -
download
1
Transcript of Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D....
Copyright 2004, David D. Lewis
(Naive) Bayesian Text Classification for Spam
Filtering
David D. Lewis, Ph.D.
Ornarose, Inc.
& David D. Lewis Consulting
www.daviddlewis.com
Presented at ASA Chicago Chapter Spring Conference., Loyola Univ.,
May 7, 2004.
Copyright 2004, David D. Lewis
MenuSpam
Spam FilteringClassification for Spam Filtering
Classification
Bayesian ClassificationNaive Bayesian Classification
Naive Bayesian Text ClassificationNaive Bayesian Text Classification for Spam Filtering
(Feature Extraction for) Spam Filtering Text Classification (for Marketing)
(Better) Bayesian Classification
Copyright 2004, David D. Lewis
Spam
• Unsolicited bulk email– or, in practice, whatever email you don’t want
• Large fraction of all email sent– Brightmail est. 64%, Postini est. 77%– Still growing
• Est. cost to US businesses exceeded $30 billion in Y2003
Copyright 2004, David D. Lewis
Approaches to Spam Control
• Economic (email pricing, ...)
• Legal (CANSPAM, ...)
• Societal pressure (trade groups, ...)
• Securing infrastructure (email servers, ...)
• Authentication (challenge/response,...)
• Filtering
Copyright 2004, David D. Lewis
Spam Filtering
• Intensional (feature-based) vs. Extensional (white/blacklist-based)
• Applied at sender vs. receiver
• Applied at email client vs. mail server vs. ISP
Copyright 2004, David D. Lewis
Statistical Classification
1. Define classes of objects
2. Specify probability distribution model connecting classes to observable features
3. Fit parameters of model to data
4. Observe features on inputs and compute probability of class membership
5. Assign object to a class
Copyright 2004, David D. Lewis
Classifier Inter- preter
CLASSIFIERCLASSIFIER
FeatureExtraction
Copyright 2004, David D. Lewis
• Extract features from header, content
• Train classifier
• Classify message and process:– Block message, insert tag, put in folder, etc.
Classification for Spam Filtering
vs. vs.
• Define classes:
Copyright 2004, David D. Lewis
Two Classes of Classifier
• Generative: Naive Bayes, LDA,...– Model joint distribution of class and features– Derive class probability by Bayes rule
• Discriminative: logistic regression, CART,...– Model conditional distribution of class given
known feature values– Model directly estimates class probability
Copyright 2004, David D. Lewis
2. Specify probability model2b. And prior distribution over parameters
3. Find posterior distribution of model parameters, given data
4. Compute class probabilities using posterior distribution (or element of it)
5. Classify object
Bayesian Classification (1)
1. Define classes
Copyright 2004, David D. Lewis
Bayesian Classification (2)
• = “Naive”/”Idiot”/”Simple” Bayes
• A particular generative model – Assumes independence of observable features
within each class of messages– Bayes rule used to compute class probability
• Might or might not use a prior on model parameters
Copyright 2004, David D. Lewis
Naive Bayes for Text Classification - History
• Maron (JACM, 1961) – automated indexing• Mosteller and Wallace (1964) – author
identification• Van Rijsbergen, Robertson, Sparck Jones,
Croft, Harper (early 1970’s) – search engines
• Sahami, Dumais, Heckerman, Horvitz (1998) – spam filtering
Copyright 2004, David D. Lewis
• Graham’s A Plan for Spam– And its mutant offspring...
• Naive Bayes-like classifier with weird parameter estimation
• Widely used in spam filters – Classic Naive Bayes superior when
appropriately used
Bayesian Classification (3)
Copyright 2004, David D. Lewis
NB & Friends: Advantages
• Simple to implement– No numerical optimization, matrix algebra, etc.
• Efficient to train and use– Fitting = computing means of feature values– Easy to update with new data– Equivalent to linear classifier, so fast to apply
• Binary or polytomous
Copyright 2004, David D. Lewis
NB & Friends: Advantages
• Independence allows parameters to be estimated on different data sets, e.g. – Estimate content features from messages with
headers omitted– Estimate header features from messages with
content missing
Copyright 2004, David D. Lewis
NB & Friends: Advantages
• Generative model– Comparatively good effectiveness with small
training sets– Unlabeled data can be used in parameter
estimation (in theory)
Copyright 2004, David D. Lewis
NB & Friends: Disadvantages
• Independence assumption wrong– Absurd estimates of class probabilities– Threshold must be tuned, not set analytically
• Generative model– Generally lower effectiveness than
discriminative techniques (e.g. log. regress.)– Improving parameter estimates can hurt
classification effectiveness
Copyright 2004, David D. Lewis
Feature Extraction
• Convert message to feature vector
• Header: sender, recipient, routing,…– Possibly break up domain names
• Text– Words, phrases, character strings– Become binary or numeric features
• URLs, HTML tags, images,…
Copyright 2004, David D. Lewis
Copyright 2004, David D. Lewis
Copyright 2004, David D. Lewis
From: Sam Elegy <[email protected]>To: [email protected]: you can buy V!@gra
Spamlike content in image form
Irrelevant legit content; doubles as hash buster
Typographic variations
Randomly generated name and email
Copyright 2004, David D. Lewis
Defeating Feature Extraction
• Misspellings, character set choice, HTML games: mislead extraction of words
• Put content in images• Forge headers (to avoid identification, but
also interferes with classification)• Innocuous content to mimic distribution in
nonspam• Hashbusters (zyArh73Gf) clog dictionaries
Copyright 2004, David D. Lewis
Survival of the Fittest
• Filter designers get to see spam
• Spammers use spam filters
• Unprecedented arms race for a statistical field
• Countermeasures mostly target feature extraction, not modeling assumptions
Copyright 2004, David D. Lewis
Miscellany
1. Getting legitimate bulk mail past spam filters
2. Other uses of text classification in marketing
3. Frontiers in Bayesian classification
Copyright 2004, David D. Lewis
Getting Legit Bulk Email Past Filters
• Test email against several filters– Send to accounts on multiple ISPs– Multiple client-based filters if particularly
concerned
• Coherent content, correctly spelled• Non-tricky headers and markup • Avoid spam keywords where possible • Don’t use spammer tricks
Copyright 2004, David D. Lewis
Text Classification in Marketing
• Routing incoming email– Responses to promotions– Detect opportunities for selling– (Automated response sometimes possible)
• Analysis of text/mixed data on customers– e.g. customer or CSR comments
• Content analysis– Focus groups, email, chat, blogs, news,…
Copyright 2004, David D. Lewis
Better Bayesian Classification
• Discriminative– Logistic regression with informative priors– Sharing strength across related problems– Calibration and confidence of predictions
• Generative – Bayesian networks/graphical models– Use of unlabeled and partially labeled data
• Hybrid
Copyright 2004, David D. Lewis
BBR
• Logistic regression w/ informative priors– Gaussian = ridge logistic regression– Laplace = lasso logistic regression
• Sparse data structures & fast optimizer– 10^4 cases, 10^5 predictors, few seconds!
• Accuracy competitive with SVMs • Free for research use
– www.stat.rutgers.edu/~madigan/BBR/
• Joint work w/ Madigan & Genkin (Rutgers)
Copyright 2004, David D. Lewis
Gaussian Laplace
Gaussian vs. Laplace Prior
Copyright 2004, David D. Lewis
Future of Spam Filtering
• More attention to training data selection, personalization
• Image processing • Robustness against word variations• More linguistic sophistication• Replacing naive Bayes with better learners
• Keep hoping for economic cure
Copyright 2004, David D. Lewis
Summary
• By volume, spam filtering is easily the biggest application of text classification– Possible of supervised learning
• Filters have helped a lot– Naive Bayes is just a starting point
• Other interesting applications of Bayesian classification