Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D....

Copyright 2004, David D. Lewis

(Naive) Bayesian Text Classification for Spam

Filtering

David D. Lewis, Ph.D.

Ornarose, Inc.

& David D. Lewis Consulting

www.daviddlewis.com

Presented at ASA Chicago Chapter Spring Conference., Loyola Univ.,

May 7, 2004.

http://www.daviddlewis.com/


MenuSpam

Spam FilteringClassification for Spam Filtering

Classification

Bayesian ClassificationNaive Bayesian Classification

Naive Bayesian Text ClassificationNaive Bayesian Text Classification for Spam Filtering

(Feature Extraction for) Spam Filtering Text Classification (for Marketing)

(Better) Bayesian Classification


Spam

• Unsolicited bulk email– or, in practice, whatever email you don’t want

• Large fraction of all email sent– Brightmail est. 64%, Postini est. 77%– Still growing

• Est. cost to US businesses exceeded $30 billion in Y2003


Approaches to Spam Control

• Economic (email pricing, ...)

• Legal (CANSPAM, ...)

• Societal pressure (trade groups, ...)

• Securing infrastructure (email servers, ...)

• Authentication (challenge/response,...)

• Filtering


Spam Filtering

• Intensional (feature-based) vs. Extensional (white/blacklist-based)

• Applied at sender vs. receiver

• Applied at email client vs. mail server vs. ISP


Statistical Classification

1. Define classes of objects

2. Specify probability distribution model connecting classes to observable features

3. Fit parameters of model to data

4. Observe features on inputs and compute probability of class membership

5. Assign object to a class


Classifier Inter- preter

CLASSIFIERCLASSIFIER

FeatureExtraction


• Extract features from header, content

• Train classifier

• Classify message and process:– Block message, insert tag, put in folder, etc.

Classification for Spam Filtering

vs. vs.

• Define classes:


Two Classes of Classifier

• Generative: Naive Bayes, LDA,...– Model joint distribution of class and features– Derive class probability by Bayes rule

• Discriminative: logistic regression, CART,...– Model conditional distribution of class given

known feature values– Model directly estimates class probability


2. Specify probability model2b. And prior distribution over parameters

3. Find posterior distribution of model parameters, given data

4. Compute class probabilities using posterior distribution (or element of it)

5. Classify object

Bayesian Classification (1)

1. Define classes



• = “Naive”/”Idiot”/”Simple” Bayes

• A particular generative model – Assumes independence of observable features

within each class of messages– Bayes rule used to compute class probability

• Might or might not use a prior on model parameters


Naive Bayes for Text Classification - History

• Maron (JACM, 1961) – automated indexing• Mosteller and Wallace (1964) – author

identification• Van Rijsbergen, Robertson, Sparck Jones,

Croft, Harper (early 1970’s) – search engines

• Sahami, Dumais, Heckerman, Horvitz (1998) – spam filtering


• Graham’s A Plan for Spam– And its mutant offspring...

• Naive Bayes-like classifier with weird parameter estimation

• Widely used in spam filters – Classic Naive Bayes superior when

appropriately used



NB & Friends: Advantages

• Simple to implement– No numerical optimization, matrix algebra, etc.

• Efficient to train and use– Fitting = computing means of feature values– Easy to update with new data– Equivalent to linear classifier, so fast to apply

• Binary or polytomous



• Independence allows parameters to be estimated on different data sets, e.g. – Estimate content features from messages with

headers omitted– Estimate header features from messages with

content missing



• Generative model– Comparatively good effectiveness with small

training sets– Unlabeled data can be used in parameter

estimation (in theory)


NB & Friends: Disadvantages

• Independence assumption wrong– Absurd estimates of class probabilities– Threshold must be tuned, not set analytically

• Generative model– Generally lower effectiveness than

discriminative techniques (e.g. log. regress.)– Improving parameter estimates can hurt

classification effectiveness


Feature Extraction

• Convert message to feature vector

• Header: sender, recipient, routing,…– Possibly break up domain names

• Text– Words, phrases, character strings– Become binary or numeric features

• URLs, HTML tags, images,…


From: Sam Elegy <[email protected]>To: [email protected]: you can buy V!@gra

Spamlike content in image form

Irrelevant legit content; doubles as hash buster

Typographic variations

Randomly generated name and email

mailto:[email protected]

mailto:[email protected]


Defeating Feature Extraction

• Misspellings, character set choice, HTML games: mislead extraction of words

• Put content in images• Forge headers (to avoid identification, but

also interferes with classification)• Innocuous content to mimic distribution in

nonspam• Hashbusters (zyArh73Gf) clog dictionaries


Survival of the Fittest

• Filter designers get to see spam

• Spammers use spam filters

• Unprecedented arms race for a statistical field

• Countermeasures mostly target feature extraction, not modeling assumptions


Miscellany

1. Getting legitimate bulk mail past spam filters

2. Other uses of text classification in marketing

3. Frontiers in Bayesian classification


Getting Legit Bulk Email Past Filters

• Test email against several filters– Send to accounts on multiple ISPs– Multiple client-based filters if particularly

concerned

• Coherent content, correctly spelled• Non-tricky headers and markup • Avoid spam keywords where possible • Don’t use spammer tricks


Text Classification in Marketing

• Routing incoming email– Responses to promotions– Detect opportunities for selling– (Automated response sometimes possible)

• Analysis of text/mixed data on customers– e.g. customer or CSR comments

• Content analysis– Focus groups, email, chat, blogs, news,…


Better Bayesian Classification

• Discriminative– Logistic regression with informative priors– Sharing strength across related problems– Calibration and confidence of predictions

• Generative – Bayesian networks/graphical models– Use of unlabeled and partially labeled data

• Hybrid


BBR

• Logistic regression w/ informative priors– Gaussian = ridge logistic regression– Laplace = lasso logistic regression

• Sparse data structures & fast optimizer– 10^4 cases, 10^5 predictors, few seconds!

• Accuracy competitive with SVMs • Free for research use

– www.stat.rutgers.edu/~madigan/BBR/

• Joint work w/ Madigan & Genkin (Rutgers)

http://www.stat.rutgers.edu/~madigan/BBR/


Gaussian Laplace

Gaussian vs. Laplace Prior


Future of Spam Filtering

• More attention to training data selection, personalization

• Image processing • Robustness against word variations• More linguistic sophistication• Replacing naive Bayes with better learners

• Keep hoping for economic cure


Summary

• By volume, spam filtering is easily the biggest application of text classification– Possible of supervised learning

• Filters have helped a lot– Naive Bayes is just a starting point

• Other interesting applications of Bayesian classification

Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D....

Documents

Transcript of Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D....