Analyzing Behavioral Features for Email Classification.
-
date post
20-Dec-2015 -
Category
Documents
-
view
229 -
download
0
Transcript of Analyzing Behavioral Features for Email Classification.
Analyzing Analyzing Behavioral Behavioral
Features for Email Features for Email ClassificationClassification
Steve Martin, Anil Sewani, Blaine Nelson, Karl Chen, and
Anthony Joseph
{steve0, anil, nelsonb, quarl, adj}@cs.berkeley.edu
University of California at Berkeley
The Problem: Email Abuse• Email has become globally ubiquitous
– By 2006, email traffic is expected to surge to 60 billion messages daily.
• However, spam accounts for half the email sent on a daily basis worldwide.
• Nearly all of the most virulent worms of 2004 spread by email.
• Email system abuse results in huge damage costs.
Current Email Analysis• Many current methods for detecting email abuse
examine characteristics of incoming email.
• Example: Spam Detection– Calculate statistical features on received mail and
classify each message separately.
• Example: Virus Scanning– Generate a hash value on each incoming message,
compare with stored database of values.– Signatures must be predetermined by human analyst.
• Can be effective, but room for improvement.
Our Approach• Huge corpus of ignored data: outgoing email!
– Can’t profile user email behavior with incoming email.– Outgoing email contains this information.
• Calculate features on outgoing email.– Observe a wide variety of statistics.
• Build a statistical understanding of user behavior.– Use to classify email sent by individual users.– Can detect sudden changes in behavior, such as
worm/spam activity.
Ex. Outgoing Email FeaturesPer-Email Features
Email Contains HTML?
Email Contains Scripts?
Email Contains Images?
Email Contains Links?
MIME Types of Attachments
Number of Attachments
Number of Words in Body
Number of Words in Subject
Number of Chars in Subject
. . .
Per-User Features(calc’d over a window of email)
Frequency of Email Sending
No. of Unique ‘To’ Addr.
No. of Unique ‘From’ Addr.
Ratio Emails w/ Attachments
Average Word Length
Avg. No. of Words/Body
Avg. No. of Words/Subject
Variance in Word Length
Variance in No. Words/Body
. . .
1. Histogram Analysis• Histograms of separate users over specific
features allow similarity estimation.
• Example below: on left, two users, same feature. On right, difference between values.– Shows how these users differ over this feature.– Can use to detect differences in behavior between
these two users.
2. Covariance Analysis• Goal: identify features that vary most
significantly with the labels.
• Method 1: Principal Component Analysis (PCA)– Determines a linear combination of relevant features
that maximize variance.– Does not take labels or redundancy into account.
• Method 2: Directions of Max Covariance– Determines directions in feature space that maximize
the covariance between data and labels.– Modified to take potential feature redundancy into
account.
Greedy Feature Ranking• Rank features with a simple greedy approach
using Directions of Max Covariance:– Rank features by their contribution to the first principal
component of covariance matrix: cov[data,labels
Feature Ranking Algorithm
Set F = all features
While F is not empty:
CovMat = Empirical Covariance Matrix
V = principle component vector of CovMat via SVD.
Select feature f from principle component of V
Modify (deflate) CovMat to eliminate redundancy
F = F - f
Feature Ranking ResultsRelative Relevance of Features per User
User
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
WordsInSubj
WordsInBody
VarWordsInBody
VarCharInSubj
ScriptsInEmail
NumToAddrInWindow
NumFromAddrInWindow
MeanWordsInBody
MeanCharInSubj
LinksInEmail
ImagesInEmail
HtmlInEmail
FreqEmailSent
CharsInSubj
AvgWordLength
Application: Worm Detection• Can apply statistical learning on outgoing email
to detect/prevent novel worm propagation.– Success depends on ability of features to identify
anomalous behavior.
• Constructed training/test sets of real email traffic artificially ‘infected’ with viruses.
• Applied feature selection techniques, then tested with different models.
Example Results
• Features added greedily using selection algorithm.
• Graphs show exists an optimal set of features, beyond which performance decreases.
Support Vector Machines Naïve Bayes Classifier
Conclusions and Future Work• Conclusion: analysis of email behavior could
have many applications.– Feature selection is extremely important to model
performance.
• In the future, study effects of feature selection on classification accuracy for other statistical models
• Try similar analysis on existing anti-spam solutions.
• Cluster user behavior into sets of common models describing general behavior patterns.