Email Spam Filtering Computer Security Seminar

32
06/16/22 06/16/22 Email Spam Filtering - Muthiyalu Jo Email Spam Filtering - Muthiyalu Jo thir thir 1 Email Spam Email Spam Filtering Filtering Computer Security Seminar Computer Security Seminar N.Muthiyalu Jothir – 271120 N.Muthiyalu Jothir – 271120 Media Informatics Media Informatics

description

Email Spam Filtering Computer Security Seminar. N.Muthiyalu Jothir – 271120 Media Informatics. Agenda. What is Spam ? Statistics Who Benefits from it? Spam Filtering Techniques Combining Filters Conclusion. What is Spam?. Spam  Unsolicited email - PowerPoint PPT Presentation

Transcript of Email Spam Filtering Computer Security Seminar

Page 1: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 11

Email Spam FilteringEmail Spam FilteringComputer Security SeminarComputer Security Seminar

N.Muthiyalu Jothir – 271120N.Muthiyalu Jothir – 271120Media InformaticsMedia Informatics

Page 2: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 22

AgendaAgenda

What is Spam ?What is Spam ? StatisticsStatistics Who Benefits from it?Who Benefits from it? Spam Filtering TechniquesSpam Filtering Techniques Combining FiltersCombining Filters ConclusionConclusion

Page 3: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 33

What is Spam?What is Spam? Spam Spam Unsolicited email Unsolicited email

Emails that involves sending identical Emails that involves sending identical or nearly identical messages to or nearly identical messages to thousands (or millions) of recipients. thousands (or millions) of recipients.

Caution !Caution !““SPAM - Spiced Ham ” is a popular SPAM - Spiced Ham ” is a popular

American canned meat brand…American canned meat brand…

Page 4: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 44

Problem Problem With a tiny investment, a spammer can send over With a tiny investment, a spammer can send over

100,000 bulk emails per hour100,000 bulk emails per hour..

Junk mails waste storage and transmission Junk mails waste storage and transmission bandwidth.bandwidth.

ISP’s investment ISP’s investment Cost we absorb as ISP’s Cost we absorb as ISP’s customercustomer

Spam is a problem because the Spam is a problem because the cost is forced onto cost is forced onto us, the recipientus, the recipient..

Page 5: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 55

StatisticsStatisticsEmail considered Spam 40% of all

email

Daily Spam emails sentDaily Spam emails sent 12.4 billion12.4 billion

Daily Spam received perperson

6

Annual Spam received perAnnual Spam received perpersonperson

2,2002,200

Spam cost to all non-corp. Internet users $255 million

Spam cost to all U.S.Spam cost to all U.S.Corporations in 2002Corporations in 2002

$8.9 billion$8.9 billion

Estimated Spam increaseby 2007

63%

Users who reply to SpamUsers who reply to Spamemailemail

28%28%

Users who purchased from Spam email 8%

Wasted corporate time per Spam emailWasted corporate time per Spam email 4-5 seconds4-5 seconds

Page 6: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 66

Who benefits from Spam?Who benefits from Spam?Financial Firms e.g. Mortgage

Lead Generators(Gain 2% of Loan value per customer data) Spammers

(Share the profit with Lead Generators)

Recipient

Information about interested customers

Recipient replies here

Page 7: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 77

Spam Control TechniquesSpam Control Techniques

Fight Back techniques Filtering Techniques

• Reporting Spam to ISP

• Fight back filters

• Slow Senders

• Law ???

• etc.

• Challenge-Response Filtering

• Blacklists and White lists

• Content based filters Rule based Bayesian filters

Page 8: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 88

Reporting Spam To ISPsReporting Spam To ISPs Original spam solution Legitimate ISPs respond to such

complaints Spammers kicked offDisadvantage Disguised Spammers. Naïve users cannot interpret the

email headers

Page 9: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 99

Filters that Fight Back (FFB) Majority of spam contain links to web pages.

Spam filters could auto retrieve the URLs and crawl back to those pages, which would increase the load on the server.

If all the spam receivers do this at the same time, the server might be crashed and so the cost of spamming increases.

Caution !

FFB usually works with blacklists (of malicious servers) in order to avoid the attack on innocent servers.

Page 10: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1010

Filtering TechniquesFiltering Techniques

Page 11: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1111

Spam Vs HamSpam Vs Ham Care to be taken in any Spam filtering techniqueCare to be taken in any Spam filtering technique

““All the Spam could be allowed to pass thro; but, All the Spam could be allowed to pass thro; but, not even a single legitimate mail should be not even a single legitimate mail should be filtered.”filtered.”

False Positive – Legitimate mail classified as spam.False Positive – Legitimate mail classified as spam.

Least false positive rate desired…Least false positive rate desired…

Caution Caution : Check your junk folder before deleting: Check your junk folder before deleting

Don’tDon’t believebelieve your Spam filter your Spam filter

Page 12: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1212

Challenge-Response Filtering Emails from unknown senders will receive an auto-reply

message asking them to verify themselves

Senders “Challenged" to type in a word that is hidden within a graphic or a sound file

Mail is forwarded to receiver’s inbox, only after successful “response”

This technique almost filters all spam . No spammer would be interested to take the extra effort to prove him / her self.

Commercial product “spamarrest”

Disadvantage This technique is rude

Sometimes senders don’t or forget to reply to the challenge

Page 13: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1313

Blacklists and White lists Blacklists of misbehaving servers or known spammers that

are collected by several sites.

Sender id in the email is compared with the blacklist

White lists are complementary to black lists, and contain addresses of trusted contacts

Use blacklists and white lists for the first level filtering (before applying content checks) and not used as the only tool for making decision.

Disadvantage Prone to wrong configurations with legitimate servers unable to

exit from a list where they had been incorrectly inserted.

Page 14: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1414

Content based filters

Not a good idea to filter mails just based Not a good idea to filter mails just based on blacklists on blacklists

Wiser decisionWiser decision Consider the actual Consider the actual content of the emailcontent of the email

Almost all the successful spam filters use Almost all the successful spam filters use this techniquethis technique

Major types : Rule-based and BayesianMajor types : Rule-based and Bayesian

Page 15: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1515

Rule Based FiltersRule Based Filters Rule based filters work based on some

static rules to decide whether a mail is a spam or not.

Rules could be• words and phrases• lots of uppercase characters• exclamation points• special characters• Web links• HTML messages• background colors• crazy Subject lines etc.

Page 16: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1616

Rule based filtersRule based filters Rules are given scores, based on importance

Incoming mails are parsed and checked for known malicious patterns

Total score calculated for the triggered rules

If Final Score > Threshold, classify as spam. Otherwise, classify as legitimate mail.

Threshold decided by the user.

Page 17: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1717

Rule Based FiltersRule Based Filters “Spamassasin”, a popular spam filtering product

uses rule based filtering.

Perl Regex (Regular expressions) used for pattern checking

Example rules• header __LOCAL_FROM_NEWS From /news@example\.com/i

• body __LOCAL_SALES_FIGURES /\bMonthly Sales Figures\b/

• score LOCAL_NEWS_SALES_FIGURES 0.8

Page 18: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1818

Rule Based FiltersRule Based Filters AdvantageAdvantage

Easy to implement Easy to implement No training requiredNo training required

DisadvantageDisadvantage Static rules too generalStatic rules too general Spammers find new ways to deceive the Spammers find new ways to deceive the

rulesrules

Page 19: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1919

Bayesian FiltersBayesian Filters Bayesian filters are the latest in spam

filtering technology and the most successful.

Bayes classifiers were used extensively in the field of pattern recognition.

Given an unlabeled example, the classifier will calculate the most likely classification with some degree of probability.

Page 20: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2020

Bayesian FiltersBayesian Filters Steps in Bayes Filtering

Training Validation Implementation

Training starts with two collections of mails : one of spam and one of legitimate mail.

For every word in these emails, it calculates a spam probability based on the proportion of spam occurrences.

Bayesian filters are quite accurate, and adapt automatically as spam evolves.

False positives are minimized by Bayesian filtering because they consider evidence of innocence as well as evidence of spam.

Page 21: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2121

Bayesian FilteringBayesian Filtering Bayes Probability,Bayes Probability,

Pr (spam | words) = Pr (spam) * Pr (spam | words) = Pr (spam) * Pr (words | Spam)

Pr (words)

Probability closer to 1 would be classified as spam and closer to 0 is classified as ham.

0.5 is set as the threshold.

Page 22: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2222

Neural Network for TrainingNeural Network for Training Neural Network StructureNeural Network Structure

i

Page 23: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2323

Neural Networks for TrainingNeural Networks for Training Neural networks are used to train the

spam filter (Rule-based or Bayesian) and itself is not a filter

Input words or rules etc.

Trained over multiple samples of the user’s mails (both spam and ham)

Weights of the links are altered till the desired output is obtained.

Page 24: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2424

Supervised LearningSupervised Learning Supervised learning Training with a

“teacher” signal

Train the system till we get optimized unaltered weights for the edges.

Caution! Take care not to over train the network.

Page 25: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2525

Combining Spam Filters

GoalGoal Combined filter aims to improve individual filters performance.

Combined Filter = Original Filter (OF) + Received Filter (RF)Combined Filter = Original Filter (OF) + Received Filter (RF)

Max gain Received filter contains some feature sets not found in the original filter.

E.g.Original Filter = {“Share Market”, “Higher Studies”}Received filter = {“Share Market”, “Job Alerts”}

Page 26: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2626

ChallengesChallenges Decisions (Spam / Ham) made by both Decisions (Spam / Ham) made by both

filters individuallyfilters individually

Decisions agree Decisions agree No Problem No Problem

DisagreementDisagreement Due to difference of Due to difference of feature setsfeature sets

ChallengesChallenges• “How do we select the correct decision or filter?”• “Who selects it?”

Page 27: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2727

Filter Selector (FS)Filter Selector (FS) Training Phase Training Phase FS predictsFS predicts the unique the unique

features (e.g. words) of RFfeatures (e.g. words) of RF

Parse the emails of training set and Parse the emails of training set and extract the featuresextract the features

‘‘BagBag’ of (predicted) features for RF ’ of (predicted) features for RF

Text similarity comparison between the current e-mail's features and the feature sets of the filters.

Page 28: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2828

Algorithm FlowchartAlgorithm Flowchart

1.1. Training PhaseTraining Phase2.2. Final VerdictFinal Verdict

Page 29: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2929

TF – IDF Similarity Measure

Commonly used in Information Retrieval applications.

More frequent words would be key to accurate classification of emails

FS predicted feature set is unique

“Query – Document” retrieval procedure.• 2 documents – Feature sets• Query – Current email

Page 30: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 3030

Experiments & ResultsExperiments & Results

Page 31: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 3131

ConclusionConclusion We discussed the techniques to We discussed the techniques to “kill”“kill” spam spam

ComparisonComparison between various techniques between various techniques

So far, So far, BayesianBayesian seems to be seems to be reliablereliable

Discussed a new approach to combine filtersDiscussed a new approach to combine filters

FutureFuture workwork : : Learning techniques for Filter SelectorLearning techniques for Filter Selector Better Similarity measures Better Similarity measures

Page 32: Email Spam Filtering Computer Security Seminar

04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 3232

Thank You Thank You