Spam - USF Computer Scienceejung/courses/686/lectures/Spam.pdf · What is spam? An unsolicited...

Spam filtering

Peter LikarishBased on slides by EJ Jung

11/03/10

What is spam?

An unsolicited email• equivalent to Direct Mail in postal service• UCE (unsolicited commercial email)• UBE (unsolicited bulk email)• 82% of US email in 2004 [MessageLabs 2004]

Etymology• Monty Python’s spam episode• “ham” = not spam

Historical events

May 3rd, 1978• Sent to Arpanet west coast users• Invitation to see new models of DEC-20• Violated ARPA’s user policy• Vendors were chastised & ceased (for then)

March 31st, 1993 – first UUCP spam• Unix-to-Unix Copy

April, 1994 – Green Card Lottery spam• Message posted to EVERY UUCP group

Who’s the victim?

Internet Service Provider (ISP)• much higher bandwidth usage

Email service providers• need to support larger disk space• need to deploy anti-spam measures

Businesses• time spent on spam• damage by spams and viruses and worms

Does spam bother you?

Potentially malicious• viruses, worms• phishing emails

Fill up the disk space

Does it bother you more than DMs?• if so, why?

How much email is spam?

Source: Microsoft Security Intelligence Report ‘09(http://www.microsoft.com/downloads/details.aspx?FamilyID=037f3771-330e-4457-a52c-5b085dc0a4cd&displaylang=en)

>97%filtered!

Other numbers

Spamhaus• >90%• http://www.spamhaus.org/effective_filtering.html

Postini (Google owned)• >90%-95%• http://googleenterprise.blogspot.com/search/label/spam%20and%20security%20trends

Consensus: >90% of all email is spam

How many messages?

Is (was) exponentially increasing

Cost of spam

Very back of the envelope… 100,000,000,000 messages/monthRecipient

• $0.025• $2.5 billion/month for recipients

Sender• $0.00001 for sender to generate• $1 million/month for sender

Profit?• 1:200,000 = 500,000 sales. $2 is break-even point

Legal actions

CAN-SPAM Act of 2003• UBE must be labeled• Must have opt-out option

Some people went to jail…Some ISPs have been shut down

Some ways to blocking spam

Network-level• DNS blackholes/blacklisting• Edge filtering

Content-level• Rule-based• checksum• Machine learning

– Probabilistic– SVM

Cost-based approach

DNS filtering

Blacklist = list of known spammers• IP addresses/domains/senders• Top five spamming ISPs account for > 80% of spam

traffic!

Whitelist = list of known good senders• IP addresses/domains/senders

Further reading:• Zhang et al. Highly Predictive Blacklisting• (http://www.cyber-ta.org/releases/HPB/HighlyPredictiveBlacklists-SRI-TR-Format.pdf)

How much email is spam?

Source: Microsoft Security Intelligence Report ‘09(http://www.microsoft.com/downloads/details.aspx?FamilyID=037f3771-330e-4457-a52c-5b085dc0a4cd&displaylang=en)

>97%filtered!

Edge filtering

Filtering at “edges” of network.• Ingress: filtering traffic entering your network• Egress: filtering traffic exiting your network

Mailstream

Providers

Sender

Content

Attachments

Content filtering

Rule-based• regular expressions on headers, contents, …• blacklist• Hash/checksum database

Bayesian probability/networks• probabilistic approach

Support Vector Machine• High dimension hyper-plane!

Given new email message do I deliver or trash?

Spam I’ve seen

Rule-based filtering

Rules are expressed as regular expressions (orSNORT traces)• If contains “refinancing” & “mortgage”, trash• If contains non-English alphabets, trash• If attachment type = executable, trash

Limitations• Difficult to write rules general enough to catch spam

but not legit email• often miss obvious tricks

– misspelling on purpose

Checksums – Fuzzy hashing

Goal: resilient to spelling changesDoes it look like spam I’ve already seen?

Fuzzy vs. non-fuzzy hashing

Traditional hashing• Same: Fixed length output• Different: Unique input = (nearly) unique output• 1 bit change in input = COMPLETELY different hash

Fuzzy hashing• Same: Fixed length output• Different: “mostly the same” = SAME value

Fuzzy hashing example

Viagra

Vi4gr4

V!aGra

V i a g r a

Fuzzy hash Output

DEF8F32

DE28F3C

DGF8E32

DAF8W3B

Output = fixed length. Look at distance between outputs.Shorter distance = more similar. Input = message

Bayesian Theorem

Conditional probability

Bayes’ Theorem

)()()|(

BPBAPBAP !

)()()|()|(

BPAPABPBAP =

Bayesian in Spam [Graham 02, Sahami 98]

Given a word “stock” in an email, what is theprobability of this email being spam?

)()()|()|(

stockPspamPspamstockP

stockspamP =

we can compute these fromthe sample set of emails

More words

Does more words mean higher probability?• related words such as stock, jackpot, blue chip, …• not-related words such as stock, diet, …• n words require O(2n) probabilities to compute

Naïve Bayesian filter• assumes the independence between words• O(n) probabilities to compute

),()()|,(),/(

jackpotstockPspamPspamjackpotstockP

jackpotstockspamP =

Support Vector Machine

Given records, plot them in n-dimension spacesand find a “plane” that divides the plots thebest.

Feature selection

How do we choose which tokens to examine?

Words alone are not sufficient• phrases, punctuation, …• sender’s email address, IP address, domain name, …• URLs, images, tags, …

More sophisticated conditions• attachment?• contains images?

Hybrid approaches

SpamAssassin• open-source• combination of rule-based and Bayesian• commonly run at SMTP server• can be applied for an individual user

Vipul’s Razor• Collaborative, distributed, checksum (statistical &

randomized signature-based) anti-spam solution• Commercial version: Cloudmark, Inc

Charging for email

If you can’t win the game, change the rules…Would you pay $0.0025 to send an email?

If you sent 50 emails/day:• ~$46/year

Spammers:• 100,000,000,000 * $.0025 = $250,000,000 dollars• Assuming adoption rate of 1:200,000• Break even = $1,250

Spam - USF Computer Scienceejung/courses/686/lectures/Spam.pdf · What is spam? An unsolicited...

Documents

Transcript of Spam - USF Computer Scienceejung/courses/686/lectures/Spam.pdf · What is spam? An unsolicited...

Introduction Spam in Society Email Spam IM Spam Text Spam Blog Spamming Spam Blogs.

buytestbank.eu · Web viewPersonal information management (PIM) helps you create and maintain (1) to-do lists, (2) ... Spam is roughly equivalent to unsolicited telephone marketing

Link Analysis for Web Spam Detection - uniroma1.itwadam.dis.uniroma1.it/pubs/link-analysis-spam.pdf · Link Analysis for Web Spam Detection • 2:5 score will probably originate from

Alberta Unsolicited Proposal Framework and Guideline€¦ · 10 Chapter Three | Unsolicited Proposal Management Framework 3.0 - Unsolicited Proposal Management Framework 3.1 Framework

Computer SPAM; The Impact of Unsolicited Email on Information … · 2014. 10. 25. · Spam and Information Security Mahmood Shubbak Spring 2014 1 Abstract Spam, or the unsolicited

The Value of Unsolicited Buy Recommendations to Investors ... · in email spam messages advertising certain stock trades. Through careful analysis of a basket of sixteen stocks that

Unsolicited Communication / SPIT / multimedia-SPAM

Spam Regulations 2004 - Rockend SMS · The Spam Act says that unsolicited commercial electronic messagesmust not be sent. Messages should only be sent to an address when it is known

INFORMATION AND DATA REQUEST TO …relevant service over such competing service. Documents solely relating to unsolicited commercial e-mail (i.e., SPAM) and malicious software need

Network Security and Privacyshmat/courses/cs378/spam.pdf · Why Hide Sources of Spam? Many email providers blacklist servers and ISPs that generate a lot of spam •Use info from

The spam Packageftp.uni-bayreuth.de/math/statlib/R/CRAN/doc/packages/spam.pdf · Version 0.12 Author Reinhard Furrer Maintainer Reinhard Furrer Depends R

Teaching / Consulting for IT Securitydownload.microsoft.com/download/0/A/A/0AADDF80-6B44-48D4... · 2018-10-13 · Xeduco Spam Email Irrelevant, unwanted, and unsolicited email to

A Spam-Detecting Artiﬁcial Immune Systemterri.zone12.com/doc/academic/TerriOda-MastersThesis-Spam.pdf · Chapter 1 Introduction 1.1 Motivation Several years ago, the word “spam”

dokkaras.comdokkaras.com/wp-content/uploads/2020/05/Email-spam-F… · Web viewINTRODUCTION. Email spam, also referred to as junk email, is unsolicited messages sent in bulk by

Modbus Unsolicited

The Spam/Anti-Spam Battlefield - Sans Technology Institute · The Spam/Anti-Spam Battlefield At times, ... Spamming is the abuse of electronic messaging systems to send unsolicited,

Spam Overview. What is Spam? Spam is unsolicited email in the form of: Commercial advertising Phishing Virus-generated Spam Scams E.g. Nigerian.

SECURITY FOCUS REPORT Spam in Today’s Business World fileSecurity Focus Report 1 Spam Trends in Today’s Business World Introduction Spam refers to unsolicited bulk email. These

Email Security Deployment Guide - cisco.com · An email solution will become unusable if spam—unsolicited and unwanted emails—is not filtered properly. The sheer volume of spam

Machine Learning Techniques in Spam Filteringkt.era.ee/hw/spam/spam.pdf · Finally, some ideas are given of how to construct a practically useful spam ﬁlter using the discussed