NetworkPaperthesis1

Group Details:-

Dhara Shah z3299353

Imad Hashmi z3193866 Zuo Cui z3261136

Our Paper:- Y. Xie , F. Yu, K. Achan , R. Panigraphy , G. Hulten and I. Osipkov ,

Spamming Botnets: Signatures and Characteristics, in Proceedings of ACM

SIGCOMM 2008, pp. 171-182, Seattle, USA August 2008. PDF

Flow of the Literature Review is as follows:-

Introduction

Background and Previous Work

Focus on Technology used in Paper

Future and Related Work

Introduction

Since email has become a wide spread means of communication around the

world and millions of email messages are transferred every minute, it is

understandable that illegitimate use of email service is also in practice since

long. One of the many abuses of this service is spamming which is used by

advertisers around the world to send advertisements of their products to

legitimate email users. Following discussion is on the methods used by anti spam

system to detect spam emails and botnets.

Background and Previous Work

There were a lot of researches on the identification and filtering of email spam.

Based on the part of email used for spam detection, all these work could be

generally classified into two main categories: non-content-based and

content- based.

Non-content-based filters are also known as address-based filters. It uses the

information such as IP address or email address in the email header to

examine. Blacklist and Whitelist are the common technique in this category.

Blacklist records the IP addresses or email addresses which send spam. And

conversely, Whitelist contains all acceptable email addresses. They can be

deployed on the client computers or email servers. Cook et al. (2006)

experimented a domain specific blacklist which worked on the mail server to

reduce the number of spam entering the network. But blacklist may easily cause

false positive. If one of them sends spam then its IP address or email address is

recorded in the blacklist. Consequently, other legitimate mails from that email

address are all marked as spam.

http://www.cse.unsw.edu.au/~cs9333/10s1/Research_Project/papers/Spamming_Botnets_Sigcomm_2008.pdf

Content-based detection filters spam by analyzing the message content of

received email, which overcome the drawbacks of Non-content-based filters.

They scan for some sensitive keywords in the content to identify the spam. This

type of filters includes Heuristic filters and Bayesian filters.

Heuristic detection, are also known as rule-based analysis which uses regular

expression rules to detect phrases or characters that are common to spam

mails. Rules can be set as email header information, keywords or URL in the

content. William Cohen (1996) used learning rules successfully to classify emails

into different folders. But there are little related researches on the spam

detection based on rules.

However, the spam detection precision relies on the rules which are set by mail

system managers. So it will take significantly long time to define the rules. After

that, the rules should be refined frequently. If these pre-set rules on the mail

system are not updated, the filters will not work efficiently on the new spam with

new features. Besides, the rules are rigid and easy to cause false positive.

In addition, because the content-based detection of spam can be considered as

the problem of text classification, several machine learning approaches have

been applied to spam detection. Among many others, Bayesian is one of those

being proposed. In 1998, Bayesian classification techniques are employed to the

issue of spam filtering (Sahami et al, 1998). It is able to classify the occurrence

of certain words or phrases in the message content. Then the filters evaluate the

probability whether spam or not by analyzing the statistics. As a result, the

Bayesian filters eliminate more than 95% spam in the experiments and identify

80% of incoming junk mail in the real scenario. It is obviously that the Bayesian

filters can provide a high correct rate with regard to the detection of plain-text

content.

Now Bayesian is widely used with other methods in many spam detection

technologies to improve the accuracy. However there are some issues in the

Bayesian filters. First, as the same issue as other machine learning approaches,

the accuracy of Bayesian filters depend on the quality of training data and

training process. Second, even Bayesian filters can provide high precision for

plain-text content, but it is difficult to detect the booming spam contained

images. Therefore, a further research conducted by Okayama University is

carried out to detect the image spam (Uemura et al. 2008). It designed a

method allows the existing Bayesian filter to learn image information, such as

the file size or name, and then evaluate the probability on the learning results.

After some experiments of this method, it can be observed that the false

negative rate dropped but the false positive rates are almost same. It means

this method can play only a booster role in the identification of spam using

Bayesian because less information is provided by images to distinguish the spam

and legitimate mails.

Content Based Detection System has lot of advantages but the time and loads of

processing space as it goes through the complete email. There is need of an anti

spam system which could combine the advantages of content based and non

content based spam detection system.

AutoRE which a software designed by the Microsoft research group and our

anchor paper has tried to combine the both type of detection systems i.e.

content based and non content based system. Now we will be discussing in detail

how AutoRE combine the both systems.

Focus on Technology used in Paper

AutoRE unlike all the previous solutions to detecting botnets (like spamhaus,

blacklist) where areas it creates and trains itself dynamically real time. To do

this it has 3 major steps when a set of emails is supplied to it, they are as

follows:-

1. URL Pre-processing

2. Group Selector

3. Regular Expression Generation

It is important to understand that we are not identifying spam or not spam

emails. As by definition any email which is regular and sends in bulk is spam,

but spam emails are not malicious as even a normal user might send an email

which is sent to his complete contact list but is relevant and not spam. Our focus

is on spam emails generated by botnets as they are not relevant emails it don’t

have any meaning to it, they are just sent to accomplish some malicious

mission. As botnets are autonomous systems, there is a pattern in their sending

behaviour as they are programmed. So to catch that pattern above mentioned

steps are followed. While doing URL Pre-processing following parameters are

considered:-

1. URL String

2. Source server IP address

3. Email sending time

All forwarded messages are discarded as a legitimate forwarding server can be

mistaken for botnet member. URL Strings which are suspiciously random and

multiple domains are extracted out. As URL strings like a.com, b.com are

unlikely to be by botnets as they are registered domain names which

economically not feasible for spammers. URL strings are then broken down and

grouped into groups as per their domain names. As it is observed that spam

emails are advertising for a particular product or particular advertising

campaign, then domain specific signatures are created. And from this domain

specific signatures domain-agnostic regular expression are created to get better

results in form of reduced false positive rates and identifying the botnets even

when they change their domains. Before creating the generalised regular

expression domain specific signature need to suffice that it’s distributed,

bursty and specific only then can be classified as spam signature.

While grouping it’s very important to understand how to group the domains as

with n number of emails there are possibly n domain names. So while

considering distributed property temporal correlation is considered and bursty

property is considered over a span of 5 days as it’s observed most ASes are

active for minimum 5 days.

Now once we are done classifying domains into the groups, next step would be

generating regular expression. By generating a Regular expression a not a

token conjunction helps us reduce the false positive rate as keywords used in

the token conjunction are words which may or may not be part of email. After

creating domain specific groups we create a signature of the group and

classification is no more based on the group and its domain agnostic. By doing

so we assured that in future if the botnets change the domain still they will be

detected as there domain will hold the same regular expression and group

signature which classifies them as spam this happens because we are not

generating domain specific signature. This a unique feature of AutoRE which

helps it finding maximum spam emails with minimum false positive rate. Also

after categorizing them and assigning them their respective regular expression

it’s very important that we verify that the emails we have classified as spam are

actually spams or not. To do so there are 2 steps we need to do. First of all we

query our suspected IP Addresses to Blacklist which are found in the list

are directly classified as spam. The ones which are not we need to run some

behavioural test to understand whether they are spams or not. This

behavioural test is done on each campaign the points to taken care of are as

follows:-

1). Similarity of Email Properties

2). Similarity of Sending Time

3). Similarity of Email Sending Behaviour

As the emails we are targeting are being generated and send by automated

system above mentioned properties play a big role. As botnets are automated

systems they are bound to have pattern as however random the sending

algorithm is designed due to the frequency of occurrence pattern is going to be

generated.

It doesn’t end here as by the means of this software we can study the

characteristic of the botnets and predict the traffic and spam emails which are

going to be generated. This study on botnets has revealed lot of facts which are

pointers for future research in the anti spam system. In the next section we will

be mentioning the results of the study on botnets and its use in technologies

emerged after AutoRE.

Future and Related Work

Characteristics of Botnets and their use in present anti spam systems:-

1). Spam Sending Patterns over the network

The above characteristic is used in A Dynamic Reputation Service for

Spotting Spammers [1] SpamSpotter is real time (like AutoRE) reputation

software for filtering spam messages. The Spamspotter software classifies email

senders in real time based on their global sending behaviour. This system is

called behavioural detection. SpamSpotter than applies a third party machine

learning behavioural algorithm on this data to generate reputation of senders. A

preferred algorithm in SpamSpotter is SNARE. It is a network level behavioural

algorithm which identifies spam senders based on their email sending behaviour

instead of their addresses and the contents that they are sending. In some

cases, SNARE mechanism is so efficient that it can identify a spammer before it

has sent a large number of email messages.

AutoRE also studied the similar behaviour though SpamSpotter goes next level

by implementing SNARE algorithm to calculate reputation of a sender.

2). Distribution of IP Address

One of the characteristic of Botnets observed while experimentation of AutoRE

was studying distribution of IP Address. This is very important characteristic to

be studied as it can help us stop and understand the wide spread of Botnets.

This property has been extended by Studying Spamming Botnets Using

Botlab [6] Botnets are the most used spamming technique used these days. It

is estimated that 85% of billions of spam messages are generated by botnets.

This paper presents a botnet monitoring platform called botlab which monitors

all incoming spam traffic at a certain location. It scans the spam messages and

obtains bot binaries through spam links. A human operator than runs specific

tools on these binaries to obtain information about the bots sending these

spams. It then executes multiple captive, sandboxed nodes from various

botnets, allowing it to observe the precise outgoing spam feeds from these

nodes. It scours the spam feeds for URLs, gathers information on scams, and

identifies exploit links. Finally, it correlates the incoming and outgoing spam

feeds to identify the most active botnets and the set of compromised hosts

comprising each botnet. Also another extension is studying the characteristic of

Botnet and using it to detect them is done in BotGraph: Large Scale

Spamming Botnet Detection [2] BotGraph detects the abnormal sharing of IP

addresses among accounts holders in an email system. Applying BotGraph to

two months of Hotmail log of total 450GB data, BotGraph successfully identified

over 26 million bot-accounts with a low false positive rate of 0.44%. BotGraph

has also been implemented using a distributed clustered algorithm with Map

Reduce technique. BotGraph can detect botnet sign-ups and already created

botnet email accounts.

Also one more interesting study came up during the research of AutoRE under

the category to scan the network traffic was the increase in use of static IP

addresses from Nov 2006 to July 2007. Due to this study it helped the blacklist

to improve by populating it by static IP address. Also research suggested that

Botnets are evolving and creating more sophisticated and polymorphic

URL’s to bypass anti spam systems.

One major disadvantage of AutoRE is its not practically real time

implemented. Till now its method are under investigation and its

inception real time is still awaited.

References

1). A Dynamic Reputation Service for Spotting Spammers Anirudh

Ramachandran, Shuang Hao, Hitesh Khandelwal, Nick Feamster, Santosh

Vempala School of Computer Science, Georgia Tech

http://www.cs.purdue.edu/homes/hkhande/projects/spam/spam_nsdi.pdf

2).BotGraph: Large Scale Spamming Botnet Detection

http://research.microsoft.com/pubs/79413/botgraph.pdf

M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. ‘A Bayesian approach to

filtering junk E-mail’. In Learning for Text Categorization: Papers from the 1998

Workshop, Madison, Wisconsin, 1998.

3). Cohen, W 1996 ‘Learning Rules that Classify E-Mail’, Advances in Inductive

Logic Programming, pp. 124-143

4). Cook, D, Hartnett, J, Manderson, K&Scanlan,J 2006, ‘Catching Spam Before

it Arrives: Domain Specific Dynamic Blacklists’, Proceedings of the 2006

Australasian workshops on Grid computing and e-research, Vol. 54, pp.193-202.

5). SNARE: Spatio-temporal Network-level Automatic Reputation Engine

http://hdl.handle.net/1853/25135

6). Studying Spamming Botnets Using Botlab

http://www.cs.washington.edu/homes/arvind/papers/botlab.pdf

7).Uemura, M& Tabata, T 2008 ‘Design and Evaluation of a Bayesian-filter-based

Image Spam Filtering Method’, 2008 International Conference on Information

Security and Assurance, 2008 IEEE

http://www.cs.purdue.edu/homes/hkhande/projects/spam/spam_nsdi.pdf

http://research.microsoft.com/pubs/79413/botgraph.pdf

http://hdl.handle.net/1853/25135

http://www.cs.washington.edu/homes/arvind/papers/botlab.pdf

NetworkPaperthesis1

Documents

Transcript of NetworkPaperthesis1