Towards Modeling Legitimate and Unsolicited Email Traffic Using Social Network Properties 1 Towards...

14
Towards Modeling Legitimate and Unsolicited Email Traffic Using Social Network Proper Towards Modeling Legitimate and Unsolicited Email Traffic Using Social Network Properties Farnaz Moradi, Tomas Olovsson, Philippas Tsigas

Transcript of Towards Modeling Legitimate and Unsolicited Email Traffic Using Social Network Properties 1 Towards...

Towards Modeling Legitimate and Unsolicited Email Traffic Using Social Network Properties 1

Towards Modeling Legitimate and UnsolicitedEmail Traffic Using Social Network Properties

Farnaz Moradi,Tomas Olovsson, Philippas Tsigas

Towards Modeling Legitimate and Unsolicited Email Traffic Using Social Network Properties 2

Legitimate and Unsolicited Email Traffic

The battle between spammers and anti-spam strategies is not over yet.

2 4 6 8 10 12 140

1

2

3

4x 106

Days

Num

ber

of E

mai

ls

All emailLegitimate email

Towards Modeling Legitimate and Unsolicited Email Traffic Using Social Network Properties 3

• Human-generated communications create implicit social networks

• Spam is sent automatically– It is expected that it does not exhibit the social network

properties of human-generated communications

• Spam can be identified based on how it is sent – It is expected that this behavior is more difficult for the

spammers to change than the content of the email

Legitimate and Unsolicited Email Communications

Towards Modeling Legitimate and Unsolicited Email Traffic Using Social Network Properties 4

Outline

• Email Dataset• Email Networks• Social Network Properties• Implication• Conclusions

Towards Modeling Legitimate and Unsolicited Email Traffic Using Social Network Properties 5

Email Dataset• SMTP packets were collected (port 25)• Packets were aggregated into TCP flows • Emails were re-constructed from flows• Emails were classified into Accepted and

Rejected by receiving mail servers • Accepted emails classified into

Ham and Spam using a well-trained SpamAssassin

• Automatic anonymization of email addresses extracted

from SMTP headers and removal of packet content

SUNET Customers

Main Internet

OptoSUNET Core Network

Access Routers

2 Core Routers

40 Gb/s 10 Gb/s (x2)

NORDUnet

Packets

Flows

Spam

Ham

Rejected

Emails

Accepted

797 M

46.8 M

20 M

16.6 M

3.4 M

1.5 M 1.9 M

Towards Modeling Legitimate and Unsolicited Email Traffic Using Social Network Properties 6

Email Networks

• Implicit social networks:– Nodes (V): Email addresses– Edges (E): Transmitted Emails

• Dataset A: – |V| = 10,544,647– |E| = 21,562,306

• Dataset B: – |V| = 4,525,687 – |E| = 8,709,216

Towards Modeling Legitimate and Unsolicited Email Traffic Using Social Network Properties 7

Structural and Temporal Properties of Email Networks

• Do email networks exhibit similar structural and temporal properties to other Social Networks?

– Scale free (power law degree distribution)– Small world (short path length & high clustering)– Connected components (giant core)

Towards Modeling Legitimate and Unsolicited Email Traffic Using Social Network Properties 8

Scale-Free Networks• Power law degree distribution

100

102

104

101

103

105

100

10-2

10-4

10-6

10-8

Degree

Fre

quen

cy

In-degreeOut-degree

100

105

101

102

103

104

10-6

10-4

10-2

100

Degree

Fre

quen

cy

In-degreeOut-degree

100

105

101

102

103

104

10-6

10-4

10-2

100

Degree

Fre

quen

cy

In-degreeOut-degree

100

102

104

101

103

105

100

10-2

10-4

10-6

Degree

Fre

quen

cy

In-degreeOut-degree

Ham

SpamRejected

Complete

Dat

aset

A

Towards Modeling Legitimate and Unsolicited Email Traffic Using Social Network Properties 9

Scale-Free Networks• Power law degree distribution

100

102

104

106

101

103

105

100

10-2

10-4

10-6

Degree

Fre

quen

cy

In-degreeOut-degree

100

105

101

102

103

104

100

10-2

10-4

10-6

Degree

Fre

quen

cy

In-degreeOut-degree

100

102

104

106

101

103

105

100

10-2

10-4

10-6

Degree

Fre

quen

cy

In-degreeOut-degree

100

105

101

102

103

104

100

10-2

10-4

10-6

Degree

Fre

quen

cy

In-degreeOut-degree

Ham

SpamRejected

Complete

Dat

aset

B

Towards Modeling Legitimate and Unsolicited Email Traffic Using Social Network Properties 10

Small-World Networks• Small average shortest path length• High average clustering coefficient

2 4 6 8 10 12 140

0.005

0.01

0.015

Days

Ave

rage

clu

ster

ing

coef

ficie

nt

HamSpam

2 4 6 8 10 12 140

0.005

0.01

0.015

Days

Ave

rage

clu

ster

ing

coef

ficie

nt

HamSpamD

atas

et A

Dataset B

1 4 7 10 146

7

8

9

10

11

12

Days

Ave

rage

sho

rtest

pat

h le

ngth

HamSpam

2 4 6 8 10 12 146

7

8

9

10

Days

Ave

rage

sho

rtest

pat

h le

ngth

HamSpam

Towards Modeling Legitimate and Unsolicited Email Traffic Using Social Network Properties 11

Connected Components• Giant connected component• Power law component size distribution

100

102

104

106

101

103

10510

-6

10-4

10-2

100

CC size

Fre

quen

cy

HamSpam

100

102

104

106

101

103

105

100

10-2

10-4

10-6

CC size

Fre

quen

cy

HamSpam

2 4 6 8 10 12 140.4

0.5

0.6

0.7

0.8

0.9

1

Days

Rel

ativ

e G

CC

siz

e

HamSpam

2 4 6 8 10 12 140.2

0.4

0.6

0.8

1

Days

Rel

ativ

e G

CC

siz

e

HamSpam

Dat

aset

AD

ataset B

Towards Modeling Legitimate and Unsolicited Email Traffic Using Social Network Properties 12

100

101

102

103

10410

-6

10-4

10-2

100

Out-Degree

Fre

quen

cy

100

101

102

103

10410

-6

10-4

10-2

100

Out-Degree

Fre

quen

cy

Out-degree distributionOutliers

Implications• Spam does not exhibit the social network properties of

human-generated communications• The unsolicited email traffic causes anomalies in the

structural properties of email networks• These anomalies can be identified by using an outlier

detection mechanism

Complete

Towards Modeling Legitimate and Unsolicited Email Traffic Using Social Network Properties 13

100

101

102

103

10410

-6

10-4

10-2

100

Out-Degree

Fre

quen

cy

Out-degree distributionOutliers

100

105

101

102

103

10410

-6

10-4

10-2

100

Out-Degree

Fre

quen

cy

Out-degree distributionOutliers

Identifying Spamming Nodes

Dataset NetworkTotal spam

Spam sent by outliers (1<k<100)

1 day 68% 95.5%A 7 days 70% 96.8%

14 days 70% 96.9%1 day 40% 82.7%

B 7 days 35% 81.3%14 days 39% 87.3%

1 day 7 days

Dat

aset

A

Towards Modeling Legitimate and Unsolicited Email Traffic Using Social Network Properties 14

Conclusions

• A network of legitimate email traffic can be modeled similar to other social networks– Small-world, scale-free network

• A network of unsolicited traffic differs from social networks– Spammers do not emulate a social network

• This unsocial behavior of spam is not hidden in the mixture of email traffic– Spammers can be identified without inspecting the content of the emails

Thank

You!