[IEEE 2007 International Symposium on Computational Intelligence in Robotics and Automation -...

7
Anti-Spam Filtering Using Neural Networks and Baysian Classifiers Yue Yang and Sherif Elfayoumy, Senior Member, IEEE AbstractElectronic mail is inarguably the most widely used Internet technology today. With the massive amount of information and speed the Internet is able to handle, communication has been revolutionized with email and other online communication systems. However, some computer users have abused the technology used to drive these communications, by sending out thousands and thousands of spam emails with little or no purpose other than to increase traffic or decrease bandwidth. This paper evaluates the effectiveness of email classifiers based on the feed- forward backpropagation neural network and Baysian classifiers. Results are evaluated using accuracy and sensitivity metrics. The results show that the feed- forward backpropagation network algorithm classifier provides relatively high accuracy and sensitivity that makes it competitive to the best known classifiers. On the other hand, though Baysian classifiers are not as accurate they are very easy to construct and can easily adapt to changes in spam patterns. I. INTRODUCTION oday’s economic climate demands that companies maximize resources and reduce operation expenses. A high volume of unwanted, and sometimes offensive, spam email diminishes productivity and increases costs (by taking up bandwidth used for Internet access, and inundating mail server processing, storage, and backups). The growing problem of unsolicited bulk email, also known as “spam”, has generated a need for reliable and efficient email classifiers. It is believed that deleting spam emails before they reach the user is quite important and could be automated if intelligent software programs are developed and used. Many tools are developed or being developed to accomplish this goal. However, most programs have preset rules that don’t adapt to customer’s characteristic or the changes of behavior of end spammers. In this paper we investigate the effectiveness of the feed-forward backpropagation neural network and Baysian classifiers. Sherif Elfayoumy is with the School of Computing, University of North Florida, Jacksonville, FL 32224 USA (phone: 904-620-2985; fax: 904-620- 2988; e-mail: [email protected]). Yue Yang is a graduate of University of North Florida. Many techniques currently exist for identifying spam emails, and numerous tools were developed to sort emails as they arrive at the user's mailbox or mail client program. Most spam filtering systems use keyword-based and rule-based techniques and employ heuristics that involve identifying sending domains from which spam is known to originate, identifying email from relay hosts known to forward spam, identifying email from unknown or strangely formatted senders, or senders in certain domains, identifying keywords common to much spam (e.g. “make money fast” or “multi level marketing”), or using other simple textual analysis such as looking for odd spacing or lots of capital letters or exclamation points [1]. The recognition capabilities of these classification heuristics require ongoing keyword and rule updating in addition to the possibility of erroneously classifying legitimate messages as spam (false positive classification) [2]. The consequences of classifying a legitimate message as a spam (referred to as false negative classification) are much greater than classifying a spam message as a legitimate (referred to as false positive classification). II. NEURAL NETWORK CLASSIFICATION The backpropagation learning method is an iterative process used to train the feed-forward neural network for minimal response error. An input pattern is applied to the network and forward propagated through the network using the initial node connection weights. Output error is determined then back propagated to establish a new set of network connection weights. The process is continued until a prescribed minimum error is achieved. Figure 1 shows a typical topology for a feed-forward network consists of an input layer, hidden layer and output layer [3]. Fig. 1. A Feed-Forward Network Topology T Proceedings of the 2007 IEEE International Symposium on Computational Intelligence in Robotics and Automation Jacksonville, FL, USA, June 20-23, 2007 FrAT1.2 1-4244-0790-7/07/$20.00 ©2007 IEEE. 272

Transcript of [IEEE 2007 International Symposium on Computational Intelligence in Robotics and Automation -...

Page 1: [IEEE 2007 International Symposium on Computational Intelligence in Robotics and Automation - Jacksonville, FL, USA (2007.06.20-2007.06.23)] 2007 International Symposium on Computational

Anti-Spam Filtering Using Neural Networks and Baysian Classifiers

Yue Yang and Sherif Elfayoumy, Senior Member, IEEE

Abstract—Electronic mail is inarguably the most widely used Internet technology today. With the massive amount of information and speed the Internet is able to handle, communication has been revolutionized with email and other online communication systems. However, some computer users have abused the technology used to drive these communications, by sending out thousands and thousands of spam emails with little or no purpose other than to increase traffic or decrease bandwidth. This paper evaluates the effectiveness of email classifiers based on the feed-forward backpropagation neural network and Baysian classifiers. Results are evaluated using accuracy and sensitivity metrics. The results show that the feed-forward backpropagation network algorithm classifier provides relatively high accuracy and sensitivity that makes it competitive to the best known classifiers. On the other hand, though Baysian classifiers are not as accurate they are very easy to construct and can easily adapt to changes in spam patterns.

I. INTRODUCTION oday’s economic climate demands that companies maximize resources and reduce operation expenses. A

high volume of unwanted, and sometimes offensive, spam email diminishes productivity and increases costs (by taking up bandwidth used for Internet access, and inundating mail server processing, storage, and backups). The growing problem of unsolicited bulk email, also known as “spam”, has generated a need for reliable and efficient email classifiers.

It is believed that deleting spam emails before they reach the user is quite important and could be automated if intelligent software programs are developed and used. Many tools are developed or being developed to accomplish this goal. However, most programs have preset rules that don’t adapt to customer’s characteristic or the changes of behavior of end spammers. In this paper we investigate the effectiveness of the feed-forward backpropagation neural network and Baysian classifiers.

Sherif Elfayoumy is with the School of Computing, University of North Florida, Jacksonville, FL 32224 USA (phone: 904-620-2985; fax: 904-620-2988; e-mail: [email protected]).

Yue Yang is a graduate of University of North Florida.

Many techniques currently exist for identifying spam emails, and numerous tools were developed to sort emails as they arrive at the user's mailbox or mail client program. Most spam filtering systems use keyword-based and rule-based techniques and employ heuristics that involve identifying sending domains from which spam is known to originate, identifying email from relay hosts known to forward spam, identifying email from unknown or strangely formatted senders, or senders in certain domains, identifying keywords common to much spam (e.g. “make money fast” or “multi level marketing”), or using other simple textual analysis such as looking for odd spacing or lots of capital letters or exclamation points [1]. The recognition capabilities of these classification heuristics require ongoing keyword and rule updating in addition to the possibility of erroneously classifying legitimate messages as spam (false positive classification) [2]. The consequences of classifying a legitimate message as a spam (referred to as false negative classification) are much greater than classifying a spam message as a legitimate (referred to as false positive classification).

II. NEURAL NETWORK CLASSIFICATION The backpropagation learning method is an iterative

process used to train the feed-forward neural network for minimal response error. An input pattern is applied to the network and forward propagated through the network using the initial node connection weights. Output error is determined then back propagated to establish a new set of network connection weights. The process is continued until a prescribed minimum error is achieved. Figure 1 shows a typical topology for a feed-forward network consists of an input layer, hidden layer and output layer [3].

Fig. 1. A Feed-Forward Network Topology

T

Proceedings of the 2007 IEEE International Symposium onComputational Intelligence in Robotics and AutomationJacksonville, FL, USA, June 20-23, 2007

FrAT1.2

1-4244-0790-7/07/$20.00 ©2007 IEEE. 272

Page 2: [IEEE 2007 International Symposium on Computational Intelligence in Robotics and Automation - Jacksonville, FL, USA (2007.06.20-2007.06.23)] 2007 International Symposium on Computational

jj j 0

lo(net θ )/θl e

=− ++

i jnet *p bj jii

w= +∑

2p k k

k

1SSE (t o )2

= −∑

The input layer of the feed-forward network is generally fully connected to all nodes in the subsequent hidden layer. Input vectors may be binary, integer or real and are usually normalized. When inputs are applied to the network, they are scaled by the connecting weights between the input layer and the first hidden layer. These connection weights are denoted by wji, where i, represents the ith input node and j represents the jth node in the first hidden layer. Each jth node of the hidden layer acts as a summing node for all scaled inputs and as an output activation function. The summing function is given in equation 1, where wji is the connection weights between the jth node of the hidden layer and the ith node of the input layer; pi is the input value at the ith node; and bj is the bias at the jth node in the hidden layer.

(1)

One of the most popular activation functions used in classification problems is the sigmoid function [4]. This function is given in equation 2, where oj is the output of the jth node. The parameter θj serves as a bias and is used to shift the activation function along the horizontal axis. The parameter θo is used to modify the shape of the sigmoid function, where low values will make the function look like a threshold-logic unit and high values result in a more gradually sloped function.

(2)

Subsequent layers work in a similar manner. The output from the preceding layer is summed at each node using equation 1. The output of the node is determined using the activation function of equation 2. There is no limit on the number of nodes in a hidden layer, nor is there a limit on the number of hidden layers. However, practical experience shows that excessively large networks are difficult to train and require large amounts of memory and execution time. The backpropagation training process begins by selecting a set of training input samples along with their corresponding output vectors [5]. An input sample is applied to a network where connecting weights have been randomly initialized. The resulting output is determined by calculating the feed forward output at each node and forward propagating the layer results until the kth output layer nodes are activated. The actual output ok is then compared to the target output tk and the error is calculated using the sum of square error (SSE), or other selected criteria. The relationship representing the SSE for an input pattern p is given in equation 3, where k represents the kth node in the output layer.

(3)

Once the error is determined the connection weights in the network will be updated using a gradient descent approach,

by back propagating change in the network weights starting at the output layer and working back toward the input layer. The incremental change in wkj for a given pattern p is termed ∆pwkj, and is proportional to -∆SSE/∆wkj. The relationship for calculating the incremental change in connection weights between the last hidden layer and the output layer is given by equation 4.

(4)

where m= iteration step in the training process η = learning rate α = momentum factor for previous iteration ∆pwkj (m) and where the δ value at the output layer is represented by equation 5.

(5)

The subscript p represents the pattern number. The values of tpk and opk represent the target output and the actual output, respectively, of the pth pattern at the kth output node. The incremental change in wji (between the hidden layer and the input layer) for a given pattern p is termed ∆pwji, and is identical to equation 4, with the exception that the delta function is redefined as given in equation 6.

(6)

In general, the network connection weights are adjusted at the end of each training cycle. That is, after all training patterns have been presented to the network and the ∆pwkj calculated, a summation of all pattern changes is applied to the network. This relationship is represented in equation 7.

(7)

Once the network is updated, the process is repeated until either a specified error limit is achieved or the total number of training cycles (epochs) is completed.

III. EXPERIMENTAL DESIGN The dataset used in this research is available from the UCI

Machine Learning Repository [6]. The dataset was generated in June and July, 1999. It consists of spam and legitimate emails that collected by George Forman, Mark Hopkins, Erik Reeber and Jaap Suermondt of Hewlett-Packard labs. The collection of spam emails came from their postmaster and individuals who had filed spam. The legitimate emails donated by George Forman, and hence the word “George” and the area code “650” are indicators of legitimate emails [7]. There are 4601 instances in the collection including 2788

p kj k j kj∆ w (m 1) ηδ o α∆w (m)+ = +

pk pk pk pk pkδ (t o )o ( o )l= − −

pj pj pj pk kjk

δ o (l o ) δ w= − ∑

kj p kjp

∆w ∆ w= ∑

FrAT1.2

273

Page 3: [IEEE 2007 International Symposium on Computational Intelligence in Robotics and Automation - Jacksonville, FL, USA (2007.06.20-2007.06.23)] 2007 International Symposium on Computational

(60.6%) legitimate and 1813 (39.4%) spam emails. There are 58 attributes in the dataset of which 48 are continuous attributes that represent the frequency of word, 6 are continuous attributes that represent the frequency of characters(“;”, “$”, …), three continuous attributes are statistics of capital letters, and one nominal class label that denotes whether the email is considered spam or legitimate. There were no missing attribute values found in these 58 attributes. Table 1 shows the 48 words and the 6 chars whose frequencies are represented by the attribute values. Min-max normalization is used to scale the attributes within the range [–1.0, 1.0].

Table 1. 48 words and 6 chars Figure 2 depicts a general framework of the email classification process that was followed in this research. In this framework, a set of preprocessed email messages, referred to as “historical data”, is used to construct the classifiers. To achieve the most accurate results, the historical data are randomly shuffled and partitioned into five independent subsets using the 5-fold cross validation method [8]. For each experiment, one subset is allocated for testing, and the other four subsets are allocated for the classifier training. For each classification model, five experiments were conducted and their accuracy and sensitivity were calculated and compared.

Fig. 2. Email classification process model

IV. NEURAL NETWORK TOPOLOGY The neural network classifiers constructed in this research

were based on five experiments using the 5-fold cross validation described above. For each experiment, a classifier was built using a training dataset, and was tested using one test subset. The 57 continuous real attributes correspond to 57 input neurons in the input layer of the neural network. Since there

is no precise method for determining the number of layers and the number of neurons per hidden layer, these must be determined experimentally. A comprehensive trial-and-error testing was followed and the training process was repeated with different network topologies and the accuracies of the resulting trained networks were estimated. Two groups of network topologies were trained. First group included one hidden layer with different number of hidden neurons (2 neurons to 56 neurons) and one output layer with different number of output neurons (2 neurons or 3 neurons). The second group included two hidden layers with 2 to 56 hidden neurons per hidden layer and one output layer with 2 or 3 output neurons. There was one nominal class label measured for each training sample in the datasets. The one nominal class label denotes whether the email is considered as spam (label=1) or non-spam (label=0). For the network topologies training process with 2 output neurons in the output layer, the output of the output layer comprises 2 output neurons which provides a combination of 2-bit binary values and indicates the sample’s class. The outputs of the classifier corresponding to the spam and non-spam are (1,0) and (0,1) respectively. For the network training process with 3 output neurons in the output layer, the output provides a combination of 3 binary values. The outputs of the classifier corresponding to the spam and non-spam are (1,1,1), (0,0,0) respectively.

A. Number of Hidden Layers During the preliminary experimentation, the following

behaviors were observed: i) a network of one hidden layer works very nearly as well as two hidden layer, ii) very large number of neurons in the hidden layer did not reduce error rate, and iii) a network of two output neurons works as well as three output neurons in the output layer. Table 2 shows part of training results measured by root mean square (RMS) error.

#Neurons in Hidden Layer

RMS 2 Output Neurons

RMS 3 Output Neurons

2 1.0 1.0 3 1.0 1.0 4 0.886 0.9 … … … 25 1.091 1.0 26 0.9 1.091 27 0.886 0.9

1 Hidden Layers

… #Neurons in

Hidden Layer 1 #Neurons in

Hidden Layer 2

… … … … 30 10 0.886 1.091 30 20 1.0 1.091 40 8 1.091 0.9 40 28 1.0 1.0 40 29 1.0 1.0

2 Hidden Layers

… … … … Table 2. Preliminary Results

Based on these training results, a network topology with 3 layers, 57 neurons in the input layer, one hidden layer (the number of neurons in the hidden layer will be determined later), and 2 or 3 output neurons in the output layer (The suitable number of output neurons need to be estimated

1. make 10. mail 19. you 28. 650 37. 1999 46. edu 2. address 11. receive 20. credit 29. lab 38. parts 47. table 3. all 12. will 21. your 30. labs 39. pm 48. conference 4. 3d 13. people 22. font 31. telnet 40. direct 49. ; 5. our 14. report 23. 0 32. 857 41. cs 50. ( 6. over 15. addresses 24. money 33. data 42. meeting 51. [ 7. remove 16. free 25. hp 34. 415 43. original 52. ! 8. internet 17. business 26. hpl 35. 85 44. project 53. $ 9. order 18. email 27. george 36. technology 45. re 54. #

FrAT1.2

274

Page 4: [IEEE 2007 International Symposium on Computational Intelligence in Robotics and Automation - Jacksonville, FL, USA (2007.06.20-2007.06.23)] 2007 International Symposium on Computational

further) was used in the following experiments.

B. Number of output neurons Table 3 shows part of the training results of the network

topology with 3 layers, 57 neurons in the input layer, 2 to 56 neurons in the hidden layer and 2 or 3 neurons in the output layer using the Stuttgart Neural Network Simulator (SNNS). As shown in the Table 3, the SSE error indicates whether the classifier successfully separate the spam email from non-spam email. 57-7-2 stands for the network topology with 3 layers, 57 neurons in the input layer, one hidden layer with 4 neurons and 2 output neurons in the output layer. The neural network topology is depicted in figure 3.

Table 3. SNNS Training Results

Fig 3. Neural Network Topology It was found from the shown results that: i) very large number of neurons in the hidden layer did not increase performance, ii) a network with two output neurons in the output layer works as well as three output neurons in the output layer, iii) the lowest SSE value is 249.3672 comes from 57-7-2 network topology, iv) if too many neurons are used in the hidden layers, it is hard for the network to make

generalizations and the SSE is high, and v) if too few neurons are used in the hidden layers, it is hard for the network to encode the significant features in the input data and the SSE is high. The network topology 57-7-2, shown in figure 3, was found to produce the least SSE and therefore was used in the rest of the experiments.

V. TRAINING RESULTS In the subsequent experiments, SNNS was used to build

the neural network classifier. Based on the preliminary experimentation, it was decided to construct the classifier with 3 layers, 57 neurons in the input layer, one hidden layer with 7 neurons and 2 output neurons in the output layer. This neural network topology was used in the remaining experiments. The results of the training in the five experiments are summarized in Table 4. Figure 4 shows all the parameters used to train the neural network. After the five neural network classifiers were trained, they were evaluated by examining the test datasets. Table 5 shows the SSE and RMS values in each experiment during the testing. A review of the SSE values revealed that experiments 2 and 3 are better than the other experiments where they resulted in the lowest SSE. To analyze the testing results and estimate the quality of the classifier the following metrics were used [9]: - Accuracy: the overall correctness of the classifier (overall-error rate) and is calculated as shown in equation 8.

negposnegtpost

spredictionofspredictionvalidofaccuracy

++

==__

_#__#

(8)

- Sensitivity: the true positive rate and is calculated as shown in equation 9.

pospost

spredictionlegitimateofspredictionlegitimatevalidofysensitivit _

__#___#

== (9)

- Specificity: the true negative rate and is calculated as shown in equation 10.

negnegt

spredictionspamofspredictionspamvalidofyspecificit _

__#___#

== (10)

Precision: the rate of samples labeled as legitimate that actually are legitimate samples. Precision is calculated as shown in equation 11.

f_pos t_post_pos

+=precision (11)

Where t_pos = legitimate classified as legitimate, t_neg = spam classified as spam, f_pos = spam classified as legitimate, f_neg = legitimate classified as spam, pos = total number of legitimate emails = t_pos + f_neg + undetermined_ legitimate neg = total number of spam emails

Network Topology

Sum of Squared Errors

Mean Square Error

Root Mean Square Error

57-4-2 334.9326 0.1185 0.3443 57-5-2 319.8133 0.1156 0.3400 57-6-2 282.0718 0.0998 0.3159 57-7-2 249.3672 0.0942 0.3070 57-8-2 258.1944 0.0998 0.3160 57-9-2 266.5492 0.1055 0.3248 57-10-2 320.7199 0.1301 0.3606 57-11-2 253.6127 0.1054 0.3247 57-12-2 242.8391 0.1035 0.3217 57-13-2 252.3179 0.1104 0.3322 57-14-2 336.9609 0.1514 0.3891 57-15-2 321.4394 0.1484 0.3852 57-18-2 321.4626 0.1619 0.4023 57-20-2 336.3409 0.1802 0.4246 57-30-2 321.8009 0.2542 0.5042

… … … … 57-4-3 501.0330 0.1776 0.4214 57-5-3 364.7724 0.1322 0.3635 57-6-3 406.7516 0.1507 0.3882 57-7-3 411.7015 0.1561 0.3951 57-9-3 503.0331 0.1999 0.4471 57-10-3 388.1664 0.1581 0.3976 57-11-3 503.5966 0.2104 0.4586 57-13-3 504.0605 0.2219 0.4710 57-16-3 377.5131 0.1807 0.4251 57-23-3 505.3958 0.3041 0.5514 57-33-3 506.0824 0.4811 0.6936

… … … …

1

2

57

FrAT1.2

275

Page 5: [IEEE 2007 International Symposium on Computational Intelligence in Robotics and Automation - Jacksonville, FL, USA (2007.06.20-2007.06.23)] 2007 International Symposium on Computational

_ _Overall Accuracy

pos negt pos t neg+

=+

_Overall Sensitivity

post pos

=

Fig. 4. Network Training Parameters

Table 5. SNNS Testing Results The output label of the classifier corresponds to the predicted class, spam or non-spam. They are compared to the nominal class label in the test dataset. Table 6 shows summary of the testing results. Threshold values of 0, 0.1, 0.2, and 0.3 were chosen for the 5 experiments. t_pos, t_neg, f_pos, f_neg, un-determined samples, pos, neg, specificity, precision, accuracy and sensitivity were computed. The values presented in Table 6 reveal that, when a threshold value of 0 is used, experiment 5 yields the highest precision (99.38). When a threshold value of 0.3 is used, experiment 2 yields the highest specificity (89.71), highest accuracy (91.74%) and highest sensitivity (93.16%). Experiment 4 yields the lowest accuracy (89.24%) and experiment 1 yields lowest sensitivity (90.27%). It is also clear that the 0.3 threshold resulted in the highest accuracy and sensitivity of all five experiments and hence it will be used in the future discussions. The experimental results indicate that when a threshold value of 0.3 is used, high overall accuracy (90.24%) and overall sensitivity (92.14%) are yielded. Overall accuracy and overall sensitivity are calculated as shown in equation 12 and equation 13.

(12)

(13)

VI. NAÏVE BAYES CLASSIFICATION Naïve Bayesian classifiers are statistical classifiers. They

can predict class membership probabilities, such as the probability that a given sample belongs to a particular class. Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes. This assumption is made to simplify the computations involved. The naïve Bayesian classifier works as follows: 1. Each data sample is represented by an n-dimensional

feature vector, X=(x1, x2, …, xn), depicting n measurements made on the sample from n attributes, respectively, A1, A2, …, An.

2. Suppose that there are m classes, C1, C2, …, Cm. Given an unknown data sample, X(i.e., having no class label),the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is the naïve Bayesian classifier assigns an unknown sample X to the class Ci if and only if

P(Ci│X) > P(Cj│X) for 1<=j<=m, j≠i. (14)

Thus we maximize P(Ci│X). The class Ci for which P(Ci│X)is maximized is called the maximum posteriori hypothesis.

3. Bayes theorem is represented by equation 18. As P(X) is

constant for all classes, only P(X|Ci) need be maximized.

P(X))P(C)C|P(X

X)|P(C iii = (15) Where: Ci = Class X = an unknown data sample P(Ci|X)= posterior probability of Ci conditioned on X P(X|Ci)= posterior probability of X conditioned on Ci P(Ci)= prior probability of Ci P(X)= prior probability of X

Exp# t_neg t_pos f_neg f_pos un-determined Neg pos specificity precision accuracy sensitivity

1 1291 2058 81 84 167 1458 2223 88.55% 96.08% 90.98% 92.58%

2 1247 2117 70 103 144 1434 2247 86.96% 95.36% 91.39% 94.21%

3 1261 2098 77 97 148 1441 2240 87.51% 95.58% 91.25% 93.66%

4 1296 2078 65 89 153 1465 2216 88.46% 95.89% 91.66% 93.77%

5 1282 2083 72 97 146 1454 2226 88.17% 95.55% 91.44% 93.58%

Experiment# 1 2 3 4 5 SSE 118.3266 98.0497 99.1199 124.4218 105.1701 MSE 0.2376 0.1969 0.199 0.2498 0.2108 RMSE 0.4874 0.4437 0.4461 0.4998 0.4591 #of Samples 920 920 920 920 921 Spam 355 379 372 348 359 Normal 565 541 548 572 562

Table 4. Training Results

FrAT1.2

276

Page 6: [IEEE 2007 International Symposium on Computational Intelligence in Robotics and Automation - Jacksonville, FL, USA (2007.06.20-2007.06.23)] 2007 International Symposium on Computational

Table 6: Neural Network Testing Results

Table 7: Naïve Bayes Testing Results

4. Given data sets with many attributes, it would be extremely computationally expensive to compute P(X|Ci) besides it requires attribute dependability probabilities. In order to reduce computation in evaluating P(X|Ci), the naïve assumption of class conditional independence is made. This presumes that the values of the attributes are conditionally independent of one another, given the class label of the sample, that is, there are no dependence relationships among the attributes. Thus,

P ( X |C ) P ( X | C ) ( 1 6 )i k i1

n

k= ∏

=

The probabilities P(X1|Ci), P(X2|Ci), … , P(Xn|Ci) can be estimated from the training samples, where

i) If Ak is categorical, then

s ikP ( X |C ) (1 7 )k i s i=

Where sik is the number of training samples of class Ci having the value Xk for Ak, and si is the number of training samples belonging to Ci.

ii) If Ak is continuous-valued, then the attribute is

typically assumed to have a Gaussian distribution so that

2k ci

2ci

k i k ci cici

( )21P(X |C ) g(x , , ) (18)

2

x

e

µσµ σ

πσ

−−

= =

Where g(Xk, µCi σ Ci) is the Gaussian(normal) density function for attribute Ak, while µCi and σ Ci are the mean and standard deviation, respectively, given the values for attribute Ak for training samples of class Ci.

ii) In order to classify an unknown sample X, P(X1|Ci)P(Ci) is evaluated for each class Ci. Sample X is then assigned to the class Ci for which P(X1|Ci)P(Ci) is the maximum.

A Naïve Bayesian classifier was constructed using the same datastes used in the neural network classifier. T_POS, T_NEG, F_POS, F_NEG, Un-determined samples, POS, NEG, accuracy, sensitivity are computed as shown in table 7. Results reveal that, experiment 5 has the highest specificity 56.07% and highest precision 76.73%, experiment 2 has the highest accuracy 75.43%, and experiment 4 has the highest sensitivity 90.16%. Experiment 1 has the lowest accuracy 69.46% and

Exp.t# threshold t_neg t_pos f_neg f_pos un-

determined Neg pos specificity precision Accuracy sensitivity

0 0 0 0 0 920 355 565 0% 0% 0% 0%

0.1 276 476 24 14 130 355 565 77.75% 97.14% 81.74% 84.25%

0.2 298 504 30 19 69 355 565 83.94% 96.37% 87.17% 89.20% 1

0.3 314 510 35 21 40 355 565 88.45% 96.05% 89.57% 90.27%

0 0 0 0 0 920 379 541 0% 0% 0% 0%

0.1 296 463 13 13 135 379 541 78.10% 97.27% 82.5% 85.58%

0.2 325 487 19 21 68 379 541 85.75% 95.87% 88.26% 90.02% 2

0.3 340 504 21 23 32 379 541 89.71% 95.64% 91.74% 93.16%

0 0 347 0 9 564 372 548 0% 97.47% 37.72% 63.32%

0.1 273 484 6 22 135 372 548 73.39% 95.65% 82.28% 88.32%

0.2 306 497 9 24 84 372 548 82.26% 95.39% 87.28% 90.63% 3

0.3 318 508 14 26 54 372 548 85.48% 95.13% 89.78% 92.70%

0 0 256 0 6 658 348 572 0% 97.71% 27.83% 44.76%

0.1 267 506 15 25 107 348 572 76.72% 95.29% 84.02% 88.46%

0.2 284 522 22 31 61 348 572 81.61% 94.39% 87.61% 91.26% 4

0.3 294 527 23 33 43 348 572 84.48% 94.11% 89.24% 92.13%

0 0 159 0 1 761 359 562 0% 99.38% 17.26% 28.29%

0.1 283 488 15 21 114 359 562 78.83% 95.87% 83.71% 86.83%

0.2 304 510 20 23 64 359 562 84.68% 95.68% 88.38% 90.75% 5

0.3 317 520 22 24 38 359 562 88.30% 95.59% 90.88% 92.53%

Experiment# T_NEG

T_POS F_NEG F_POS Un-determined NEG POS Specificity Precision Accuracy Sensitivity

1 176 463 81 186 14 369 551 47.70% 71.34% 69.46% 84.03%

2 206 488 49 169 8 376 544 54.79% 74.28% 75.43% 89.71%

3 204 447 85 168 16 381 539 53.54% 72.68% 70.76% 82.93%

4 165 522 49 172 12 341 579 48.39% 75.22% 74.67% 90.16%

5 194 498 74 151 4 346 575 56.07% 76.73% 75.14% 86.61%

FrAT1.2

277

Page 7: [IEEE 2007 International Symposium on Computational Intelligence in Robotics and Automation - Jacksonville, FL, USA (2007.06.20-2007.06.23)] 2007 International Symposium on Computational

Experiment 3 has the lowest sensitivity 82.93%. The overall accuracy is 73.09% and overall sensitivity is 86.73%.

VII. CONCLUSION The disadvantage of feed-forward back-propagation is its

long training times and the lack of learning from new emails after the classifier is developed. Our classifier evaluation suggests that the feed-forward backpropagation neural network classifier has an overall accuracy of 90.24% and an overall sensitivity of 92.14%. Neural networks involve long training times and are therefore more suitable for applications where this is feasible. Advantages of neural networks include their high tolerance to noisy data as well as their ability to generalize and to classify patterns on which they have not been trained. A description of how the accuracy and sensitivity of the feed-forward backpropagation neural network was provided in the previous section. In the future, we are looking for techniques for improving the classifier’s accuracy and sensitivity. Bagging and boosting techniques may be used for this purpose. Each of them combines a series of T learned classifiers, C1, C2, …, CT, with the aim of creating an improved composite classifier, C*. In theory, Naïve Bayesian classifiers have the minimum error rate in comparison to all other classifiers. However, in practice this is not always the case owing to inaccuracies in the assumptions made for its use, such as class conditional independence, and the lack of available probability data. It is still a viable option since its accuracy and sensitivity is close to those of existing commercial products in addition to its ability to enhance and adapt its learning from new examples. This feature is important as spammers are expected to always change their behaviors and the way spam email look.

REFERENCES [1] Paul Graham, “A plan for spam,” http://www.paulgraham.com/spam.html [2] I. Androutsopoulos, J. Koutsias, K. Chandrinos, and C. Spyropoulos,

“An experimental comparison of naïve Bayesian and keyword-based anti-spam filtering with encrypted personal e-mail messages,” Proc. SIGIR 2000, Athens, Greece, pp. 160–167, 2000.

[3] Y. Pao, “Adaptive Pattern Recognition and Neural Networks,” Addison Wesley, New York, NY, 1989.

[4] C. Ji, and S. Ma, "Performance and Efficiency: Recent Advances in Supervised Learning," Proc. IEEE, vol. 87, pp. 1519-1535, 1999.

[5] Letteri H. Tsoukalas and Rober E. Uhrig et al., “Fuzzy and Neural Approaches in Engineering,” John Wiley & Sons, Inc., 1996.

[6] “UCI Machine Learning Repository Content Summary,” http://www.ics.uci.edu/~mlearn/MLSummary.html.

[7] “Spambase Documentation,” http://www.ics.uci.edu/~mlearn/MLRepository.html.

[8] B. Efron, and R. Tibshirani, “An Introduction to the Bootstrap,” Chapman and Hall, New York, London, 1993.

[9] J. Han, and J. Kamber, “Data Mining: Concepts and Techniques,” Morgan Kaufmann, 2001.

FrAT1.2

278