Data Mining on SpamBase,Wine Quality and Communities and Crime Datasets

1

Application of Data Mining Techniques like LinearDiscriminant Analysis(LDA), k-means clustering,Multiple Linear Regression, Principle ComponentAnalysis(PCA) and Logistic Regression on Datasets

Ankit Ghosalkar,Nikita Shivdikar,Pallavi Tilak and Rohan Dalvi

Abstract—Data mining techniques are used for a wide va-riety of purposes. Techniques such as Classification, As-sociation and Clustering are often tremendously used forfinding interesting patterns in the data. In this project,we have implemented data mining techniques like Classi-fication, Logistic Regression, Linear Discriminant Analy-sis (LDA), K-Means Clustering and Principle ComponentAnalysis (PCA) and Multiple Linear Regression on threedatasets namely Wine Quality dataset, Spambase Datasetand Communities, Crimes dataset. We have evaluated theperformance of these techniques on separate training andtest sets. The experimental results were analysed to deter-mine the drawbacks of the implemented technique on thedataset. .

Index Terms—Data Mining,,Classification.Clustering ,Lo-gistic Regression,Linear Discriminant Analysis.

I. Introduction

The amount of data generated from various sourcesis increasing by leaps and bounds. This data generatedwould be of no use if it does not reflect any usefulinformation. Thus there is a need to extract knowledgefrom the data. This led to the evolution of the concept ofdata mining. Data mining refers to extracting knowledgeor finding interesting patterns in data. The knowledgeobtained is then stored in knowledge database for futureusage. This knowledge is utilized by business analyst fordecision making purposes. Traditionally, the task datamining was complex and computationally expensive. Thedata analysts had to manually dredge through the datafor finding interesting patterns. With small datasets, thetask was much simpler; but with larger datasets this taskwas very much time consuming. Additionally experts wererequired to mine such huge datasets and extract usefulknowledge. Thus there arises a need for inventing newtechniques for data mining.

The new techniques for data mining are more effi-cient and automated. Techniques such as Classification,Clustering, Association, Regression etc. have made thetask of data mining much more simpler and efficient.The Classification technique is used for supervised learn-ing wherein we know the class labels and we have topredict the class of new instances. Various algorithmsare developed that carry out the task of classificationnamely 1R, Naive Bayes, Decision trees etc. All thesealgorithms have their own advantages and disadvantages.The Clustering technique is used for unsupervised learning

wherein we do not know the class label in advance andwe have to cluster together those instances which aresimilar to each other such that at the end of clustering weobtain distinct clusters. The various algorithms used forclustering are K-Means clustering, Hierarchical clustering,spectral clustering etc. Again, depending on the sizeand the characteristics of the data that we are analysing,we have to choose appropriate algorithm. We can alsouse Association for discovering relevant rules from thedataset. Before applying any of these algorithms, weneed to process the raw data. This raw data which iscollected from various sources might contain missingvalues or the format of the data might not be proper.This constitutes the data pre-processing stage whereinthe data is cleaned (outliers are detected) and preparedfor mining. After implementing data mining techniqueson the data, we evaluate the performance of the modelsbuilt and analyse the results of the experiments performedon the data that is using the built model for testing newinstances and measuring the accuracy and error rate of theclassifiers. This gives insights into how well a particularclassifier performs on a particular data set and whatmeasures need to be taken to tune the parameters of theclassifier. This also gives useful information about whichclassifier would perform better on the given dataset. Theexperimental results are used for various purposes such asadvertisement companies can analyse customer data andtheir trends of purchasing goods which can be useful fortargeted marketing purposes. The various other domainswherein data mining is widely used are banking, sports,mobile industries, networking, education, business andgovernment.

In this project, we have mainly focussed on Classifi-cation namely the binary classification using LogisticRegression, Linear Discriminant Analysis, Clusteringand Principle Component Analysis. We have obtaineddatasets from UCI Machine learning repository website.The description of these datasets is as follows:

Wine quality dataset: wine is increasingly enjoyed by alarge number of customers. Therefore wine industries areinventing new strategies to increase production and saleof wine. The key aspect to be taken into considerationduring wine production is the quality of wine. Evaluatingthe quality of wine is an important task of wine industries

2

because they have to ensure that the wine producedis unadulterated and safe for human health. Wine isassessed by various physicochemical tests which includedetermining its density, alcohol or pH values and sensorytests which are conducted by human experts. Thisdataset is of particular interest because it holds valuableinformation with respect to the wine quality assessment.The dataset contains 4898 instances/samples of whitewine from the north of Portugal. These samples are testedfor their wine quality. The first 11 attributes are usedto analyse the various parameters of wine assessmentand the 12th attribute determines the wine quality ona scale ranging from 0(very bad) to 10(excellent). Wehave used Linear Discriminant Analysis (LDA) techniquefor the purpose of classification of wine samples. Wehave also used Principle Component Analysis (PCA)and K-means Clustering on this dataset.Applying DataMining techniques on this raw data will help extractinguseful knowledge. It is interesting to predict good useof this knowledge for wine businesses.Evaluating qualitywill improve decision making such as identifying factorsthat are more relevant in decision making and accordinglysetting prices and gaining customer satisfaction.Alsoit would be interesting to know how this knowledgewill help control several parameters during the wineproduction process for improving the quality and providemore customer satisfaction. For example, increasing ordecreasing the concentration of residual sugar, alcohol orother attributes in order to improve quality and measurethe concentration of each ingredient as to how it affectsthe popularity of the wine.

SpamBase Dataset:The number of spam emails that in-ternet user receives everyday has increased tremendously.This dataset consist of spam and non-spam emails collectedfrom various sources. This Multivariate dataset has 4601instances and 57 attributes.The attributes characteristicsare Integer or Real. The description of various predictorvariables are: word freq WORD - frequency of alphanu-meric characters which are covered by non-alphanumericcharacters; char freq CHAR - number of CHAR oc-currences to the total number of characters in email;capital run length average- average length of continuoussequences of capital letters; capital run length longest-average length of longest continuous sequences of capitalletters; capital run length total- total number of capitalletters in the e-mail; Class- 1 for the email considered asspam and 0 for the non-spam email. For the purpose ofclassification, we have used Logistic Regression which iswell known binary classification technique that is whenthere are only two class labels. The application of datamining on Spambase dataset can be used in the designingof spam filters.

The various tools that we have used for the purposeof data mining are WEKA, Matlab and R.Usage of thesetools makes the task of data mining much easier, faster andefficient. The results generated were either numeric values

or confusion matrices or graphs and plots which wereanalysed and evaluated for performance and accuracy aswell as comparison with other approaches.The rest of thearticle is organized as follows: Section 2 describes data col-lection and preprocessing. Section 3 describes the varioustechniques in detail that were used for data mining on thewine quality and Spambase data sets as well as the resultsof the experiments that were performed on the datasets.Section 4 compares these techniques with respect to var-ious performance measures. Section 5 concludes the paper.

Communities and Crimes dataset: This datacombines socio-economic data from the 1990 US Census,law enforcement data from the 1990 US LEMAS survey,and crime data from the 1995 FBI UCR. The number ofinstances in the dataset are 1994 with 128 attributes. Ithas a large number of missing values. It is a multivariatedataset. The attribute characteristics is real. It containsall those attributes that help predict the crime or are havesome connection with the crimes. We have implementedMultiple Linear Regression technique on this dataset tobuild a model that would correctly predict the value ofthe response variable that is total number violent crimesper 100k population. The information of the variousattributes of this dataset is as follows:[3]state: US state (by number)county: numeric code for county contains many missingvaluescommunity: numeric code for community - not predic-tive and many missing values (numeric)communityname: community name - not predictive -for information only (string)fold: fold number for non-random 10 fold cross valida-tion, potentially useful for debugging, paired tests - notpredictive (numeric)population: population for community: (numeric -decimal)householdsize: mean people per household (numeric -decimal)racepctblack: percentage of population that is africanamerican (numeric - decimal)racePctWhite: percentage of population that is cau-casian (numeric - decimal)racePctAsian: percentage of population that is of asianheritage (numeric - decimal)racePctHisp: percentage of population that is of his-panic heritage (numeric - decimal)agePct12t21: percentage of population that is 12-21 inage (numeric - decimal)agePct12t29: percentage of population that is 12-29 inage (numeric - decimal)agePct16t24: percentage of population that is 16-24 inage (numeric - decimal)agePct65up: percentage of population that is 65 andover in age (numeric - decimal)numbUrban: number of people living in areas classifiedas urban (numeric - decimal)pctUrban: percentage of people living in areas classified

: 3

as urban (numeric - decimal)medIncome: median household income (numeric -decimal) pctWWage: percentage of households withwage or salary income in 1989 (numeric - decimal)pctWFarmSelf : percentage of households with farm orself employment income in 1989 (numeric - decimal)pctWInvInc: percentage of households with investment/ rent income in 1989 (numeric - decimal)pctWSocSec: percentage of households with socialsecurity income in 1989 (numeric - decimal)pctWPubAsst: percentage of households with publicassistance income in 1989 (numeric - decimal)pctWRetire: percentage of households with retirementincome in 1989 (numeric - decimal)medFamInc: median family income (differs fromhousehold income for non-family households) (numeric -decimal)perCapInc: per capita income (numeric - decimal)whitePerCap: per capita income for caucasians (numeric- decimal)blackPerCap: per capita income for african americans(numeric - decimal)indianPerCap: per capita income for native americans(numeric - decimal)AsianPerCap: per capita income for people with asianheritage (numeric - decimal)OtherPerCap: per capita income for people with ’other’heritage (numeric - decimal)HispPerCap: per capita income for people with hispanicheritage (numeric - decimal)NumUnderPov: number of people under the povertylevel (numeric - decimal)PctPopUnderPov: percentage of people under thepoverty level (numeric - decimal)PctLess9thGrade: percentage of people 25 and overwith less than a 9th grade education (numeric - decimal)PctNotHSGrad: percentage of people 25 and over thatare not high school graduates (numeric - decimal)PctBSorMore: percentage of people 25 and over with abachelors degree or higher education (numeric decimal)PctUnemployed: percentage of people 16 and over, inthe labor force, and unemployed (numeric - decimal)PctEmploy: percentage of people 16 and over who areemployed (numeric - decimal)PctEmplManu: percentage of people 16 and over whoare employed in manufacturing (numeric - decimal)PctEmplProfServ: percentage of people 16 and overwho are employed in professional services (numeric -decimal)PctOccupManu: percentage of people 16 and over whoare employed in manufacturing (numeric - decimal)PctOccupMgmtProf : percentage of people 16 andover who are employed in management or professionaloccupations (numeric - decimal)MalePctDivorce: percentage of males who are divorced(numeric - decimal)MalePctNevMarr: percentage of males who have nevermarried (numeric - decimal)

FemalePctDiv: percentage of females who are divorced(numeric - decimal)TotalPctDiv: percentage of population who are divorced(numeric - decimal)PersPerFam: mean number of people per family (nu-meric - decimal)PctFam2Par: percentage of families (with kids) that areheaded by two parents (numeric - decimal)PctKids2Par: percentage of kids in family housing withtwo parents (numeric - decimal)PctYoungKids2Par: percent of kids 4 and under in twoparent households (numeric - decimal)PctTeen2Par: percent of kids age 12-17 in two parenthouseholds (numeric - decimal)PctWorkMomYoungKids: percentage of moms of kids6 and under in labor force (numeric - decimal)PctWorkMom: percentage of moms of kids under 18 inlabor force (numeric - decimal)NumIlleg: number of kids born to never married (nu-meric - decimal)PctIlleg: percentage of kids born to never married(numeric - decimal)NumImmig: total number of people known to be foreignborn (numeric - decimal)PctImmigRecent: percentage of immigrants whoimmigated within last 3 years (numeric - decimal)PctImmigRec5: percentage of immigrants who immi-gated within last 5 years (numeric - decimal)PctImmigRec8: percentage of immigrants who immi-gated within last 8 years (numeric - decimal)PctImmigRec10: percentage of immigrants whoimmigated within last 10 years (numeric - decimal)PctRecentImmig: percent of population who haveimmigrated within the last 3 years (numeric - decimal)PctRecImmig5: percent of population who haveimmigrated within the last 5 years (numeric - decimal)PctRecImmig8: percent of population who haveimmigrated within the last 8 years (numeric - decimal)PctRecImmig10: percent of population who haveimmigrated within the last 10 years (numeric - decimal)PctSpeakEnglOnly: percent of people who speak onlyEnglish (numeric - decimal)PctNotSpeakEnglWell: percent of people who do notspeak English well (numeric - decimal)PctLargHouseFam: percent of family households thatare large (6 or more) (numeric - decimal)PctLargHouseOccup: percent of all occupied house-holds that are large (6 or more people) (numeric - decimal)PersPerOccupHous: mean persons per household (nu-meric - decimal) PersPerOwnOccHous: mean personsper owner occupied household (numeric - decimal)PersPerRentOccHous: mean persons per rental house-hold (numeric - decimal) PctPersOwnOccup: percent ofpeople in owner occupied households (numeric - decimal)PctPersDenseHous: percent of persons in dense hous-ing (more than 1 person per room) (numeric - decimal)PctHousLess3BR: percent of housing units with lessthan 3 bedrooms (numeric - decimal)

4

MedNumBR: median number of bedrooms (numeric -decimal)HousVacant: number of vacant households (numeric -decimal)PctHousOccup: percent of housing occupied (numeric -decimal)PctHousOwnOcc: percent of households owner occu-pied (numeric - decimal)PctVacantBoarded: percent of vacant housing that isboarded up (numeric - decimal)PctVacMore6Mos: percent of vacant housing that hasbeen vacant more than 6 months (numeric - decimal)MedYrHousBuilt: median year housing units built(numeric - decimal)PctHousNoPhone: percent of occupied housing unitswithout phone (in 1990, this was rare!) (numeric decimal)PctWOFullPlumb: percent of housing without completeplumbing facilities (numeric - decimal)OwnOccLowQuart: owner occupied housing - lowerquartile value (numeric - decimal)OwnOccMedVal: owner occupied housing - medianvalue (numeric - decimal) OwnOccHiQuart: owner oc-cupied housing - upper quartile value (numeric - decimal)RentLowQ: rental housing - lower quartile rent (numeric- decimal) RentMedian: rental housing - median rent(Census variable H32B from file STF1A) (numeric -decimal)RentHighQ: rental housing - upper quartile rent (nu-meric - decimal) MedRent: median gross rent (Censusvariable H43A from file STF3A - includes utilities)(numeric - decimal)MedRentPctHousInc: median gross rent as a percent-age of household income (numeric - decimal)MedOwnCostPctInc: median owners cost as a percent-age of household income - for owners with a mortgage(numeric - decimal)MedOwnCostPctIncNoMtg: median owners cost asa percentage of household income - for owners without amortgage (numeric - decimal)NumInShelters: number of people in homeless shelters(numeric - decimal)NumStreet: number of homeless people counted in thestreet (numeric - decimal)PctForeignBorn: percent of people foreign born (nu-meric - decimal)PctBornSameState: percent of people born in the samestate as currently living (numeric - decimal)PctSameHouse85: percent of people living in the samehouse as in 1985 (5 years before) (numeric - decimal)PctSameCity85: percent of people living in the samecity as in 1985 (5 years before) (numeric - decimal)PctSameState85: percent of people living in the samestate as in 1985 (5 years before) (numeric - decimal)LemasSwornFT: number of sworn full time policeofficers (numeric - decimal)LemasSwFTPerPop: sworn full time police officers per100K population (numeric - decimal)LemasSwFTFieldOps: number of sworn full time police

officers in field operations (on the street as opposed toadministrative etc.) (numeric - decimal)LemasSwFTFieldPerPop: sworn full time policeofficers in field operations (on the street as opposed toadministrative etc.) per 100K population (numeric -decimal)LemasTotalReq: total requests for police (numeric -decimal)LemasTotReqPerPop: total requests for police per100K population (numeric - decimal)PolicReqPerOffic: total requests for police per policeofficer (numeric - decimal)PolicPerPop: police officers per 100K population (nu-meric - decimal)RacialMatchCommPol: a measure of the racial matchbetween the community and the police force. High valuesindicate proportions in community and police force aresimilar (numeric - decimal)PctPolicWhite: percent of police that are caucasian(numeric - decimal)PctPolicBlack: percent of police that are african ameri-can (numeric - decimal)PctPolicHisp: percent of police that are hispanic(numeric - decimal)PctPolicAsian: percent of police that are asian (numeric- decimal)PctPolicMinor: percent of police that are minority ofany kind (numeric - decimal)OfficAssgnDrugUnits: number of officers assigned tospecial drug units (numeric - decimal)NumKindsDrugsSeiz: number of different kinds ofdrugs seized (numeric - decimal)PolicAveOTWorked: police average overtime worked(numeric - decimal)LandArea: land area in square miles (numeric - decimal)PopDens: population density in persons per square mile(numeric - decimal)PctUsePubTrans: percent of people using public transitfor commuting (numeric - decimal)PolicCars: number of police cars (numeric - decimal)PolicOperBudg: police operating budget (numeric -decimal)LemasPctPolicOnPatr: percent of sworn full timepolice officers on patrol (numeric - decimal)LemasGangUnitDeploy: gang unit deployed (numeric- decimal - but really ordinal - 0 means NO, 1 means YES,0.5 means Part Time)LemasPctOfficDrugUn: percent of officers assigned todrug units (numeric - decimal)PolicBudgPerPop: police operating budget per popula-tion (numeric - decimal)ViolentCrimesPerPop: total number of violent crimesper 100K population (numeric - decimal) response variable

The various tools that we have used for the purposeof data mining are WEKA, Matlab and R. Usage ofthese tools makes the task of data mining much easier,faster and efficient. The results generated were either

: 5

numeric values or confusion matrices or graphs and plotswhich were analysed and evaluated for performance andaccuracy as well as comparison with other approaches.The rest of the article is organized as follows: Section2 describes data collection and pre-processing. Section3 describes the various techniques in detail that wereused for data mining on the wine quality, Spambase andCommunities and crimes datasets as well as the resultsof the experiments that were performed on the datasets.Section 4 compares these techniques with respect tovarious performance measures. Section 5 concludes thepaper and gives directions for future work.

II. Data Collection and Pre-Processing:

The Wine quality, Spambase and Communities andcrimes datasets were obtained from UCI machine learningrepository website. The Spambase dataset obtained fromthe site was in the proper format as expected and did notcontain any missing values. So we not did not carry outany pre-processing tasks on this dataset. The Wine qualitydataset however was not in the proper format. It containedall the values of each instance in a single cell of excelspread sheet separated by semicolon and thus was hard toread. In order to bring it in proper readable format, wewrote the following code in R which copied all the valuesseparated by semicolon into another table of new excel file.

wine <−read.table(′winequality−white.csv′,sep =′;′ )write.table(wine,’winequality-whitewine.csv’, sep=’,’)

Further the Wine quality dataset did not containany missing values. So we could now use this cleaneddataset for data mining. The Communities and crimesdataset had lots of missing values for many attributes.Also since we are doing Multiple Linear regression onthis dataset, we eliminated the nominal attribute namedcountry from the dataset. Since the missing valuesfor the attributes was more than 85% we eliminatedsuch attributes from our dataset. The eliminated at-tributes were country, community, communityname,LemasSwornFT, LemasSwFTPerPop, LemasSwFT-FieldOps, LemasSwTotalReq, LemasTotReqPerPop,PolicReqPerOffic, PolicePerPop, RacialMatchCommPol,PctPolicWhite, PctPoliceBlack, PctPoliceHisp, PctPo-liceAsian, PoliceCars, PoliceOperBudg, LemasPctOnPatr,LemasGangUnitDeploy, PoliceBudgPerPop. Finally, allthe datasets were cleaned and ready to use for datamining.

III. Application of Data Miningtechniques on Wine quality,

Spambase and Communities andCrimes datasets

In this section, we specify the analysis which were carriedalong with our findings.

A. Logistic Regression on Spambase Dataset

Logistic Regression is a binary classification technique.We use this technique when our response variable is binary(i.e. 0 or 1) and a collection of real valued explanatory vari-ables. The goal is given a vector X, the task is to predictits class as accurately as possible. The response variable isrelated to the predictor variable through a relationship ofthe form.

Fig. 1. Analysis of the Spambase dataset: Building the Model

Applying the Logistic Regression technique using R,we first fit the full model that is we build the modelthat includes all the explanatory variables of Spambasedataset. The function glm in R is used for model fittingin Logistic Regression. We use the following commands inR:

spam~<−read.csv(′spambase.csv′,header = T )glm.spam− glm(Class~.,data = spam,family = binomial(′logit′))summary(glm.spam)

R output after fitting full model:

6

After studying this model, we have observed that themodel is significant because residual deviance of themodel is 1815.8. However the model contains lots of noisevariables or insignificant variables. This can be inferredfrom the large p-values of these noise variables. Hence weneed a model that contains only significant variables. Forthis, we use a technique called variable selection whereinonly significant variables are retained in the model whilethe rest are eliminated. The following is the R code forselecting variables that have high correlation with theresponse variable:

XY <−spamp <−ncol(XY )good.var <−c()for(i in 1:(p-1)){if(abs(cor(XY [, i], XY [, p])) >= 0.30)if(cor.test(XY [, i],XY [,p])$p.value < 0.05){good.var <−c(good.var,colnames(XY )[i])} }

The output obtained after applying this technique: Fourvariables namely ”word freq remove” ”word freq your”,”word freq 000”, ”char freq ..4” were deemed significant.Hence we build the model again using these variables.

glm.newmodel¡-glm(Class word freq remove+word freq your+word freq 000+char freq ..4, data=spam)summary(glm.newmodel)

R output after Refitting the model:

: 7

The summary of this model reveals that the model issignificant from the value of its residual deviance which isless than the degrees of freedom. Also all the variables aresignificant since their p-values are zeros. Hence we canuse this model for our further analysis:

Prediction with Logistic RegressionThe model obtained above can now be used for pre-

diction. In this, we predict the log odd ratios for each ofthe instances. We compute R output after Refitting themodel:

And then predict the probability of the class (0-non-spamor 1-spam) of each instance

We have used the following R commands for our predic-tions:logs.odds.ratio¡-predict(glm.newmodel, spam[,-57])probabilities¡-predict(glm.newmodel,spam[,-57],type=’response’)predicted¡-rep(0,4601)positive¡-which(probabilities¿0.5)predicted[positive]¡-1

Performance of Logistic Regression on Spambasedataset:

We have obtained the confusion matrix showing actualand predicted values.Actual¡-spam$Classconfmat¡-table(Actual, predicted)confmat

From the above confusion matrix, we have computedvarious parameters:True Positives (TP): Number of emails correctlypredicted as spam are 1042.False Positives (FP): 137 emails which were predictedas spam were actually non-spams.True Negatives (TN): 2651 emails predicted as non-spams were non-spams.False Negatives (FN): 771 emails were predicted asnon-spams but were actually spams.Precision = TP/(TP+FP) = 0.8837 = 88.37%Recall = TP/(TP+FN) = 0.5747 = 57.47%F-measure= 2 (Precision X Recall) / (Precision +Recall) = 0.6964

Performance of Classification(PCC):(TN+TP)/number of instances = 80.27%Error rate (Misclassification rate)= 1- PCC = 2%

ROC (Receiver Operating Curve)The ROC curve is widely used to measure the qualityof classification technique. This curve plots the FalsePositive Rate and True Positive Rate. The following plotis ROC curve with respect to the new model obtained.

8

This ROC curve illustrates the performance of theSpambase classifier.

B. Linear Discriminant Analysis (LDA) on Winequality dataset

LDA is similar to Principle Component Analysis. Thedifference is PCA does more of attribute classificationwhereas LDA does more of data classification. Like PCA,LDA performs dimensionality reduction while preservingmuch of the information. The models obtained using LDAsometimes shows high accuracy than the more complexmodels. LDA technique is used for classification purpose.LDA finds a discriminant function of the two predictors Xany Y and results in a new set of transformed values thatgives more accurate discrimination than the predictorvariable alone. It tries to find directions along which theclasses are best separated. Additionally it not only findsscatter within the classes but also between classes.

LDA Method1: Let N be the number of classes2: Let be the mean vector of class i=1,2N3: Let be the number of samples within class i=1,2N4: Let L be the total number of samples

We now compute the scatter matrix within the class us-ing the following formula:

Scatter matrix between the class:

Next we find the mean of the entire dataset:

LDA tries to minimize the scatter within the classwhile maximizes the scatter between the classes whilereducing the variation due to sources and retaining theclass separability.(Maximize) (det(Sb)/det(Sw))Linear Transformation:

The Linear transformation is given by matrix U withcolumns as the Eigen Vectors of inverse(Sw)(Sb). Thereare at most N-1 non generalized Eigen Vectors. Thegeneralized Eigen Vector is given by:

If Sw is non-singular, then Eigen value is obtained by

This LDA technique is implemented on Wine qualitydataset to classify the instances based on the wine quality.The library MASS in R is used to perform LDA on winequality datasetwine.lda <−lda(quality~.,data = wine)wine.ldaWe select first two significant components as shown be-low: R output: If Sw is non-singular, then Eigen value isobtained by

From this output, we can infer that the first discriminantfunction obtained is the linear combination of variablesthat is:(1.864)fixed.acidity-(4.755)volatile.acidity-(7.046)citric.acidity+(1.89)residual.sugar-(5.294)chlorides+(1.0608)free.sulphur.dioxide-(1.229)total.sulphu.dioxide-(3.445)density+(1.698)pH+(1.61)sulphates+(5.3604)alcohol.

Next,we need to calculate the values of the first dis-criminant function for each instance in the dataset.wine.lda.vals <−predict(wine.lda,wine[1 : 11])wine.lda.vals$x[,1]R output for first few instances:

: 9

Also from the proportion of trace, we get the percent-age of separation between the groups achieved by eachdiscriminant function. For example, the first discrimi-nant function achieves a separation of 83.12% while thesecond discriminant function achieves a separation of11.83%.Therefore to achieve a good separation betweenthe groups, we need to use both the discriminant functions.

Results of Linear Discriminant Analysis:The results of applying LDA on wine quality datasetis shown by the stacked histogram of the values of thediscriminant function for the instances of different groups.We use the function ldahist() in R to make a stackedhistogram of the values of the first discriminant function.

Prediction using LDATo understand the accuracy of the discriminator inclassification, we have taken two new instances and usedLDA to classify them.

#R code

dis <−lda(quality~.,data = testwine)predict(disc) predict(disc)$class

newInstance1: fixed.acidity=7.1, volatile.acidity=0.24,citric.acid=0.41, residual.sugar=17.8, chlorides=0.046,free.sulfur.dioxide=39, total.sulfur.dioxide=145, den-sity=0.9998, pH=3.32, sulphates=0.39, alcohol=8.7

predict(disc, newInstance)

class predicted =5; actual class=5

newInstance2: fixed.acidity=8.1, volatile.acidity=0.27,citric.acid=0.41, residual.sugar=1.45, chlorides=0.033,free.sulfur.dioxide=11, total.sulfur.dioxide=63, den-sity=0.9908, pH=2.99, sulphates=0.56, alcohol=12

class predicted =5; actual class=4

Cross Validation using LDA:The classifier is trained on the part of data and is used topredict the rest of the data#R codetrainSet <−sample(1 : 2020,1010)table(wine$quality[trainSet])

classifier < −lda(quality ., data = wine, subset =trainSet)Predicted <−predict(classifier,wine[−trainSet, ])$classActual <−wine$quality[−trainSet]table(Actual, Predicted)

Output obtained:

The confusion matrix obtained above shows the numberof instances that were predicted correctly and the numberof instances that were misclassified. Thus LDA techniqueis widely used for classification purpose especially whenthere are more number of classes.

C. k means clustering on wine quality dataset

Clustering is defined as grouping of similar thingstogether. It is often confused as being similar to classifi-cation. Clustering is unsupervised exploration procedurewhereas classification is supervised and used for predictionpurposes. K-means is a clustering algorithm which isunsupervised. It helps to assign the data to a particularcluster given k clusters which is decided a priori. Thefirst step is to decide on which are the ideal choices forthe k centroids. Next step is to associate the values ofthe dataset with nearest possible centroid iteratively bycomputing the distance. The main objective of clusteringis optimising the objective function in squared errorfunction. It can be employed in detecting the odds in thedata for example for fraudulent transactions or scams. Wehave used K-Means clustering technique on wine qualitydataset to cluster the wine samples into seven clustersrepresenting the quality level of the wine sample (rangingfrom 3 to 9). The following is the pseudo code of K-meansclustering algorithm

Pseudo code: 1:assign each tuple to a randomlychosen cluster2: calculate the centroid for each cluster3: loop until no new centroid is obtained4: assign each instance to the closest cluster5: (the cluster with closest centroid to tuple)6: update each centroid of the cluster7: (based on new cluster assignments)8: end loop9: return clusters

10

We have implemented K-means clustering in Matlab.And obtained the following output :

Fig. 2. K-means clustering

The plot obtained did not clearly indicate the sevenclusters . In order to get better visualization of the ofthe seven clusters we used a library called rattle in R.The following plot was obtained after running K-means in .

Fig. 3. K-means clustering

D. Princliple Component Analysis (PCA) on WineQuality dataset

PCA stands for Principal Component analysis. It isthe process for transforming the set of observations ofrelative variables into value sets which are called principalcomponents. Number of Principal components is less thanor equal to the number of initial variables. This structuregives us the first Principal Component which possesseslargest variance. It identifies patterns in data on the basisof the variance in their similarities and differences. Itis popular as it works well on data of high dimensions.Moreover compression of data by reduction of dimensionsdoes not incur any kind of loss in the overall data.

Principle Component:A linear combination of the original variables. FirstPrinciple component explains most of the variation in thedata. The second principle component explains most ofthe rest of the variance and so on.

Standard Deviation(s):

Variance:Variance is defined as the spread of data in data set.Variance = s*s

Pseudo Code of PCA:Step 1:Calculate the mean of the data setStep 2:Subtract the mean from the original value (X-XMEAN)and (X-XMEAN)*(X-XMEAN)Step 3:Calculate the Standard Deviation (s) and Variance withthe above stated values. This step performs the datacentering of data.Step 4:Covariance of Data is calculated by the formulaStep 5:

Calculate the eigenvectors and eigenvalues of the covari-ance matrix.Step 6:Choose the principal Components. Select the first twovalues (largest) from the vector to be the principalcomponentsStep 7:Derive the new data set.We have implemented PCA on wine quality dataset inMatlab. The following plots were obtained:

: 11

Fig. 4. First two principle Components

In the case of Wine Quality dataset the attributesalcohol content and fixed acidity are observed to be bestdescribing the wine quality. Thus these two features areused as dimensions in plotting of the graph. The lesserthe acidity the better it is in quality whereas the alcoholcontent lying between 10 to 12 per cent is considered to begood. It is considered to be very good if alcohol contentis more than 12 per cent and acidity as low as possible.PCA is widely used for exploratory data analysis becausewhat it does is finding the most significant variables thatexplain most of the variance in the data. So when thedataset is huge, PCA can make the task much more easier.PCA is used in diverse fields. It reduces the complexdataset to a lower dimension to reveal the interestingpatterns in the data.

E. Multiple Linear Regression on Communitiesand Crimes Dataset

Linear Regression is an approach for modelling therelationship between the response variable or the depen-dent variable and one or more predictor variables or theindependent variables. There are two types of LinearRegression techniques namely Simple Linear Regressionwherein there is only one predictor variable and MultipleLinear Regression wherein there are more than onepredictor variables. The Simple Linear Regression modelis shown as follows:

Where Y is the response variable and and are theintercept and the slope respectively. x is the predictorvariable and is the noise variable. The Multiple LinearRegression is shown as follows:

In this model we have more than one predictor variablesand their slope variables. Multiple Linear Regressiontechnique on Communities and Crimes dataset has helpedin finding out the number of significant attributes that areresponsible to predict the response (i.e. ViolentCrimes-PerPop) more accurately.

Analysis of the Communities and Crimes Dataset:Building the modelFirst we have built the model using all the attributes ofthe dataset. The following R commands were used:lm.mo <−lm(V iolentCrimesPerPop~.,data = crimes)summary(lm.mo)R output:

12

From the above model obtained, we can infer that themodel has significant number of noise variables. In orderto eliminate the noise variables, we have implementedvariable selection algorithm wherein we select only thesignificant variables that have high correlation with theresponse variable. We perform the correlation test on thepredictor variables and then rebuild the model with thesignificant variables alone. Re-building the model withsignificant variables:

R-output:

The model obtained above is significant as we cansee the p-value of the model is 0 as well as the all thepredictor variables. Also we have obtained a large valueof F-statistics which represents significance.

Prediction using the above model:

Given the set of values of the new instance, state =1, racepctblack =0.48 , PctEmploy=0.57 , MalePctNev-Marr = 0.45, PctWorkMom= 0.54, NumStreet = 0.09We can now predict the value of the response variable byusing the following equation of multiple linear regressionthat we have obtained from the above model:

ViolentCrimesPerPop = 0.3208 (0.00214)state + (0.4989)racepctblack (0.1681)PctEmploy + (0.1072)MalePct-NewMarr (0.1592)PctWorkMom + (0.4610)NumStreet =0.46607The predicted value of the ViolentCrimesPerPop ofthe new instance is 0.466 whereas the Actual value is0.5. Hence we can say that the model does fairly wellwhen given new instance. Thus using Multiple Linearregression, we have obtained one of the optimal predictionmodel though there can be other optimal models similarto the one we have derived. From this model we caninfer that the ViolentCrimesPerPop is dependent on thepredictor variables like state, racepctblack, PctEmploy,MalePctNevMarr, PctWorkWomen and NumStreet.

: 13

Multiple linear regression is a very flexible method.The predictor variables can be numeric or categorical,and interactions between variables can be incorporated.This technique makes use of the data very efficiently.The models obtained are used for the prediction of newinstances. One of the disadvantages of Multiple LinearRegression is that it is sensitive to outliers.

IV. Performance of the data mining techniquesdiscussed above

Beginning with Spambase dataset, the performance ofLogistic Regression on this dataset is 80.27% which ispretty good. The reason behind high performance is thatLogistic Regression is used for binary classification andperforms well on the datasets that have binary class la-bels (0 or 1) just like the case of Spambase dataset. OtherClassification techniques might not give such high perfor-mance of classification as Logistic regression. Logistic Re-gression technique is more robust. The predictor variablesdont have to be necessarily normally distributed, or haveequal variance in each group. However there are severaldrawbacks of Logistic Regression. Logistic regression canperform well on a dataset that has a large number of pre-dictor variables. But, there are many situations when itis not. The parameter estimation procedure of logistic re-gression relies heavily on having an adequate number ofsamples for each combination of predictor variables. Hencesmall sample sizes can lead to widely inaccurate estimatesof parameters. Thus, before using logistic regression weshould first make sure that the sample size is large enoughand then implement logistic regression method.Data mining techniques like K-Means clustering, Princi-ple Component Analysis (PCA) and Linear DiscriminantAnalysis (LDA) were performed on Wine quality dataset.From the performance of all these techniques, it was ob-served that LDA performs well than the other two tech-niques. K-Means clustering does not perform well as wehave seen that at the first instance the clusters obtainedwere not distinct. This is due to the fact that the datapoints overlap and do not have a perfect separation fromother data points. PCA and LDA are quite similar to eachother. However PCA is not always optimal for classifica-tion. Whereas LDA is better than PCA. In PCA the shapeand the location of the original dataset changes when di-mensions are reduced whereas LDA does not change thelocation of the original dataset but instead provides a bet-ter separation between the classes as we have clearly seenin the analysis of wine quality using LDA and has given83.12% separation. The main advantage of PCA is that itis completely non-parametric that is it can produce the re-sults given any dataset. PCA serves to represent the datain simpler and reduced form. But in this case, from thevery nature/characteristics of the wine quality dataset, ithappens that LDA technique best suits for the classifica-tion of wine samples.Lastly we analyse the implementation of Multiple LinearRegression (MLR) on Communities and crimes dataset.

The model built using MLR provides good prediction withlowest error rate. MLR technique is widely used for build-ing models that has more than one predictor variables.While implementing MLR, we first need to make sure thatthe class that we are predicting is numeric and continuous.One of the drawbacks of MLR is that all the attributesmust be numeric. Therefore to satisfy this condition wehad to eliminate one nominal attribute from the datasetbefore applying MLR technique. On an average, this tech-nique is better suited for datasets l

V. Conclusion and Future work

We have studied the application of data mining tech-niques like K-means clustering, Logistic Regression, Princi-ple Component Analysis, Multiple Linear Regression, andLinear Discriminant Analysis on the UCI machine learn-ing datasets like Wine quality, Spambase, Communitiesand Crimes datasets. We have analysed the performanceof the algorithms on these datasets. We were also able todraw various inferences regarding the selection of a specificdata mining technique on a particular dataset dependingon the characteristics of the dataset. Thus with the se-lection of appropriate techniques, we were able to obtainthe expected results. Next we compared the performanceof the three techniques like K-means clustering, PCA andLDA on Wine quality dataset and concluded that LDA per-forms better than the other two. Our future work wouldprobably include analysis of more number datasets andthe application of data mining techniques in addition tothe ones described in this project.

References

[1] http://archive.ics.uci.edu/ml/datasets/Wine[2] http://archive.ics.uci.edu/ml/datasets/Spambase[3] http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime[4] http://www.mathworks.com/help/stats/princomp.html[5] http://udel.edu/ mcdonald/statlogistic.html[6] http://www.wikipedia.org/

Data Mining on SpamBase,Wine Quality and Communities and Crime Datasets

Education

Transcript of Data Mining on SpamBase,Wine Quality and Communities and Crime Datasets