Data Mining Course Project: Income Analysiscs4tf3/project_2002/report_kobi.pdf · Data Mining is a...

1

Data Mining Course Project:

Income Analysis

SFWR ENG / COM SCI 4TF3

December 10, 2002

Produced By: Ajanthan Rajalingam 9808797 Damandeep Matharu 9803391 Kobi Vinayagamoorthy 9813645 Narinderpal Ghoman 9806817

2

1. Introduction........................................................................................................... 3

1.1 Background ............................................................................................................. 3 1.2 Scope .......................................................................................................................... 3 1.3 Objectives ................................................................................................................ 5 1.4 Dataset Acquisition............................................................................................. 5

2. Data Preprocessing..................................................................................... 7

2.1 Data Preparation for Removing Outliers and Missing Values...... 7 2.2 Outliers ...................................................................................................................... 7

2.2.1 Combined computer and human inspection.............................................................. 8 2.2.2 Method 2 for Removing Outliers............................................................................ 13

2.3 Missing Values.................................................................................................... 14 2.4 Data Preparation for WEKA ........................................................................ 15

3. Learning Schemes...................................................................................... 16

3.1 1R algorithm .......................................................................................................... 17 3.2 ID3 algorithm ........................................................................................................ 18 3.3 J48 algorithm with Default Parameters.................................................... 20 3.4 J48 algorithm with Non-Default Parameters......................................... 22

4. Performance Experiments............................................................... 23

4.1 Training Set ............................................................................................................ 24 4.2 Testing Set .............................................................................................................. 24 4.3 Stratified tenfold cross-validation............................................................... 25

5. Conclusion............................................................................................................ 28

6. References ............................................................................................................ 29

7. Appendix ................................................................................................................ 30

3

1. Introduction

1.1 Background

Society produces enormous amount of raw data for the purpose of recording facts and to find patterns underlying the data. The collected raw data are useless without techniques to automatically extract information from them. Data Mining is a process of extracting previously unknown and potentially useful knowledge from such huge data. Data Mining has many major components, including Classification, Association Rules and Sequence Analysis. A classification rule involves finding rules that partition the given data into predefined classes. In this process, a set of data (training set data) is analyzed and a set of grouping rules are generated which can be used to classify future data (testing set data). For example, a car dealer may classify cars and give the indications that describe each class or subclass of cars, such as good cars and bad cars. An association rule involves finding rules that imply certain association relationships among a set of attributes in the given data. In this process, a set of association rules is generated at multiple levels of abstraction from the relevant set(s) of attributes in a data. For example, a car dealer may discover a set of symptoms often occurring together with certain kinds of cars and further study the reasons behind them. This may disclose some useful patterns that distinguish a good car from a bad car. In sequential analysis, patterns are discovered that occur in sequence. For example, a customer rents ``Rocky 1'', then ``Rocky 2'', and then ``Rocky 3''.

1.2 Scope The scope of this paper is limited to classification algorithms. The following is a list of the data classification methods studied in this paper.

Data Classification Methods

• 1R algorithm

The 1R algorithm generates a set of simple rules that all test one particular attribute, and builds a 1-level decision tree. The 1R algorithm treats the missing values as a separate attribute value. This algorithm shows that it is easy to get reasonable performance on a

4

variety of classification problems by analyzing only one attribute. However, it is substantially inferior in error rate to C4.5 decision trees.

• ID3 algorithm

The ID3 algorithm is a decision-tree building algorithm. It determines the classification of objects by testing the values of the their properties. It builds a decision tree for the given data in a top-down fashion, starting from a set of objects and a specification of properties. At each node of the tree, one property is tested based on maximizing information gain and minimizing entropy, and the results are used to split the object set. This process is recursively done until the set in a given sub-tree is homogeneous (i.e. it contains objects belonging to the same category). This becomes a leaf node of the decision tree. The ID3 algorithm uses a greedy search. It selects a test using the information gain criterion, and then never explores the possibility of alternate choices. This makes it a very efficient algorithm, in terms of processing time. However, a disadvantage is that the information gain criterion has a strong bias in favour of tests with more outcomes over tests with fewer outcomes. This can produce unnecessarily wide decision-trees that have many leaves containing single instances, which may lead to over-training. The ID3 algorithm also cannot deal with numeric attributes, missing values and noisy data.

• C4.5 algorithm

The C4.5 algorithm generates a decision tree for the given data by recursively splitting that data. The decision tree grows using Depth-first strategy. The C4.5 algorithm considers all the possible tests that can split the data and selects a test that gives the best information gain (i.e. highest gain ratio). This test removes ID3’s bias in favour of wide decision trees. For each discrete attribute, one test is used to produce many outcomes as the number of distinct values of the attribute. For each continuous attribute, the data is sorted, and the entropy gain is calculated based on binary cuts on each distinct values in one scan of the sorted data. This process is repeated for all continuous attributes. The C4.5 algorithm allows pruning of the resulting decision trees. This increases the error rates on the training data, but importantly, decreases the error rates on the unseen testing data. The C4.5 algorithm can also deal with numeric attributes, missing values, and noisy data. J48 algorithm The J48 algorithm is an improved version of the C4.5 algorithm, and it is supplied with the WEKA tool set.

5

1.3 Objectives

The ultimate goal of this project is to determine whether an individual’s annual income exceeds 50,000 US dollars. This goal is achieved by the following objectives:

First, use the training dataset to train the machine to predict whether a person makes over 50K a year. If data mining algorithm is unable to handle missing values and/or outliers, then create own code to clean missing values and outliers. Second, use the testing dataset to test whether the machine can predict if a person makes over 50K a year. Third, based on the experiment results, compare the accuracy of several data mining algorithms used. Fourth, compare and analyze the complexity of several implemented data mining algorithms and consider what kind of technology they use to achieve high accuracy and at the same time to avoid over fitting.

1.4 Dataset Acquisition

The dataset used in this paper is called “adult”, and it is used to predict if an individual’s annual income exceeds 50,000 US dollars. Dataset contains:

48842 instances (32561 training set, and 16281 testing set) 14 attributes (6 continuous numerical, and 8 nominal) 2 classes missing values (7%)

This dataset was extracted from the Data Extraction System (DES) of the US Census Bureau:

http://www.census.gov/ftp/pub/DES/www/welcome.html

The dataset can also be downloaded from the following ftp site: ftp://ftp.ics.uci.edu/pub/machine-learning-databases/adult/

6

The following table provides details about the 14 attributes that will be used to train and test the outcome (outcome is ≤ $50000 or > $50000):

Data size Data Set Data type

Age Numerical: Years 16 to 150 Years

Workclass Nominal Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked

Fnlwgt Numerical: Zip code 000000 to 999999

Education Nominal

Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

Education-num Numerical: Years 0 to 17 years

Marital-status Nominal Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

Occupation Nominal

Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

Relationship Nominal Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

Race Nominal White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black

Sex Nominal Female, Male

Capital-gain Numerical: American

Dollars 0 to 50K

Capital-loss Numerical: American

Dollars 0 to 50K

Hours-per-week Numerical: hours 0 to (24 * 7) hours

Native-country Nominal

United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

7

1.4.1 Training Set The training set is the set of independent instances that have been used to form the classifier. Generally, the larger the training dataset, the better the classifier. Two thirds of the whole dataset is used for training (i.e. 32,561 instances). Due to the large size of the training set, we expect to obtain a good classifier.

1.4.2 Testing Set Testing set: This is the set of independent instances that have played no part in formation of classifier. The larger the testing dataset, the more accurate the error estimates. One third of the whole dataset is used for testing (i.e. 16281 instances). Due to the large size of the testing set, we expect to estimate the error more accurately.

2. Data Preprocessing

2.1 Data Preparation for Removing Outliers and Missing Values

The original dataset contained some missing values and outliers. It was required of us to develop our own code to clean up the dataset. Commas separated the 14 instances in the original dataset. We converted the dataset into Microsoft Excel spreadsheet document, and separated the 14 attributes into separate columns. Then we used our own code that was developed using Visual Basic to clean outliers and missing values from the dataset. The software code used to clean up the data is provided in Appendix 2 and 3.

2.2 Outliers Data object that vary considerably and/or fall outside the expected or acceptable range can be considered as outliers. Such data objects do not comply with the general behavior or model of the data. Outliers can be caused by measurement or execution errors. Outliers can worsen the performance of data mining algorithms and also may result in algorithms producing inaccurate decisions. For this project, various techniques have been employed to reduce the effect of outliers on the performance of the data mining algorithms.

8

Mining outliers from the dataset can be classified by two main sub-problems: 1) Defining the data that can be considered as inconsistent in a given dataset. 2) Finding an efficient method to mining thee data that was found to be inconsistent. Solution to these problems varies with every data set. The objective (ie. the purpose) for mining the data must also be clearly stated in order to make decisions about outlier detection and its treatment effectively. The task of classifying outliers becomes even more tricky when the outliers are hidden in trends, seasonal or cyclic changes. Multi-dimensional data analysis can sometime show outliers that would have gone undetected. For the Census Bureau Data project, classifying outliers was achieved by analyzing multi-dimensional attributes using clustering, distance-based approach and visualization. A combination of MS Access and MS Excel was used to get an in-depth understanding of different attributes. SQL queries were used to perform multi-dimensional analysis. Also, clustering method was used to construct graphs in order to detect outliers.

2.2.1 Method 1: Combined computer and human inspection

Outlier detection of numerical attributes were conducted multi-dimensional query analysis, inspecting clustered graphs and using distance-based method to cluster data and normalize outliers.

I) Age Analysis A single dimensional analysis for the age attribute shows that the minimum age is seventeen years and maximum age is ninety years. The following SQL query was used in MS Access: Query: Counts all the rows of the capital loss attribute and categorize it by the work class attribute. Code: Appendix 2 Query 1.1 (Result of the query is not shown here) We did not find any outliers in this attribute and we assume that age provided by the data set is a correct representation of the original participants of the census.

9

II) Compare WorkClass with Capital Loss Query: Counts all the rows of the capital loss attribute and categorize it by the work class attribute. Code: Appendix 2 Query 2.1

WorkClass Count_Capita Federal-gov 960 Local-gov 2093 Never-worked 7 Private 22696 Self-emp-inc 1116 Self-emp-not- 2541 State-gov 1298 Without-pay 14 Query: Finds the minimum and the maximum value of the capital loss attribute categorized by the work class attribute. Code: Appendix 2 Query 2.2

WorkClass Count_Capita MIN MAX Federal-gov 58 625 3683 Local-gov 127 323 2444 Private 982 155 4356 Self-emp-inc 84 1485 2559 Self-emp-not- 152 880 2824 State-gov 58 625 3683 The query analysis did not show any visible outliers. Further tests are conducted by plotting the Age attribute against the capital loss attribute. Please refer to the Appendix 1 Graph 1: Age vs Capital Loss. The cluster graph clearly shows the outliers identifiable by human inspection. We can see that most of the points fall under the range of 1340 and 2824. Few points lie outside this desired range. Using the normalization method, all values that fall outside the desired range were brought in the desired range. Visual Basic code is provided in the Appendix 2 Code: Capital Loss Outlier Treatment. Pseudo Code: Outlier Detection: Outlier_Upper = Capital Loss > 2824 Action: Adjust the values of Outlier_Upper by randomly assigning a value between 2824 and 2559. That is 2559 <= replace_value <= 2824

10

After the change After the treatment of the capital loss outliers, another cluster graph was constructed. Please check the Appendix 1 Graph 2: Age vs Capital Loss (fixed).

III) Compare Age with Capital gain Query: Finds the minimum and the maximum value of the capital gain Code: Appendix 2 Query 3.1

WorkClass Count_CapitalGai Min Max Federal-gov 94 1471 99999 Local-gov 194 594 99999 Private 1732 114 99999 Self-emp-inc 199 1151 99999 Self-emp-not- 261 401 99999 State-gov 106 114 99999 Without-pay 2 2414 4416This query shows that the outliers are the values 99999. Query: Finds the minimum and the maximum value when capital gain < 99999 Code: Appendix 2 Query 3.2

WorkClass Count_Capita Min Max Federal-gov 93 1471 20051 Local-gov 188 594 20051 Private 1650 114 41310 Self-emp-inc 164 1151 27828 Self-emp-not- 232 401 41310 State-gov 105 114 25236 Without-pay 2 2414 4416 Using the results from the above query, we can get rid of the 99999 values. The code is shown in Appendix 2 Code Capital Gain Outlier Treatment. After the change After the outlier is fixed, the cluster graph is drawn. Appendix 1 Graph: Age vs Capital Gain (fixed). The following query now shows that treated values for the capital gain attribute. Query: Finds the minimum and the maximum value of the capital gain Code: Appendix 2 Query 3.3

11

WorkClass Count_CapitalGai Min Max Federal-gov 94 1471 20051 Local-gov 194 594 20051 Private 1732 114 41310 Self-emp-inc 199 1151 27828 Self-emp-not- 261 401 41310 State-gov 106 114 25236 Without-pay 2 2414 4416

IV) Hours per week analysis Query: Finds the minimum, maximum and the average value categorized by work class. Code: Appendix 2 Query 4.1

WorkClass Count_HoursWorkedPerWee MinHours MaxHours AverageHours Federal-gov 960 4 99 41.379166666666 Local-gov 2093 2 99 40.982799808886 Never-worked 7 4 40 28.428571428571 Private 22696 1 99 40.267095523440 Self-emp-inc 1116 1 99 48.818100358422 Self-emp-not- 2541 1 99 44.421881149153 State-gov 1298 1 99 39.031587057010 Without-pay 14 10 65 32.714285714285 The results of the query indicate that Never-Worked field value of the work class attribute has values for hours worked. This is an example of when multi-dimensional analysis can show outliers that could have gone undetected. In order to fix the problem, the value of zero was assigned to the hours worked field where work class value = Never-worked. Appendix 2 Code: Hours worked (Never-worked only) outlier treatment. After the change After the fix, the following query is analyzed. Query: Finds the minimum, maximum and the average value categorized by work class. Code: Appendix 2 Query 4.2

WorkClass Count_HoursWorkedPerWee MinHours MaxHours AverageHours Federal-gov 960 4 99 41.379166666666 Local-gov 2093 2 99 40.982799808886 Never-worked 7 0 0 0 Private 22696 1 99 40.267095523440 Self-emp-inc 1116 1 99 48.818100358422 Self-emp-not- 2541 1 99 44.421881149153 State-gov 1298 1 99 39.031587057010 Without-pay 14 10 65 32.714285714285

12

The results of the query indicate that the value of 99 is an outlier. In order to correct this outlier, further analysis is needed. Create a scatter graph by plotting the age attribute values against the hours worked. Appendix 1 Graph: Age vs Hours Worked. The graph shows that majority of the points are in 1 – 60 range. However there are significant amount of points in the 61 – 80 and 81 – 99 range. In order to find the correct normalizing factor, more query were constructed to analyze the amount of points in the ranges discussed above. Query: Finds the minimum, maximum value of hours worked in the range of 60 – 80, categorized by work class. Code: Appendix 2 Query 4.3

WorkClass Count_HoursWorke Min Max Federal-gov 15 65 72 Local-gov 43 61 78 Private 410 62 78 Self-emp-inc 87 62 78 Self-emp-not- 168 62 78 State-gov 30 61 77 Without-pay 1 65 65 Query: Finds the minimum, maximum value of hours worked in the range of 81 – 99, categorized by work class. Code: Appendix 2 Query 4.4

WorkClass Count_WorkHou Min Max Local-gov 5 90 97 Private 59 81 98 Self-emp-inc 10 81 98 Self-emp-not- 41 84 98 State-gov 3 84 90 Using the results of the above queries, the normalizing factor of 20 is found. The code shown in Appendix 2 Code: Hours Worked outlier treatment normalizes the values that fall in the range of 80-99. The new values will fall in the range of 60-80.

After the change Query: Finds the minimum, maximum and the average value of hours worked categorized by work class. Code: Appendix 2 Query 4.5

13

WorkClass Count_HoursWorkedPerWee MinHours MaxHours AverageHours Federal-gov 960 4 79 41.275 Local-gov 2093 2 79 40.829909221213 Never-worked 7 0 0 0 Private 22696 1 79 40.121695452943 Self-emp-inc 1116 1 79 48.137096774193 Self-emp-not- 2541 1 79 43.713498622589 State-gov 1298 1 79 38.908320493066 Without-pay 14 10 65 32.714285714285 A new cluster graph is drawn using the age and hours worked attributes. Appendix 1 Graph: Age vs Hours Worked (fixed).

2.2.2 Method 2: Removing Outliers We plotted the raw data onto a graph and examined the data distribution. From the graph, we found that 99999 is an outlier. Using this information, we wrote a program to remove 99999 from the dataset. The software code used to remove the outliers is provided in Appendix 3. The following figure is a sample graph that was used to examine the data distribution.

Figure 2.2.2: Graph showing data distribution

14

2.3 Missing Values A major part of the data cleaning process involves detecting and correcting missing values in the dataset. Missing values may occur for various reasons, such as malfunctioning measurement equipment, changes in experimental design during data collection, and collation of several similar but not identical datasets. The following table shows the different types of methods that were used to fix missing value in the dataset for this project.

Missing value correction

Method

Description of Method and method

NUMERICAL AND NOMINAL

ATTRIBUTES Replace with Missing Value Attribute

This method takes an input file with missing values denoted “?” and replace it with a “Missing” attribute

NOMINAL ATTRIBUTES

ReplaceNomMostFrequent

This method will replace all the nominal values in the

dataset that have missing values with the most frequent value in that attribute set

NOMINAL ATTRIBUTES

ReplaceNomMostFrequent

SameClass

This method will replace all the nominal values in the

dataset that have missing values with the most frequent value of the same class of the missing value

NUMERICAL ATTRIBUTES

Replacedbyglobalvalues()

This method will replace all the numerical values that are missing in the dataset by a global value that the

user enters.


Replacedbyaverage_of_present_ feature()

This method will replace all the numerical values that

are missing in the dataset by the average of the present feature.


Replacedbyaverage_of_ corresponding_class()

This method will replace all the numerical values that

are missing in the dataset by the average of the corresponding class.

15

2.4 Data Preparation for WEKA To use the data mining software, WEKA, the dataset must be converted into ARFF format. After fixing the missing values and outliers, the dataset was saved as a comma delimited file using Microsoft Excel. This converted the column-separated attributes back into comma-separated attributes. Then, Microsoft Word was used to insert the header structure that makes up the beginning of an ARFF file. The dataset was then saved as an ARFF file. A sample view of the ARFF file for the adult dataset is shown below in Figure 2.4.1. The ARFF format

% ARFF file for adult data with some numeric features % @relation outlier @attribute age integer @attribute workclass {Priv, Self-emp, Gov} @attribute fnlwgt integer @attribute education {Bach, college, HS, Masters, Doct} @attribute education-num integer @attribute marital-status {Div, Sin, Sep, Mar} @attribute occupation {Tech, Craft, Sales, Exec} @attribute relationship {Wife, Husband, relative} @attribute race {White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black} @attribute sex {F, M} @attribute capital-gain integer @attribute capital-loss integer @attribute hours-per-week integer @attribute native-country {US, Eng, CA} @attribute income {<=50K, >50K} @data 39, Gov, 7, Bach, 13, Mar, Tech, Wife, White, M, 2174, 0, 40, US, <=50K 50, Priv, 5, HS, 9, Div, Craft, relative, White, M, 0, 0, 40, US, <=50K 53, Private, 4, HS, 7, Sep, Sales, Husband, Black, M, 0, 0, 40, US, <=50K

Figure 2.4.1: A sample view of the ARFF file for the adult dataset

16

3. Learning Schemes The classifiers we tried include 1R, ID3, J48 with default parameters, and J48 with non-default parameters. We have run tests on different training data and testing data.

I) Dataset 1: Replace missing values with the term “missing value” and delete any outliers (For 1R and J48)

II) Dataset 2: For nominal attributes, replace missing values with the most

frequently occurring value for that attribute; For numerical attributes, replace missing values with the mean value; and delete any outliers (For 1R and J48)

III) Dataset 3: For nominal attributes, replace missing values with the most

frequently occurring value for that specific class; For numerical attributes, replace missing values with the mean value; and delete any outliers (For 1R and J48)

IV) Dataset 4: Replace missing values with the term “missing value”; delete any

outliers; then change all numerical attributes in to nominal attributes. (For ID3, 1R and J48)

V) Dataset 5: No change to original training data and test data (keeping all

missing values and outliers). (For 1R and J48) VI) Dataset 6: For nominal attributes, replace missing values with the most

frequently occurring value for that attribute; For numerical attributes, replace missing values with the mean value; and fix all outliers (For 1R and J48)

VII) Dataset 7: For nominal attributes, replace missing values with the most

frequently occurring value for that attribute; For numerical attributes, replace missing values with the mean value; delete any outliers; and then convert all numerical attributes into nominal attributes. (For ID3, 1R and J48)

VIII) Dataset 8: For nominal attributes, replace missing values with the most

frequently occurring value for that specific class; For numerical attributes, replace missing values with the mean value; delete any outliers; and then convert all numerical attributes into nominal attributes. (For ID3, 1R and J48)

17

We will describe the experiments done using the datasets described above in the following sections.

3.1 1R algorithm The first algorithm used was the 1R algorithm, which produces very simple rules based on one attribute only. Although it makes little sense to use this scheme for the prediction, it can be greatly useful in establishing the baseline performance benchmark before progressing onto more sophisticated learning schemes Another reason is that simplicity pays off and it’s a good idea to start with the simplest test first. All different datasets used produced similar outcome with the IR algorithm. The following shows a section of the IR result for training set without removing the outliers or the missing values (Dataset 5). The complete IR result is provided in the Appendix 4. === Classifier model (full training set) === capital-gain: < 3048.0 -> <=50K < 3120.0 -> >50K < 4668.5 -> <=50K < 4826.0 -> >50K < 4932.5 -> <=50K < 4973.5 -> >50K < 5119.0 -> <=50K < 5316.5 -> >50K < 5505.5 -> <=50K < 6457.5 -> >50K < 7073.5 -> <=50K < 10543.0 -> >50K < 10585.5 -> <=50K < 30961.5 -> >50K < 70654.5 -> <=50K >= 70654.5 -> >50K (26059/32245 instances correct) Correctly Classified Instances 26059 80.8156 % From the output, we see that the attribute selected by the IR algorithm was capital-gain, and the algorithm correctly classified 80.8156% of the instances.

18

The following shows a section of the IR result for testing set without removing the outliers or the missing values. The complete IR result is provided in the Appendix 4. === Classifier model (full training set) === capital-gain: < 3048.0 -> <=50K < 3120.0 -> >50K < 4668.5 -> <=50K < 4826.0 -> >50K < 4932.5 -> <=50K < 4973.5 -> >50K < 5119.0 -> <=50K < 5316.5 -> >50K < 5505.5 -> <=50K < 6457.5 -> >50K < 7073.5 -> <=50K < 10543.0 -> >50K < 10585.5 -> <=50K < 30961.5 -> >50K < 70654.5 -> <=50K >= 70654.5 -> >50K (26059/32245 instances correct) Correctly Classified Instances 13198 81.0638 % From the output, we see that the algorithm correctly classified 81.0638% of the instances. However, 1R algorithm was unsuccessful in finding that one attribute that accurately determined whether an individual’s income exceeded 50,000 US dollars.

3.2 ID3 algorithm The ID3 algorithm builds a decision tree for the given data in a top-down fashion. We have run tests using only a few different training data and testing data described previously. This is due to the reason that in order to use ID3 algorithm, all attributes have to be nominal. The different training set and data set used to test ID3 came from: Dataset 5, Dataset 7, and Dataset 8. Output extracted from Dataset 7 will be examined here, since it is used in the performance analysis later in Section 4. Dataset 7: For nominal attributes, replace missing values with the most frequently occurring value for that attribute; For numerical attributes, replace missing values with the mean value; delete any outliers; and then convert all numerical attributes into nominal attributes.

19

When the output is produced, we noticed two main aspects about this algorithm (ID3):

i) The produced decision tree is extremely big ii) The algorithm suffers from the over-training problem

The following shows a section of the ID3 result for training set. The complete result is provided in the Appendix 4. | workclass = Private | | race = White | | | sex = Female | | | | native-country = United-States: <=50K | | | | native-country = Cambodia: null | | | | native-country = England: <=50K | | | | native-country = Puerto-Rico: null | | | | native-country = Canada: null | | | | native-country = Germany: null | | | | native-country = Outlying-US(Guam-USVI-etc): null | | | | native-country = India: null | | | | native-country = Japan: null | | | | native-country = Greece: null | | | | native-country = South: null | | | | native-country = China: null | | | | native-country = Cuba: null | | | | native-country = Iran: null | | | | native-country = Honduras: null | | | | native-country = Philippines: null | | | | native-country = Italy: null | | | | native-country = Poland: null | | | | native-country = Jamaica: null | | | | native-country = Vietnam: null | | | | native-country = Mexico: null | | | | native-country = Portugal: null | | | | native-country = Ireland: null | | | | native-country = France: null | | | | native-country = Peru: null | | | | native-country = Hong: null | | | | native-country = Holand-Netherlands: null | | | | native-country = Missing: null Correctly Classified Instances 29020 90.4332 % From the output, we see that wide decision-trees that have many leaves containing single instances are produced. This may lead to over-training. We also see that the algorithm correctly classified 90.4332% of the instances.

20

The following shows a section of the ID3 result for testing set. The complete result is provided in the Appendix 4. | | occupation = Other-service | | | native-country = United-States | | | native-country = Cambodia: null | | | native-country = England: >50K | | | native-country = Puerto-Rico: <=50K | | | native-country = Canada: null | | | native-country = Germany: null | | | native-country = Outlying-US(Guam-USVI-etc): null | | | native-country = India: null | | | native-country = Japan: null | | | native-country = Greece: null | | | native-country = South: null | | | native-country = China: null | | | native-country = Cuba: null | | | native-country = Iran: null | | | native-country = Honduras: null | | | native-country = Philippines: null | | | native-country = Italy: null | | | native-country = Poland: null | | | native-country = Jamaica: null | | | native-country = Vietnam: null | | | native-country = Hungary: null | | | native-country = Guatemala: null | | | native-country = Nicaragua: null | | | native-country = Scotland: <=50K | | | native-country = Thailand: null | | | native-country = Yugoslavia: null | | | native-country = El-Salvador: null | | | native-country = Trinadad&Tobago: null | | | native-country = Peru: null | | | native-country = Hong: null | | | native-country = Holand-Netherlands: null | | | native-country = Missing: null Correctly Classified Instances 11972 73.5336 % From the output, again we see that wide decision-trees that have many leaves containing single instances are produced. We also see that the algorithm’s accuracy rate has gone done significantly from the training set to the testing set to 73.5336% of the instances.

3.3 J48 algorithm with Default Parameters J48 builds a decision tree model by analyzing training data, and uses this model to classify the testing data (user data). We use the default parameters provided in WEKA. For example, the confidence threshold factor for pruning is set to 0.25, minimum

21

number of instances in a leaf is 2, and the reduced-error pruning is set to false default. Please see Appendix 4 for the screenshot of the WEKA default parameter settings. We have run tests using different training data and testing data described previously. Dataset 5: We use the whole training data and test data (keeping all missing values and outliers). When the output is produced, we noticed two main aspects about this algorithm (J4.8 with default parameters):

iii) The produced decision tree is moderately big iv) The algorithm suffers from the over-fitting problem

The following shows a section of the J48 with default parameters result for training set without removing the outliers or the missing values. The complete result is provided in the Appendix 4. === Classifier model (full training set) === J48 pruned tree ------------------ occupation = Adm-clerical | workclass = Private | | sex = Female | | | education = HS-grad | | | | age <= 41 | | | | | hours-per-week <= 53: >50K (19.11/4.05) | | | | | hours-per-week > 53: <=50K (2.0) Number of Leaves : 545 Size of the tree : 680 Correctly Classified Instances 28299 87.7624 % From the result shown above, we notice that females under 41 years of age and working less than 53 hours per week make more money than the females working more than 53 hours per week. This is a clear case of over-fitting, since we would expect females who work more hours to make more money. We also see that the algorithm correctly classified 87.7624% of the instances.

22

The following shows a section of the J48 with default parameters result for testing set without removing the outliers or the missing values. The complete result is provided in the Appendix 4. === Classifier model (full training set) === J48 pruned tree ------------------ occupation = Adm-clerical | workclass = Private | | sex = Female | | | education = HS-grad | | | | age <= 41 | | | | | hours-per-week <= 53: >50K (19.11/4.05) | | | | | hours-per-week > 53: <=50K (2.0) Number of Leaves : 688 Size of the tree : 874 Correctly Classified Instances 13878 85.6825 %

From the result shown above, we again notice that females under 41 years of age and working less than 53 hours per week make more money than the females working more than 53 hours per week. This is a clear case of over-fitting, since we would expect females who work more hours to make more money. We also see that the algorithm correctly classified 85.6825% of the instances.

3.4 J48 algorithm with Non-Default Parameters We used all the default parameters provided in WEKA, except that we set the reduced-error pruning to true. Please see Appendix 4 for the screenshot of the WEKA non-default parameter settings. We have run tests using different training data and testing data described previously. Output extracted from Dataset 7 will be examined here, since it is used in the performance analysis later in Section 4. Dataset 7: For nominal attributes, replace missing values with the most frequently occurring value for that attribute; For numerical attributes, replace missing values with the mean value; delete any outliers; and then convert all numerical attributes into nominal attributes. When the output is produced, we noticed two main aspects about this algorithm (J4.8 with non-default parameters):

v) The produced decision tree is moderately big vi) The algorithm suffers less from the over-fitting problem, due to error pruning.

23

The following shows a section of the J48 with non-default parameter result for training set. The complete result is provided in the Appendix 4. === Classifier model (full training set) === J48 pruned tree ------------------ marital-status = Married-civ-spouse | education-num = LowED | | capital-gain = LowCapitalgain | | | capital-loss = Lowcapitalloss | | | | hours-per-week = HighHoursperweek | | | | | native-country = United-States | | | | | | age = Old | | | | | | | education = Bachelors: <=50K (0.0) | | | | | | | education = Some-college: <=50K (0.0) | | | | | | | education = 11th: <=50K (56.0/7.0) | | | | | | | education = HS-grad | | | | | | | | workclass = Private | | | | | | | | | race = White | | | | | | | | | | occupation = Tech-support: >50K (12.0/5.0) Number of Leaves : 1327 Size of the tree : 1479 Correctly Classified Instances 27325 85.1511 % From the result shown above, we see that old individuals who are from the United States, and have a high-school diploma, and work in the private tech-support industry and are white make over 50,000 US dollars. There doesn’t seem to be any over-fitting in the output. We also see that the algorithm correctly classified 85.1511% of the instances.

4. Performance Experiments and Postprocessing

The three different learning schemes used are J48 (Default and Non-Default), 1R and ID3. In this section, we will compare the performance of these schemes. First, we will compare the accuracy of each learning scheme in the training set. Secondly, we will compare the learning schemes in the testing set. We will then use the stratified tenfold cross-validation method for comparison of learning schemes.

24

4.1 Training Set Using the training set, the following table is constructed showing the percentage of correctly classified instances in the training set.

Data set J48 (Default) J48(Non-Default)

1R ID3

1 88.0773 87.2079 80.8663 2 88.0555 87.4696 80.8663 3 88.1957 87.6504 80.8663 4 90.7645 5 87.7624 87.4244 80.8156 6 80.839 86.1798 80.9342 7 84.8457 85.1511 90.4332 8 90.6825 We can see that J48 (Default) is better in almost all the data sets. The datasets that contain the ID3 values, we see that ID3 is better. We need to perform more tests using the testing set to prove these results.

4.2 Testing Set

Data set J48 (Default) J48(Non-Default)

1R ID3

1 85.6825 85.7937 81.1447 2 85.5467 85.7564 81.1447 3 85.8307 85.948 81.1447 4 78.3613 5 85.9898 85.824 81.0638 6 88.0593 87.2056 81.2358 7 84.0059 85.0059 73.5336 8 73.552 The testing set shows that J48(Non-Default) is performing better. We will perform more analysis using stratified tenfold cross-validation to prove this result.

25

4.3 Stratified tenfold cross-validation We saw the performance of each algorithm using training and testing data. It is possible that the sample used for training (or testing) may not be a correct representative. One way to remove the bias caused by any particular sample is to run many iterations with different random samples. To reduce the effect of any uneven representation of the training or testing data set, we will use the stratified tenfold cross-validation method. The data is divided randomly into ten parts. Each part is held out in turn and the learning scheme is trained on the remaining nine parts. Then its error rate is calculated on the holdout set. The procedure is executed a total of ten times. The error estimates from each portion is averaged together to yield an overall error estimate. In order to analyze the performance of different algorithms, we ran the tenfold cross-validation ten times. The results are recorded below. Data set: For nominal attributes, replace missing values with the most frequently occurring value for that attribute; For numerical attributes, replace missing values with the mean value; and delete any outliers.

Number of runs J48 (Default) J48 (Non – Default)

1R

1 86.1172 85.7245 80.7354 2 85.9884 85.6457 80.7354 3 86.1254 85.7234 80.7354 4 86.1148 85.8985 80.7354 5 86.2141 85.7198 80.7354 6 85.9945 86.0127 80.7354 7 86.1358 85.6654 80.7354 8 86.1854 85.8695 80.7354 9 86.1175 85.7236 80.7354 10 86.2117 85.9114 80.7354

Mean 86.12048 85.78945 80.7354

To calculate the different learning schemes, we use the paired Student’s t-test. T-Test: K: number of samples X: Set X Y: Set Y Di = Xi - Yi ∆ = Mean(X) – Mean(Y) T-test = ∆/ [Standard Deviation (Di) / sqrt(K)]

26

Differences

(D) J48(Non-Default) vs

1R J48(default) vs J48 (Non-

Default) J48(Default) vs

1R D1 4.9891 0.3927 5.3818 D2 4.9103 0.3427 5.253 D3 4.988 0.402 5.39 D4 5.1631 0.2163 5.3794 D5 4.9844 0.4943 5.4787 D6 5.2773 0.0182 5.2591 D7 4.93 0.4704 5.4004 D8 5.1341 0.3159 5.45 D9 4.9882 0.3939 5.3821 D10 5.176 0.3003 5.4763 Standard Deviation (D) 0.123303233 0.137871188 0.07812191

T-Test 129.6 7.59 217.98 At 95% confidence interval, the value of Z = 1.83 If the t-test value is greater than the Z value, or if the t-test value is smaller than the –Z value, the hypothesis cannot be rejected. For the above t-tests, we cannot reject the null hypothesis. In other words, we are 95% confident that the X vs Y is valid and X is better than Y in its performance.

This analysis shows that the performance of J48 (Non-Default) is better than 1R. The performance of J48 (Default) is slightly better than J48 (Non-Default).

27

Data set: For nominal attributes, replace missing values with the most frequently occurring value for that specific class; For numerical attributes, replace missing values with the mean value; delete any outliers; and then convert all numerical attributes into nominal attributes.

Group J48 (Default) J48 (Non – Default)

1R ID3

1 83.8828 83.4466 78.1178 78.6538 2 83.6874 83.1975 78.1178 78.4536 3 83.8763 83.2587 78.1178 78.5478 4 83.9187 83.6465 78.1178 78.4368 5 83.7234 83.3725 78.1178 78.5971 6 83.8354 83.5734 78.1178 78.6234 7 83.8497 83.4973 78.1178 78.6547 8 83.7698 83.5078 78.1178 78.4653 9 83.9977 83.4473 78.1178 78.5901 10 83.9004 83.4099 78.1178 78.6177 Mean 83.84416 83.43575 78.118 78.56403 Calculate the Differences between each pair and standard deviation Pair 1: J48 (Default) vs J48 (Non-Default) Pair 2: J48 (Default) vs 1R Pair 3: J48 (Default) vs ID3 Pair 4: J48 (Non-Default) vs 1R Pair 5: J48 (Non-Default) vs ID3 Pair 6: 1R vs ID3

Differences (D)

Pair1 Pair 2 Pair 3 Pair 4 Pair 5 Pair 6

D1 0.4362 5.765 5.229 5.3288 4.7928 0.536D2 0.4899 5.5696 5.2338 5.0797 4.7439 0.3358D3 0.6176 5.7585 5.3285 5.1409 4.7109 0.43D4 0.2722 5.8009 5.4819 5.5287 5.2097 0.319D5 0.3509 5.6056 5.1263 5.2547 4.7754 0.4793D6 0.262 5.7176 5.212 5.4556 4.95 0.5056D7 0.3524 5.7319 5.195 5.3795 4.8426 0.5369D8 0.262 5.652 5.3045 5.39 5.0425 0.3475D9 0.5504 5.8799 5.4076 5.3295 4.8572 0.4723D10 0.4905 5.7826 5.2827 5.2921 4.7922 0.4999Standard Deviation (Di)

0.127342

0.094053 0.105522 0.135553 0.154421 0.083567

T-Test 10.14 192.52 158.23 124.05 99.76 -16.87 At 95% confidence interval, the hypothesis cannot be rejected.

28

The analysis shows that the performance of J48 (Non-Default) is better than J48 (Default), 1R and ID3 algorithms. The reason for this conclusion is that J48 (Non-Default) has better performance than 1R and ID3 and the performance of J48 (Non-Default) and J48 (Default) are relatively similar, but J48 (Non-Default) also overcomes the problem of over-fitting and it therefore is the best option for our data.

5. Conclusion

We obtained raw data, and wrote our own code to fix outliers and missing values from the training and testing set. Then we used the training dataset to train the machine to predict whether a person makes over 50K a year. Then we use the testing dataset to test whether the machine can predict if a person makes over 50K a year. Based on the experiment results, we compared the accuracy and performance of several data mining algorithms. The different algorithms used were: 1R, ID3, J48 with default parameters, and J48 with non-default parameters

Our result shows that the 1R algorithm was unsuccessful in finding that one attribute that accurately determined whether an individual’s income exceeded 50,000 US dollars. But it did was useful in establishing the baseline performance benchmark before progressing onto more sophisticated learning schemes ID3 algorithm produced wide decision-trees that have many leaves containing single instances are produced. We also noticed that the algorithm’s accuracy rate went down significantly from the training set to the testing set to 73.5336% of the instances. J48 algorithm with default parameters correctly classified 85.6825% of the test instances – higher than all other algorithms tested. However, it suffered from over-fitting problems. J48 with non-default parameters gave slightly lesser accuracy (85.1511%) than J48 algorithm with default parameters, but did not suffer from the over-fitting problem. The results show that for our data, out of the all the algorithms compared, J48 with non-default parameters is the best algorithm when considering sufficiently good accuracy rate and avoiding over-fitting.

29

6. References [1] Witten, I. and Frank, E. 2000. Data Mining: Practical Machine Learning Tools and Techniques with Java. San Francisco. Morgan Kaufmann Publishers. [2] Devore, J. 2000. Probability and Statistics for Engineering and the Sciences. Pacific Grove. Duxbury. [3] Peng, J. 2002. Introduction to Machine Learning and Data Mining, Course Notes (CS4TF3). McMaster University. [4] Joshi, K. 1997. Analysis of Data Mining Algorithms.

http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm.

30

7. Appendix

Data Mining Course Project: Income Analysiscs4tf3/project_2002/report_kobi.pdf · Data Mining is a...

Documents

Transcript of Data Mining Course Project: Income Analysiscs4tf3/project_2002/report_kobi.pdf · Data Mining is a...