Homework 2

IDS 572 – Data Mining for Business

Fall 2015

Homework #2

Submitted By-

Group 11

Ankit Bhardwaj ([email protected])

Arpit Gulati ([email protected])

Nitish Puri ([email protected])

mailto:[email protected]



Problem 1 a) Input the data set. Set the role of INCOME to target. Use a partition node to divide the

data into 60% train, 40% test. Ans: Salary-class.csv file was the input to the var file which was then connected to the partition node to divide the data into 60% train, 40% test.

b) Create the default C&R decision tree. How many leaves are in the tree? Ans :

There are a total of 7 leaves which can be seen when we see the decision tree using the viewer option in the C & R Model. c) What are the major predictors of INCOME? Ans:

The major predictor of Income are MSTATUS, C-GAIN, DEGREE, JOBTYPE.

d) Give three rules that describe who is likely to have an INCOME > 50K and who is likely to have an income <= 50K. These rules should be relevant (support at least 5% in the training sample) and strong (either confidence more than 75% “> 50K" or 90% “<= 50K"). If there are no three rules that meet these criteria, give the three best rules you can.

Rule 1 for INCOME <=50K

Support = 6835/13559 = 0.50409 or 50.409% Confidence = 6835/7170 = 0.95328 or 95.328% if MSTATUS in [ " Divorced" " Married-spouse-absent" " Never-married" " Separated" " Widowed" ] and C-GAIN <= 7139.500 then INCOME <=50K

Rule 2 for INCOME >50K

Support = 942/13559 = 0.06947 or 6.947% Confidence = 942/1307 = 0.72073 or 72.073%

If MSTATUS in [ " Married-AF-spouse" " Married-civ-spouse" ] and DEGREE in [ " Bachelors" " Doctorate" " Masters" " Prof-school" ] and C-GAIN <= 5095.500 and JOBTYPE in [ " Armed-Forces" " Exec-managerial" " Handlers-cleaners" " Prof-specialty" " Protective-serv" " Sales" " Tech-support" ] then INCOME >50K

Rule 3 for INCOME <=50K

Support = 2932/13559 = 0.21624 or 21.624% Confidence = 2932/4141 = 0.70804 or 70.804% If MSTATUS in [ " Married-AF-spouse" " Married-civ-spouse" ] and DEGREE in [ " 10th" " 11th" " 12th" " 1st-4th" " 5th-6th" " 7th-8th" " 9th" " Assoc-acdm" " Assoc-voc" " HS-grad" " Preschool" " Some-college" ] and C-GAIN <= 5095.500 then INCOME <=50K.

e) Create two more C&R trees. The first is just like the default tree except you do not “prune tree to avoid overfitting" (on the basic tab). The other does prune, but you require 500 records in a parent branch and 100 records in a child branch. How do the three trees differ (briefly). Which seems most accurate on the training data? Which seems most accurate on the test data?

Ans: Tree which does not prune

Tree that does prune, but require 500 records in a parent branch and 100 records in a child branch.

Tree that does not prune has a maximum Tree Depth :7 and has 17 leaves.

Tree that does prune but require 500 records in a parent branch and 100 records in a child branch has a Tree Depth : 4 and 7 leaves.

The default tree has Tree Depth : 4 and 7 leaves.

As per the analysis Default tree and Prune 500,100 are pretty similar as they have same depth and number of leaves. However, tree that does not prune has a different tree depth : 7 and has 17 leaves as there is a difference in Predictor Importance of the attribute for the tree that does not Prune.

When we connect all the three trees to the analysis node the results are given below.

The result clearly states that all the three trees are predicting 84.96% results as correct for training data and 84.13% results as correct for testing data. Hence, as all the tree models are in 100% agreement with each other. However, Default model and Model that prunes has Tree depth as 4 so, that would be more efficient and would avoid overfitting.

Problem 2 a) Input the zoo1.csv to train the decision tree classifier (C5.0) and come up with a decision

tree to classify a new record into one of the categories (pick the “favor accuracy" option in the C5.0 node). Make sure you examine the data first and think about what field(s) to use for the classification scheme.

Ans : After examining the data it is pretty eminent that the Attribute Animal has a distinct value every time and can be avoided while making the tree.

b) Rename the generated node as “fulltree" and fully unfold it while browsing it. Use this to draw the full tree - how many leaves does it have? What is the classification accuracy on the training dataset? You can check this through an analysis node or through a table.

Ans:

Total number of leaves in fulltree is 7.

Classification accuracy on Training Data is 100%.

c) Next, reset the option in C5.0 to choose “ruleset" as opposed to “decision tree" and generate a new node - rename this “fullrules." Once again fully unfold the ruleset and write out the rules for each type.

Ans : Selected Rule Set as opposed to Decision Tree.

Rules for each type: Rules for amphibian - contains 1 rule(s) Rule 1 for amphibian if feathers = FALSE and milk = FALSE and aquatic = TRUE and breathes = TRUE then amphibian

Rules for bird - contains 1 rule(s) Rule 1 for bird if feathers = TRUE then bird Rules for fish - contains 1 rule(s) Rule 1 for fish if backbone = TRUE and breathes = FALSE then fish Rules for insect - contains 1 rule(s) Rule 1 for insect if feathers = FALSE and milk = FALSE and airborne = TRUE then insect Rules for invertebrate - contains 1 rule(s) Rule 1 for invertebrate if airborne = FALSE and backbone = FALSE then invertebrate Rules for mammal - contains 1 rule(s) Rule 1 for mammal if milk = TRUE then mammal Rules for reptile - contains 1 rule(s) Rule 1 for reptile if feathers = FALSE and milk = FALSE and aquatic = FALSE and backbone = TRUE then reptile Default: mammal

d) Compare your results from parts (b) and (c) and comment on them. Ans: Both the models give 100% on the training dataset. However, one difference which can be seen from the analysis is that both models use different values for Predictor Importance of Attributes. Predictor Importance for Fulltree. It gives importance to mainly 3 attributes milk, backbone and feathers.

Predictor Importance for Fulltree. It gives importance attributes milk, feathers, backbone, airbourne, breathes, aquatic.

e) Next, use the ”fulltree" node and an analysis node to classify the records in the testing dataset, zoo2.csv (to do this just disconnect the zoo1.csv data source node and instead connect a new data source node at the beginning of the data stream with zoo2.csv as the variable). Compare the classification accuracy here with what you saw in part (b) and comment. What are the misclassified animals?

Ans:

Using the analysis node we can see that tree predicts 90% of the results correct and 10% wrong that is 3 records are predicted wrong. However, in part (b) for training data accuracy was 100%.

The three Misclassified animals are 1) Classified Flea as invertebrate should be insect. 2) Classified Seasnake as fish should be reptile. 3) Classified Termite as invertebrate should be insect.

f) Suppose you wished to use a single level tree (i.e., 1R - just one attribute to classify) and you use the full data set (zoo.csv) to determine this. Which of the three attributes “milk", “feathers" and “aquatic" yields the best results? Why do you think the results are so skewed in each case?

Ans: Case 1 taking “Milk” as the only attribute-

Accuracy is 60.4% when taking only “Milk” as the attribute for predicting “Type” i.e 61 out of 101 results are predicted correct.

Case 2 taking “Feathers” as the only attribute-

Accuracy is 60.4% when taking only “Feathers” as the attribute for predicting “Type” i.e 61 out of 101 results are predicted correct.

Case 3 taking “Aquatic” as the only attribute-

Accuracy is 47.52% when taking only “Aquatic” as the attribute for predicting “Type” i.e 48 out of 101 results are predicted correct. Clearly, the attributes “Milk” and “Feathers” yield the best result i.e 60.4% in comparison to “Aquatic” which gives a result of 47.52%. The results are skewed in each case as we are taking only one attribute at a time and do not have sufficient information to classify all the animals into correct types. Also, as we can see from the results is that in all the tree cases mammal is the type which each model is predicted for most of the observation it is so as for Mammal when Milk is True in most of the cases type is mammal, when feathers are false type is mammal and when aquatic is false type is mammal. So, the distribution is favoring Mammal to a great extent.

Problem 3

(a) What is the gini impurity value for this data set?

Solution :- Here , buys-computer is the target variable .

Of all the 14 records we have 7 YES and & 7 No.

pyes = 7/14 = 0.5

pno = 7/14 = 0.5

The Gini measure of a node is one minus the sum of the squares of the proportions of the

classes.

Gini impurity value for the data set: 1 – (0.52 + 0.52) = 0.5

(b) Suppose a decision tree algorithm did the initial split on income. For each of the children describe the number of records, number of yeses, number of nos, and gini impurity value.

Ans:-

PURE

Records description from the decision tree drawn

3 records having high income don’t buy computer

Out of 6 records having medium income, 4 buy computer and 2 don’t buy.

Out of 5 records having low income, 3 buy computer and 2 don’t buy.

LOW HIGH

Income

MEDIUM

0 Yes, 3 No

3 Yes, 2 No 4 Yes, 2 No

Gini Index of leaf nodes

Gini(High income node) = 0 (Pure )

Gini(Medium leaf node) = 1 − ((2

6)2

+ (4

6)2

) = 4/9

Gini (Low leaf node) = 1 − ((2

5)2

+ (3

5)2

) = 12/25

Gini (Income) = calculated using weighted avg = Gini (High)+ Gini(Medium) + Gini (Low)

= 0 *(3/14) + 4/9 * (6/14) + 12/25 *( 5/14)

= 4/21 + 6/35 = 0.3619

(c) What is the gain in gini impurity of income obtained by splitting on income as in part

(b)?

Ans :- Gain = Gini index of entire data set – Gini index using ‘Income’ as split variable.

= 0.5 (part (a) calculated) – 0.3619 (part (b) calculated)

= 0.1381.

(d) Continue building the tree. If you do not split on a node, explain why. Draw the final tree that you obtained. Ans :- We will classify the decision tree further, keeping ‘Income’ as the initial split variable and try

out different combinations of ‘Student’ and ‘Credit Rating’ to see what we get on the target

variable.

Take on splitting a node :->

1. Follow HIT & TRIAL approach i.e choosing a splitting variable.

2. The moment we get a pure subset, we will not split further.

PURE

LOW HIGH

Income

MEDIUM

0 Yes, 3 No


Step 1 :- Choosing ‘Credit rating ‘ as the next split variable , we will see the values of target variable on

the combination <Medium income , Fair credit rating> and <Medium Income and Excellent credit rating>

Impurity

PURE Credit rating

FAIR EXCELLENT

PURE Impurity

Conclusion :- We rule out the option of choosing ‘Credit rating ‘ as the next splitting variable for

‘MEDIUM ‘ income , since we are not getting a pure subset.

Step 2 :- Choosing ‘student ‘ as the next split variable for ‘Medium ‘ income , we will see the values

of target variable on the combination <Medium income , Yes is a student> and <Medium Income

and No is not a student>.

PURE STUDENT

YES NO

PURE PURE

LOW

HIGH

Income

MEDIUM

0 Yes, 3 No


0 Yes, 3 No’s

1 Yes, 2 No’s

LOW HIGH

Income

MEDIUM

0 Yes, 3 No


4 Yes, 0 No’s

0 Yes, 2 No’s

Step 3:- Looking at the next splitting variable ‘Credit Rating ‘for the LOW income group. We will check

the values of target variable we get for the combination <Low income, Fair credit rating> and <Low

income, Fair credit rating>

FINAL DECISION TREE

Credit Rating

PURE STUDENT

YES NO FAIR Excellent

PURE PURE

PURE PURE

Based on the decision tree, we obtained, we can infer the following:

1) If income is high, customer will not buy a computer. 2) If income is medium, and customer is a student, then customer will buy a computer. 3) If income is medium, and customer is a not a student, then customer will not buy a

computer. 4) If income is low, and credit rating is fair, then customer will not buy a computer. 5) If income is low, and credit rating is excellent, then customer will buy a computer.

LOW HIGH MEDIUM

0 Yes, 3 No


4 Yes, 0 No

0 Yes, 2 No

Income

0 Yes, 2 No

3 Yes, 0 No

Problem 4:

a) Describe the purpose of separating the data into training and testing data.

The purpose of separating the data into training and testing data is to determine the

accuracy of predictions. We measure the performance of a model in terms of its error

rate: percentage of incorrectly classified instances in the data set. To measure the

performance of a data set we divide the data set into

Training data : The training set (seen data) to build the model (determine its

parameters) Test data: The test set (unseen data) to measure its performance (holding the

parameters constant).Test data is something that we get in future. We don't know their Y/dependent variable value and we predict it using our model.

Sometimes it is useful to temporarily "split" a dataset in order to compare analytic output across different subsets of data. This can be useful when you want to compare frequency distributions or descriptive statistics with respect to the categories of some variable (e.g., Gender), or want to filter the results so that you can pull out only the information relevant to the group of interest.

Outcomes of predictive model on dividing the data set

Success: Instances of data set is predicted correctly

Error: Instances of data set is predicted incorrectly

Error rate: proportion of errors made over the whole set of instances b) Which problem do we try to address when using pruning? Please explain.

Overfitting is addressed using pruning. Overfitting happens when we include branches in the

decision tree that try to fit data too specifically. As a result, there is a reduced error in the

training data set at the cost of increased test data set error. To avoid overfitting we implement

pruning mechanism. Pruning helps us to

Remove the branches of decision tree that are of low statistical significance.

To avoid overfitting while making a decision tree for a particular data set.

Things to be considered while pruning

Never remove branches (attributes) that is predictive in nature.

The aim of pruning is to discard parts of a classification model that describe random variation in

the training sample rather than true features of the underlying domain.

Two pruning strategies

• Pre-pruning: the process is done during the construction of the tree. There is

some criteria to stop expanding the nodes (allowing a certain level of "impurity"

in each node).

• Post-pruning: the process is done after the construction of the tree. Branches are

removed from the bottom up to a certain limit. It uses similar criteria to pre-

pruning.

Homework 2

Documents

Transcript of Homework 2