Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair...

Classification of Breast Cancer dataset

using Decision Tree Induction

Sunil Nair Abel Gebreyesus Masters of Health InformaticsDalhousie UniversityHINF6210 Project Presentation – November 25, 2008

05/03/23 HINF6210/Project presentation/Abel/Sunil

2

Agenda

Objective Dataset Approach Classification Methods Decision Tree Problems Future direction


3

Introduction

Breast Cancer prognosis Breast cancer incidence is high Improvement in diagnostic methods Early diagnosis and treatment. But, recurrence is high Good prognosis is important….


4

Objective

Significance of project Previous work done using this dataset Most previous work indicated room for

improvement in increasing accuracy of classifier


5

Breast Cancer Dataset

# of Instances: 699 # of Attributes: 10 plus

Class attribute Class distribution:

Benign (2): 458 (65.5%) Malignant (4): 241 (34.5%)

Missing Values : 16

Wisconsin Breast Cancer Database (1991) University of Wisconsin Hospitals, Dr. William H. Wolberg


6

Attributes

1 Sample code number id number

2 Clump Thickness 1-10

3 Uniformity of Cell Size 1-10

4 Uniformity of Cell Shape 1-10

5 Marginal Adhesion 1-10

6 Single Epithelial Cell Size 1-10

7 Bare Nuclei 1-10

8 Bland Chromatin 1-10

9 Normal Nucleoli 1-10

10 Mitoses 1-10

11 Class Benign (2), Malignant (4)

•Indicate Cellular characteristics•Variables are Continuous, Ordinal with 10 levels


7

Attributes / class - distribution• Dataset unbalanced


8

Our Approach

Data Pre-processing Comparison between Classification techniques Decision Tree Induction

Attribute Selection J48 Evaluation


9

Data Pre-processing Filter out the ID column Handle Missing Values

WEKA


10

Data preprocessing

Two options to manage Missing data – WEKA “Replacemissingvalues” weka.filters.unsupervised.attribute.ReplaceMissingValues

Missing nominal and numeric attributes replaced with mode-means

Remove (delete) the tuple with missing values. Missing values are attribute bare nuclei = 16 Outliers


11

Comparison chart – Handle Missing Value

PERFORMANCE EVALUATION

DATASET#

RULES MAEAct.Acc.Rate

Exp.Acc.Rate

Complete 14 8% 94% 87%

MissingRemoved 11 5% 96% 90%

MissingReplaced 14 7% 95% 89%

Class B M Total

B 160 7 167

M 3 63 66

Total 163 70 233

Confusion MatrixTotal Correctly Classified Instances Test split = 223

Accuracy Rate:95.78%

How many predictions by chance?

Expected Accuracy Rate = Kappa Statistic-is used to measure the agreement between predicted and actual categorization of data while correcting for prediction that occurs by chance.


12

Data Pre-processing Missing Value Replaced - Mean-Mode Missing Value Removed - Mean-Mode


13

Agenda

Objective Dataset Approach

Data Pre-Processing Classification Methods Decision Tree Problems Future direction


14

Classification Methods Comparison

Test Set PERFORMANCE EVALUATION

CLASSIFIER#

Total Inst.

MAEAct.Acc.Rate

Exp.Acc.Rate

Naïve Bayes233 4% 96% 90%

Neural Network 233 10% 91% 79%

Support Vector M 233 3% 97% 94%

DT-J48 233 4% 97% 92%


15

Classification using Decision Tree Decision Tree – WEKA J48 (C4.5)

Divide and conquer algorithm Convert tree to Classification rules J48 can handle numeric attributes.

Attribute Selection - Information gain


16

Attributes Selected – most IG

PERFORMANCE EVALUATION

DATASET#

RULES MAEAct.Acc.Rate

Exp.Acc.Rate

Attributes Selected 11 4% 97% 92%

MissingRemoved 11 5% 96% 90%

MissingReplaced 14 7% 95% 89%

weka.filters.supervised.attribute.AttributeSelection-Eweka.attributeSelection.InfoGainAttributeEval-Sweka.attributeSelection.Ranker

RankAttribute

Information Gain

1 Uniformity of Cell Size 0.675

2 Uniformity of Cell Shape 0.66

3 Bare Nucleoli 0.564

4 Bland Chromatin 0.543

5 Single Epithelial Cell Size 0.505

6 Normal Nucleoli 0.466

7 Clump Thickness 0.459

8 Marginal Adhesion 0.443

9 Mitosis 0.198


17

The DT – IG/Attribute selectionVisualization


18

Decision Tree - Problems

Concerns Missing values Pruning – Preprune or postprune Estimating error rates

Unbalanced Dataset Bias in prediction Overfitting – in test set Underfitting


19

Confusion Matrix – Performance EvaluationThe overall Accuracy rate is the

number of correct classifications divided by the total number of classifications: TP+TN /

TP+TN+FP+FN

Error Rate = 1- Accuracy

Not a correct measure if Unbalanced Dataset

Classes are unequally represented

Predicted Class

B (2) M (4)

Act.Class

B (2) TP FN

M (4) FP TN


20

Unbalanced dataset problem

Solution: Stratified Sampling Method

Partitioning of dataset based on class Random Sampling Process Create Training and Test set with equal size class Testing set data independent from Training set.

Standard Verification technique Best error estimate


21

Stratified Sampling Method


22

Performance Evaluation

Test Set PERFORMANCE EVALUATION

Dataset#

Instances#

Rules MAEAct.Acc.Rate

Exp.Acc.Rate

Training set 476 13 2% 99% 97%

Testing set 412 13 3% 96% 92%


23

Tree Visualization


24

Unbalanced dataset Problem

Solution: Cost Matrix Cost sensitive classification Costs not known

Complete financial analysis needed; i.e cost of Using ML tool Gathering training data Using the model Determining the attributes for test

Cross Validation once all costs are known


25

Future direction The overall accuracy of the classifier needs to be

increased Cluster based Stratified Sampling

Partitioning the original dataset using Kmeans Alg. Multiple Classifier model

Bagging and Boosting techniques ROC (Receiver Operating Characteristic)

Plotting the TP Rate (Y-axis) over FP Rate (X-Axis) Advantage: Does not regard class distribution or

error costs.


26

ROC Curve - Visualization

For Benign class For Malignant class

•Area under the curve AUC•Larger the area, better is the model


27

Questions / Comments

Thank You!

Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair...

Health & Medicine

Transcript of Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair...