Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair...

27
Classification of Breast Cancer dataset using Decision Tree Induction Sunil Nair Abel Gebreyesus Masters of Health Informatics Dalhousie University HINF6210 Project Presentation – November 25, 2008

Transcript of Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair...

Page 1: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

Classification of Breast Cancer dataset

using Decision Tree Induction

Sunil Nair Abel Gebreyesus Masters of Health InformaticsDalhousie UniversityHINF6210 Project Presentation – November 25, 2008

Page 2: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

2

Agenda

Objective Dataset Approach Classification Methods Decision Tree Problems Future direction

Page 3: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

3

Introduction

Breast Cancer prognosis Breast cancer incidence is high Improvement in diagnostic methods Early diagnosis and treatment. But, recurrence is high Good prognosis is important….

Page 4: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

4

Objective

Significance of project Previous work done using this dataset Most previous work indicated room for

improvement in increasing accuracy of classifier

Page 5: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

5

Breast Cancer Dataset

# of Instances: 699 # of Attributes: 10 plus

Class attribute Class distribution:

Benign (2): 458 (65.5%) Malignant (4): 241 (34.5%)

Missing Values : 16

Wisconsin Breast Cancer Database (1991) University of Wisconsin Hospitals, Dr. William H. Wolberg

Page 6: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

6

Attributes

1 Sample code number id number

2 Clump Thickness 1-10

3 Uniformity of Cell Size 1-10

4 Uniformity of Cell Shape 1-10

5 Marginal Adhesion 1-10

6 Single Epithelial Cell Size 1-10

7 Bare Nuclei 1-10

8 Bland Chromatin 1-10

9 Normal Nucleoli 1-10

10 Mitoses 1-10

11 Class Benign (2), Malignant (4)

•Indicate Cellular characteristics•Variables are Continuous, Ordinal with 10 levels

Page 7: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

7

Attributes / class - distribution• Dataset unbalanced

Page 8: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

8

Our Approach

Data Pre-processing Comparison between Classification techniques Decision Tree Induction

Attribute Selection J48 Evaluation

Page 9: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

9

Data Pre-processing Filter out the ID column Handle Missing Values

WEKA

Page 10: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

10

Data preprocessing

Two options to manage Missing data – WEKA “Replacemissingvalues” weka.filters.unsupervised.attribute.ReplaceMissingValues

Missing nominal and numeric attributes replaced with mode-means

Remove (delete) the tuple with missing values. Missing values are attribute bare nuclei = 16 Outliers

Page 11: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

11

Comparison chart – Handle Missing Value

PERFORMANCE EVALUATION

DATASET#

RULES MAEAct.Acc.Rate

Exp.Acc.Rate

Complete 14 8% 94% 87%

MissingRemoved 11 5% 96% 90%

MissingReplaced 14 7% 95% 89%

Class B M Total

B 160 7 167

M 3 63 66

Total 163 70 233

Confusion MatrixTotal Correctly Classified Instances Test split = 223

Accuracy Rate:95.78%

How many predictions by chance?

Expected Accuracy Rate = Kappa Statistic-is used to measure the agreement between predicted and actual categorization of data while correcting for prediction that occurs by chance.

Page 12: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

12

Data Pre-processing Missing Value Replaced - Mean-Mode Missing Value Removed - Mean-Mode

Page 13: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

13

Agenda

Objective Dataset Approach

Data Pre-Processing Classification Methods Decision Tree Problems Future direction

Page 14: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

14

Classification Methods Comparison

Test Set PERFORMANCE EVALUATION

CLASSIFIER#

Total Inst.

MAEAct.Acc.Rate

Exp.Acc.Rate

Naïve Bayes233 4% 96% 90%

Neural Network 233 10% 91% 79%

Support Vector M 233 3% 97% 94%

DT-J48 233 4% 97% 92%

Page 15: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

15

Classification using Decision Tree Decision Tree – WEKA J48 (C4.5)

Divide and conquer algorithm Convert tree to Classification rules J48 can handle numeric attributes.

Attribute Selection - Information gain

Page 16: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

16

Attributes Selected – most IG

PERFORMANCE EVALUATION

DATASET#

RULES MAEAct.Acc.Rate

Exp.Acc.Rate

Attributes Selected 11 4% 97% 92%

MissingRemoved 11 5% 96% 90%

MissingReplaced 14 7% 95% 89%

weka.filters.supervised.attribute.AttributeSelection-Eweka.attributeSelection.InfoGainAttributeEval-Sweka.attributeSelection.Ranker

RankAttribute

Information Gain

1 Uniformity of Cell Size 0.675

2 Uniformity of Cell Shape 0.66

3 Bare Nucleoli 0.564

4 Bland Chromatin 0.543

5 Single Epithelial Cell Size 0.505

6 Normal Nucleoli 0.466

7 Clump Thickness 0.459

8 Marginal Adhesion 0.443

9 Mitosis 0.198

Page 17: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

17

The DT – IG/Attribute selectionVisualization

Page 18: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

18

Decision Tree - Problems

Concerns Missing values Pruning – Preprune or postprune Estimating error rates

Unbalanced Dataset Bias in prediction Overfitting – in test set Underfitting

Page 19: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

19

Confusion Matrix – Performance EvaluationThe overall Accuracy rate is the

number of correct classifications divided by the total number of classifications: TP+TN /

TP+TN+FP+FN

Error Rate = 1- Accuracy

Not a correct measure if Unbalanced Dataset

Classes are unequally represented

Predicted Class

B (2) M (4)

Act.Class

B (2) TP FN

M (4) FP TN

Page 20: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

20

Unbalanced dataset problem

Solution: Stratified Sampling Method

Partitioning of dataset based on class Random Sampling Process Create Training and Test set with equal size class Testing set data independent from Training set.

Standard Verification technique Best error estimate

Page 21: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

21

Stratified Sampling Method

Page 22: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

22

Performance Evaluation

Test Set PERFORMANCE EVALUATION

Dataset#

Instances#

Rules MAEAct.Acc.Rate

Exp.Acc.Rate

Training set 476 13 2% 99% 97%

Testing set 412 13 3% 96% 92%

Page 23: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

23

Tree Visualization

Page 24: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

24

Unbalanced dataset Problem

Solution: Cost Matrix Cost sensitive classification Costs not known

Complete financial analysis needed; i.e cost of Using ML tool Gathering training data Using the model Determining the attributes for test

Cross Validation once all costs are known

Page 25: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

25

Future direction The overall accuracy of the classifier needs to be

increased Cluster based Stratified Sampling

Partitioning the original dataset using Kmeans Alg. Multiple Classifier model

Bagging and Boosting techniques ROC (Receiver Operating Characteristic)

Plotting the TP Rate (Y-axis) over FP Rate (X-Axis) Advantage: Does not regard class distribution or

error costs.

Page 26: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

26

ROC Curve - Visualization

For Benign class For Malignant class

•Area under the curve AUC•Larger the area, better is the model

Page 27: Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

05/03/23 HINF6210/Project presentation/Abel/Sunil

27

Questions / Comments

Thank You!