Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair...
-
Upload
sunil-nair -
Category
Health & Medicine
-
view
19.784 -
download
0
Transcript of Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair...
Classification of Breast Cancer dataset
using Decision Tree Induction
Sunil Nair Abel Gebreyesus Masters of Health InformaticsDalhousie UniversityHINF6210 Project Presentation – November 25, 2008
05/03/23 HINF6210/Project presentation/Abel/Sunil
2
Agenda
Objective Dataset Approach Classification Methods Decision Tree Problems Future direction
05/03/23 HINF6210/Project presentation/Abel/Sunil
3
Introduction
Breast Cancer prognosis Breast cancer incidence is high Improvement in diagnostic methods Early diagnosis and treatment. But, recurrence is high Good prognosis is important….
05/03/23 HINF6210/Project presentation/Abel/Sunil
4
Objective
Significance of project Previous work done using this dataset Most previous work indicated room for
improvement in increasing accuracy of classifier
05/03/23 HINF6210/Project presentation/Abel/Sunil
5
Breast Cancer Dataset
# of Instances: 699 # of Attributes: 10 plus
Class attribute Class distribution:
Benign (2): 458 (65.5%) Malignant (4): 241 (34.5%)
Missing Values : 16
Wisconsin Breast Cancer Database (1991) University of Wisconsin Hospitals, Dr. William H. Wolberg
05/03/23 HINF6210/Project presentation/Abel/Sunil
6
Attributes
1 Sample code number id number
2 Clump Thickness 1-10
3 Uniformity of Cell Size 1-10
4 Uniformity of Cell Shape 1-10
5 Marginal Adhesion 1-10
6 Single Epithelial Cell Size 1-10
7 Bare Nuclei 1-10
8 Bland Chromatin 1-10
9 Normal Nucleoli 1-10
10 Mitoses 1-10
11 Class Benign (2), Malignant (4)
•Indicate Cellular characteristics•Variables are Continuous, Ordinal with 10 levels
05/03/23 HINF6210/Project presentation/Abel/Sunil
7
Attributes / class - distribution• Dataset unbalanced
05/03/23 HINF6210/Project presentation/Abel/Sunil
8
Our Approach
Data Pre-processing Comparison between Classification techniques Decision Tree Induction
Attribute Selection J48 Evaluation
05/03/23 HINF6210/Project presentation/Abel/Sunil
9
Data Pre-processing Filter out the ID column Handle Missing Values
WEKA
05/03/23 HINF6210/Project presentation/Abel/Sunil
10
Data preprocessing
Two options to manage Missing data – WEKA “Replacemissingvalues” weka.filters.unsupervised.attribute.ReplaceMissingValues
Missing nominal and numeric attributes replaced with mode-means
Remove (delete) the tuple with missing values. Missing values are attribute bare nuclei = 16 Outliers
05/03/23 HINF6210/Project presentation/Abel/Sunil
11
Comparison chart – Handle Missing Value
PERFORMANCE EVALUATION
DATASET#
RULES MAEAct.Acc.Rate
Exp.Acc.Rate
Complete 14 8% 94% 87%
MissingRemoved 11 5% 96% 90%
MissingReplaced 14 7% 95% 89%
Class B M Total
B 160 7 167
M 3 63 66
Total 163 70 233
Confusion MatrixTotal Correctly Classified Instances Test split = 223
Accuracy Rate:95.78%
How many predictions by chance?
Expected Accuracy Rate = Kappa Statistic-is used to measure the agreement between predicted and actual categorization of data while correcting for prediction that occurs by chance.
05/03/23 HINF6210/Project presentation/Abel/Sunil
12
Data Pre-processing Missing Value Replaced - Mean-Mode Missing Value Removed - Mean-Mode
05/03/23 HINF6210/Project presentation/Abel/Sunil
13
Agenda
Objective Dataset Approach
Data Pre-Processing Classification Methods Decision Tree Problems Future direction
05/03/23 HINF6210/Project presentation/Abel/Sunil
14
Classification Methods Comparison
Test Set PERFORMANCE EVALUATION
CLASSIFIER#
Total Inst.
MAEAct.Acc.Rate
Exp.Acc.Rate
Naïve Bayes233 4% 96% 90%
Neural Network 233 10% 91% 79%
Support Vector M 233 3% 97% 94%
DT-J48 233 4% 97% 92%
05/03/23 HINF6210/Project presentation/Abel/Sunil
15
Classification using Decision Tree Decision Tree – WEKA J48 (C4.5)
Divide and conquer algorithm Convert tree to Classification rules J48 can handle numeric attributes.
Attribute Selection - Information gain
05/03/23 HINF6210/Project presentation/Abel/Sunil
16
Attributes Selected – most IG
PERFORMANCE EVALUATION
DATASET#
RULES MAEAct.Acc.Rate
Exp.Acc.Rate
Attributes Selected 11 4% 97% 92%
MissingRemoved 11 5% 96% 90%
MissingReplaced 14 7% 95% 89%
weka.filters.supervised.attribute.AttributeSelection-Eweka.attributeSelection.InfoGainAttributeEval-Sweka.attributeSelection.Ranker
RankAttribute
Information Gain
1 Uniformity of Cell Size 0.675
2 Uniformity of Cell Shape 0.66
3 Bare Nucleoli 0.564
4 Bland Chromatin 0.543
5 Single Epithelial Cell Size 0.505
6 Normal Nucleoli 0.466
7 Clump Thickness 0.459
8 Marginal Adhesion 0.443
9 Mitosis 0.198
05/03/23 HINF6210/Project presentation/Abel/Sunil
17
The DT – IG/Attribute selectionVisualization
05/03/23 HINF6210/Project presentation/Abel/Sunil
18
Decision Tree - Problems
Concerns Missing values Pruning – Preprune or postprune Estimating error rates
Unbalanced Dataset Bias in prediction Overfitting – in test set Underfitting
05/03/23 HINF6210/Project presentation/Abel/Sunil
19
Confusion Matrix – Performance EvaluationThe overall Accuracy rate is the
number of correct classifications divided by the total number of classifications: TP+TN /
TP+TN+FP+FN
Error Rate = 1- Accuracy
Not a correct measure if Unbalanced Dataset
Classes are unequally represented
Predicted Class
B (2) M (4)
Act.Class
B (2) TP FN
M (4) FP TN
05/03/23 HINF6210/Project presentation/Abel/Sunil
20
Unbalanced dataset problem
Solution: Stratified Sampling Method
Partitioning of dataset based on class Random Sampling Process Create Training and Test set with equal size class Testing set data independent from Training set.
Standard Verification technique Best error estimate
05/03/23 HINF6210/Project presentation/Abel/Sunil
21
Stratified Sampling Method
05/03/23 HINF6210/Project presentation/Abel/Sunil
22
Performance Evaluation
Test Set PERFORMANCE EVALUATION
Dataset#
Instances#
Rules MAEAct.Acc.Rate
Exp.Acc.Rate
Training set 476 13 2% 99% 97%
Testing set 412 13 3% 96% 92%
05/03/23 HINF6210/Project presentation/Abel/Sunil
23
Tree Visualization
05/03/23 HINF6210/Project presentation/Abel/Sunil
24
Unbalanced dataset Problem
Solution: Cost Matrix Cost sensitive classification Costs not known
Complete financial analysis needed; i.e cost of Using ML tool Gathering training data Using the model Determining the attributes for test
Cross Validation once all costs are known
05/03/23 HINF6210/Project presentation/Abel/Sunil
25
Future direction The overall accuracy of the classifier needs to be
increased Cluster based Stratified Sampling
Partitioning the original dataset using Kmeans Alg. Multiple Classifier model
Bagging and Boosting techniques ROC (Receiver Operating Characteristic)
Plotting the TP Rate (Y-axis) over FP Rate (X-Axis) Advantage: Does not regard class distribution or
error costs.
05/03/23 HINF6210/Project presentation/Abel/Sunil
26
ROC Curve - Visualization
For Benign class For Malignant class
•Area under the curve AUC•Larger the area, better is the model
05/03/23 HINF6210/Project presentation/Abel/Sunil
27
Questions / Comments
Thank You!