Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.

16
Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001

Transcript of Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.

Page 1: Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.

Prognostic Prediction of Breast Cancer Using C5

Sakina Begum May 1, 2001

Page 2: Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.

Breast Cancer Diagnosis

Second leading cause of cancer death in women.

Fine Needle Aspirate (FNA)extract cells and fluid from mass using thin needleexamine cells under microscope

Early detection of breast cancer depends on accurate diagnosis.

Page 3: Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.

Ability to correctly diagnose cancer using FNA and visual interpretation varies from 65% to 98%.

Page 4: Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.

University of Wisconsin hospitals use Xcyt.

Use information about cell characteristic from FNA and multisurface method to determine if tumor is benign or malignant. I wanted to do the same thing using C5.

Page 5: Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.

Data Preparation

File has 569 patients, 32 attributes for each patientIDdiagnosis10 average cell characteristics10 standard deviations for each cell characteristic10 “worst” cell characteristics

Two files:All 32 attributes12 attributes (including 10 average cell characteristics)

Page 6: Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.

sed and awk are programmable UNIX utilities that perform actions on lines that match a particular condition.

awk -f awkfile -F, data1 > data2{print($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12)}

842302 M 17.99 10.38 122.8 1001 0.1184 0.2776 0.3001 0.1471 0.2419 0.07871

sed ‘s/ /,/g’ data2 > cancer.data

842302,M,17.99,10.38,122.8,1001,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871

Page 7: Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.

Data Mining

C5 extracts informative patterns from data.

-f identifies the application name (called a filestem).

-r causes rules to be derived from trees.

-S x constructs a classifier containing x% of data from data file. Classifier is evaluated on a non-overlapping set of test cases.

Page 8: Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.

Successive runs of C5 with sampling will usually produce different results.

I used sampling size 10%, 30%, 50%, 70%, and 90%.

I ran C5 three times on each different sampling size.

By default, the random sample changes every time a classifier is constructed.

case 1case 2

case 1case 2

case 1case 2

Page 9: Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.

32 attributes

0

5

10

15

20

25

30

0 10 20 30 40 50 60 70 80 90 100

percent of data used for training

perc

ent

of e

rror

s in

tre

e

run 1

run 2

run 3

Page 10: Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.

12 attributes

0

5

10

15

20

25

30

0 10 20 30 40 50 60 70 80 90 100

percent of data used for training

per

cen

t of

err

ors

in t

ree

run 1run 2run 3

Page 11: Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40 50 60 70 80 90 100

percent of data used for training

per

cen

t of

err

ors

in t

ree

average 12average 32

Page 12: Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.

concave points

area perimeter

perimeter texture

texture

symmetry

concave points

compactness

MB

M

M

M

M

M

B

B

B

0.049

693.7 693.7

102.1

102.1

102.8

0.049

102.8

19.73 19.73

15.45

15.45

0.211

0.085

0.085

0.123

0.123

0.211

Page 13: Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.

Each rule consists of:arbitrary rule numberstatisticsone or more conditions that must be satisfiedclass predicted by ruleconfidence with which prediction is made

Statistics:number of training cases covered by rule/number of cases that do not belong to the rulelift is result of dividing the rules estimated accuracy by relative frequency of predicted class.

Page 14: Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.

Conclusion

Decision tree gives average of 6% to 7% errors.

Classifier may be overtrained. Better results by selecting few cell features.

Developers of Xcyt obtained best results using three features: worst area, worst smoothness, and average texture.

Page 15: Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.

Lessons Learned

Familiar with C5.

Importance of knowledge of domain.

Further work:Build classifier using different subset of features.Use adaptive boosting option.

Page 16: Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.

References

W. N. Street, O. L. Mangasarian, W. H. Wolberg. An Inductive Learning Approach to Prognostic Prediction

O. L. Mangasarian, W. N. Street, W. H. Wolberg. Breast Cancer Diagnosis and Prognosis via Linear Programming

Machine Learning for Cancer Diagnosis and Prognosis: http://www.cs.wisc.edu/~olvi/uwmp/cancer.html