CANB 7640 Final Project Presentationshelmi.com/education/CANB.pdf · CANB 7640 Final Project...

Post on 08-Oct-2020

1 views 0 download

Transcript of CANB 7640 Final Project Presentationshelmi.com/education/CANB.pdf · CANB 7640 Final Project...

CANB 7640 Final Project Presentation

Chronic Obstructive Pulmonary Disease (COPD) is an umbrella term used to describeprogressive lung diseases including emphysema, chronic bronchitis, refractory (non-reversible) asthma, and some forms of bronchiectasis. This disease is characterized byincreasing breathlessness. http://www.copdfoundation.org/

3rd leading cause of death in the US

• Dr. Farnoush Banaei-Kashani (PhD)

Assistant Professor

• Shahab Helmi

PhD Student

• Dr. Katerina Kechris (PhD)

Associate professor

• Dr. Russell Bowler (MD, PhD)

Professor

• Sean Jacobson (MS)

Data Analyst

4

6

Predict how COPD progress over time

Reverse engineering -> what are the causes? (Future work)

Mainly from http://www.copdgene.org/ (PRIVATE)

Metabolomics

Genetics

Genomics

Proteomics

Clinical

CT Scan

The dataset used in this project has 5000 samples and each sample has around 150 features.

SID NewGold 1 NewGold 2 …

Data preprocessor:• Handling null values• Data normalization• Discretization

Overlap Module

Predictor• KNN• KNN+ Decision Tree• Naïve Bayes

KNN

KNN + Decision Tree 1

A B … D3

KNN + Decision Tree 2

A B … D3

A-A A-B … A-D3

min(σ1𝑘 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑡𝑒𝑠𝑡, 𝑡𝑟𝑎𝑖𝑛

𝑘)

C# and LINQ

Microsoft SQL Server

Train-Test Ratio

90-10

80-20

70-30

Features

All 150

Numerical-only

Categorical-only

Genetic-only

Disease history-only

0

10

20

30

40

50

60

70

80

90

k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10

90-10 – KNN – All features

Accuracy 1 Accuracy 2

61

.4

65

65 6

5.5

65

.5

67 6

7.8

67

.8

67

.5

67

63

.3

66

66

67

.7

67

.8 68

.4 69 69

.2

69

.5

69

.6

64

.6 65

.5 66

.7 67

.6 68

.7

69

.1 69

.6

69

.9

69

.9

69

.8

K=1 K=2 K=3 K=4 K=5 K=6 K=7 K=8 K=9 K=10

KNN KNN+DT1 KNN+DT2NB = 54%

0

10

20

30

40

50

60

70

80

k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10

All Categorical Numerical Genetic Disease

Feature Max Accuracy 1 Max Accuracy 2

Disease History 70.5% 85.25%

All 69.9% 84.65%

Categorical 69.1% 84.05%

Numerical 68.1% 83.95%

Genetic 64% 82.9%

Working with domain experts

Better feature selection (medical doctors)

Better data preprocessing (statisticians) + PCA analysis

Testing all feature combinations!

Dimensionality curse (2150 combinations) -> smart algorithm

But may solve the mystery of COPD progess

More classification algorithms, such as SVMs, NNs, …