Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop...

31
Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop http://dataminingmed.weebly.com

Transcript of Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop...

Page 1: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

Datamining Project: UpdateMarcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

http://dataminingmed.weebly.com

Page 2: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

Recap• Data

▫ Non-homogenous datasets (Clinical Trial/Pubmed)▫ Cancer-related▫ Relations (Explicit links)

• Motivation▫ Implicit links between clinical trials and pubmed articles

may exist

• Aim▫ Provide scientists in the biological community insight into

related clinical trials and/or other publications of interest

2

Page 3: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

Data Pre-Prepocessing• First Trial terms:

1. if. radiation therapy2. i),. gemcitabine/cisplatin3. weeks until. disease progression4. this. regimen5. serum levels of. interleukin-66. biliary adenocarcinomas7. adenocarcinoma treated8. post-operative adjuvant paclitaxel +

cisplatin9. phase ii trial of post-operative10.cardia receiving. post-operative

adjuvant paclitaxel11.gastro-esophageal junction or cardia

3

Page 4: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

Data Pre-Prepocessing• LingPipe codes gives terms such as:

Running Stadistical Name Entity Recognizer with Training a Named Entity

Recognizer with two models: pos-en-bio-medpost.HiddenMarkovModel and

pos-en-general-brown.HiddenMarkovModel1. brain metastases2. patients undergo3. prophylactic cranial irradiation4. brain5. disease small cell lung cancer6. cranial irradiation7. health economics8. therapy vs progression9. administration

4

Refer to: http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html

Page 5: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

Data Pre-Prepocessing• Final terms:

Trials:

1. bronchi

2. bronchial

3. bsh

4. cachexia

5. calcimimetic

6. calcium

7. hybridization

8. hydrochloride

9. hydrocortisone

10.hydroxyproline

11.hypercortisolism

5

Refer to: https://github.com/tnunes/becas-python

Pubmed:

1. abdomen

2. acc

3. acetate

4. acitretin

5. actinomycin

6. add

7. dermatitis

8. desmoid

9. desmolase

10.desmoplastic

11.detoxification

Semantic groupIdentified entity types:➢ Chemicals➢ Enzymes➢ Genes➢ Protein➢ Disease or

Syndrome➢ Anatomical

structure➢ Body System

Page 6: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

Entity Extraction - TFIDF•Using textmodeler code

▫Extract entities▫Calculate TFIDF

•Examples of Features:▫“thyroid cancer” ▫“stem cell”

▫“cell lung cancer tumor cells tumor cells” X▫“arms arm arm oxaliplatin arm arm” X

•Number of Unique Entities:▫Pubmed: 1696▫Trials: 1492

6

Page 7: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

Term Extraction - TFIDF•Implement Simple Code

▫Term Extraction▫TFIDF Calculation

•Examples of Features:▫“mesothelioma” ▫“adenocarcinoma”▫“neoplasia”

•Number of Unique Entities:▫Pubmed: 818▫Trials: 802

7

Page 8: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

8

ResultsFor the variance analysis, we removed the maximum threshold, we only use Minimum threshold to see if there are any improvements.

Page 9: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

9

ResultsThen we choose threshold = 0.00008 and 0.000070.

And we noticed the two ACS figures are very similar.

Page 10: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

10

ResultsThen we removed the terms with variance lower than threshold, and get the clusters before dependent clusterK=10. But after the dependent clustering, there is only one giant cluster.

Page 11: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

11

Results

Page 12: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

12

Results

Page 13: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

13

ResultsNow we use the same data set with preprocessing: we removed the terms like “and”, “or”.

Page 14: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

14

ResultsThis is the variance using the preprocessed data.

And I set the threshold to .00005, .00006, 0.00007, .00008, .00009, .0001, .00011, 0.00012, .00013, .00014, .00015, .00016, 0.00017, .00018, .00019, .0002.

And we set the threshold candidates to: 0.00003, 0.00006, 0.00008, 0.00009, 0.0001, 0.00011, 0.00012, 0.00013, .00014, .00015, .00016, .00017, .00018, .00019,.00020,.00021,.00022,.00023,.00024, 0.00025.

Page 15: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

15

Results

K=5

Page 16: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

16

Results

Page 17: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

17

Results

Before dependent clustering

Page 18: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

18

Results

After dependent clustering

Page 19: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

19

ResultsWhen we considered the medical terms and use the dictionary to preprocess the data, and we used entities as the feature, then we get:

Page 20: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

20

ResultsThe clustering results before dependent clustering:

Page 21: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

21

ResultsWhen we considered the medical terms and use the dictionary to preprocess the data, and we use each term as a feature, we get:

Page 22: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

22

ResultsBefore dependent clustering

Page 23: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

Heterogeneous Naïve Bayes Classification

Find the Probability that a relation exists for every document in corpus B, given every document in corpus A

23

Page 24: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

Corpus A

doc t1 t2 t3 t4

A rat cat cat bat

B rat rat bat

C dog dog cat

D bird bat bat dog

Z cat bird dog

Corpus B

doc t1 t2 t3 t4 t5

1 trial boy boy sick

2 trial healthy girl

3 trial cancer treatment girl

4 trial cancer brain cancer

5 trial blind boy girl girl

6 trial brain cancer blind

Relational

Doc (A) Doc (B)

A 2,4

B 1,6

C 1,2

D 4

24

Page 25: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

doc (A) class t1 t2 t3 t4

A trial rat cat cat bat

A healthy rat cat cat bat

A girl rat cat cat bat

A trial rat cat cat bat

A cancer rat cat cat bat

A brain rat cat cat bat

A cancer rat cat cat bat

B trial rat rat bat

B boy rat rat bat

B boy rat rat bat

B sick rat rat bat

B trial rat rat bat

B brain rat rat bat

B cancer rat rat bat

B blind rat rat bat

C trial dog dog cat

C boy dog dog cat

C boy dog dog cat

C sick dog dog cat

C trial dog dog cat

C healthy dog dog cat

C girl dog dog cat

D trial bird bat bat dog

D cancer bird bat bat dog

D brain bird bat bat dog

D cancer bird bat bat dog

25

Page 26: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

Corpus A

doc t1 t2 t3 t4

A rat cat cat bat

B rat rat bat

C dog dog cat

D bird bat bat dog

Z cat bird dog

Corpus B

doc t1 t2 t3 t4 t5

1 trial boy boy sick

2 trial healthy girl

3 trial cancer treatment girl

4 trial cancer brain cancer

5 trial blind boy girl girl

6 trial brain cancer blind

docs doc 1 doc 2 doc 3 doc 4 doc 5 doc 6

doc A 0.001645 0.001628 0.002001 0.002752 0.001866 0.002091

doc B 0.011329 0.004655 0.007571 0.013608 0.013202 0.016166

doc C 0.010044 0.008304 0.006061 0.003992 0.011153 0.003543

doc D 0.000146 0.000146 0.000435 0.000851 0.000146 0.000561

doc Z 0.000584 0.000584 0.001033 0.001655 0.000584 0.001206

26

Page 27: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

Naïve Bayes Formulation

𝑃 (𝑑𝑜𝑐1|𝑑𝑜𝑐𝐴 )∝0.001645

P(trial)∙P(trial|rat) ∙P(trial|cat) ∙P(trial|cat) ∙P(trial|bat)+ P(boy)∙P(boy|rat) ∙P(boy|cat) ∙P(boy|cat) ∙P(boy|bat)+P(boy)∙P(boy|rat) ∙P(boy|cat) ∙P(boy|cat) ∙P(boy|bat)+

P(sick)∙P(sick|rat) ∙P(sick|cat) ∙P(sick|cat) ∙P(sick|bat)+

A rat cat cat bat

1 trial boy boy sick

27

Page 28: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

Naïve Bayes Laplace Transform

• This handles better handles the terms that do not appear at all, however, we lose even more accuracy.

• This raises the question: Do we need to improve accuracy?

28

Page 29: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

Naïve Bayes Accuracy

•We MAY not need to improve accuracy•We are more interested in relative ratings

4.214256697426845E-61 9.545714918515275E-62 6.375720538007726E-69 …

29

Page 30: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

Naïve Bayes Future Improvement

•Improve accuracy•Improve speed•Determine criteria for predicting new links•Find out if new links improve or harm dependent clustering

30

Page 31: Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop .

Contributions from each MemberSW/P Removal1

Non-BT Removal2

VTR3

TFIDF Terms

TFIDF Entities

DC4

DenC5

NB6

DA & DV7

Web-site8

Jessica X X X X X

Lauren X X X X X X

Marcus X

Vince X X X X

1: Stop Word/Punctuation Removal2: Non-biological term removal3: Variance Term Removal4: Dependent Clustering5: Density Clustering6: Naïve Bayes – New Algorithm7: Data Analysis & Data Visualization

Contribution

31