Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop...

Post on 13-Jan-2016

216 views 1 download

Tags:

Transcript of Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop...

Datamining Project: UpdateMarcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop

http://dataminingmed.weebly.com

Recap• Data

▫ Non-homogenous datasets (Clinical Trial/Pubmed)▫ Cancer-related▫ Relations (Explicit links)

• Motivation▫ Implicit links between clinical trials and pubmed articles

may exist

• Aim▫ Provide scientists in the biological community insight into

related clinical trials and/or other publications of interest

2

Data Pre-Prepocessing• First Trial terms:

1. if. radiation therapy2. i),. gemcitabine/cisplatin3. weeks until. disease progression4. this. regimen5. serum levels of. interleukin-66. biliary adenocarcinomas7. adenocarcinoma treated8. post-operative adjuvant paclitaxel +

cisplatin9. phase ii trial of post-operative10.cardia receiving. post-operative

adjuvant paclitaxel11.gastro-esophageal junction or cardia

3

Data Pre-Prepocessing• LingPipe codes gives terms such as:

Running Stadistical Name Entity Recognizer with Training a Named Entity

Recognizer with two models: pos-en-bio-medpost.HiddenMarkovModel and

pos-en-general-brown.HiddenMarkovModel1. brain metastases2. patients undergo3. prophylactic cranial irradiation4. brain5. disease small cell lung cancer6. cranial irradiation7. health economics8. therapy vs progression9. administration

4

Refer to: http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html

Data Pre-Prepocessing• Final terms:

Trials:

1. bronchi

2. bronchial

3. bsh

4. cachexia

5. calcimimetic

6. calcium

7. hybridization

8. hydrochloride

9. hydrocortisone

10.hydroxyproline

11.hypercortisolism

5

Refer to: https://github.com/tnunes/becas-python

Pubmed:

1. abdomen

2. acc

3. acetate

4. acitretin

5. actinomycin

6. add

7. dermatitis

8. desmoid

9. desmolase

10.desmoplastic

11.detoxification

Semantic groupIdentified entity types:➢ Chemicals➢ Enzymes➢ Genes➢ Protein➢ Disease or

Syndrome➢ Anatomical

structure➢ Body System

Entity Extraction - TFIDF•Using textmodeler code

▫Extract entities▫Calculate TFIDF

•Examples of Features:▫“thyroid cancer” ▫“stem cell”

▫“cell lung cancer tumor cells tumor cells” X▫“arms arm arm oxaliplatin arm arm” X

•Number of Unique Entities:▫Pubmed: 1696▫Trials: 1492

6

Term Extraction - TFIDF•Implement Simple Code

▫Term Extraction▫TFIDF Calculation

•Examples of Features:▫“mesothelioma” ▫“adenocarcinoma”▫“neoplasia”

•Number of Unique Entities:▫Pubmed: 818▫Trials: 802

7

8

ResultsFor the variance analysis, we removed the maximum threshold, we only use Minimum threshold to see if there are any improvements.

9

ResultsThen we choose threshold = 0.00008 and 0.000070.

And we noticed the two ACS figures are very similar.

10

ResultsThen we removed the terms with variance lower than threshold, and get the clusters before dependent clusterK=10. But after the dependent clustering, there is only one giant cluster.

11

Results

12

Results

13

ResultsNow we use the same data set with preprocessing: we removed the terms like “and”, “or”.

14

ResultsThis is the variance using the preprocessed data.

And I set the threshold to .00005, .00006, 0.00007, .00008, .00009, .0001, .00011, 0.00012, .00013, .00014, .00015, .00016, 0.00017, .00018, .00019, .0002.

And we set the threshold candidates to: 0.00003, 0.00006, 0.00008, 0.00009, 0.0001, 0.00011, 0.00012, 0.00013, .00014, .00015, .00016, .00017, .00018, .00019,.00020,.00021,.00022,.00023,.00024, 0.00025.

15

Results

K=5

16

Results

17

Results

Before dependent clustering

18

Results

After dependent clustering

19

ResultsWhen we considered the medical terms and use the dictionary to preprocess the data, and we used entities as the feature, then we get:

20

ResultsThe clustering results before dependent clustering:

21

ResultsWhen we considered the medical terms and use the dictionary to preprocess the data, and we use each term as a feature, we get:

22

ResultsBefore dependent clustering

Heterogeneous Naïve Bayes Classification

Find the Probability that a relation exists for every document in corpus B, given every document in corpus A

23

Corpus A

doc t1 t2 t3 t4

A rat cat cat bat

B rat rat bat

C dog dog cat

D bird bat bat dog

Z cat bird dog

Corpus B

doc t1 t2 t3 t4 t5

1 trial boy boy sick

2 trial healthy girl

3 trial cancer treatment girl

4 trial cancer brain cancer

5 trial blind boy girl girl

6 trial brain cancer blind

Relational

Doc (A) Doc (B)

A 2,4

B 1,6

C 1,2

D 4

24

doc (A) class t1 t2 t3 t4

A trial rat cat cat bat

A healthy rat cat cat bat

A girl rat cat cat bat

A trial rat cat cat bat

A cancer rat cat cat bat

A brain rat cat cat bat

A cancer rat cat cat bat

B trial rat rat bat

B boy rat rat bat

B boy rat rat bat

B sick rat rat bat

B trial rat rat bat

B brain rat rat bat

B cancer rat rat bat

B blind rat rat bat

C trial dog dog cat

C boy dog dog cat

C boy dog dog cat

C sick dog dog cat

C trial dog dog cat

C healthy dog dog cat

C girl dog dog cat

D trial bird bat bat dog

D cancer bird bat bat dog

D brain bird bat bat dog

D cancer bird bat bat dog

25

Corpus A

doc t1 t2 t3 t4

A rat cat cat bat

B rat rat bat

C dog dog cat

D bird bat bat dog

Z cat bird dog

Corpus B

doc t1 t2 t3 t4 t5

1 trial boy boy sick

2 trial healthy girl

3 trial cancer treatment girl

4 trial cancer brain cancer

5 trial blind boy girl girl

6 trial brain cancer blind

docs doc 1 doc 2 doc 3 doc 4 doc 5 doc 6

doc A 0.001645 0.001628 0.002001 0.002752 0.001866 0.002091

doc B 0.011329 0.004655 0.007571 0.013608 0.013202 0.016166

doc C 0.010044 0.008304 0.006061 0.003992 0.011153 0.003543

doc D 0.000146 0.000146 0.000435 0.000851 0.000146 0.000561

doc Z 0.000584 0.000584 0.001033 0.001655 0.000584 0.001206

26

Naïve Bayes Formulation

𝑃 (𝑑𝑜𝑐1|𝑑𝑜𝑐𝐴 )∝0.001645

P(trial)∙P(trial|rat) ∙P(trial|cat) ∙P(trial|cat) ∙P(trial|bat)+ P(boy)∙P(boy|rat) ∙P(boy|cat) ∙P(boy|cat) ∙P(boy|bat)+P(boy)∙P(boy|rat) ∙P(boy|cat) ∙P(boy|cat) ∙P(boy|bat)+

P(sick)∙P(sick|rat) ∙P(sick|cat) ∙P(sick|cat) ∙P(sick|bat)+

A rat cat cat bat

1 trial boy boy sick

27

Naïve Bayes Laplace Transform

• This handles better handles the terms that do not appear at all, however, we lose even more accuracy.

• This raises the question: Do we need to improve accuracy?

28

Naïve Bayes Accuracy

•We MAY not need to improve accuracy•We are more interested in relative ratings

4.214256697426845E-61 9.545714918515275E-62 6.375720538007726E-69 …

29

Naïve Bayes Future Improvement

•Improve accuracy•Improve speed•Determine criteria for predicting new links•Find out if new links improve or harm dependent clustering

30

Contributions from each MemberSW/P Removal1

Non-BT Removal2

VTR3

TFIDF Terms

TFIDF Entities

DC4

DenC5

NB6

DA & DV7

Web-site8

Jessica X X X X X

Lauren X X X X X X

Marcus X

Vince X X X X

1: Stop Word/Punctuation Removal2: Non-biological term removal3: Variance Term Removal4: Dependent Clustering5: Density Clustering6: Naïve Bayes – New Algorithm7: Data Analysis & Data Visualization

Contribution

31