Early detection of cancer using NLP / Limor Lahiani

Early Cancer Diagnosisusing NLP to analyze biomedical literature

image: inside.miroculus.com

http://inside.miroculus.com/

@LimorLahhttps://il.linkedin.com/in/limorl catalystcodehttp://limorl.com

Limor LahianiSDE Manager,DX Partner Catalyst, Microsoft

Partner.

Generalize.

Share.

image: miroculus.com/open

DISEASE DECODEDa simple blood test to detect disease at the molecular level

http://miroculus.com/open

image: miroculus.com/introducing-loom

http://inside.miroculus.com/introducing-loom/

mir-1245a mir-146a

mir-17

mir-210mir-24-2

image credit: miroculus.com

BRCA2

http://inside.miroculus.com/introducing-loom/

which microRNAs is related to which gene?

miRNA

genesdiseases

scheduler delta

querying doc

processing

grap

h AP

I

classifying

relations

entity extractionrelation classifier

corpus

Relation Extraction(corpus-to-graph)

domain-specific generic

microRNA-gene relation classifier

Machine Learning 101

designing algorithms for inferring unknowns from knowns

supervised learningGiven known labeled data ,

find a function Given unlabeled data

find patterns or explain key features in the data

unsupervised learning

classification

regression

spam detectionhandwriting

…

stock predictiondemand forecasting

…

clustering

dimension

reductionsimilar profiles

genetic clustering

anomaly

detectionfraud detection

fault detectionmatrix

factorization for collaborative

filtering

semi-supervised learningactive learning

given a sentence which contains microRNA and gene , determine

whether is related to

relation extraction classifier

positive example

We report here the involvement of miR-146a and miR-146b-5p that bind to the same site in the 3'UTR of BRCA1 and down-regulate its expression as demonstrated using reporter assays. PubMed #21472990

BRCA1

mir-146a

mir-146b

http://www.ncbi.nlm.nih.gov/pubmed/21472990/

non-positive example

"The biological effects of miR-132 were assessed in CRC cell lines using the transwell assay” PubMed #24914372

http://www.ncbi.nlm.nih.gov/pubmed/24914372

training data

feature extracti

onML

model

break to sentences (TextBlob)

extract entities (GNAT)

positive + non-positive samples

distant supervision

distant supervision / positive unlabeled learning

Up-regulation of mirna-1245 targets BRCA2mirna-342 regulates BRCA1 expression…We didn’t find correlation between mirna-200 and BRCA1We tested for mirna-100 and BRCA1

mirna-1245 BRCA2mirna-342 BRCA1… …

unstructured: sentences structured: known relations db

distant supervisi

on

Up-regulation of mirna-1245 targets BRCA2 POSITIVEmirna-342 regulates BRCA1 expression POSITIVE… …We didn’t find correlation between mirna-200 and BRCA1

NON_POSITIVE

We assessed mirna-100 and BRCA1 NON_POSITIVE

training data

feature extracti

onML

model




distant supervision

entity replacement

tokenizing (nltk)

mirna-335 was found to regulate BRCA1

ENTITYM was found to regulate ENTITYG

high levels of expression of miRNA-335 and miRNA-342 were found together with low levels of BRCA1

high levels of expression of ENTITYM and OTHER_ENTITY

were found together with low levels of ENTITYG

high levels of expression of OTHER_ENTITY and ENTITYM were found together with low levels of ENTITYG

training data

feature extracti

onML

model




distant supervision

entity replacement

tokenizing (nltk)

trimming

mirna-335 was found to regulate BRCA1

ENTITY1 was found to regulate ENTITY2

We report here the involvement of ENTITYM that bind to

the same site in the ENTITYG and down-regulate its expression as demonstrated using reporter assays.

We report here the involvement of ENTITYM that bind to

the same site in the ENTITYG and down-regulate its expression as demonstrated using reporter assays.

cleaning, stemming, & normalizing

training data

feature extracti

onML

model




distant supervision

entity replacement

tokenizing (nltk)

trimmingcleaning, stemming, &

normalizing

bag-of-words(scikit-learn)

syntactic-based

(spacy.io)

word embedding(doc2vec, genism)

part-of-speech tagging,

dependency parse tree

word vector representation

1-gram, 2-gram, 3-gram,

…

king – man + woman = queenparis – france + spain =

madrid

ENTITYM was found to regulate ENTITYG

[1, 1, 1, 1, 1]

training data

feature extracti

onML

model




distant supervision

entity replacement

tokenizing (nltk)

trimmingcleaning, stemming, &

normalizing

bag-of-words(scikit-learn)

syntactic-based

(spacy.io)

word embedding(doc2vec, genism)

part-of-speech tagging,

dependency parse tree

word vector representation

1-gram, 2-gram, 3-gram,

…

split to 75% training,

25% evaluation

F1 score evaluation for all feature combination

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛=𝑡𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠𝑟𝑒𝑐𝑎𝑙𝑙=𝑡𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

𝑎𝑙𝑙𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

features f1-scoreBOW 1-3 grams 0.87BOW 1-3 grams + POS Tags 3-gram 0.87BOW 1-3 grams + Doc2Vec 0.87BOW 1-gram 0.8BOW 2-gram 0.85BOW 3-gram 0.83Doc2Vec 0.65POS Tags 3-gram 0.62

final results

build on others: research academic work

try a simple approach first, before Deep Learning

sharing is caring

CatalystCode/corpus-to-graph-pipelineCatalystCode/corpus-to-graph-mlhttps://aka.ms/dxdevblog

from information to intelligence

image: Social_Network_Analysis_Visualization

https://commons.wikimedia.org/wiki/File:Social_Network_Analysis_Visualization.png

questions?

thanks ;)

Early detection of cancer using NLP / Limor Lahiani

Technology

Transcript of Early detection of cancer using NLP / Limor Lahiani