Early detection of cancer using NLP / Limor Lahiani
-
Upload
geektimecoil -
Category
Technology
-
view
58 -
download
1
Transcript of Early detection of cancer using NLP / Limor Lahiani
Early Cancer Diagnosisusing NLP to analyze biomedical literature
image: inside.miroculus.com
@LimorLahhttps://il.linkedin.com/in/limorl catalystcodehttp://limorl.com
Limor LahianiSDE Manager,DX Partner Catalyst, Microsoft
Partner.
Generalize.
Share.
image: miroculus.com/open
DISEASE DECODEDa simple blood test to detect disease at the molecular level
image: miroculus.com/introducing-loom
mir-1245a mir-146a
mir-17
mir-210mir-24-2
image credit: miroculus.com
BRCA2
which microRNAs is related to which gene?
miRNA
genesdiseases
scheduler delta
querying doc
processing
grap
h AP
I
classifying
relations
entity extractionrelation classifier
corpus
Relation Extraction(corpus-to-graph)
domain-specific generic
microRNA-gene relation classifier
Machine Learning 101
designing algorithms for inferring unknowns from knowns
supervised learningGiven known labeled data ,
find a function Given unlabeled data
find patterns or explain key features in the data
unsupervised learning
classification
regression
spam detectionhandwriting
…
stock predictiondemand forecasting
…
clustering
dimension
reductionsimilar profiles
genetic clustering
anomaly
detectionfraud detection
fault detectionmatrix
factorization for collaborative
filtering
semi-supervised learningactive learning
given a sentence which contains microRNA and gene , determine
whether is related to
relation extraction classifier
positive example
We report here the involvement of miR-146a and miR-146b-5p that bind to the same site in the 3'UTR of BRCA1 and down-regulate its expression as demonstrated using reporter assays. PubMed #21472990
BRCA1
mir-146a
mir-146b
non-positive example
"The biological effects of miR-132 were assessed in CRC cell lines using the transwell assay” PubMed #24914372
training data
feature extracti
onML
model
break to sentences (TextBlob)
extract entities (GNAT)
positive + non-positive samples
distant supervision
distant supervision / positive unlabeled learning
Up-regulation of mirna-1245 targets BRCA2mirna-342 regulates BRCA1 expression…We didn’t find correlation between mirna-200 and BRCA1We tested for mirna-100 and BRCA1
mirna-1245 BRCA2mirna-342 BRCA1… …
unstructured: sentences structured: known relations db
distant supervisi
on
Up-regulation of mirna-1245 targets BRCA2 POSITIVEmirna-342 regulates BRCA1 expression POSITIVE… …We didn’t find correlation between mirna-200 and BRCA1
NON_POSITIVE
We assessed mirna-100 and BRCA1 NON_POSITIVE
training data
feature extracti
onML
model
break to sentences (TextBlob)
extract entities (GNAT)
positive + non-positive samples
distant supervision
entity replacement
tokenizing (nltk)
mirna-335 was found to regulate BRCA1
ENTITYM was found to regulate ENTITYG
high levels of expression of miRNA-335 and miRNA-342 were found together with low levels of BRCA1
high levels of expression of ENTITYM and OTHER_ENTITY
were found together with low levels of ENTITYG
high levels of expression of OTHER_ENTITY and ENTITYM were found together with low levels of ENTITYG
training data
feature extracti
onML
model
break to sentences (TextBlob)
extract entities (GNAT)
positive + non-positive samples
distant supervision
entity replacement
tokenizing (nltk)
trimming
mirna-335 was found to regulate BRCA1
ENTITY1 was found to regulate ENTITY2
We report here the involvement of ENTITYM that bind to
the same site in the ENTITYG and down-regulate its expression as demonstrated using reporter assays.
We report here the involvement of ENTITYM that bind to
the same site in the ENTITYG and down-regulate its expression as demonstrated using reporter assays.
cleaning, stemming, & normalizing
training data
feature extracti
onML
model
break to sentences (TextBlob)
extract entities (GNAT)
positive + non-positive samples
distant supervision
entity replacement
tokenizing (nltk)
trimmingcleaning, stemming, &
normalizing
bag-of-words(scikit-learn)
syntactic-based
(spacy.io)
word embedding(doc2vec, genism)
part-of-speech tagging,
dependency parse tree
word vector representation
1-gram, 2-gram, 3-gram,
…
king – man + woman = queenparis – france + spain =
madrid
ENTITYM was found to regulate ENTITYG
[1, 1, 1, 1, 1]
training data
feature extracti
onML
model
break to sentences (TextBlob)
extract entities (GNAT)
positive + non-positive samples
distant supervision
entity replacement
tokenizing (nltk)
trimmingcleaning, stemming, &
normalizing
bag-of-words(scikit-learn)
syntactic-based
(spacy.io)
word embedding(doc2vec, genism)
part-of-speech tagging,
dependency parse tree
word vector representation
1-gram, 2-gram, 3-gram,
…
split to 75% training,
25% evaluation
F1 score evaluation for all feature combination
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛=𝑡𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠𝑟𝑒𝑐𝑎𝑙𝑙=𝑡𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑎𝑙𝑙𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
features f1-scoreBOW 1-3 grams 0.87BOW 1-3 grams + POS Tags 3-gram 0.87BOW 1-3 grams + Doc2Vec 0.87BOW 1-gram 0.8BOW 2-gram 0.85BOW 3-gram 0.83Doc2Vec 0.65POS Tags 3-gram 0.62
final results
build on others: research academic work
try a simple approach first, before Deep Learning
sharing is caring
CatalystCode/corpus-to-graph-pipelineCatalystCode/corpus-to-graph-mlhttps://aka.ms/dxdevblog
from information to intelligence
image: Social_Network_Analysis_Visualization
questions?
thanks ;)