Biomedical Relation Extraction for Knowledge Graph Completion

Post on 15-Apr-2017

96 views 1 download

Transcript of Biomedical Relation Extraction for Knowledge Graph Completion

Bio-RE for KG completion

Gotta catch’em all!™

Claudiu Mihăilă

Because it matters

Drug development process

Signalling pathways

Scientia potentia est

• ∼26M biomedical articlesindexed

• ∼3500 articles per day in 2015

• More information than any oneperson can comprehend

Structured databases

Structured databases

• High number of DBs

• Manually curated by experts

but

• Long backlog

• Limited coverage, reflectingbias of the curators

• Limited linkage to literatureevidence

ML Objective

Resistance to IL-10 inhibition of interferon gamma productionand expression of suppressor of cytokine signalling 1 in CD4+Tcells from patients with rheumatoid arthritis.

ML Objective

Resistance to IL-10 inhibition of interferon gamma productionand expression of suppressor of cytokine signalling 1 in CD4+Tcells from patients with rheumatoid arthritis.

ML Objective

Resistance to IL-10 inhibition of interferon gamma productionand expression of suppressor of cytokine signalling 1 in CD4+Tcells from patients with rheumatoid arthritis.

ML Objective

Resistance to IL-10 inhibition of interferon gamma productionand expression of suppressor of cytokine signalling 1 in CD4+Tcells from patients with rheumatoid arthritis.

ML Objective

Resistance to IL-10 inhibition of interferon gamma productionand expression of suppressor of cytokine signalling 1 in CD4+Tcells from patients with rheumatoid arthritis.

Supervised ML approaches

BioNLP-ST’16 – Task GE4 – NFκB KB construction

• 2-stage classification with SVM/LR + post-processing

• RNNs with LSTM units on words, PoSs, dependencies

P R F1Evex 0.47 0.32 0.38TEES 0.45 0.33 0.38VERSE 0.60 0.23 0.33

Supervised ML

• High confidence annotations

• Literature evidencemarked-up

but

• Limited coverage/high bias

• Limited number of corpora

• Expensive and slow toproduce training data

Distant SupervisionCombining DBs and Supervised ML

Align relation candidates extractedfrom text with known relationshipsand use structured database asdistant supervision training signal

• no bias to specific genre

• cheap and fast to produce fairlylarge training sets

Distant Supervision Example

PubMedMutations in the gene encoding the TAR DNA-binding protein 43have been identified in some familial amyotrophic lateral sclero-sis (ALS).

A novel missense mutation in a highly conserved region of TDP-43 was identified in a patient with sporadic ALS.

We screened the TARDBP mutation in 721 Japanese ALS by di-rect sequencing.

DB:IS_ASSOCIATED_WITH

TDP43, ALS

Famous work

DeepDive (Stanford/Lattice)

• Gene-gene interactionfrom PLOS biomedicaljournals

• Uses BIOGRID for distantsupervision

Literome

• Protein regulation eventextraction from Pubmedabstracts

• Uses the PathwayInteraction Database fordistant supervision

Information Extraction Pipeline

PMPMC

EntityRecognition

EntityResolution

SyntacticParsing

OpenIERelations

Knowledge graph

Distant supervision Pipeline

BioDBs

PMPMC

KnowledgeGraph

DistantSupervision

ExtractedRelations

Curation

InformationExtraction

Curation

Examples of learned relations(IS_ASSOCIATED_WITH, CAH, WNK1)

. . . mineralocorticoid excess can be caused by congenital adrenal hyperplasia (CAH) . . .due to mutations in the WNK1, WNK4, KLHL3, CUL3 genes.

PM:22932914

(IS_ASSOCIATED_WITH, Brachydactyly, CHSY1)

Our results place Chsy1 as an essential regulator of joint patterning and providea mouse model of human brachydactylies caused by mutations in CHSY1.

PM:22280990

(IS_ASSOCIATED_WITH, Cushing Syndrome, KCNJ5)

. . . these mutations, in addition to mutations in the KCNJ5 gene. . ., may be responsiblefor the tumorigenesis of APAs and CPAs with subclinical Cushing’s syndrome.

PM:26743443

Challenges

• Error propagation fromupstream tasks

• Cross-sentence relations

• Long tails - overfitting to themore common entity/entitypairs

• Speculation, negation, changesover time, conflictinginformation

Thank you for your attention!Any questions?