Generating Biomedical Hypotheses Using Semantic Web Technologies

46
Generating (and Testing) Biomedical Hypotheses Using Semantic Web Technologies 1 Michel Dumontier, Ph.D. Associate Professor of Medicine (Biomedical Informatics) Stanford University @micheldumontier::AAAI-FSW:Nov 16, 2013

description

With its focus on investigating the nature and basis for the sustained existence of living systems, modern biology has always been a fertile, if not challenging, domain for formal knowledge representation and automated reasoning. Over the past 15 years, hundreds of projects have developed or leveraged ontologies for entity recognition and relation extraction, semantic annotation, data integration, query answering, consistency checking, association mining and other forms of knowledge discovery. In this talk, I will discuss our efforts to build a rich foundational network of ontology-annotated linked data, discover significant biological associations across these data using a set of partially overlapping ontologies, and identify new avenues for drug discovery by applying measures of semantic similarity over phenotypic descriptions. As the portfolio of Semantic Web technologies continue to mature in terms of functionality, scalability and an understanding of how to maximize their value, increasing numbers of biomedical researchers will be strategically poised to pursue increasingly sophisticated KR projects aimed at improving our overall understanding of the capability and behavior of biological systems.

Transcript of Generating Biomedical Hypotheses Using Semantic Web Technologies

Page 1: Generating Biomedical Hypotheses Using Semantic Web Technologies

1

Generating (and Testing) Biomedical Hypotheses Using Semantic Web Technologies

Michel Dumontier, Ph.D.

Associate Professor of Medicine (Biomedical Informatics)Stanford University

@micheldumontier::AAAI-FSW:Nov 16, 2013

Page 2: Generating Biomedical Hypotheses Using Semantic Web Technologies

2 @micheldumontier::AAAI-FSW:Nov 16, 2013

Page 3: Generating Biomedical Hypotheses Using Semantic Web Technologies

Science, it works ******

@micheldumontier::AAAI-FSW:Nov 16, 20133

Page 4: Generating Biomedical Hypotheses Using Semantic Web Technologies

We use the biomedical literature to find out what we believe to be true

4 @micheldumontier::AAAI-FSW:Nov 16, 2013

Page 5: Generating Biomedical Hypotheses Using Semantic Web Technologies

5

we need to access to the most recent biomedical data (problems: access, format, identifiers & linking)

@micheldumontier::AAAI-FSW:Nov 16, 2013

Page 6: Generating Biomedical Hypotheses Using Semantic Web Technologies

6

we also need access to the most effective software to analyze, predict and evaluate

(problems: OS, versioning, input/output formats)

@micheldumontier::AAAI-FSW:Nov 16, 2013

Page 7: Generating Biomedical Hypotheses Using Semantic Web Technologies

7

we build support for our hypotheses by constructing and (partly) reusing fairly sophisticated workflows

@micheldumontier::AAAI-FSW:Nov 16, 2013

Page 8: Generating Biomedical Hypotheses Using Semantic Web Technologies

8

Wouldn’t it be great if we could just find the evidence required to support or dispute a scientific hypothesis using the most up-to-date and relevant data, tools and scientific

knowledge?

@micheldumontier::AAAI-FSW:Nov 16, 2013

Page 9: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 20139

madness!

1. Build a massive network of interconnected data and software using web standards

2. Develop methods to generate and test hypotheses across these data

1. tactical formalization2. uncovering associations3. evidence gathering

3. Contribute back to the global knowledge graph

Page 10: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201310

The Semantic Web is the new global web of knowledge

standards for publishing, sharing and querying facts, expert knowledge and services

scalable approach for the discoveryof independently formulated

and distributed knowledge

Page 11: Generating Biomedical Hypotheses Using Semantic Web Technologies

11

we’re building an incredible network of linked data

@micheldumontier::AAAI-FSW:Nov 16, 2013Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch

Page 12: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201312

Bio2RDF Linked Data Map

Bio2RDF Release 2: Improved coverage, interoperability and provenance of Life Science Linked Data . ESWC 2013.

Page 13: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201313

At the heart of Linked Data for the Life Sciences

• Free and open source• Uses Semantic Web standards• Release 2 (Jan 2013): 1B+ interlinked statements from 19 conventional and high

value datasets• Includes provenance, statistics• Partnerships with EBI, NCBI, DBCLS, NCBO, OpenPHACTS, and commercial tool

providers

chemicals/drugs/formulations, genomes/genes/proteins, domainsInteractions, complexes & pathwaysanimal models and phenotypesDisease, genetic markers, treatmentsTerminologies & publications

A Callahan, J Cruz-Toledo, M Dumontier. Querying Bio2RDF Linked Open Data with a Global Schema. 2013. Journal of Biomedical Semantics. 4(Suppl 1):S1.

Page 14: Generating Biomedical Hypotheses Using Semantic Web Technologies

Dumontier::FDA:Nov 15,201314

Resource Description Framework

• It’s a language to represent knowledge– Logic-based formalism -> automated reasoning– graph-like properties -> data analysis

• Good for– Describing in terms of type, attributes, relations– Integrating data from different sources– Sharing the data (W3C standard)– Reuse what is available, develop what you’re

missing.

Page 15: Generating Biomedical Hypotheses Using Semantic Web Technologies

Dumontier::FDA:Nov 15,201315

drugbank:DB00586

drugbank_vocabulary:Drug

rdf:type

drugbank_target:290

drugbank_vocabulary:Target

rdf:type

drugbank_vocabulary:targets

rdfs:label

Prostaglandin G/H synthase 2 [drugbank_target:290]

rdfs:label

Diclofenac [drugbank:DB00586]

Page 16: Generating Biomedical Hypotheses Using Semantic Web Technologies

Dumontier::FDA:Nov 15,201316

The linked data network expands with every reference

drugbank:DB00586

pharmgkb_vocabulary:Drug

rdf:type

rdfs:labeldiclofenac [drugbank:DB00586]

pharmgkb:PA449293

drugbank_vocabulary:Drug

pharmgkb_vocabulary:xref

diclofenac [pharmgkb:PA449293]rdfs:label

DrugBank

PharmGKB

Page 17: Generating Biomedical Hypotheses Using Semantic Web Technologies

17

Federated Queries over Independent SPARQL EndPoints

Get all protein catabolic processes (and more specific) from biomodels

SELECT ?go ?label count(distinct ?x) WHERE { service <http://bioportal.bio2rdf.org/sparql> { ?go rdfs:label ?label . ?go rdfs:subClassOf ?tgo ?tgo rdfs:label ?tlabel . FILTER regex(?tlabel, "^protein catabolic process") } service <http://biomodels.bio2rdf.org/sparql> { ?x <http://bio2rdf.org/biopax_vocabulary:identical-to> ?go . ?x a <http://www.biopax.org/release/biopax-level3.owl#BiochemicalReaction> . }}

@micheldumontier::AAAI-FSW:Nov 16, 2013

Page 18: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201318

Bio2RDF: 1M+ SPARQL queries per monthUniProt: 6.4M queries per month

EBI: 3.5M queries (Oct 2013)

Page 19: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201319

Directions

• research– approaches for data publication – methods for large scale data reduction– methods for mining associations

• service– coverage, quality and licensing– better user interfaces– sustainability of service

Page 20: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201320

tactical formalization

a. discovery of drug and disease pathway associations

b. prediction of drug targets using phenotypes

Page 21: Generating Biomedical Hypotheses Using Semantic Web Technologies

21

ontology as a strategy to formally

represent and integrate knowledge

@micheldumontier::AAAI-FSW:Nov 16, 2013

Page 22: Generating Biomedical Hypotheses Using Semantic Web Technologies

22

Have you heard of OWL?

@micheldumontier::AAAI-FSW:Nov 16, 2013

Page 23: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201323

Identification of disease enriched pathways

def: A biological pathway is constituted by a set of molecular components that undertake some biological transformation to achieve a stated objective

glycolysis : a pathway that converts glucose to pyruvate

Page 24: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201324

aberrant pathways

which pathways are associated with disease?Which pathways can be perturbed by drugs?

drug

pathway

disease

Page 25: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201325

Identification of drug and disease enriched pathways

• Approach– Integrate 3 datasets: DrugBank, PharmGKB and

CTD– Integrate 7 terminologies: MeSH, ATC, ChEBI,

UMLS, SNOMED, ICD, DO– Identify significant pathways via enrichment

analysis over the fully inferred knowledge base• Formalized as an OWL-EL ontology

– 650,000+ classes, 3.2M subClassOf axioms, 75,000 equivalentClass axioms

Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics. Bioinformatics. 2012.

Page 26: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201326

drug

disease

drugpathway

mercaptopurine [pharmgkb:PA450379]

mercaptopurine [drugbank:DB01033]

purine-6-thiol[CHEBI:2208]

Class Equivalence

mercaptopurine[ATC:L01BB02]

Top Level Classes(disjointness) drug diseasegenepathway

property chains

Class subsumption

mercaptopurine [mesh:D015122]

gene

Reciprocal Existentials

Page 27: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201327

Benefits: Enhanced Query Capability

– Use any mapped terminology to query a target resource.– Use knowledge in target ontologies to formulate more

precise questions• ask for drugs that are associated with diseases of the joint:

‘Chikungunya’ (do:0050012) is defined as a viral infectious disease located in the ‘joint’ (fma:7490) and caused by a ‘Chikungunya virus’ (taxon:37124).

– Learn relationships that are inferred by automated reasoning.

• alcohol (ChEBI:30879) is associated with alcoholism (PA443309) since alcoholism is directly associated with ethanol (CHEBI:16236)

• ‘parasitic infectious disease’ (do:0001398) retrieves 129 disease associated drugs, 15 more than are directly associated.

Page 28: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201328

Knowledge Discovery through Data Integration and Enrichment Analysis

• OntoFunc: Tool to discover significant associations between sets of objects and ontology categories. enrichment of attribute among a selected set of input items as compared to a reference set. hypergeometric or the binomial distribution, Fisher's exact test, or a chi-square test.

• We found 22,653 disease-pathway associations, where for each pathway we find genes that are linked to disease.– Mood disorder (do:3324) associated with Zidovudine Pathway

(pharmgkb:PA165859361). Zidovudine is for treating HIV/AIDS. Side effects include fatigue, headache, myalgia, malaise and anorexia

• We found 13,826 pathway-chemical associations– Clopidogrel (chebi:37941) associated with Endothelin signaling

pathway (pharmgkb:PA164728163). Endothelins are proteins that constrict blood vessels and raise blood pressure. Clopidogrel inhibits platelet aggregation and prolongs bleeding time.

Page 29: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201329

PhenomeDrug

A computational approach to predict drug targets, drug effects, and drug indications using phenotypic information

Mouse model phenotypes provide information about human drug targets.Hoehndorf R, Hiebert T, Hardy NW, Schofield PN, Gkoutos GV, Dumontier M.Bioinformatics. 2013.

Page 30: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201330

animal models provide insight for on target effects

• In the majority of 100 best selling drugs ($148B in US alone), there is a direct correlation between knockout phenotype and drug effect

• Gastroesophageal reflux– Proton pump inhibitors (e.g Prilosec - $5B)– KO of alpha or beta unit of H/K ATPase are achlorhydric (normal pH) in

stomach compared to wild type (pH 3-5) when exposed to acidic solution

• Immunological Indications– Anti-histamines (Claritin, Allegra, Zyrtec)– KO of histamine H1 receptor leads to decreased responsiveness of

immune system– Predicts on target effects : drowsiness, reduced anxiety

Zambrowicz and Sands. Nat Rev Drug Disc. 2003.

Page 31: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201331

Identifying drug targets from mouse knock-out phenotypes

drug

gene

phenotypes effects

human gene

non-functional gene model

ortholog

similar

inhibits

Main idea: if a drug’s phenotypes matches the phenotypes of a null model, this suggests that the drug is an inhibitor of the gene

Page 32: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 2013

Terminological Interoperability(we must compare apples with apples)

Mouse Phenotypes

Drug effects(mappings from UMLS to DO, NBO, MP)

Mammalian Phenotype OntologyPhenomeNet

PhenomeDrug

Page 33: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201333

Semantic SimilarityGiven a drug effect profile D and a mouse model M, we compute the semantic similarity as an information weighted Jaccard metric.

The similarity measure used is non-symmetrical and determines the amount of information about a drug effect profile D that is covered by a set of mouse model phenotypes M.

Page 34: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 2013

null models do well in predicting targets of drug inhibitors

• 14,682 drugs; 7,255 mouse genotypes• Validation against known and predicted inhibitor-target pairs

– 0.739 ROC AUC for human targets (DrugBank)– 0.736 ROC AUC for mouse targets (STITCH)– 0.705 ROC AUC for human targets (STITCH)

• diclofenac (STITCH:000003032) – NSAID used to treat pain, osteoarthritis and rheumatoid arthritis– Drug effects include liver inflammation (hepatitis), swelling of liver

(hepatomegaly), redness of skin (erythema)– 49% explained by PPARg knockout

• peroxisome proliferator activated receptor gamma (PPARg) regulates metabolism, proliferation, inflammation and differentiation,

• Diclofenac is a known inhibitor

– 46% explained by COX-2 knockout • Diclofenac is a known inhibitor

Page 35: Generating Biomedical Hypotheses Using Semantic Web Technologies

Finding evidence to support or dispute a hypothesis takes some digging around

@micheldumontier::AAAI-FSW:Nov 16, 201335

Page 36: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201336

• Tyrosine Kinase Inhibitors (TKI)– Imatinib, Sorafenib, Sunitinib, Dasatinib, Nilotinib, Lapatinib– Used to treat cancer– Linked to cardiotoxicity.

• FDA launched drug safety program to detect toxicity – Need to integrate data and ontologies (Abernethy, CPT 2011)– Abernethy (2013) suggest using public data in genetics,

pharmacology, toxicology, systems biology, to predict/validate adverse events

• What evidence could we gather to give credence that TKI’s causes non-QT cardiotoxicity?

FDA Use Case: TKI non-QT Cardiotoxicity

Page 37: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201337

• The goal of HyQue is retrieve and evaluate evidence that supports/disputes a hypothesis– hypotheses are described as a set of events

• e.g. binding, inhibition, phenotypic effect

– events are associated with types of evidence • a query is written to retrieve data• a weight is assigned to provide significance

• Hypotheses are written by people who seek answers• data retrieval rules are written by people who know the

data and how it should be interpreted

HyQue

1. HyQue: Evaluating hypotheses using Semantic Web technologies. J Biomed Semantics. 2011 May 17;2 Suppl 2:S3.2. Evaluating scientific hypotheses using the SPARQL Inferencing Notation. Extended Semantic Web Conference (ESWC 2012). Heraklion, Crete. May 27-31, 2012.

Page 38: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201338

HyQue: A Semantic Web Application

Software

OntologiesData

Hypothesis Evaluation

Page 39: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201339

SPIN functions to evaluate TKI Cardiotoxicity

• Effects [r3.sider, lit] • Does the drug have known cardiotoxic effects?

• Models [r2.mgi] • Are there cardiotoxic phenotypes in null mouse models of target genes?

• Assays [ebi.chembl] • Is hERG inhibited?

• Is the drug TUNEL positive?

• Targets [r2.drugbank, lit] • Does the drug inhibit cardiotoxic-related proteins?

• RAF1, PDFGFR, VEGFR, AMPK

Page 40: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201340

Page 41: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201341

http://bio2rdf.org/drugbank:DB01268

Page 42: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201342

Page 43: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201343

Page 44: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201344

In Summary

• This talk was about making sense of the structured data we already have

• RDF-based Linked Open Data acts as a substrate for query answering and task-based formalization in OWL

• Discovery through the generation of testable hypotheses in the target domain.

• The next big challenge lies in capturing the output of computational experiments just as much as extracting/formalizing user-contributed knowledge .

Page 45: Generating Biomedical Hypotheses Using Semantic Web Technologies

45

Acknowledgements

Bio2RDF Release 2: Allison Callahan, Jose Cruz-Toledo, Peter Ansell

Aberrant Pathways: Robert Hoehndorf, Georgios Gkoutos

PhenomeDrug: Tanya Hiebert, Robert Hoehndorf, Georgios Gkoutos, Paul Schofield

HyQue: Alison Callahan, Nigam Shah

@micheldumontier::AAAI-FSW:Nov 16, 2013

Page 46: Generating Biomedical Hypotheses Using Semantic Web Technologies

@micheldumontier::AAAI-FSW:Nov 16, 201346

[email protected]

Website: http://dumontierlab.com Presentations: http://slideshare.com/micheldumontier