Exploiting semantic networks of public data for systems chemical biology

Exploiting semantic networks of public data for systems chemical biology

Indiana University School of Informatics and Computing

David Wild, http://djwild.info

Assistant Professor and Director,Cheminformatics & Chemogenomics Research Group (CCRG)Indiana University School of Informatics and [email protected]

“Information is cheap. Understanding is expensive” (Karl Fast)

What do we mean by system?

For our purposes, the network of relationships of chemicals, drugs, targets, genes, expression

profiles, pathways, (publications), diseases and side-effects in the body

What’s a semantic network?

A network of nodes and edges represented in RDF format, annotated with node / edge labels

using an ontology (OWL)

Stored in an RDF triple storeSearchable using the SPARQL query language

What we have created at Indiana A Semantic Linked Dataset called Chem2Bio2RDF that

integrates multiple public experimental and literature-derived datasets relating to chemical compounds, drugs, targets, genes, expression profiles, pathways, diseases and drug side effects

A set of semantic algorithms and tools for visualizing, analyzing and predicting relationships in the semantic linked data and relating this data to publications

Semantic Technologies: an enabler for integration

Allows simple, flexible description of heterogeneous graphs of data relationships (RDF), optionally following the rules of an ontology (OWL)

Strengths Merging datasets and moving data between repositories is

technically straightforward – dataset mappings are themselves described in RDF (and OWL). RDF and OWL are highly standardized and allow precise representation

Powerful cross-dataset searching with SPARQL Increasing availability of powerful off-the-shelf searching

and visualization tools (TopBraid, etc) Allows application of graph theory algorithms to data Can express data provenance in RDF

Weaknesses Just emerging from early adopters phase – received bad

press in pharma as hyped too early Triple stores historically less efficient than relational

DBMSs (but rapidly changing) Most focus has been on data and integration rather than

algorithms to use the data Difficulty weighting edges in a relational graph

Systems Chemical Biology and Semantic Technologies map quite nicely as they are both about complex networks http://blog.project-sierra.de/

archives/1639

Current activity in Semantic Web & Drug Discovery

OpenPHACTS (www.openphacts.org) >€10m European project to create an “open pharmacological space” (OPS)

using Triple Stores and Semantic Web technologies Chem2Bio2RDF has been integrated into the OPS

SWHCLSIG (http://www.w3.org/wiki/HCLSIG) W3C special interest group, have created BioRDF (RDF representations of

biological data) and LODD (Linking Open Drug Data)

CSHALS (http://www.iscb.org/cshals2011) Conference on semantics in healthcare and the life sciences

Pistoia Alliance (http://www.pistoiaalliance.org/) Industry alliance for collaboration and integration of drug discovery data

JCI RDF in chemistry (http://www.jcheminf.com/series/acsrdf2010) Journal of Cheminformatics thematic series

http://www.openphacts.org/

http://www.w3.org/wiki/HCLSIG

http://www.w3.org/wiki/HCLSIG

http://www.iscb.org/cshals2011

http://www.iscb.org/cshals2011

http://www.pistoiaalliance.org/

http://www.pistoiaalliance.org/

http://www.jcheminf.com/series/acsrdf2010

http://www.jcheminf.com/series/acsrdf2010

Systems chemical biology + Semantic Web

Drug Discovery Today, 2012, in press.

Big Data in the public domain There is now an incredibly rich resource of public information relating

compounds, targets, genes, pathways, and diseases. Just for starters there is in the public domain information on: ~30 million compounds and ~500,000 bioassays (PubChem, ChemSpider) ~60 million compound bioactivities (PubChem Bioassay, ChEMBL, Matador,

etc) ~5,000 drugs (DrugBank) ~9 million protein sequences (SwissProt) and ~60,000 3D structures (PDB) ~14 million human nucleotide sequences (EMBL) ~20 million life science publications (PubMED) Multitude of other sets (drugs, toxicogenomics, chemogenomics, metagenomics

…)

Chem2Bio2RDF – www.chem2bio2rdf.org

Semantically integrates 42 heterogeneous public datasets related to drug discovery in a fast Virtuoso triple-store with SPARQL endpoint (linked from main site)

Datasets cover chemistry, chemogenomics, biology, systems & pathways, pharmacology, phenotypes, toxicology, glycomics and publications, and biological entities of compounds, drugs, targets, genes, pathways, diseases and side-effects

Major datasets include PubChem, ChEMBL, DrugBank, PharmGKB, BindingDB, STITCH, CTD, KEGG, SWISSPROT, PDB, SIDER, PubMed. Full set at http://chem2bio2rdf.wikispaces.com/Datasets

Holds data on ~31m chemical structures, ~5,000 marketed drugs, ~59m bioactivity data points and ~19m publications

Linked into LOD cloud, and may form part of OpenPHACTS repository

Permits SPARQL searching using Chem2Bio2OWL ontology. For more information, see BMC Bioinformatics 2010, 11, 255.

http://chem2bio2rdf.wikispaces.com/Datasets

http://chem2bio2rdf.wikispaces.com/Datasets

Representing inter-dataset relationships in RDF

RDF describes noun-verb-noun relationships in many formats CETRORELIX is_active_against HCGR <CETRORELIX> <is_active_against> <HCGR>

Many types of relationship can be described, including heterogeneous ones. Power is increased using URI’s and ontologies (OWL) URI gives unique identifier for a noun or verb clause

http://chem2bio2rdf.org/drugbank/resource/drugbank_interaction/269

Same As relationship can map equivalent items in different datasets Ontology describes valid nouns and verbs, and can describe equivalent classes

A set of RDF statements comprises an RDF graph Nodes and edges can be labeled but not cleanly weighted (although a

weighting ontology does exist)

Example RDF relationships in Chem2Bio2RDF

Chem2Bio2OWL – Semantic annotation

Ontology describes meaning independent of dataset. Data dependent relationships are then mapped to classes in the ontology (“annotation”).

Example: the drugbank:DrugBankTarget maps to “Binding” class Described in Chen et al., Journal of Cheminformatics, 2012, 4:6

and http://chem2bio2owl.wikispaces.com Fills a gap in current ontologies: covers relationship of chemical

compounds and drugs to targets, genes, assays and side-effects Aligned with other ontologies: released on NCBO Bioportal (

http://bioportal.bioontology.org/ontologies/1615) Simplifies SPARQL searching by integrating equivalent classes

across datasets (no longer need to explicitly specify datasets and fields)

Increases power of SPARQL searching allowing inclusion of data and relational classes (e.g. activator vs antagonist)

http://chem2bio2owl.wikispaces.com/

http://bioportal.bioontology.org/ontologies/1615

http://bioportal.bioontology.org/ontologies/1615

SPARQL – a semantic query language

PREFIX pubchem: <http://chem2bio2rdf.org/pubchem/resource/>PREFIX kegg: <http://chem2bio2rdf.org/kegg/resource/>PREFIX uniprot: <http://chem2bio2rdf.org/uniprot/resource/> SELECT ?compound_cid (count(?compound_cid) as ?active_assays)FROM <http://chem2bio2rdf.org/pubchem>FROM <http://chem2bio2rdf.org/kegg>FROM <http://chem2bio2rdf.org/uniprot> WHERE { ?bioassay pubchem:CID ?compound_cid . ?bioassay pubchem:outcome ?activity . FILTER (?activity=2) . ?bioassay pubchem:Score ?score . FILTER (?score>50) . ?bioassay pubchem:gi ?gi . ?uniprot uniprot:gi ?gi . ?pathway kegg:protein ?uniprot . ?pathway kegg:Pathway_name ?pathway_name . FILTER regex(?pathway_name,"MAPK signaling pathway","i") . } GROUP BY ?compound_cid HAVING (count(*)>1)

Semantic network algorithms & tools from IU

Association Search – visualize literature supported associations between any two entities (compound, drug, gene, pathway, disease, side effect). PLoS One, 2011, 6(12), e27506.

Semantic Link Association Prediction (SLAP) – find most highly associated entities (compound, drug, gene, pathway, disease, side effect) to any other entity, based on probabilistic weightings of graph edges based on public experimental datasets. PLoS Computational Biology, in review.

BioLDA – find most highly associated entities to any other entity based on a complex topic model analysis of the literature (PubMed). PLoS One, 2011, 6 (3), e17243

See also: WENDI (J. Cheminf., 2010,2,6); Chemogenomic Explorer (BMC Bio. 2011,12,256), ChemLDA, ChemBioGrid (J. Chem. Inf. Model., 2007; 47(4) pp 1303-1307)

All algorithms and tools available on http://djwild.info

Association Search

Identifies genes specifically involved in the relationship of drug Rosiglitazone and side-effect Myocardial Infarction. Shown paths are from public datasets but have support in the literature (via BioLDA)

Ibuprofen and Parkinson’s Disease Identified 70 genes

associated with Ibuprofen and Parkinson’s disease, 9 of which are related to inflammation (IL1A, IL1B, IL1RN, IL6, LTA, NFKB1, NFKBIA, PTGS2, TNF)

Clear direct association between PTGS2 (COX2) and Parkinson’s Disease via CTD (leading to literature)

Single gene, AMBP, differentially associated with Ibuprofen and Parkinson’s Disease but not with other NSAIDS (AMBP has shown potential as a Parkinson’s biomarker)

Thiazolinediones and Myocardial Infarction

Gene/Drug Rosi-glitazone

Tro-glitazone

Pio-glitazone

SAA2 Strong“Discussed” PharmGKB

V. weak V. weak

APOE Strong“Discussed” PharmGKB + Matador

V. weak V. weak

ADIPOQ StrongPositive PharmGKB

V. weak StrongPositive PharmGKB

CYP2C8 StrongChangesmetabolism (CTD)

V. Weak StrongChangesmetabolism (CTD)

APOE, ADIPOQ, LDL, HDL, Rosi and Pio

Pioglitazone AND Rosiglitazone increase ADIPOQ whichresults in increased HDL (good) cholesterol

Rosiglitazone only interacts with APOE and results in an increase in LDL (bad) cholesterol

Semantic Link Association Prediction (SLAP)

Predicts a probability of association of a compound and a target based on the network paths between them that involve drugs, targets, pathways, diseases, tissues, GO terms, chemical ontologies, substructure and drug side-effects

It can be primarily considered as a “missing link prediction” Data source is a subset of the Chem2Bio2RDF network including

250,000 compounds with known bioactivities and the targets known to associated with these drugs

Raw Score is a measure of the significance of a single path between a compound and target, based on topology and semantics of the path nodes and edges. Raw scores are normally distributed within a path pattern

Association Score is a sum of z-scores of raw scores relative to a distribution of random pair scores for different paths and path patterns. Association scores form a normal distribution

Association Significance is a significance p-value of an association score based on the normal distribution of association scores.

Example: Troglitazone and PPARGAssociation score: 2385.9Association significance: 9.06 x 10-6 => missing link predicted

SLAP web tool http://chem2bio2rdf.org/slap

SLAP – target profile for IbuprofenCOX2 – main targetRegulate neurotransmitter release COX1

Dopamine receptorSeratonin

receptorsCannaboid receptorsMuscarinic receptor(motor control)

vs acetaminophen and aspirin

SLAP - Biologically Similar Drugs

Dopamine agonist, used in Parkinson’s

Dopamine agonist, used in Parkinson’s

Troglitazone vs Rosiglitazone

Chemogenomic Explorer Uses WENDI (J Cheminf., 2010, 2:6 ) web service to generate XML

of related biological data for a compound using Chem2Bio2RDF XML is converted to RDF with a WENDI ontology Applies RDF inference engine and rule set to infer compound-

disease relationships based on evidence paths (e.g. similar compound is active in an assay associated with a gene which is associated with a disease). These are represented as new RDF.

Facet browser allows clustering, filtering and exploration of evidence paths by disease association

For more information, see Zhu, Q. et al., BMC Bioinformatics, 2011, 12:256

Chemogenomic Explorer Interface

BioLDA – semantic Bioterm literature extraction

BioLDA Topic Model of PubMed Literature

Latent Dirichlet Assoication (LDA) identifies “latent topics” by word association: a kind of fuzzy clustering. Each word can have associations with multiple topics, and has a varying degree of strength

Term-topic edges are labeled with probability (i.e. strength of a relationship to a topic). Term-term edges are labeled with KL-divergence (measure of distance)

Considered BioTerms rather than free text, and applied to 336,899 MedLine abstracts on 50 topics published in 2009

Based on work done by Jie Tang on social networks (see www.arnetminer.com)

More information can be found in PLoS One, 2011, 6 (3), e17243

Example: Topic 10

BLASC Calculates KL-Divergence

score for any bioterm pairs (drugs,genes, side-effects, pathways, etc)

Available from http://djwild.info

Applications in drug discovery processes

Integrative virtual screening SLAP / BLASC association with targets and/or known ligands Comparison with QSAR models, LBVS, Docking and Pharmacophore search Harmonic data fusion Applied to PXR antagonists (Univ. Cincinatti) and Mycobacterium Tubercolusis

inhibition (OSDD)

Polypharmacology Drug indication network based on SLAP target profiles Adverse effect network based on SLAP off-target profiles

Searching and exploring mechanisms of action Association search, BLASC, SLAP Examples tested: Thiazolinediones and Myocardial Infarction; Ibuprofen and

Parkinson Disease

Other work in progress Improvement of SLAP algorithms Mapping of patient and metagenomics data

Try the tools out – djwild.info

Cheminformatics Education at Indiana University

LuLu eBook - $29http://slg.djwild.info

Free cheminformatics learning resources

http://icep.wikispaces.com

Residential Ph.D. program in Informatics with a

Cheminformatics specialty

Distance Graduate Certificate program in Chemical Informatics

http://djwild.info/ed

Exploiting semantic networks of public data for systems chemical biology

Education

Transcript of Exploiting semantic networks of public data for systems chemical biology