Exploiting Semantic Web Techniques For Representing And Utilising
Exploiting semantic networks of public data for systems chemical biology
-
Upload
david-wild -
Category
Education
-
view
609 -
download
2
Transcript of Exploiting semantic networks of public data for systems chemical biology
Exploiting semantic networks of public data for systems chemical biology
Indiana University School of Informatics and Computing
David Wild, http://djwild.info
Assistant Professor and Director,Cheminformatics & Chemogenomics Research Group (CCRG)Indiana University School of Informatics and [email protected]
“Information is cheap. Understanding is expensive” (Karl Fast)
What do we mean by system?
For our purposes, the network of relationships of chemicals, drugs, targets, genes, expression
profiles, pathways, (publications), diseases and side-effects in the body
What’s a semantic network?
A network of nodes and edges represented in RDF format, annotated with node / edge labels
using an ontology (OWL)
Stored in an RDF triple storeSearchable using the SPARQL query language
What we have created at Indiana A Semantic Linked Dataset called Chem2Bio2RDF that
integrates multiple public experimental and literature-derived datasets relating to chemical compounds, drugs, targets, genes, expression profiles, pathways, diseases and drug side effects
A set of semantic algorithms and tools for visualizing, analyzing and predicting relationships in the semantic linked data and relating this data to publications
Semantic Technologies: an enabler for integration
Allows simple, flexible description of heterogeneous graphs of data relationships (RDF), optionally following the rules of an ontology (OWL)
Strengths Merging datasets and moving data between repositories is
technically straightforward – dataset mappings are themselves described in RDF (and OWL). RDF and OWL are highly standardized and allow precise representation
Powerful cross-dataset searching with SPARQL Increasing availability of powerful off-the-shelf searching
and visualization tools (TopBraid, etc) Allows application of graph theory algorithms to data Can express data provenance in RDF
Weaknesses Just emerging from early adopters phase – received bad
press in pharma as hyped too early Triple stores historically less efficient than relational
DBMSs (but rapidly changing) Most focus has been on data and integration rather than
algorithms to use the data Difficulty weighting edges in a relational graph
Systems Chemical Biology and Semantic Technologies map quite nicely as they are both about complex networks http://blog.project-sierra.de/
archives/1639
Current activity in Semantic Web & Drug Discovery
OpenPHACTS (www.openphacts.org) >€10m European project to create an “open pharmacological space” (OPS)
using Triple Stores and Semantic Web technologies Chem2Bio2RDF has been integrated into the OPS
SWHCLSIG (http://www.w3.org/wiki/HCLSIG) W3C special interest group, have created BioRDF (RDF representations of
biological data) and LODD (Linking Open Drug Data)
CSHALS (http://www.iscb.org/cshals2011) Conference on semantics in healthcare and the life sciences
Pistoia Alliance (http://www.pistoiaalliance.org/) Industry alliance for collaboration and integration of drug discovery data
JCI RDF in chemistry (http://www.jcheminf.com/series/acsrdf2010) Journal of Cheminformatics thematic series
Systems chemical biology + Semantic Web
Drug Discovery Today, 2012, in press.
Big Data in the public domain There is now an incredibly rich resource of public information relating
compounds, targets, genes, pathways, and diseases. Just for starters there is in the public domain information on: ~30 million compounds and ~500,000 bioassays (PubChem, ChemSpider) ~60 million compound bioactivities (PubChem Bioassay, ChEMBL, Matador,
etc) ~5,000 drugs (DrugBank) ~9 million protein sequences (SwissProt) and ~60,000 3D structures (PDB) ~14 million human nucleotide sequences (EMBL) ~20 million life science publications (PubMED) Multitude of other sets (drugs, toxicogenomics, chemogenomics, metagenomics
…)
Chem2Bio2RDF – www.chem2bio2rdf.org
Semantically integrates 42 heterogeneous public datasets related to drug discovery in a fast Virtuoso triple-store with SPARQL endpoint (linked from main site)
Datasets cover chemistry, chemogenomics, biology, systems & pathways, pharmacology, phenotypes, toxicology, glycomics and publications, and biological entities of compounds, drugs, targets, genes, pathways, diseases and side-effects
Major datasets include PubChem, ChEMBL, DrugBank, PharmGKB, BindingDB, STITCH, CTD, KEGG, SWISSPROT, PDB, SIDER, PubMed. Full set at http://chem2bio2rdf.wikispaces.com/Datasets
Holds data on ~31m chemical structures, ~5,000 marketed drugs, ~59m bioactivity data points and ~19m publications
Linked into LOD cloud, and may form part of OpenPHACTS repository
Permits SPARQL searching using Chem2Bio2OWL ontology. For more information, see BMC Bioinformatics 2010, 11, 255.
Representing inter-dataset relationships in RDF
RDF describes noun-verb-noun relationships in many formats CETRORELIX is_active_against HCGR <CETRORELIX> <is_active_against> <HCGR>
Many types of relationship can be described, including heterogeneous ones. Power is increased using URI’s and ontologies (OWL) URI gives unique identifier for a noun or verb clause
http://chem2bio2rdf.org/drugbank/resource/drugbank_interaction/269
Same As relationship can map equivalent items in different datasets Ontology describes valid nouns and verbs, and can describe equivalent classes
A set of RDF statements comprises an RDF graph Nodes and edges can be labeled but not cleanly weighted (although a
weighting ontology does exist)
Example RDF relationships in Chem2Bio2RDF
Chem2Bio2OWL – Semantic annotation
Ontology describes meaning independent of dataset. Data dependent relationships are then mapped to classes in the ontology (“annotation”).
Example: the drugbank:DrugBankTarget maps to “Binding” class Described in Chen et al., Journal of Cheminformatics, 2012, 4:6
and http://chem2bio2owl.wikispaces.com Fills a gap in current ontologies: covers relationship of chemical
compounds and drugs to targets, genes, assays and side-effects Aligned with other ontologies: released on NCBO Bioportal (
http://bioportal.bioontology.org/ontologies/1615) Simplifies SPARQL searching by integrating equivalent classes
across datasets (no longer need to explicitly specify datasets and fields)
Increases power of SPARQL searching allowing inclusion of data and relational classes (e.g. activator vs antagonist)
SPARQL – a semantic query language
PREFIX pubchem: <http://chem2bio2rdf.org/pubchem/resource/>PREFIX kegg: <http://chem2bio2rdf.org/kegg/resource/>PREFIX uniprot: <http://chem2bio2rdf.org/uniprot/resource/> SELECT ?compound_cid (count(?compound_cid) as ?active_assays)FROM <http://chem2bio2rdf.org/pubchem>FROM <http://chem2bio2rdf.org/kegg>FROM <http://chem2bio2rdf.org/uniprot> WHERE { ?bioassay pubchem:CID ?compound_cid . ?bioassay pubchem:outcome ?activity . FILTER (?activity=2) . ?bioassay pubchem:Score ?score . FILTER (?score>50) . ?bioassay pubchem:gi ?gi . ?uniprot uniprot:gi ?gi . ?pathway kegg:protein ?uniprot . ?pathway kegg:Pathway_name ?pathway_name . FILTER regex(?pathway_name,"MAPK signaling pathway","i") . } GROUP BY ?compound_cid HAVING (count(*)>1)
Semantic network algorithms & tools from IU
Association Search – visualize literature supported associations between any two entities (compound, drug, gene, pathway, disease, side effect). PLoS One, 2011, 6(12), e27506.
Semantic Link Association Prediction (SLAP) – find most highly associated entities (compound, drug, gene, pathway, disease, side effect) to any other entity, based on probabilistic weightings of graph edges based on public experimental datasets. PLoS Computational Biology, in review.
BioLDA – find most highly associated entities to any other entity based on a complex topic model analysis of the literature (PubMed). PLoS One, 2011, 6 (3), e17243
See also: WENDI (J. Cheminf., 2010,2,6); Chemogenomic Explorer (BMC Bio. 2011,12,256), ChemLDA, ChemBioGrid (J. Chem. Inf. Model., 2007; 47(4) pp 1303-1307)
All algorithms and tools available on http://djwild.info
Association Search
Identifies genes specifically involved in the relationship of drug Rosiglitazone and side-effect Myocardial Infarction. Shown paths are from public datasets but have support in the literature (via BioLDA)
Ibuprofen and Parkinson’s Disease Identified 70 genes
associated with Ibuprofen and Parkinson’s disease, 9 of which are related to inflammation (IL1A, IL1B, IL1RN, IL6, LTA, NFKB1, NFKBIA, PTGS2, TNF)
Clear direct association between PTGS2 (COX2) and Parkinson’s Disease via CTD (leading to literature)
Single gene, AMBP, differentially associated with Ibuprofen and Parkinson’s Disease but not with other NSAIDS (AMBP has shown potential as a Parkinson’s biomarker)
Thiazolinediones and Myocardial Infarction
Gene/Drug Rosi-glitazone
Tro-glitazone
Pio-glitazone
SAA2 Strong“Discussed” PharmGKB
V. weak V. weak
APOE Strong“Discussed” PharmGKB + Matador
V. weak V. weak
ADIPOQ StrongPositive PharmGKB
V. weak StrongPositive PharmGKB
CYP2C8 StrongChangesmetabolism (CTD)
V. Weak StrongChangesmetabolism (CTD)
APOE, ADIPOQ, LDL, HDL, Rosi and Pio
Pioglitazone AND Rosiglitazone increase ADIPOQ whichresults in increased HDL (good) cholesterol
Rosiglitazone only interacts with APOE and results in an increase in LDL (bad) cholesterol
Semantic Link Association Prediction (SLAP)
Predicts a probability of association of a compound and a target based on the network paths between them that involve drugs, targets, pathways, diseases, tissues, GO terms, chemical ontologies, substructure and drug side-effects
It can be primarily considered as a “missing link prediction” Data source is a subset of the Chem2Bio2RDF network including
250,000 compounds with known bioactivities and the targets known to associated with these drugs
Raw Score is a measure of the significance of a single path between a compound and target, based on topology and semantics of the path nodes and edges. Raw scores are normally distributed within a path pattern
Association Score is a sum of z-scores of raw scores relative to a distribution of random pair scores for different paths and path patterns. Association scores form a normal distribution
Association Significance is a significance p-value of an association score based on the normal distribution of association scores.
Example: Troglitazone and PPARGAssociation score: 2385.9Association significance: 9.06 x 10-6 => missing link predicted
SLAP web tool http://chem2bio2rdf.org/slap
SLAP – target profile for IbuprofenCOX2 – main targetRegulate neurotransmitter release COX1
Dopamine receptorSeratonin
receptorsCannaboid receptorsMuscarinic receptor(motor control)
vs acetaminophen and aspirin
SLAP - Biologically Similar Drugs
Dopamine agonist, used in Parkinson’s
Dopamine agonist, used in Parkinson’s
Troglitazone vs Rosiglitazone
Chemogenomic Explorer Uses WENDI (J Cheminf., 2010, 2:6 ) web service to generate XML
of related biological data for a compound using Chem2Bio2RDF XML is converted to RDF with a WENDI ontology Applies RDF inference engine and rule set to infer compound-
disease relationships based on evidence paths (e.g. similar compound is active in an assay associated with a gene which is associated with a disease). These are represented as new RDF.
Facet browser allows clustering, filtering and exploration of evidence paths by disease association
For more information, see Zhu, Q. et al., BMC Bioinformatics, 2011, 12:256
Chemogenomic Explorer Interface
BioLDA – semantic Bioterm literature extraction
BioLDA Topic Model of PubMed Literature
Latent Dirichlet Assoication (LDA) identifies “latent topics” by word association: a kind of fuzzy clustering. Each word can have associations with multiple topics, and has a varying degree of strength
Term-topic edges are labeled with probability (i.e. strength of a relationship to a topic). Term-term edges are labeled with KL-divergence (measure of distance)
Considered BioTerms rather than free text, and applied to 336,899 MedLine abstracts on 50 topics published in 2009
Based on work done by Jie Tang on social networks (see www.arnetminer.com)
More information can be found in PLoS One, 2011, 6 (3), e17243
Example: Topic 10
BLASC Calculates KL-Divergence
score for any bioterm pairs (drugs,genes, side-effects, pathways, etc)
Available from http://djwild.info
Applications in drug discovery processes
Integrative virtual screening SLAP / BLASC association with targets and/or known ligands Comparison with QSAR models, LBVS, Docking and Pharmacophore search Harmonic data fusion Applied to PXR antagonists (Univ. Cincinatti) and Mycobacterium Tubercolusis
inhibition (OSDD)
Polypharmacology Drug indication network based on SLAP target profiles Adverse effect network based on SLAP off-target profiles
Searching and exploring mechanisms of action Association search, BLASC, SLAP Examples tested: Thiazolinediones and Myocardial Infarction; Ibuprofen and
Parkinson Disease
Other work in progress Improvement of SLAP algorithms Mapping of patient and metagenomics data
Try the tools out – djwild.info
Cheminformatics Education at Indiana University
LuLu eBook - $29http://slg.djwild.info
Free cheminformatics learning resources
http://icep.wikispaces.com
Residential Ph.D. program in Informatics with a
Cheminformatics specialty
Distance Graduate Certificate program in Chemical Informatics
http://djwild.info/ed