Post on 01-Feb-2020
Ontologies and data integration in biomedicineSuccess stories and challenging issues
Olivier Bodenreider
Lister Hill National Centerfor Biomedical CommunicationsBethesda, Maryland - USA
Data Integration in the Life SciencesEvry, FranceJune 26, 2008
Lister Hill National Center for Biomedical Communications 3
Integration yields nice pictures![Yildirim, 2007]
Lister Hill National Center for Biomedical Communications 4
Motivation Translational research
“Bench to Bedside”Integration of clinical and research activities and resultsSupported by research programs
NIH RoadmapClinical and Translational Science Awards (CTSA)
Requires the effective integration and exchange and of information between
Basic researchClinical research
Lister Hill National Center for Biomedical Communications 6
Motivation Translational research
BasicResearch
ClinicalResearch
andPractice
Lister Hill National Center for Biomedical Communications 8
Terminology and translational research
CancerBasic
Research
EHRCancerPatients
NCI Thesaurus SNOMED CT
Lister Hill National Center for Biomedical Communications 9
Approaches to data integration
WarehousingSources to be integrated are transformed into a common format and converted to a common vocabulary
MediationLocal schema (of the sources)Global schema (in reference to which the queries are made)
Lister Hill National Center for Biomedical Communications 10
Ontologies and warehousing
RoleProvide a conceptualization of the domain
Help define the schemaInformation model vs. ontology
Provide value sets for data elementsEnable standardization and sharing of data
ExamplesAnnotations to the Gene OntologyRepositories for translational research (CTSA)Clinical information systems
Lister Hill National Center for Biomedical Communications 11
Ontologies and mediation
RoleReference for defining the global schemaMap between local and global schemas
ExamplesTAMBISBioMediatorOntoFusion
Lister Hill National Center for Biomedical Communications 13
Annotating data
Gene OntologyFunctional annotation of gene productsin several dozen model organisms
Various communities use the same controlled vocabulariesEnabling comparisons across model organismsAnnotations
Assigned manually by curatorsInferred automatically (e.g., from sequence similarity)
Lister Hill National Center for Biomedical Communications 14
GO Annotations for Aldh2 (mouse)
http:// www.informatics.jax.org/
Lister Hill National Center for Biomedical Communications 15
GO ALD4 in Yeast
http://db.yeastgenome.org/
Lister Hill National Center for Biomedical Communications 16
GO Annotations for ALDH2 (Human)
http://www.ebi.ac.uk/GOA/
Lister Hill National Center for Biomedical Communications 17
Integration applications
Based on shared annotationsEnrichment analysis (within/across species)Clustering (co-clustering with gene expression data)
Based on the structure of GOClosely related annotationsSemantic similarity
Based on associations between gene products and annotationsLeveraging reasoning
[Bodenreider, PSB 2005]
[Sahoo, Medinfo 2007]
[Lord, PSB 2003]
Lister Hill National Center for Biomedical Communications 18
Gene Ontology
Integration Entrez Gene + GO
gene
GO
PubMed
Gene name
OMIM
Sequence
InteractionsGlycosyltransferase
Congenital muscular dystrophy
Entrez Gene
[Sahoo, Medinfo 2007]
Lister Hill National Center for Biomedical Communications 19
From glycosyltransferaseto congenital muscular dystrophy
MIM:608840 Muscular dystrophy, congenital, type 1D
GO:0008375
has_associated_phenotype
has_molecular_function
EG:9215LARGE
acetylglucosaminyl-transferase
GO:0016757glycosyltransferase
GO:0008194isa
GO:0008375 acetylglucosaminyl-transferase
GO:0016758
Lister Hill National Center for Biomedical Communications 21
Cancer Biomedical Informatics Grid
US National Cancer InstituteCommon infrastructure used to share data and applications across institutions to support cancer research efforts in a grid environment
Data and application services available on the gridSupported by ontological resources
Lister Hill National Center for Biomedical Communications 22
caBIG services
caArrayMicroarray data repository
caTissueBiospecimen repository
caFE (Cancer Function Express)Annotations on microarray data
…
caTRIPCancer Translational Research Informatics PlatformIntegrates data services
Lister Hill National Center for Biomedical Communications 23
Ontological resources
NCI ThesaurusReference terminology for the cancer domain~ 60,000 conceptsOWL Lite
Cancer Data Standards Repository (caDSR)Metadata repositoryUsed to bridge across UML models through Common Data ElementsLinks to concepts in ontologies
Success stories
Semantic Webfor Health Care and Life Sciences
http://www.w3.org/2001/sw/hcls/
Lister Hill National Center for Biomedical Communications 26
Biomedical Semantic Web
IntegrationData/InformationE.g., translational research
Hypothesis generationKnowledge discovery
[Ruttenberg, 2007]
Lister Hill National Center for Biomedical Communications 27
HCLS mashup of biomedical sources
NeuronDB
BAMS
NC Annotations
Homologene
SWAN
Entrez Gene
Gene Ontology
Mammalian Phenotype
PDSPki
BrainPharm
AlzGene
Antibodies
PubChem
MeSH
Reactome
Allen Brain Atlas
Publications
http://esw.w3.org/topic/HCLS/HCLSIG_DemoHomePage_HCLSIG_Demo
Lister Hill National Center for Biomedical Communications 29
HCLS mashup NeuronDB
Protein (channels/receptors)NeurotransmittersNeuroanatomyCellCompartmentsCurrents
BAMSProteinNeuroanatomyCellsMetabolites (channels)PubMedID
NC Annotations
Genes/ProteinsProcessesCells (maybe)PubMed ID
Allen Brain Atlas
GenesBrain imagesGross anatomy -> neuroanatomy
Homologene
GenesSpeciesOrthologiesProofs
SWAN
PubMedIDHypothesisQuestionsEvidence
Genes
Entrez GeneGenesProtein
GOPubMedID
Interaction (g/p)Chromosome
C. location
GO
Molecular functionCell components
Biological processAnnotation gene
PubMedID
Mammalian Phenotype
Genes Phenotypes
DiseasePubMedID
ProteinsChemicals
Neurotransmitters
PDSPki
BrainPharmDrug
Drug effectPathological agent
PhenotypeReceptorsChannelsCell typesPubMedIDDisease
AlzGene
Gene Polymorphism
PopulationAlz Diagnosis
AntibodiesGenes Antibodies
PubChem
NameStructurePropertiesMeSH term
MeSHDrugsAnatomyPhenotypesCompoundsChemicalsPubMedIDPubChem
Reactome
Genes/proteinsInteractionsCellular locationProcesses (GO)
Lister Hill National Center for Biomedical Communications 30
HCLS mashup NeuronDB
Protein (channels/receptors)NeurotransmittersNeuroanatomyCellCompartmentsCurrents
BAMSProteinNeuroanatomyCellsMetabolites (channels)PubMedID
NC Annotations
Genes/ProteinsProcessesCells (maybe)PubMed ID
Allen Brain Atlas
GenesBrain imagesGross anatomy -> neuroanatomy
Homologene
GenesSpeciesOrthologiesProofs
SWAN
PubMedIDHypothesisQuestionsEvidence
Genes
Entrez GeneGenesProtein
GOPubMedID
Interaction (g/p)Chromosome
C. location
GO
Molecular functionCell components
Biological processAnnotation gene
PubMedID
Mammalian Phenotype
GenesPhenotypes
DiseasePubMedID
ProteinsChemicals
Neurotransmitters
PDSPki
BrainPharmDrug
Drug effectPathological agent
PhenotypeReceptorsChannelsCell typesPubMedIDDisease
AlzGene
GenePolymorphism
PopulationAlz Diagnosis
AntibodiesGenesAntibodies
PubChem
NameStructurePropertiesMeSH term
MeSHDrugsAnatomyPhenotypesCompoundsChemicalsPubMedIDPubChem
Reactome
Genes/proteinsInteractionsCellular locationProcesses (GO)
Lister Hill National Center for Biomedical Communications 31
HCLS mashups
Based on RDF/OWLBased on shared identifiers
“Recombinant data” (E. Neumann)
Ontologies used in some casesSupport applications (SWAN, SenseLab, etc.)
Journal of Biomedical Informaticsspecial issue on Semantic Bio-mashups(forthcoming)
Lister Hill National Center for Biomedical Communications 33
Trans-namespace integration
Addison Disease(D000224)
Addison's disease (363732003)
Biomedicalliterature
MeSH
Clinicalrepositories
SNOMED CT
Primary adrenocortical insufficiency(E27.1)
ICD 10
Lister Hill National Center for Biomedical Communications 34
(Integrated) concept repositories
Unified Medical Language Systemhttp://umlsks.nlm.nih.govNCBO’s BioPortalhttp://www.bioontology.org/tools/portal/bioportal.htmlcaDSRhttp://ncicb.nci.nih.gov/NCICB/infrastructure/cacore_overview/cadsr
Open Biomedical Ontologies (OBO)http://obofoundry.org/
Lister Hill National Center for Biomedical Communications 35
Integrating subdomains
Biomedicalliterature
MeSH
Genomeannotations
GOModelorganisms
NCBITaxonomy
Geneticknowledge bases
OMIM
Clinicalrepositories
SNOMED CTOthersubdomains
…
Anatomy
FMA
UMLS
Lister Hill National Center for Biomedical Communications 3636
Integrating subdomains
Biomedicalliterature
Genomeannotations
Modelorganisms
Geneticknowledge bases
Clinicalrepositories
Othersubdomains
Anatomy
Lister Hill National Center for Biomedical Communications 37
Trans-namespace integration
Genomeannotations
GOModelorganisms
NCBITaxonomy
Geneticknowledge bases
OMIMOther
subdomains
…
Anatomy
FMA
UMLSAddison Disease (D000224)
Addison's disease (363732003)
Biomedicalliterature
MeSH
Clinicalrepositories
SNOMED CT
UMLSC0001403
Lister Hill National Center for Biomedical Communications 38
Mappings
Created manually (e.g., UMLS)PurposeDirectionality
Created automatically (e.g., BioPortal)Lexically: ambiguity, normalizationSemantically: lack of / incomplete formal definitions
Key to enabling semantic interoperabilityEnabling resource for the Semantic Web
Lister Hill National Center for Biomedical Communications 40
Identifying biomedical entities
Multiple identifiers for the same entity in different ontologiesBarrier to data integration in general
Data annotated to different ontologies cannot “recombine”Need for mappings across ontologies
Barrier to data integration in the Semantic WebMultiple possible identifiers for the same entity
Depending on the underlying representational scheme (URI vs. LSID)Depending on who creates the URI
Lister Hill National Center for Biomedical Communications 41
Possible solutions
PURL http://purl.orgOne level of indirection between developers and usersIndependence from local constraints at the developer’s end
The institution creating a resource is also responsible for minting URIs
E.g., URI for genes in Entrez Gene
Guidelines: “URI note”W3C Health Care and Life Sciences Interest Group
Lister Hill National Center for Biomedical Communications 43
Availability
Many ontologies are freely availableThe UMLS is freely available for research purposes
Cost-free license requiredLicensing issues can be tricky
SNOMED CT is freely available in member countries of the IHTSDO
Being freely availableIs a requirement for the Open Biomedical Ontologies (OBO)Is a de facto prerequisite for Semantic Web applications
Lister Hill National Center for Biomedical Communications 44
Discoverability
Ontology repositoriesUMLS: 143 source vocabularies(biased towards healthcare applications)NCBO BioPortal: ~100 ontologies(biased towards biological applications)Limited overlap between the two repositories
Need for discovery services
Lister Hill National Center for Biomedical Communications 45
Formalism
Several major formalismWeb Ontology Language (OWL) – NCI ThesaurusOBO format – most OBO ontologiesUMLS Rich Release Format (RRF) – UMLS, RxNorm
Conversion mechanismsOBO to OWLLexGrid (import/export to LexGrid internal format)
Lister Hill National Center for Biomedical Communications 46
Ontology integration
Post hoc integration , form the bottom upUMLS approachIntegrates ontologies “as is”, including legacy ontologiesFacilitates the integration of the corresponding datasets
Coordinated development of ontologiesOBO Foundry approachEnsures consistency ab initioExcludes legacy ontologies
Lister Hill National Center for Biomedical Communications 47
Quality
Quality assurance in ontologies is still imperfectly defined
Difficult to define outside a use case or applicationSeveral approaches to evaluating quality
Collaboratively, by users (Web 2.0 approach)Marginal notes enabled by BioPortal
Centrally, by expertsOBO Foundry approach
Important factors besides qualityGovernanceInstalled base / Community of practice
Lister Hill National Center for Biomedical Communications 49
Integrating genomic and clinical data
No genomic data available for most patientsNo precise clinical data available associated with most genomic data (GWAS excepted)
Genomicdata
Clinicaldata
Lister Hill National Center for Biomedical Communications 50
Integrating genomic and clinical data
Genomicdata
Lister Hill National Center for Biomedical Communications 51
Integrating genomic and clinical data
Genomicdata
Upregulatedgenes
Diseases(extracted from text
+ MeSH terms)
Lister Hill National Center for Biomedical Communications 52
Integrating genomic and clinical data
Clinicaldata
Genomicdata
Codeddischarge
summaries
Laboratorydata
Upregulatedgenes
Diseases(extracted from text
+ MeSH terms)
Lister Hill National Center for Biomedical Communications 53
The Butte approach Methods
Courtesy of David Chen, Butte Lab
Lister Hill National Center for Biomedical Communications 54
The Butte approach Results
Courtesy of David Chen, Butte Lab
Lister Hill National Center for Biomedical Communications 55
The Butte approach
Extremely rough methodsNo pairing between genomic and clinical dataText miningMapping between SNOMED CT and ICD 9-CM through UMLSReuse of ICD 9-CM codes assigned for billing purposes
Extremely preliminary resultsRediscovery more than discovery
Extremely promising nonetheless
Lister Hill National Center for Biomedical Communications 56
The Butte approach References
Dudley J, Butte AJ "Enabling integrative genomic analysis of high-impact human diseases through text mining." Pac Symp Biocomput2008; 580-91Chen DP, Weber SC, Constantinou PS, Ferris TA, Lowe HJ, Butte AJ "Novel integration of hospital electronic medical records and gene expression measurements to identify genetic markers of maturation." Pac Symp Biocomput 2008; 243-54Butte AJ, "Medicine. The ultimate model organism." Science 2008; 320: 5874: 325-7
Lister Hill National Center for Biomedical Communications 57
Conclusions
Ontologies are enabling resources for data integrationStandardization works
Grass roots effort (GO)Regulatory context (ICD 9-CM)
Bridging across resources is crucialOntology integration resources / strategies(UMLS, BioPortal / OBO Foundry)
Massive amounts of imperfect data integrated with rough methods might still be useful