Semantic empowerment of Life Science Applications October 2006

Semantic empowermentof Life Science Applications

October 2006

Amit Sheth LSDIS Lab, Department of Computer Science,

University of Georgia

Acknowledgement: NCRR funded Bioinformatics of Glycan Expression, collaborators, partners at CCRC (Dr. William S. York)

and Satya S. Sahoo, Cartic Ramakrishnan, Christopher Thomas, Cory Henson.

http://lsdis.cs.uga.edu/~amit

http://lsdis.cs.uga.edu/projects/glycomics/

Computation, data and semantics In life sciences

• “The development of a predictive biology will likely be one of the major creative enterprises of the 21st century.” Roger Brent, 1999

• “The future will be the study of the genes and proteins of organisms in the context of their informational pathways or networks.” L. Hood, 2000

• "Biological research is going to move from being hypothesis-driven to being data-driven." Robert Robbins

• “We’ll see over the next decade complete transformation (of life science industry) to very database-intensive as opposed to wet-lab intensive.” Debra Goldfarb

We will show how semantics is a key enabler for achieving the above predictions and visions in which information and process play critical role.

Semantic Web and Life Science

• Data captured per year = 1 exabyte (1018)(Eric Neumann, Science, 2005)

• How much is that?– Compare it to the estimate of the total words

ever spoken by humans = 12 exabyte • Death by data• The need for

– Search– Integration – Analysis,

decision support

– Discovery

Not data, but analysis and

insight, leading to decisions

and discovery


Life Science research today deals with highly heterogeneous as well as massive amounts of data distributed across the world.

We need more automated ways for integration and analysis leading to insight and discovery

- to understand cellular components, molecular functions and biological processes, and more importantly complex interactions and interdependencies between them.

Benefits of Semantics

• Development of large domain-specific knowledge – for reference, common nomenclature, tagging

• Integration of heterogeneous multi-source data: biomedical documents (text), scientific/experimental data and structured databases

• Semantic search, browsing, integration analysis, and discovery

Faster and more reliable discovery leading to quality of life improvements

What is semantics & Semantic Web

• Meaning and use of data• From syntax and structure to semantics (beyond

formatting, organization, query interfaces,….)• XML -> RDF -> OWL -> Rules -> Trust• Ontologies at the heart of Semantic Web,

capturing agreement and domain knowledge• (Automatic) Semantic annotation, reasoning,…• Also, increasing use of Services oriented

Architecture -> semantic Web services• W3C SW for Health Care and Life Sciences


This talk will demonstrate some of the efforts in:

• Building large (populated) life science ontologies (GlycO, ProPreO)

• Gathering/extracting knowledge and metadata: entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry)

• Semantic web services and registries, leading to better discovery/reuse of scientific tools and their composition

• Ontology-driven applications developed

http://lsdis.cs.uga.edu/projects/glycomics/glyco/

http://lsdis.cs.uga.edu/projects/glycomics/propreo/

Semantic Applications

• Active Semantic Medical Records Demo : an operational health care application using multiple ontologies, semantic annotations and rule based decsion support

• Semantic Browser Demo: contextual browsing of PubMed aided by ontology and schema (in future instance) level relationships

• N-glycosylation process: an example of scientific workflow

• Integrated Semantic Information & Knowledge System (ISIS): integrated access and analysis of structured databases, sc. literature and experimental data

Others we will not discuss: SemBowser, SemDrug, ….

Let us start with a couple of simple applications

http://lsdis.cs.uga.edu/projects/asdoc/index.php?page=1

http://lsdis.cs.uga.edu/projects/asdoc/index.php?page=1

http://lsdis.cs.uga.edu/projects/semdis/SemanticBrowser/

Life Science Ontologies

• ProPreO• An ontology for capturing process and lifecycle information

related to proteomic experiments• 398 classes, 32 relationships• 3.1 million instances• Published through the National Center for Biomedical

Ontology (NCBO) and Open Biomedical Ontologies (OBO)

• Glyco• An ontology for structure and function of Glycopeptides• 573 classes, 113 relationships• Published through the National Center for Biomedical

Ontology (NCBO)

N-Glycosylation metabolic pathway

GNT-Iattaches GlcNAc at position 2

UDP-N-acetyl-D-glucosamine + alpha-D-Mannosyl-1,3-(R1)-beta-D-mannosyl-R2 <=>

UDP + N-Acetyl-$beta-D-glucosaminyl-1,2-alpha-D-mannosyl-1,3-(R1)-beta-D-mannosyl-$R2

GNT-Vattaches GlcNAc at position 6

UDP-N-acetyl-D-glucosamine + G00020 <=> UDP + G00021

N-acetyl-glucosaminyl_transferase_VN-glycan_beta_GlcNAc_9N-glycan_alpha_man_4

• Challenge – model hundreds of thousands of complex carbohydrate entities

• But, the differences between the entities are small (E.g. just one component)

• How to model all the concepts but preclude redundancy → ensure maintainability, scalability

GlycO ontology

N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15: 235-251

-D-GlcpNAc-D-GlcpNAc-D-Manp-(1-4)- -(1-4)-

-D-Manp -(1-6)+-D-GlcpNAc-(1-2)-

-D-Manp -(1-3)+-D-GlcpNAc-(1-4)-

-D-GlcpNAc-(1-2)+

GlycoTree

EnzyO• The enzyme ontology EnzyO is highly

intertwined with GlycO. While it’s structure is mostly that of a taxonomy, it is highly restricted at the class level and hence allows for comfortable classification of enzyme instances from multiple organisms

• GlycO together with EnzyO contain all the information that is needed for the description of Metabolic pathways – e.g. N-Glycan Biosynthesis

Pathway representation in GlycO

Pathways do not need to be explicitly defined in GlycO. The residue-, glycan-, enzyme- and reaction descriptions contain all the knowledge necessary to infer pathways.

Zooming in a little …

The N-Glycan with KEGG ID 00015 is the substrate to the reaction R05987, which is catalyzed by an enzyme of the class EC

2.4.1.145.

The product of this reaction is the

Glycan with KEGG ID 00020.

Reaction R05987catalyzed by enzyme 2.4.1.145

adds_glycosyl_residueN-glycan_b-D-GlcpNAc_13

• Multiple data sources used in populating the ontologyo KEGG - Kyoto Encyclopedia of Genes and

Genomeso SWEETDBo CARBANK Database

• Each data source has different schema for storing data

• There is significant overlap of instances in the data sources

• Hence, entity disambiguation and a common representational format are needed

GlycO population

Has CarbBank

ID?

IUPAC to LINUCS

LINUCS to GLYDE

Compare to Knowledge

Base

Already in KB?

YES

NO

Semagix Freedom knowledge extractor

Instance Data

YES: next Instance

Insert into KB

NO

Ontology population workflow

Has CarbBank

ID?

IUPAC to LINUCS

LINUCS to GLYDE


Base

Already in KB?

YES

NO


Instance Data

YES: next Instance

Insert into KB

NO

[][Asn]{[(4+1)][b-D-GlcpNAc]{[(4+1)][b-D-GlcpNAc]

{[(4+1)][b-D-Manp]{[(3+1)][a-D-Manp]

{[(2+1)][b-D-GlcpNAc]{}[(4+1)][b-D-GlcpNAc]

{}}[(6+1)][a-D-Manp]{[(2+1)][b-D-GlcpNAc]{}}}}}}


Has CarbBank

ID?

IUPAC to LINUCS

LINUCS to GLYDE


Base

Already in KB?

YES

NO


Instance Data

YES: next Instance

Insert into KB

NO

<Glycan> <aglycon name="Asn"/> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="Man" > <residue link="3" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> </residue> <residue link="6" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> </residue> </residue> </residue> </residue> </residue></Glycan>


Has CarbBank

ID?

IUPAC to LINUCS

LINUCS to GLYDE


Base

Already in KB?

YES

NO


Instance Data

YES: next Instance

Insert into KB

NO


• Two aspects of glycoproteomics:o What is it? → identificationo How much of it is there? → quantification

• Heterogeneity in data generation process, instrumental parameters, formats

• Need data and process provenance → ontology-mediated provenance

• Hence, ProPreO models both the glycoproteomics experimental process and attendant data

ProPreO ontology

ProPreO population: transformation to rdf

Scientific Data

Computational Methods

Ontology instances

“Protein RDF”

chemicalmass

monoisotopicmass

amino-acidsequence

n-glycosylationconcensus

Protein Dataamino-acidsequence

ChemicalMass RDF

MonoisotopicMass RDF

Amino-acidSequence

RDF

“Peptide RDF”

chemicalmass

monoisotopicmass

amino-acidsequence

n-glycosylationconcensus

parentprotein

CalculateChemical

Mass

CalculateMonoisotopic

Mass

DetermineN-glycosylation

Concensus

Key

Protein Path

Peptide Path

amino-acidsequence

Extract Peptide Amino-acid Sequence from Protein Amino-acid Sequence

ProPreO population: transformation to rdf

Scientific DataComputational Methods

RDF



• building large life science ontologies (GlycO -an ontology for structure and function for Glycopeptides and ProPreO - an ontology for capturing process and lifecycle information related to proteomic experiments) and their application in advanced ontology-driven semantic applications

• entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry), and resulting capability in integrated access and analysis of structured databases, scientific literature and experimental data

• semantic web services and registries, leading to better discovery/reuse of scientific tools and composition of scientific workflows that process high-throughput data and can be adaptive

• semantic applications developed



Relationship extraction from unstructured data

(other related research: biological entity extraction)

Overview

9284 documents 4733

documents

Biologically active substance

LipidDisease or Syndrome

affects

causes

affects

causes

complicates

Fish Oils Raynaud’s Disease???????

instance_of instance_of

5 documents

UMLS

MeSH

PubMed

About the data used

• UMLS – A high level schema of the biomedical domain– 136 classes and 49 relationships– Synonyms of all relationship – using variant

lookup (tools from NLM)

• MeSH – Terms already asserted as instance of one or

more classes in UMLS• PubMed

– Abstracts annotated with one or more MeSH terms

T147—effect T147—induce T147—etiology T147—cause T147—effecting T147—induced

Example PubMed abstract (for the domain expert)

Abstract

Classification/Annotation

Method – Parse Sentences in PubMed

SS-Tagger (University of Tokyo)

SS-Parser (University of Tokyo)

(TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ endogenous) (CC or) (JJ exogenous) ) (NN stimulation) ) (PP (IN by) (NP (NN estrogen) ) ) ) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT the) (NN endometrium) ) ) ) ) ) )

Method – Identify entities and Relationships in Parse Tree

ModifiersModified entitiesComposite Entities

Method – Identify entities and Relationships in Parse Tree

Method – Fact Extraction from Parse Tree

Semantic annotation of scientific/experimental data

830.9570 194.9604 2

580.2985 0.3592

688.3214 0.2526

779.4759 38.4939

784.3607 21.7736

1543.7476 1.3822

1544.7595 2.9977

1562.8113 37.4790

1660.7776 476.5043

parent ion m/z

fragment ion m/z

ms/ms peaklist data

fragment ionabundance

parent ionabundance

parent ion charge

ProPreO: Ontology-mediated provenance

Mass Spectrometry (MS) Data

<ms-ms_peak_list>

<parameter instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer”

mode=“ms-ms”/>

<parent_ion m-z=“830.9570” abundance=“194.9604” z=“2”/>

<fragment_ion m-z=“580.2985” abundance=“0.3592”/>








</ms-ms_peak_list>

OntologicalConcepts

ProPreO: Ontology-mediated provenance

Semantically Annotated MS Data



• building large life science ontologies (GlycO -an ontology for structure and function for Glycopeptides and ProPreO - an ontology for capturing process and lifecycle information related to proteomic experiments) and their application in advanced ontology-driven semantic applications

• entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry), and resulting capability in integrated access and analysis of structured databases, scientific literature and experimental data

• semantic web services and registries, leading to better discovery/reuse of scientific tools and composition of scientific workflows that process high-throughput data and can be adaptive

• semantic applications developed



N-GlycosylationN-Glycosylation ProcessProcess (NGPNGP)Cell Culture

Glycoprotein Fraction

Glycopeptides Fraction

extract

Separation technique I

Glycopeptides Fraction

n*m

n

Signal integrationData correlation

Peptide Fraction

Peptide Fraction

ms data ms/ms data

ms peaklist ms/ms peaklist

Peptide listN-dimensional arrayGlycopeptide identificationand quantification

proteolysis

Separation technique II

PNGase

Mass spectrometry

Data reductionData reduction

Peptide identificationbinning

n

1

Storage

Standard FormatData

Raw Data

Filtered Data

Search Results

Final Output

Agent Agent Agent Agent Biological Sample Analysis

by MS/MS

Raw Data to

Standard Format

DataPre-

process

DB Search

(Mascot/Sequest)

Results Post-

process

(ProValt)

O I O I O I O I O

Biological Information

SemanticAnnotationApplications

Semantic Web Process to incorporate provenance

Converting biological information to the W3C Resource Description

Framework (RDF): Experience with Entrez Gene

Collaboration with Dr. Olivier Bodenreider (US National Library of Medicine, NIH, Bethesda, MD)

Biomedical Knowledge Repository

Entrez

BiomedicalKnowledgeRepository

….

Implementation

XSLT

Entrez Gene Entrez Gene XML

Entrez Gene RDF graph Entrez Gene RDF

Web interface

XSLT

ENTREZ GENE ENTREZ GENE XML

ENTREZ GENE RDF GRAPH ENTREZ GENE RDF….

Implementation

XSLT

Entrez Gene Entrez Gene XML

Entrez Gene RDF graph Entrez Gene RDF

Connecting different genes

APP gene [Homo sapiens]

APP gene [Gallus gallus]

APP gene [Canis familiaris ]

protease nexin-II

amyloid beta A4 protein

amyloid-beta protein

A4 amyloid protein

beta-amyloid peptide

amyloid beta (A4) precursor protein (protease nexin-II, Alzheimer disease)

cerebral vascular amyloid peptide

amyloid protein

eg:has_protein_reference_name_E

amyloid beta A4 proteinamyloid beta A4 proteinHuman APP gene is implicated in Alzheimer's disease. Which genes are functionally homologous to this gene?

Raw2mzXML mzXML2Pkl Pkl2pSplit MASCOT Search ProVault

Raw mzXML Pkl pSplit MACOTresult

ProVaultresult

ExperimentalData Semantic

Annotation MetadataFile

SPARQL query-based User Interface

SemanticMetadataRegistry

PROTEOMECOMMONS

PROTEOMICS WORKFLOW

Integrated Semantic Information and knowledge System (Isis)

ProPreO ontology

EXPERIMENTAL DATA

Have I performed an error? Give me all result files from a similar

organism, cell, preparation, mass spectrometric conditions

and compare results.

Is the result erroneous? Give me all result files from a similar

organism, cell, preparation, mass spectrometric conditions

and compare results.

Summary, Observations, Conclusions

• We now have semantics and services enabled approaches that support semantic search, semantic integration, semantic analytics, decision support and validation (e.g., error prevention in healthcare), knowledge discovery, process/pathway discovery, …

• http://lsdis.cs.uga.edu• http://knoesis.org

http://lsdis.cs.uga.edu/projects/asdoc/http://lsdis.cs.uga.edu/projects/glycomics/

http://lsdis.cs.uga.edu/

http://knoesis.org/

http://lsdis.cs.uga.edu/projects/asdoc/

http://lsdis.cs.uga.edu/projects/glycomics/

Semantic empowerment of Life Science Applications October 2006

Documents

Transcript of Semantic empowerment of Life Science Applications October 2006