Semantic empowerment of Life Science Applications October 2006

47
Semantic empowerment of Life Science Applications October 2006 Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia Acknowledgement: NCRR funded Bioinformatics of Glycan Expression , collaborators, partners at CCRC (Dr. William S. York) and Satya S. Sahoo, Cartic Ramakrishnan, Christopher Thomas, Cory Henson.

description

Semantic empowerment of Life Science Applications October 2006. Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia. Acknowledgement: NCRR funded Bioinformatics of Glycan Expression , - PowerPoint PPT Presentation

Transcript of Semantic empowerment of Life Science Applications October 2006

Page 1: Semantic empowerment of Life Science Applications October  2006

Semantic empowermentof Life Science Applications

October 2006

Amit Sheth LSDIS Lab, Department of Computer Science,

University of Georgia

Acknowledgement: NCRR funded Bioinformatics of Glycan Expression, collaborators, partners at CCRC (Dr. William S. York)

and Satya S. Sahoo, Cartic Ramakrishnan, Christopher Thomas, Cory Henson.

Page 2: Semantic empowerment of Life Science Applications October  2006

Computation, data and semantics In life sciences

• “The development of a predictive biology will likely be one of the major creative enterprises of the 21st century.” Roger Brent, 1999

• “The future will be the study of the genes and proteins of organisms in the context of their informational pathways or networks.” L. Hood, 2000

• "Biological research is going to move from being hypothesis-driven to being data-driven." Robert Robbins

• “We’ll see over the next decade complete transformation (of life science industry) to very database-intensive as opposed to wet-lab intensive.” Debra Goldfarb

We will show how semantics is a key enabler for achieving the above predictions and visions in which information and process play critical role.

Page 3: Semantic empowerment of Life Science Applications October  2006

Semantic Web and Life Science

• Data captured per year = 1 exabyte (1018)(Eric Neumann, Science, 2005)

• How much is that?– Compare it to the estimate of the total words

ever spoken by humans = 12 exabyte • Death by data• The need for

– Search– Integration – Analysis,

decision support

– Discovery

Not data, but analysis and

insight, leading to decisions

and discovery

Page 4: Semantic empowerment of Life Science Applications October  2006

Semantic empowermentof Life Science Applications

Life Science research today deals with highly heterogeneous as well as massive amounts of data distributed across the world.

We need more automated ways for integration and analysis leading to insight and discovery

- to understand cellular components, molecular functions and biological processes, and more importantly complex interactions and interdependencies between them.

Page 5: Semantic empowerment of Life Science Applications October  2006

Benefits of Semantics

• Development of large domain-specific knowledge – for reference, common nomenclature, tagging

• Integration of heterogeneous multi-source data: biomedical documents (text), scientific/experimental data and structured databases

• Semantic search, browsing, integration analysis, and discovery

Faster and more reliable discovery leading to quality of life improvements

Page 6: Semantic empowerment of Life Science Applications October  2006

What is semantics & Semantic Web

• Meaning and use of data• From syntax and structure to semantics (beyond

formatting, organization, query interfaces,….)• XML -> RDF -> OWL -> Rules -> Trust• Ontologies at the heart of Semantic Web,

capturing agreement and domain knowledge• (Automatic) Semantic annotation, reasoning,…• Also, increasing use of Services oriented

Architecture -> semantic Web services• W3C SW for Health Care and Life Sciences

Page 7: Semantic empowerment of Life Science Applications October  2006

Semantic empowermentof Life Science Applications

This talk will demonstrate some of the efforts in:

• Building large (populated) life science ontologies (GlycO, ProPreO)

• Gathering/extracting knowledge and metadata: entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry)

• Semantic web services and registries, leading to better discovery/reuse of scientific tools and their composition

• Ontology-driven applications developed

Page 8: Semantic empowerment of Life Science Applications October  2006

Semantic Applications

• Active Semantic Medical Records Demo : an operational health care application using multiple ontologies, semantic annotations and rule based decsion support

• Semantic Browser Demo: contextual browsing of PubMed aided by ontology and schema (in future instance) level relationships

• N-glycosylation process: an example of scientific workflow

• Integrated Semantic Information & Knowledge System (ISIS): integrated access and analysis of structured databases, sc. literature and experimental data

Others we will not discuss: SemBowser, SemDrug, ….

Let us start with a couple of simple applications

Page 9: Semantic empowerment of Life Science Applications October  2006

Life Science Ontologies

• ProPreO• An ontology for capturing process and lifecycle information

related to proteomic experiments• 398 classes, 32 relationships• 3.1 million instances• Published through the National Center for Biomedical

Ontology (NCBO) and Open Biomedical Ontologies (OBO)

• Glyco• An ontology for structure and function of Glycopeptides• 573 classes, 113 relationships• Published through the National Center for Biomedical

Ontology (NCBO)

Page 10: Semantic empowerment of Life Science Applications October  2006

N-Glycosylation metabolic pathway

GNT-Iattaches GlcNAc at position 2

UDP-N-acetyl-D-glucosamine + alpha-D-Mannosyl-1,3-(R1)-beta-D-mannosyl-R2 <=>

UDP + N-Acetyl-$beta-D-glucosaminyl-1,2-alpha-D-mannosyl-1,3-(R1)-beta-D-mannosyl-$R2

GNT-Vattaches GlcNAc at position 6

UDP-N-acetyl-D-glucosamine + G00020 <=> UDP + G00021

N-acetyl-glucosaminyl_transferase_VN-glycan_beta_GlcNAc_9N-glycan_alpha_man_4

Page 11: Semantic empowerment of Life Science Applications October  2006

• Challenge – model hundreds of thousands of complex carbohydrate entities

• But, the differences between the entities are small (E.g. just one component)

• How to model all the concepts but preclude redundancy → ensure maintainability, scalability

GlycO ontology

Page 12: Semantic empowerment of Life Science Applications October  2006

N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15: 235-251

-D-GlcpNAc-D-GlcpNAc-D-Manp-(1-4)- -(1-4)-

-D-Manp -(1-6)+-D-GlcpNAc-(1-2)-

-D-Manp -(1-3)+-D-GlcpNAc-(1-4)-

-D-GlcpNAc-(1-2)+

GlycoTree

Page 13: Semantic empowerment of Life Science Applications October  2006

EnzyO• The enzyme ontology EnzyO is highly

intertwined with GlycO. While it’s structure is mostly that of a taxonomy, it is highly restricted at the class level and hence allows for comfortable classification of enzyme instances from multiple organisms

• GlycO together with EnzyO contain all the information that is needed for the description of Metabolic pathways – e.g. N-Glycan Biosynthesis

Page 14: Semantic empowerment of Life Science Applications October  2006

Pathway representation in GlycO

Pathways do not need to be explicitly defined in GlycO. The residue-, glycan-, enzyme- and reaction descriptions contain all the knowledge necessary to infer pathways.

Page 15: Semantic empowerment of Life Science Applications October  2006

Zooming in a little …

The N-Glycan with KEGG ID 00015 is the substrate to the reaction R05987, which is catalyzed by an enzyme of the class EC

2.4.1.145.

The product of this reaction is the

Glycan with KEGG ID 00020.

Reaction R05987catalyzed by enzyme 2.4.1.145

adds_glycosyl_residueN-glycan_b-D-GlcpNAc_13

Page 16: Semantic empowerment of Life Science Applications October  2006

• Multiple data sources used in populating the ontologyo KEGG - Kyoto Encyclopedia of Genes and

Genomeso SWEETDBo CARBANK Database

• Each data source has different schema for storing data

• There is significant overlap of instances in the data sources

• Hence, entity disambiguation and a common representational format are needed

GlycO population

Page 17: Semantic empowerment of Life Science Applications October  2006

Has CarbBank

ID?

IUPAC to LINUCS

LINUCS to GLYDE

Compare to Knowledge

Base

Already in KB?

YES

NO

Semagix Freedom knowledge extractor

Instance Data

YES: next Instance

Insert into KB

NO

Ontology population workflow

Page 18: Semantic empowerment of Life Science Applications October  2006

Has CarbBank

ID?

IUPAC to LINUCS

LINUCS to GLYDE

Compare to Knowledge

Base

Already in KB?

YES

NO

Semagix Freedom knowledge extractor

Instance Data

YES: next Instance

Insert into KB

NO

[][Asn]{[(4+1)][b-D-GlcpNAc]{[(4+1)][b-D-GlcpNAc]

{[(4+1)][b-D-Manp]{[(3+1)][a-D-Manp]

{[(2+1)][b-D-GlcpNAc]{}[(4+1)][b-D-GlcpNAc]

{}}[(6+1)][a-D-Manp]{[(2+1)][b-D-GlcpNAc]{}}}}}}

Ontology population workflow

Page 19: Semantic empowerment of Life Science Applications October  2006

Has CarbBank

ID?

IUPAC to LINUCS

LINUCS to GLYDE

Compare to Knowledge

Base

Already in KB?

YES

NO

Semagix Freedom knowledge extractor

Instance Data

YES: next Instance

Insert into KB

NO

<Glycan> <aglycon name="Asn"/> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="Man" > <residue link="3" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> </residue> <residue link="6" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> </residue> </residue> </residue> </residue> </residue></Glycan>

Ontology population workflow

Page 20: Semantic empowerment of Life Science Applications October  2006

Has CarbBank

ID?

IUPAC to LINUCS

LINUCS to GLYDE

Compare to Knowledge

Base

Already in KB?

YES

NO

Semagix Freedom knowledge extractor

Instance Data

YES: next Instance

Insert into KB

NO

Ontology population workflow

Page 21: Semantic empowerment of Life Science Applications October  2006

• Two aspects of glycoproteomics:o What is it? → identificationo How much of it is there? → quantification

• Heterogeneity in data generation process, instrumental parameters, formats

• Need data and process provenance → ontology-mediated provenance

• Hence, ProPreO models both the glycoproteomics experimental process and attendant data

ProPreO ontology

Page 22: Semantic empowerment of Life Science Applications October  2006

ProPreO population: transformation to rdf

Scientific Data

Computational Methods

Ontology instances

Page 23: Semantic empowerment of Life Science Applications October  2006

“Protein RDF”

chemicalmass

monoisotopicmass

amino-acidsequence

n-glycosylationconcensus

Protein Dataamino-acidsequence

ChemicalMass RDF

MonoisotopicMass RDF

Amino-acidSequence

RDF

“Peptide RDF”

chemicalmass

monoisotopicmass

amino-acidsequence

n-glycosylationconcensus

parentprotein

CalculateChemical

Mass

CalculateMonoisotopic

Mass

DetermineN-glycosylation

Concensus

Key

Protein Path

Peptide Path

amino-acidsequence

Extract Peptide Amino-acid Sequence from Protein Amino-acid Sequence

ProPreO population: transformation to rdf

Scientific DataComputational Methods

RDF

Page 24: Semantic empowerment of Life Science Applications October  2006

Semantic empowermentof Life Science Applications

This talk will demonstrate some of the efforts in:

• building large life science ontologies (GlycO -an ontology for structure and function for Glycopeptides and ProPreO - an ontology for capturing process and lifecycle information related to proteomic experiments) and their application in advanced ontology-driven semantic applications

• entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry), and resulting capability in integrated access and analysis of structured databases, scientific literature and experimental data

• semantic web services and registries, leading to better discovery/reuse of scientific tools and composition of scientific workflows that process high-throughput data and can be adaptive

• semantic applications developed

Page 25: Semantic empowerment of Life Science Applications October  2006

Relationship extraction from unstructured data

(other related research: biological entity extraction)

Page 26: Semantic empowerment of Life Science Applications October  2006

Overview

9284 documents 4733

documents

Biologically active substance

LipidDisease or Syndrome

affects

causes

affects

causes

complicates

Fish Oils Raynaud’s Disease???????

instance_of instance_of

5 documents

UMLS

MeSH

PubMed

Page 27: Semantic empowerment of Life Science Applications October  2006

About the data used

• UMLS – A high level schema of the biomedical domain– 136 classes and 49 relationships– Synonyms of all relationship – using variant

lookup (tools from NLM)

• MeSH – Terms already asserted as instance of one or

more classes in UMLS• PubMed

– Abstracts annotated with one or more MeSH terms

T147—effect T147—induce T147—etiology T147—cause T147—effecting T147—induced

Page 28: Semantic empowerment of Life Science Applications October  2006

Example PubMed abstract (for the domain expert)

Abstract

Classification/Annotation

Page 29: Semantic empowerment of Life Science Applications October  2006

Method – Parse Sentences in PubMed

SS-Tagger (University of Tokyo)

SS-Parser (University of Tokyo)

(TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ endogenous) (CC or) (JJ exogenous) ) (NN stimulation) ) (PP (IN by) (NP (NN estrogen) ) ) ) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT the) (NN endometrium) ) ) ) ) ) )

Page 30: Semantic empowerment of Life Science Applications October  2006

Method – Identify entities and Relationships in Parse Tree

Page 31: Semantic empowerment of Life Science Applications October  2006

ModifiersModified entitiesComposite Entities

Method – Identify entities and Relationships in Parse Tree

Page 32: Semantic empowerment of Life Science Applications October  2006

Method – Fact Extraction from Parse Tree

Page 33: Semantic empowerment of Life Science Applications October  2006

Semantic annotation of scientific/experimental data

Page 34: Semantic empowerment of Life Science Applications October  2006

830.9570 194.9604 2

580.2985 0.3592

688.3214 0.2526

779.4759 38.4939

784.3607 21.7736

1543.7476 1.3822

1544.7595 2.9977

1562.8113 37.4790

1660.7776 476.5043

parent ion m/z

fragment ion m/z

ms/ms peaklist data

fragment ionabundance

parent ionabundance

parent ion charge

ProPreO: Ontology-mediated provenance

Mass Spectrometry (MS) Data

Page 35: Semantic empowerment of Life Science Applications October  2006

<ms-ms_peak_list>

<parameter instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer”

mode=“ms-ms”/>

<parent_ion m-z=“830.9570” abundance=“194.9604” z=“2”/>

<fragment_ion m-z=“580.2985” abundance=“0.3592”/>

<fragment_ion m-z=“688.3214” abundance=“0.2526”/>

<fragment_ion m-z=“779.4759” abundance=“38.4939”/>

<fragment_ion m-z=“784.3607” abundance=“21.7736”/>

<fragment_ion m-z=“1543.7476” abundance=“1.3822”/>

<fragment_ion m-z=“1544.7595” abundance=“2.9977”/>

<fragment_ion m-z=“1562.8113” abundance=“37.4790”/>

<fragment_ion m-z=“1660.7776” abundance=“476.5043”/>

</ms-ms_peak_list>

OntologicalConcepts

ProPreO: Ontology-mediated provenance

Semantically Annotated MS Data

Page 36: Semantic empowerment of Life Science Applications October  2006

Semantic empowermentof Life Science Applications

This talk will demonstrate some of the efforts in:

• building large life science ontologies (GlycO -an ontology for structure and function for Glycopeptides and ProPreO - an ontology for capturing process and lifecycle information related to proteomic experiments) and their application in advanced ontology-driven semantic applications

• entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry), and resulting capability in integrated access and analysis of structured databases, scientific literature and experimental data

• semantic web services and registries, leading to better discovery/reuse of scientific tools and composition of scientific workflows that process high-throughput data and can be adaptive

• semantic applications developed

Page 37: Semantic empowerment of Life Science Applications October  2006

N-GlycosylationN-Glycosylation ProcessProcess (NGPNGP)Cell Culture

Glycoprotein Fraction

Glycopeptides Fraction

extract

Separation technique I

Glycopeptides Fraction

n*m

n

Signal integrationData correlation

Peptide Fraction

Peptide Fraction

ms data ms/ms data

ms peaklist ms/ms peaklist

Peptide listN-dimensional arrayGlycopeptide identificationand quantification

proteolysis

Separation technique II

PNGase

Mass spectrometry

Data reductionData reduction

Peptide identificationbinning

n

1

Page 38: Semantic empowerment of Life Science Applications October  2006

Storage

Standard FormatData

Raw Data

Filtered Data

Search Results

Final Output

Agent Agent Agent Agent Biological Sample Analysis

by MS/MS

Raw Data to

Standard Format

DataPre-

process

DB Search

(Mascot/Sequest)

Results Post-

process

(ProValt)

O I O I O I O I O

Biological Information

SemanticAnnotationApplications

Semantic Web Process to incorporate provenance

Page 39: Semantic empowerment of Life Science Applications October  2006

Converting biological information to the W3C Resource Description

Framework (RDF): Experience with Entrez Gene

Collaboration with Dr. Olivier Bodenreider (US National Library of Medicine, NIH, Bethesda, MD)

Page 40: Semantic empowerment of Life Science Applications October  2006

Biomedical Knowledge Repository

Entrez

BiomedicalKnowledgeRepository

….

Page 41: Semantic empowerment of Life Science Applications October  2006

Implementation

XSLT

Entrez Gene Entrez Gene XML

Entrez Gene RDF graph Entrez Gene RDF

Page 42: Semantic empowerment of Life Science Applications October  2006

Web interface

XSLT

ENTREZ GENE ENTREZ GENE XML

ENTREZ GENE RDF GRAPH ENTREZ GENE RDF….

Page 43: Semantic empowerment of Life Science Applications October  2006

Implementation

XSLT

Entrez Gene Entrez Gene XML

Entrez Gene RDF graph Entrez Gene RDF

Page 44: Semantic empowerment of Life Science Applications October  2006

Connecting different genes

APP gene [Homo sapiens]

APP gene [Gallus gallus]

APP gene [Canis familiaris ]

protease nexin-II

amyloid beta A4 protein

amyloid-beta protein

A4 amyloid protein

beta-amyloid peptide

amyloid beta (A4) precursor protein (protease nexin-II, Alzheimer disease)

cerebral vascular amyloid peptide

amyloid protein

eg:has_protein_reference_name_E

amyloid beta A4 proteinamyloid beta A4 proteinHuman APP gene is implicated in Alzheimer's disease. Which genes are functionally homologous to this gene?

Page 45: Semantic empowerment of Life Science Applications October  2006

Raw2mzXML mzXML2Pkl Pkl2pSplit MASCOT Search ProVault

Raw mzXML Pkl pSplit MACOTresult

ProVaultresult

ExperimentalData Semantic

Annotation MetadataFile

SPARQL query-based User Interface

SemanticMetadataRegistry

PROTEOMECOMMONS

PROTEOMICS WORKFLOW

Integrated Semantic Information and knowledge System (Isis)

ProPreO ontology

EXPERIMENTAL DATA

Have I performed an error? Give me all result files from a similar

organism, cell, preparation, mass spectrometric conditions

and compare results.

Is the result erroneous? Give me all result files from a similar

organism, cell, preparation, mass spectrometric conditions

and compare results.

Page 46: Semantic empowerment of Life Science Applications October  2006

Summary, Observations, Conclusions

• We now have semantics and services enabled approaches that support semantic search, semantic integration, semantic analytics, decision support and validation (e.g., error prevention in healthcare), knowledge discovery, process/pathway discovery, …

Page 47: Semantic empowerment of Life Science Applications October  2006

• http://lsdis.cs.uga.edu• http://knoesis.org

http://lsdis.cs.uga.edu/projects/asdoc/http://lsdis.cs.uga.edu/projects/glycomics/