Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group...

39
Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK

Transcript of Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group...

Page 1: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Life Sciences: a case study for the

Semantic Web

Professor Carole GobleInformation Management

GroupUniversity of Manchester

UK

Page 2: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Pioneers and incubators

• The Web -> Physics – well-organised microcosm of the

general community. – definite and clearly articulated

information dissemination needs.– smart motivated people prepared to

co-operate, and with the means and desires to do so.

• The Semantic Web -> Life Sciences

Page 3: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Why Life Sciences?• Knowledge-based discipline

– Collaborative history– Publication shift: articles -> data -> knowledge– Content with extensive metadata -> annotation &

controlled vocabularies– Highly contextual, unstable and fuzzy

• In silico experiments– Information harvesting & PSE– Orchestrating resources -> workflow– Services that exploit enriched content– Support for scientific/research method = SW issues– Transparent collection of annotation

Page 4: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Why Life Sciences?

• Strong enthusiastic cohesive community– I3C use cases– Grass roots ontologies and annotation– Distributed annotation services– NEED for provenance, audit, security …– A chance of concrete articulation– Sanger, EBI & NCBI– ISCB

Page 5: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Hypotheses DesignIntegration

Annotation /Knowledge

Representation

InformationSources

InformationFusion

ClinicalResources

IndividualisedMedicine

Data Mining

Case-BaseReasoning

Data CaptureClinical

Image/SignalGenomic/Proteomic

Analysis

Knowledge Repositories

Model & Analysis Libraries

Disease Genetics & Pharmacogenomics

Page 6: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Cows to Proteins

• Jim Hendler-> how many cows in Texas?Q: What ATPase superfamily proteins are

found in mouse?A: 1. P21958 (from Swiss-Prot)2. InterPro is a pattern database and could

tell you3. Attwood’s lab expertise is in nucleotide

binding proteins ….

Page 7: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Drugformulary

Chemicaldatabase

Highthro’put

screening

Enzymedatabase

ReceptordatabaseExpressn.

database

Clinicaltrials

database

SNPsdatabase

Tissuedatabase

Which compounds interact with (alpha-adrenergic receptors) ((over expressed in (bladder epithelial cells)) but not (smooth muscle tissue)) of ((patients with urinary flow dysfunction) and a sensitivity to the (quinazoline family of compounds))?

Page 8: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Webs of Knowledge

Page 9: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Interoperating e-Services

Service providerService provider

Service providerService providerService

providerService provider

Service providerService provider

Service providerService provider

Interoperation is by hand or Perl scripts

Page 10: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

But surely this is just all about querying and linking (lots of) databases?

Isn’t the information all computationally accessible already?

The document publishingnavigation interface

legacy

Page 11: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Navigation-based interaction

Page 12: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Identity

Page 13: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

“Inaccessible” Descriptions

• Evolving• Non-predictive• The structured

part of the schema is open to change

• Hence flat file mark up’s prevalence

• XML is king.

Page 14: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

ID PRIO_HUMAN STANDARD; PRT; 253 AA.AC P04156;DE MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) (ASCR).OS Homo sapiens (Human).OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.OX NCBI_TaxID=9606;RN [1]RP SEQUENCE FROM N.A.RX MEDLINE=86300093 [NCBI, ExPASy, Israel, Japan]; PubMed=3755672;RA Kretzschmar H.A., Stowring L.E., Westaway D., Stubblebine W.H., Prusiner S.B., Dearmond S.J.RT "Molecular cloning of a human prion protein cDNA.";RL DNA 5:315-324(1986).RN [6]RP STRUCTURE BY NMR OF 23-231.RX MEDLINE=97424376 [NCBI, ExPASy, Israel, Japan]; PubMed=9280298;RA Riek R., Hornemann S., Wider G., Glockshuber R., Wuethrich K.;RT "NMR characterization of the full-length recombinant murine prion protein, mPrP(23-231).";RL FEBS Lett. 413:282-288(1997).CC -!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THE HOST GENOME AND IS CC EXPRESSED BOTH IN NORMAL AND INFECTED CELLS.CC -!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED "RODS".CC -!- SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR.CC -!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS AND ANIMALS INFECTED WITH CC NEURODEGENERATIVE DISEASES KNOWN AS TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION CC DISEASES, LIKE: CREUTZFELDT-JAKOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROME (GSS), CC FATAL FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; SCRAPIE IN SHEEP AND GOAT; BOVINE CC SPONGIFORM ENCEPHALOPATHY (BSE) IN CATTLE; TRANSMISSIBLE MINK ENCEPHALOPATHY (TME); CC CHRONIC WASTING DISEASE (CWD) OF MULE DEER AND ELK; FELINE SPONGIFORM ENCEPHALOPATHY CC (FSE) IN CATS AND EXOTIC UNGULATE ENCEPHALOPATHY(EUE) IN NYALA AND GREATER KUDU. THE CC PRION DISEASES ILLUSTRATE THREE MANIFESTATIONS OF CNS DEGENERATION: (1) INFECTIOUS (2)CC SPORADIC AND (3) DOMINANTLY INHERITED FORMS. TME, CWD, BSE, FSE, EUE ARE ALL THOUGHT TO CC OCCUR AFTER CONSUMPTION OF PRION-INFECTED FOODSTUFFS.CC -!- SIMILARITY: BELONGS TO THE PRION FAMILY.DR HSSP; P04925; 1AG2. [HSSP ENTRY / SWISS-3DIMAGE / PDB]DR MIM; 176640; -. [NCBI / EBI]DR InterPro; IPR000817; -.DR Pfam; PF00377; prion; 1.DR PRINTS; PR00341; PRION.KW Prion; Brain; Glycoprotein; GPI-anchor; Repeat; Signal; Polymorphism; Disease mutation.

Swiss-ProtFlat file

Page 15: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Literature holds knowledge

Consequence -> information extraction big business

& metadata is

required.

Page 16: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Community-wide markupAnnotation and Curation

“the elucidation and description of biologically relevant features”

Computationally formed – e.g. cross references to other database entries, date collected;

Intellectually formed – the accumulated knowledge of an expert distilling the aggregated information drawn from multiple data sources and analyses, and the annotators knowledge.

Expressed Sequence Tagsmillions

nrdb 503,479

TrEMBL 234,059

Swiss-Prot 85,661

InterPro 2990

PRINTS1310

Page 17: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

ID PRIO_HUMAN STANDARD; PRT; 253 AA.AC P04156;DE MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) (ASCR).OS Homo sapiens (Human).OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.OX NCBI_TaxID=9606;RN [1]RP SEQUENCE FROM N.A.RX MEDLINE=86300093 [NCBI, ExPASy, Israel, Japan]; PubMed=3755672;RA Kretzschmar H.A., Stowring L.E., Westaway D., Stubblebine W.H., Prusiner S.B., Dearmond S.J.RT "Molecular cloning of a human prion protein cDNA.";RL DNA 5:315-324(1986).RN [6]RP STRUCTURE BY NMR OF 23-231.RX MEDLINE=97424376 [NCBI, ExPASy, Israel, Japan]; PubMed=9280298;RA Riek R., Hornemann S., Wider G., Glockshuber R., Wuethrich K.;RT "NMR characterization of the full-length recombinant murine prion protein, mPrP(23-231).";RL FEBS Lett. 413:282-288(1997).CC -!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THE HOST GENOME AND IS CC EXPRESSED BOTH IN NORMAL AND INFECTED CELLS.CC -!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED "RODS".CC -!- SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR.CC -!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS AND ANIMALS INFECTED WITH CC NEURODEGENERATIVE DISEASES KNOWN AS TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION CC DISEASES, LIKE: CREUTZFELDT-JAKOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROME (GSS), CC FATAL FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; SCRAPIE IN SHEEP AND GOAT; BOVINE CC SPONGIFORM ENCEPHALOPATHY (BSE) IN CATTLE; TRANSMISSIBLE MINK ENCEPHALOPATHY (TME); CC CHRONIC WASTING DISEASE (CWD) OF MULE DEER AND ELK; FELINE SPONGIFORM ENCEPHALOPATHY CC (FSE) IN CATS AND EXOTIC UNGULATE ENCEPHALOPATHY(EUE) IN NYALA AND GREATER KUDU. THE CC PRION DISEASES ILLUSTRATE THREE MANIFESTATIONS OF CNS DEGENERATION: (1) INFECTIOUS (2)CC SPORADIC AND (3) DOMINANTLY INHERITED FORMS. TME, CWD, BSE, FSE, EUE ARE ALL THOUGHT TO CC OCCUR AFTER CONSUMPTION OF PRION-INFECTED FOODSTUFFS.CC -!- SIMILARITY: BELONGS TO THE PRION FAMILY.DR HSSP; P04925; 1AG2. [HSSP ENTRY / SWISS-3DIMAGE / PDB]DR MIM; 176640; -. [NCBI / EBI]DR InterPro; IPR000817; -.DR Pfam; PF00377; prion; 1.DR PRINTS; PR00341; PRION.KW Prion; Brain; Glycoprotein; GPI-anchor; Repeat; Signal; Polymorphism; Disease mutation.

Swiss-ProtAnnotatio

n

Page 18: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

gc; PRIONgx; PR00341gt; Prion protein signaturegp; INTERPRO; IPR000817gp; PROSITE; PS00291 PRION_1; PS00706 PRION_2gp; BLOCKS; BL00291gp; PFAM; PF00377 prionbb;gr; 1. STAHL, N. AND PRUSINER, S.B.gr; Prions and prion proteins.gr; FASEB J. 5 2799-2807 (1991).gr;gr; 2. BRUNORI, M., CHIARA SILVESTRINI, M. AND POCCHIARI, M.gr; The scrapie agent and the prion hypothesis.gr; TRENDS BIOCHEM.SCI. 13 309-313 (1988).gr; gr; 3. PRUSINER, S.B.gr; Scrapie prions.gr; ANNU.REV.MICROBIOL. 43 345-374 (1989).bb;gd; Prion protein (PrP) is a small glycoprotein found in high quantity in the brain of animals infected with gd; certain degenerative neurological diseases, such as sheep scrapie and bovine spongiform encephalopathy (BSE), gd; and the human dementias Creutzfeldt-Jacob disease (CJD) and Gerstmann-Straussler syndrome (GSS). PrP is gd; encoded in the host genome and is expressed both in normal and infected cells. During infection, however, the gd; PrP molecules become altered and polymerise, yielding fibrils of modified PrP protein.gd;gd; PrP molecules have been found on the outer surface of plasma membranes of nerve cells, to which they are gd; anchored through a covalent-linked glycolipid, suggesting a role as a membrane receptor. PrP is also gd; expressed in other tissues, indicating that it may have different functions depending on its location. gd;gd; The primary sequences of PrP's from different sources are highly similar: all bear an N-terminal domain gd; containing multiple tandem repeats of a Pro/Gly rich octapeptide; sites of Asn-linked glycosylation; an gd; essential disulphide bond; and 3 hydrophobic segments. These sequences show some similarity to a chicken gd; glycoprotein, thought to be an acetylcholine receptor-inducing activity (ARIA) molecule. It has been gd; suggested that changes in the octapeptide repeat region may indicate a predisposition to disease, but it is gd; not known for certain whether the repeat can meaningfully be used as a fingerprint to indicate susceptibility.gd;gd; PRION is an 8-element fingerprint that provides a signature for the prion proteins. The fingerprint was gd; derived from an initial alignment of 5 sequences: the motifs were drawn from conserved regions spanning gd; virtually the full alignment length, including the 3 hydrophobic domains and the octapeptide repeats gd; (WGQPHGGG). Two iterations on OWL18.0 were required to reach convergence, at which point a true set comprising gd; 9 sequences was identified. Several partial matches were also found: these include a fragment (PRIO_RAT) gd; lacking part of the sequence bearing the first motif,and the PrP homologue found in chicken - this matches gd; well with only 2 of the 3 hydrophobic motifs (1 and 5) and one of the other conserved regions (6), but has an gd; N-terminal signature based on a sextapeptide repeat (YPHNPG) rather than the characteristic PrP octapeptide.

PRINTS Annotation

Page 19: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

The “Annotation Workflow”

EMBLSwiss-Prot

PRINTS

Analysis

Analysis

GPCRDB

Analysis

TrEMBL

Analysis

Page 20: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

In silico experiments

Nicola: Domain; Task; Events ontologies

Simon: Support of research itself

Page 21: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

In silico experiments

• Resource discovery, interoperation, fusion, sharing, finding, filtering

• Work flows• Science is dynamic – change

propagation• Problem Solving Environments• Collaborative and dynamic virtual

organisations

Page 22: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Annotating the annotations

• Transparent annotation by side effect• Provenance, Trust, Authentication• Audit• Versioning, roll-backs and snap shots• Confidentiality• Credit – digital signatures• Authorisation & security …• Automated side effects of as part of the PSE• All potentials for Semantic Web Markup

Page 23: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Not just data and tools…Laboratories

Teams

Repositories

People

Page 24: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Problem Space• Ability to store and retrieve huge

volumes of information• Ability to capture, enrich, classify,

publish and structure knowledge about•Domains Organisations•Individuals Research Collaborations•Experiments Results•Services

Page 25: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Share info -> share meaning

Service providerService provider

Service providerService providerService

providerService provider

Service providerService provider

Service providerService provider

Page 26: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Ontologies are big news

• Gene Ontology– Marking up annotation of major databases– Identity, Linking databases together– Classification/index framework for instances

& results– It is sloppy but it is used by everybody!– Gene Ontology -> DAML+OIL -> inference!

• http://www.geneontology.org

Page 27: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

BioOntology Consortium

• 150 people attended the last BOC meeting• GSK and BOC mandated DAML+OIL• Plethora of other ontologies

– Bioinformatics• Many ontologies but under control

– Medical informatics• Tons of ontologies, out of control

• Representing the natural world is tough!!– Sufficiency conditions …

Page 28: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

StructuralGenomics

Population Genetics

Genome sequence

Functional genomics Tissue

Clinical trial

Disease

Clinical Data

• Data resources have been built introspectively for human researchers

• Information is machine readable not machine understandable

• Sharing vocabulary is a step towards unification

Page 29: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

“The technical advantages of knowledge modeling are obvious. Knowledge bases can be automatically checked for consistency; they support inference mechanisms which derive data which have not been explicitly stored; they also offer extensive request and navigation facilities. However, the most immediate benefit of knowledge base design lies in the modeling process itself, through the effort of explication, organization and structuration [sic] of the knowledge it requires.”

Editorial: Bioinformatics, July 2000

Page 30: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Quality & Stability• Open Knowledge &

transparency• Data quality• Inconsistency,

incompleteness• Provenance• Contamination, noise,

experimental rigour• Data irregularity• Evolution, Audit, Versioning

“ … the problem in the field is not a lack of good integrating software, Smith says. The packages usually end up leading back to public databases. "The problem is: the databases are God-awful," he told BioMedNet.

If the data is still fundamentally flawed, then better algorithms add little”

Temple Smith, director of the Molecular Engineering Research

Center at Boston University, BioMedNet 2000

Page 31: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Supporting Science

• All the great stuff Simon talked about• Information is contextual• Personalisation

– My view of a metabolic pathway– My experimental process flows

• Science is not linear– What did we know then– What do we know now

• Longevity of data – It has to be available in 50 years time.

Page 32: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

The Grid• Large scale distributed

data management• Large scale distributed

computation• High speed

communications• Dynamic collaborative

virtual organisations• UK Govt £120 million• http://www.gridform.org

Page 33: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Eating our own dog food myGrid

• UK research council funded e-Science Project• Start 1st October for 36-42 months• £3.4 million• 6 academic partners, 8 commercial• 19 FTEs

• Web Services + Semantic Web + Grid

• http://www.mygrid.org.uk

Page 34: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

myGrid Objectives• Straightforward discovery, interoperation,

sharing– information AND processes AND best practice

• Improving quality of both experiments and data– provenance through information <-> process

linkage– propagating change

• Individual creativity & collaborative working• Enabling genomic level bioinformatics

Page 35: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

myGrid Technologies Database access from the Grid Process enactment on the Grid Personalisation services Metadata services & Ontologies

DAML+OIL !! Laying the foundations for Agent Services Collaboration Environments Service composition• Ontologies, Protocols & APIs

Grid + Services + Semantic Web

Page 36: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

“Bioinformatics is a knowledge-based discipline. Many predictions, and interpretations, of data in biology are made by comparing the data in hand against existing knowledge”

Dr. Andy Brass, ad nauseum

•Analogy/knowledge-based rather than axiom-based

Page 37: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Remarks• Semantic Web literacy in biology weak• Grid literacy in biology strong• Biology loves XML and ignores RDF

– Annotations sit in other (non RDF) databases.

• Role of (legacy) databases and semantic web markup– Lots of metadata already in databases– Will we really mark up every database instance?– Exporting results as RDF– Using inference over results of queries

Page 38: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Remarks

• Change management – What did we know then?

• Custodianship, guardianship, longevity…

• Performance, robustness, scale.• Tools & easy to use environments• Demonstrators

Page 39: Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

How does this bit fit??