Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology...

44
Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology and Hydrology

Transcript of Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology...

Page 1: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Exploiting scientific data in the domain of ‘omics

'Genomics Standards Consortium

Ontology requirements and experiences'

Dawn Field

Oxford Centre for Ecology and Hydrology

Page 2: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

OverviewGoal of this Workshop: to explore what's been achieved to date with RDF, meta-data and ontologies in exploiting scientific data - particularly data integration, discovery and sharing

•what we have achieved•the challenges we face•what we hope to achieve in the near future•what are the major issues requiring further research

Page 3: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Challenges and Opportunities

• Rapidly growing collection of genomes• Increasing need for researchers to access,

combine and analyze data sets containing genomic, taxonomic, ecological and environmental data

• Increasing number of initiatives capturing metadata

• Additional information about complete genome sequences would be beneficial

Page 4: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

De novo DNA sequencingContinues to grow exponentially

SymBio CorporationSymBio CorporationSymBio CorporationSymBio Corporation

Page 5: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Data scope of genome resources at NCBI

Viruses Microbes Fungi/small eukaryotes

Insects Fishes

Mouse/Rat

Plants

pig, cow Human

D.melanogaster, A.gambia,D.pseudoobscura, Honey bee,

A.thalianaBarleyCornOatRice

SoybeanTomato

RiceWheat

NematodaC.elegans, C.briggsae

Chicken

Dog

chimpanzee

Organisms Environmental samples?

Page 6: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

The Promise of Metagenomics

Page 7: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Features of GBMF marine microbial genome sequencing project webpageFeatures of GBMF marine microbial genome sequencing project webpage

Acts as portal to primary investigator webpage

Provides basic information about the organism

1) Phylogeny of organism

2) Physiology, if known

3) Habitat

4) Geographic location

5) Isolation technique

6) Primary citation

7) Culture collection

www.moore.org/microgenomewww.moore.org/microgenome

Page 8: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Problems

DATA INTEGRATION!!!!!

NO SUFFICIENT DATA REGARDING PHYSIOLOGY OF ORGANISMS !!!!

Page 9: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.
Page 10: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Morphology and Growth

• Haemophilus influenzae is a non-motile, gram-negative, rod shaped bacterium. Optimal growth temperature is 37 degrees and doubling time in culture is 26 minutes.

Page 11: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Interactions and Ecology

• H. influenzae is a obligate commensal with the ability to cause disease including menigitis and otitis media. The primary habitat of this species is the human nasopharyx. This bacterium is faculatively anaerobic and uses organic matter as a source of carbon and organic matter as a source of energy.

Page 12: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

What we have achieved

Page 13: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Cataloguing our Complete Genome Collection

• Proposal: Field D, & Hughes J (2005). Cataloguing our current genome collection. Microbiology 151: 1016-1019

• Analysis: Hughes J & Field D (2005) Ecological Perspectives on our complete genome collection” Ecology Letters. 8, 1334-1345

• Workshop: “Cataloguing our current genome collection” Sept 7-9, 2005 Cambridge, UK NIEeS ; D. Field, G. Garrity, N. Morrison, J. Selengut, P. Sterk, N. Thomson, T. Tatusova. Meeting report. Comp. Func. Genomics.

• Genomic Standards Consortium (GSC): http://gensc.sourceforge.net

• Funding: “Cataloguing our current genome collection” (NERC International Opportunities Fund Award: NE/3521773/1)

Page 14: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Cataloguing our Complete Genome Collection

• Workshop: “Cataloguing our current genome collection II” Nov 10-11, 2005, EBI, Cambridge, UK; D. Field, N. Morrison, J. Selengut, P. Sterk, Meeting report OMICS (in press)

• Special issue of OMICS on data standards: guest editors Dawn Field and Susanna Sansone; organized around first two GSC workshops

• Funding: “Cataloguing our current genome collection” funding from NIEeS for two more workshops in June 2006 and 2007

• Workshop: 3rd GSC workshop Sept 11-13, 2006 NIEeS, Cambridge UK. Co-organizers Dawn Field and Tatiana Tatusova

• Genome Catalogue: Launch of implementation of MIGS checklist as a database ready to accept case study genomes

Page 15: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Overview of GSC activitiesThe aim of the Genomic Standards Consortium (GSC) is to support thecommunity-based development of a genomic standard that captures aricher set of information about complete genomes and metagenomicdatasets.

• Checklist

• Implementation

• Ontology development

• Metadata exchange

Page 16: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Overview of GSC activities

• Checklist: The GSC is currently working together towards the "Minimal Information about a Genome Sequence" (MIGS) specification.

• Implementation: To promote discussion and support the capture of preliminary data an XML schema has been built from the checklist and implemented as the Genome Catalogue database.

Page 17: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Overview of GSC activities

• Ontology development: The GSC is also working towards the development of controlled vocabularies for describing genomes and this work feeds into the FuGO project (A Functional Genomics Investigation Ontology).

• Metadata exchange: GFF3 and GnoME

Page 18: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

The challenges we face

Page 19: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Challenges

• Defining the standard• Collecting the data• Fields can be calculated

in a variety of ways; separate curated and calculated fields

• We don’t know enough about many of these genomes with respect to ‘lifestyle’

• Relationships between genomes

• Completeness of data

Page 20: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Defining the Checklist

Taxonomic Groups

EukaryotesBacteria/ArchaeaPlasmidsVirusesOrganellesMetagenomes

Concepts

OrganismPhenotypeEnvironmentSample ProcessingData Processing

Implementation Working GroupMetadata Exchange Working Group

Page 21: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Proliferation of MI Checklists

• Upcoming special issue of OMICS: a journal of integrative biology on data standards includes descriptions of 7 checklists

• Upcoming issue of Nature Biotechnology expected to include more

Page 22: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Protein Standards Initiative (June 2006)Special session:The proliferation of “MI” checklists:

opportunities and challenges

Chris Taylor (EBI) Minimal Information about a Protein Experiment (MIAPE) and “MIxxx and the need for a central registry”

Dawn Field (CEH Oxford) Minimal Information about a Genome Sequence (MIGS)

Don Robertson (Pfizer Global R&D, Ann Arbor MI) MSI -- Metabolomics Standards Initiative.

Graeme Grimes (Scottish Centre for Genomic Technology and Information, Edinburgh, UK) Minimum Information About a RNAi Experiment (MIARE)

Stefan Wiemann (DKFZ, Heidelberg, Germany) Minimum Information About a Cellular Assay (MIACA)

Ryan Brinkman (UBC, Canada) presented by Chris Taylor (EBI) Minimum Information for a Fluorescence Activated Cell Experiment (MIFACE)

Page 23: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

MICheck: A Minimum Information Checklist Portal

Chris Taylor, Dawn Field, Susanna-Assunta Sansone, Rolf Apweiler, Michael Ashburner, Cathy Ball, Pierre-Alain Binz, Alvis Brazma, Ryan Brinkman, Eric Deutsch, Oliver Fiehn, Jennifer Fostel, Peter Ghazal, Graeme Grimes, Nigel Hardy, Henning Hermjakob, Randall Julian, Martin Kuiper, Nicholas Le Novère, Jim Leebens-Mack, Suzi Lewis, Ruth McNally, Norman Morrison, Norman Paton, John Quackenbush, Donald Robertson, Philippe Rocca-Serra, Barry Smith, Jason Snape, Stefan Wiemann

Page 24: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

micheck.sourceforge.net

The MICheck website will provide • a comprehensive list of MI checklists• ‘convenience’ links to relevant resources; appropriate

tools, data formats, ontologies• links to relevant policy statements from various external

bodies (such as funders’ data sharing policies, journals’ publication guidelines and so forth).

• contact(s) for submitting feedback • where possible, most recent versions of checklists

(either as a local copy or a link)• charter for the group• guidelines for registering a checklist• sign-up details for the mailing list.

Page 25: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

micheck.sourceforge.net

The MICheck website will provide • Minimal Information about a Minimal Information

Checklist (MIMI)• Searchable database of terms from all checklists

Page 26: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

We propose that the MICheck play two primary roles:

• The first is to provide a ‘one-stop shop’ for researchers, journal editors and reviewers, and funders; providing a quick and simple way to discover (whether there are) guidelines for a particular domain.

• This second is to facilitate investigation of the boundaries, overlaps and gaps between projects, minimally by raising awareness of the scope and progress of extant efforts.

Page 27: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

These two roles translate into two distinct parts of MICheck

• Portal: exists simply to raise awareness of, and afford simple access to a wide range of checklists; registering for the portal implies no commitment to integrate by the registrant.

• Foundry: communities can, if motivated, sign up to the foundry to jointly examine ways to refactor the checklists over which they have control and begin to produce the first components of a suite of self-consistent, clearly bounded, orthogonal, integrable checklist modules.

Page 28: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Registering a projectDomain: Genomics and metagenomicsChecklist type: Primary guidelinesCommunity Name: The Genomic Standards ConsortiumMain website: http://gensc.sourceforge.org/MI Checklist Name: Minimal Information about a Genomic SequenceMI Checklist Acronym: MIGSCurrent Version Number: 0.1Release Date for current version: 2006-01-01Primary Contact Person: Dr Jane DoeComments: Early draft based on first two exploratory workshops; public distribution for commentKey concepts: eukaryotes, bacteria/archaea, plasmids, organelles, viruses, metagenomes, organism, phenotype, environment, sample processing, data processingBibliography: Publications to be reposited where possibleLocation of document(s): http://sourceforge.net/project/showfiles.php?group_id=153365

Page 29: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Proteomics: three main efforts

• The Minimum Information About a Proteomics Experiment (MIAPE)— HUPO Proteomics Standards Initiative

• The ‘Paris Guidelines’— sponsored by MCP

• Guidelines for the Next Ten Years of Proteomics— published by Proteomics

MCP31

Proteomics16

PSI 22+

2

1

0

1

MIAPE MCP (‘Paris’) Proteomics

Requirement Facts Facts + Q.A. Facts + Q.A.

Breadth Complete MS / MSI (+) Complete

Depth Significant Significant Moderate

Drafting Committees Committee Committee

RevisionPSI Meetings

Ad hocNot specified

Page 30: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Integrative Activities

Project [URL] Products

RSBI[www.mged.org/Workgroups/rsbi]

Cross-domain analysis of project structures; development of well-characterized generic concepts to facilitate integrative activities

FuGE [fuge.sf.net]

Object model (and markup language) to support the description of diverse experiments and development of new formats

FuGO [fugo.sf.net[

Ontology providing descriptors for a wide range of experimental workflows, equipment and data types

OBO Foundry [obofoundry.org]

Collaborative management of orthogonal (i.e. non-overlapping) ontologies covering diverse domains

Page 31: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Defining the Checklist

Taxonomic Groups

EukaryotesBacteria/ArchaeaPlasmidsVirusesOrganellesMetagenomes

Concepts

OrganismPhenotypeEnvironmentSample Processing

Data Processing

Implementation Working GroupMetadata Exchange Working Group

Investigation

‘Study’

‘Assay’

Page 32: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

what we hope to achieve in the near future

Page 33: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

FuGO

An Ontology for Functional Genomics Investigation

Susanna-Assunta Sansone (EBI): Overview

Trish Whetzel (Un of Penn): Microarray

Daniel Schober (EBI): Metabolomics

Chris Taylor (EBI): Proteomics

On behalf of the FuGO working grouphttp://fugo.sourceforge.net

Page 34: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

                    

 

FuGO - Rationale Standardization activities in (single) domains

• Reporting structures, CVs/ontology and exchange formats

Pieces of a puzzle• Standards should stand alone BUT also function together

- Build it in a modular way, maximizing interactions

Capitalize on synergies, where commonality exists Develop a common terminology for those parts of an investigation that are common across technological and biological domains

Source and Characteristi

cs

Treatments

Collection

Sample Preparation

Instrumental Analysis

(MS, NMR, array, etc.)

Computational

Analysis

Data Pre-Processing

Investigation

Design

Page 35: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

FuGO - Overview Purpose

• NOT model biology, NOR the laboratory workflow

• BUT provide core of ‘universal’ descriptors for its components

-To be ‘extended’ by biological and technological domain-specific WGs

• No dependency on any Object Model- Can be mapped to any object model, e.g. FuGE OM

Open source approach• Protégé tool and Ontology Web Language (OWL)

Source and Characteristi

cs

Treatments

Collection

Sample Preparation

Instrumental Analysis

(MS, NMR, array, etc.)

Computational

Analysis

Data Pre-Processing

Investigation

Design

Page 36: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

FuGO – Communities and Funds List of current communities

• Omics technologies- HUPO - Proteomics Standards Initiative (PSI)

- Microarray Gene Expression Data (MGED) Society

- Metabolomics Society – Metabolomics Standards Initiative (MSI)• Other technologies

- Flow cytometry

- Polymorphism

• Specific domains of application- Environmental groups (crop science and environmental genomics)

- Nutrition group

- Toxicology group

- Immunology groups

List of current funds• NIH-NHGRI grant (C. Stoeckert, Un of Pen) for workshops and ontologist

• BBSRC grant (S.A. Sansone, EBI) for ontologist

Page 37: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Coordination Committee• Representatives of technological and biological communities

- Monthly conferences calls Developers WG

• Representatives and members of these communities- Weekly conferences calls

Documentations• http://fugo.sourceforge.net

Advisory Board• Advise on high level design and best practices• Provide links to other key efforts

• Barry Smith, Buffalo Un and IFOMIS• Frank Hartel, NIH-NCI• Mark Musen, Stanford Un and Protégé Team• Robert Stevens, Manchester Un• Steve Oliver, Manchester Un• Suzi Lewis, Berkeley Un and GO

FuGO – Processes

-> cBiO will also oversee the Open BioMedical Ontology (OBO) initiative

Page 38: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

FuGO – Strategy Use cases -> within community activity

• Collect real examples

Bottom up approach -> within community activity• Gather terms and definitions

- Each communities in its own domain

Top down approach -> collaborative activity• Develop a ‘naming convention’• Build a top level ontology structure, is_a relationships• Other foreseen relationships

- part_of (currently expressed in the taxonomy as cardinal_part_of)- participate_in (input) and derive_from (output), - describe or qualify- located_in and contained_in

Binning terms in the top level ontology structure• The higher semantics helps for faster ‘binning’

Page 39: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Binning process - ongoing• Reconciliations into one canonical version• Iterative process

Common working practices - established

• Each class consists of: term ID, preferred term, synonyms, definition and comments

• Sourceforge tracker to send comments on terms, definitions, relationships

Timeline for completion of core omics technologies

• Two years and several intermediate milestones• Interim solution

- Community-specific CVs posted under the OBO

Ultimately FuGO will be part of the OBO Foundry (Core) Ontology Overview paper – “Special Issue on Data Standards” OMICS journal

FuGO – Status and Plans

Page 40: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Areas requiring significant research

Page 41: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Summary: gensc.sf.net

The GSC is tackling the issue of describing ourcomplete genome collections in greater detailthrough:MIGSGenome CatalogueOntology DevelopmentMetadata ExchangeIn co-ordination with:MICheck micheck.sf.netFuGO fugo.sf.net

Page 42: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

Acknowledgements

GSC: Coordinators• Dawn Field (CEH Oxford)• George Garrity (Bergey’s

Trust) • Norman Morrison (NEBC)• Jeremy Selengut (TIGR) • Peter Sterk (EBI) • Tatiana Tatusova (NCBI) • Nick Thomson (Sanger)

Working Groups

General Members of the GSC

Participants of all meetings

gensc.sf.net

Page 43: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

MICheck: A Minimum Information Checklist Portal

Chris Taylor, Dawn Field, Susanna-Assunta Sansone, Rolf Apweiler, Michael Ashburner, Cathy Ball, Pierre-Alain Binz, Alvis Brazma, Ryan Brinkman, Eric Deutsch, Oliver Fiehn, Jennifer Fostel, Peter Ghazal, Graeme Grimes, Nigel Hardy, Henning Hermjakob, Randall Julian, Martin Kuiper, Nicholas Le Novère, Jim Leebens-Mack, Suzi Lewis, Ruth McNally, Norman Morrison, Norman Paton, John Quackenbush, Donald Robertson, Philippe Rocca-Serra, Barry Smith, Jason Snape, Stefan Wiemann

Acknowledgements

Page 44: Exploiting scientific data in the domain of ‘omics 'Genomics Standards Consortium Ontology requirements and experiences' Dawn Field Oxford Centre for Ecology.

FuGO

An Ontology for Functional Genomics Investigation

Susanna-Assunta Sansone (EBI): Overview

Trish Whetzel (Un of Pen): Microarray

Daniel Schober (EBI): Metabolomics

Chris Taylor (EBI): Proteomics

On behalf of the FuGO working grouphttp://fugo.sourceforge.net