bioinformatics enabling knowledge generation from agricultural omics data

50
bioinformatics enabling knowledge generation from agricultural omics data Fiona McCarthy AgBase:

description

bioinformatics enabling knowledge generation from agricultural omics data. AgBase:. Fiona McCarthy. Summary. ‘omics’ technologies: the ‘data deluge’ organising data: bioinformatics and biocuration data sharing and analysis: bio-ontologies from data to knowledge - PowerPoint PPT Presentation

Transcript of bioinformatics enabling knowledge generation from agricultural omics data

Page 1: bioinformatics enabling knowledge generation from agricultural omics data

bioinformatics enabling knowledge generation

from agricultural omics data

Fiona McCarthy

AgBase:

Page 2: bioinformatics enabling knowledge generation from agricultural omics data

Summary ‘omics’ technologies: the ‘data deluge’ organising data: bioinformatics and

biocuration data sharing and analysis: bio-ontologies from data to knowledge making sense of agricultural data

Page 3: bioinformatics enabling knowledge generation from agricultural omics data

Databases and Biological Data The number of databases has increased

Sequence repositories: NCBI, EMBL, DDJB Model Organism Databases (MODs) Specialist biological databases or ‘knowledge

databases’ (eg, InterPro, interaction databases, gene expression data)

Need to connect information in different databases

Databases are increasing in size and complexity

Page 4: bioinformatics enabling knowledge generation from agricultural omics data

0

5000

10000

15000

20000

25000

‘00 ‘01 ‘02 ‘03 ‘04 ‘05 ‘06 ‘07 ‘08 ‘09

No.

0

2

4

6

8

10

12

14

16

18

70 75 80 85 90 95 00 05

No. x 106

Page 5: bioinformatics enabling knowledge generation from agricultural omics data

Generating Biological Data Amount of biological data is increasing

exponentially Completed and ongoing genome

sequencing projects High throughput “omics” technologies

New sequencing technologies Existing microarrays Proteomics

Page 6: bioinformatics enabling knowledge generation from agricultural omics data
Page 7: bioinformatics enabling knowledge generation from agricultural omics data

Biocomputing Technologies enable ‘omics’ technologies

to move from large database/consortiums into individual laboratories

Managing this data: acquire store access analyze visualize share

Page 8: bioinformatics enabling knowledge generation from agricultural omics data

NIH WORKING DEFINITION OF BIOINFORMATICS ANDCOMPUTATIONAL BIOLOGY

Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.

Page 9: bioinformatics enabling knowledge generation from agricultural omics data

Bioinformatics Managing data

different file formats linking between different databases

Adding value multiple levels of information from one ‘omics’

data set re-analysis linking data sets

Organizing annotating data biocuration - annotation

Page 10: bioinformatics enabling knowledge generation from agricultural omics data

Annotation ANNOTATE: to denote or demarcate Genome annotation is the process of

attaching biological information to genomic sequences. It consists of two main steps:

1. identifying functional elements in the genome: “structural annotation”

2. attaching biological information to these elements: “functional annotation”

Page 11: bioinformatics enabling knowledge generation from agricultural omics data

Community Annotation Researchers are the domain experts – but

relatively few contribute to annotation time 'reward' & 'employer/funding agency recognition' training – easy to use tools, clear instructions

Required submission Community annotation

Groups with special interest do focused annotation or ontology development

As part of a meeting/conference or distributed (eg. wikis)

Students!

Page 12: bioinformatics enabling knowledge generation from agricultural omics data

Biocuration biocurators are biologists who are trained

to annotate biological data (using database structures, bio-ontologies, etc).

databases use biocuration to enhance value of biological data “knowledge databases”

but how to ensure data consistency between databases?

Page 13: bioinformatics enabling knowledge generation from agricultural omics data

What Are Ontologies?“An ontology is a controlled vocabulary of well defined terms with specified relationships between those terms, capable of interpretation by both humans and computers.” Bio-ontologies are used to capture biological

information in a way that can be read by both humans and computers annotate data in a consistent way allows data sharing across databases allows computational analysis of high-throughput

“omics” datasets Objects in an ontology (eg. genes, cell types, tissue

types, stages of development) are well defined.

The ontology shows how the objects relate to each other

Page 14: bioinformatics enabling knowledge generation from agricultural omics data

Ontologies

digital identifier(computers)

description(humans)

relationships between terms

Gene Ontology version 1.1348 (27/07/2010):

32,091 terms, 99.3% defined

19,169 biological process 2,745 cellular component 8,736 molecular function

1,441 obsolete terms (not included in figures above)

Page 15: bioinformatics enabling knowledge generation from agricultural omics data
Page 16: bioinformatics enabling knowledge generation from agricultural omics data

Relationships: the True Path Rule Why are relationships between terms

important? TRUE PATH RULE: all attributes of

children must hold for all parents so if a protein is annotated to a term, it

must also be true for all the parent terms this enables us to move up the ontology

structure from a granular term to a broader term

Premise of many GO anaylsis tools

Page 17: bioinformatics enabling knowledge generation from agricultural omics data

Genomic AnnotationStructural Annotation: Open reading frames (ORFs) predicted during genome

assembly predicted ORFs require experimental confirmation

Functional Annotation: annotation of gene products = Gene Ontology (GO)

annotation initially, predicted ORFs have no functional literature

and GO annotation relies on computational methods (rapid)

functional literature exists for many genes/proteins prior to genome sequencing

Gene Ontology annotation does not rely on a completed genome sequence

Page 18: bioinformatics enabling knowledge generation from agricultural omics data

Functional annotation usingGene Ontology

Nomenclature(species’ genome nomenclature committees)

Other annotations

using other bio-ontologies e.g.

AnatomyOntology

Structural Annotationincluding Sequence Ontology

Genomic Annotation

Page 19: bioinformatics enabling knowledge generation from agricultural omics data

http://obo.sourceforge.net/

Gene Ontology Plant Ontology

Sequence OntologyTrait Ontology

Expression/Tissue OntologiesInfectious Disease Ontology

Cell Ontology

Page 20: bioinformatics enabling knowledge generation from agricultural omics data

bio-ontologies (Open Biomedical Ontologies) computational pipelines (‘breadth’)

for computational annotations useful for gene products without published information

manual biocuration (‘depth’) requires trained biocurators community annotation efforts each species has its own body of literature

biocuration co-ordination MODs? Consortium? Community? biocuration prioritization co-ordination with existing Dbs, annotation, nomenclature

initiatives data updates

Bio-ontology requirements

Page 21: bioinformatics enabling knowledge generation from agricultural omics data

Gene Ontology (GO) de facto method for functional annotation Assigns functions based upon Biological

Process, Molecular Function, Cellular Component

Widely used for functional genomics (high throughput)

Many tools available for gene expression analysis using GO

http://www.geneontology.org

Page 22: bioinformatics enabling knowledge generation from agricultural omics data

Plant Ontology (PO) describes plant structures and growth and

developmental stages Currently used for Arabidopsis, maize, rice – more

being added (soybean, tomato, cotton, etc) Plant Structure: describes morphological and

anatomical structures representing organ, tissue and cell types

Growth and developmental stages: describes (i) whole plant growth stages and (ii) plant structure developmental stages

http://www.plantontology.org/

Page 23: bioinformatics enabling knowledge generation from agricultural omics data

Use GO for…….1. Determining which classes of gene products

are over-represented or under-represented. 2. Grouping gene products.3. Relating a protein’s location to its function.4. Focusing on particular biological pathways

and functions (hypothesis-testing).

Page 24: bioinformatics enabling knowledge generation from agricultural omics data

OntologiesPathways & Networks

GO Cellular Component

GO Biological Process

GO Molecular Function

BRENDA

Pathway Studio 5.0

Ingenuity Pathway Analyses

Cytoscape

Interactome Databases

Functional Understanding

Page 25: bioinformatics enabling knowledge generation from agricultural omics data

http://www.agbase.msstate.edu/

Page 26: bioinformatics enabling knowledge generation from agricultural omics data

1. Provides structural annotation for agriculturally important genomes

2. Provides functional annotation (GO)3. Provides tools for functional modeling4. Provides bioinformatics & modeling

support for research community

Page 27: bioinformatics enabling knowledge generation from agricultural omics data

Avian Gene Nomenclature

Page 28: bioinformatics enabling knowledge generation from agricultural omics data
Page 29: bioinformatics enabling knowledge generation from agricultural omics data

GO & PO: literature annotation for rice, computational annotation for rice, maize, sorghum, Brachypodia

1. Literature annotation for Agrobacterium tumefaciens, Dickeya dadantii, Magnaporthe grisea, Oomycetes

2. Computational annotation for Pseudomonas syringae pv tomato, Phytophthora spp and the nematode Meloidogyne hapla.

Literature annotation for chicken, cow, maize, cotton;

Computational annotation for agricultural species & pathogens.

literature annotation for human; computational annotation for UniProtKB entries (237,201 taxa).

Page 30: bioinformatics enabling knowledge generation from agricultural omics data
Page 31: bioinformatics enabling knowledge generation from agricultural omics data

Comparing AgBase & EBI-GOA Annotations

computational

manual - sequence

manual - literature

Gen

e P

rod

uct

s an

no

tate

d

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

AgBaseChick

EBI-GOAChick

AgBaseCow

EBI-GOACow

Project

Complementary to EBI-GOA: Genbank proteins not represented in UniProt & EST sequences on arrays

Page 32: bioinformatics enabling knowledge generation from agricultural omics data

AgBase EBI GOA

EBI-IntAct

Roslin

HGNC

UCL-Heart project

MGI

Reactome

Contribution to GO Literature Biocuration

Chicken

Cow

< 0.50%

< 1.50%

97.82%

88.78%

Page 33: bioinformatics enabling knowledge generation from agricultural omics data

AgBase Biocurators

AgBasebiocuration

interface

AgBase database

‘sanity’ check

‘sanity’ check& GOC QC

EBI GOA Project

GO Consortiumdatabase

‘sanity’ check& GOC

QC ‘sanity’ check

GO analysis tools Microarray developers

UniProt dbQuickGO browserGO analysis toolsMicroarray developers

Public databases AmiGO browserGO analysis toolsMicroarray developers

AgBase Quality Checks & Releases

‘sanity’ check: checks to ensure all appropriate information is captured, no obsolete GO:IDs are used, etc.

Page 34: bioinformatics enabling knowledge generation from agricultural omics data

Quality improvement Microarray annotations

Page 35: bioinformatics enabling knowledge generation from agricultural omics data
Page 36: bioinformatics enabling knowledge generation from agricultural omics data

IITA Crops cowpea – “reduced representation” sequencing underway soybean - preliminary assembly banana - sequencing in progress yam - genome sequencing for Dioscorea alata – EST development (IITA & VSU) cassava - genome sequencing in progress maize - genome sequencing completed; other subspecies being sequenced

Page 37: bioinformatics enabling knowledge generation from agricultural omics data

Cowpea 54,123 genome sequences 187,483 ESTs Annotated via homology to Arabidopsis &

other plants GO annotation via homology – availability?

Page 38: bioinformatics enabling knowledge generation from agricultural omics data

Soybean NCBI: 1,459,639 ESTs, 34,946 proteins,

2,882 genes UniProt: 12,837 proteins (EBI GOA

automatic GO annotation) UniGene assemblies available multiple microarrays available

Page 39: bioinformatics enabling knowledge generation from agricultural omics data
Page 40: bioinformatics enabling knowledge generation from agricultural omics data
Page 41: bioinformatics enabling knowledge generation from agricultural omics data

Banana

7,102 genome sequences 14,864 ESTs 1,399 NCBI proteins; 680 UniProt Musa acuminata (sweet banana): 3,898

GO annotations to 491 proteins Musa acuminata AAA Group (Cavendish

banana): 579 annotations to 96 proteins

Page 42: bioinformatics enabling knowledge generation from agricultural omics data

Plantain Musa ABB Group (taxon:214693) - cooking

banana or plantain 11,070 ESTs, 112 proteins 173 GO annotations to 53 proteins functional genomics based on banana?

Page 43: bioinformatics enabling knowledge generation from agricultural omics data

Yams

55577 Dioscorea rotundata white yam55571 Dioscorea alata water yam29710 Dioscorea cayenensis yellow yam

Dioscorea (taxon:4672) & subspecies NCBI: 31 ESTs, 623 proteins Genome sequencing for Dioscorea alata – EST

development (IITA & VSU) 183 GO annotations to 25 proteins

Page 44: bioinformatics enabling knowledge generation from agricultural omics data

Cassava ESTs: 80,631 NCBI proteins: 568, UniProt:253 2,251 GO annotations assigned to 218 proteins 2 Euphorbia esula (leafy spurge) /cassava arrays

Page 45: bioinformatics enabling knowledge generation from agricultural omics data

Maize Zea mays (taxon:4577) Genome sequencing completed by

Washington University – other subspecies being sequenced

Active GO annotation project - 131,925 GO annotations to 20,288 proteins

Page 46: bioinformatics enabling knowledge generation from agricultural omics data
Page 47: bioinformatics enabling knowledge generation from agricultural omics data
Page 48: bioinformatics enabling knowledge generation from agricultural omics data
Page 49: bioinformatics enabling knowledge generation from agricultural omics data

AgBase Collaborative Model How can we help you? Can make GO annotations public via the

GO Consortium Have computational pipelines to do rapid,

first pass GO annotation (including transcript/EST sequences)

Provide bioinformatics support for collaborators

Developing new tools Training/support for modeling data

Page 50: bioinformatics enabling knowledge generation from agricultural omics data

Dr Susan Bridges

Divya Pedinti

Dr Teresia Buza

Philippe Chouvarine

Cathy Grisham

Lakshmi Pillai

Hui WangSeval Ozkan