Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO...

Post on 07-Sep-2020

1 views 0 download

Transcript of Introduction to the Functional Annotation · 2010. 6. 24. · • Annotation contained in the GO...

Lecture 4 – Introduction to Functional Annotation

José Luis Mosquera

Computational Lab on Microarrays Data AnalysisSpecial Topics in Computer Science

Institute of Bioinformatics – Johannes Kepler UniversityJune 2010

Outline

1. Introduction1. Biological significance2. The Gene Ontology

2. Methods1. Some approaches to find biological meaning2. Hypergeometric and related approaches

3. Tools1. Evolution of the GO Tools2. SerbGO: Searching for the best GO Tool3. FatiGO

Biological Significance (1/5)

• With the advent of genomic technologies it has become possible to perform high throughput biological experiments in a routinely manner.

• It highlighted different challenges

1. The experiment itself2. Statistical analysis of the results3. Biological interpretation

• These experiments often yield lists of identifiers (genes, peptides,...) which are selected using some specific criteria to assign them statistical significance.

High throughput experiments

Biological Significance (2/5)

• Sometimes the selected items either

1. as being statistical significant is very high, or2. do not show any statistical significance

• Whatever the reason it is expected they “mean something” biologically.

High throughput experiments

Synthesis

What the list means from the biological point of view.

Biological Significance (3/5)

• The usual (reasonable?) way to proceed is to shift the focus from “statistical” to “biological” significance.

• Whereas there is a clear agreement about what means statistical significance...

• There is no consensus definition of biological significance,

• Although everyone talks about it…

If biological significance is the answer, what was the question?

Biological Significance (4/5)

• Interestingly biological significance is often re-casted in terms of statistical significance

Biological significance means Statistical significance

R. Díaz-Uriarte, CAMDA 2002

...to understand the biological relevance of statisticaldifferences in gene expression data...by examining significant differences in the distribution of (GO)terms related to biological processes or molecular function.

Biological Significance (5/5)

• Although it is not necessarily so

Biological significance does not mean Statistical significance

GeneSifter website

... to characterize the biology involved in a particularexperiment, and to identify particular genes of interest...combining the identification of broad biological themes with theability to focus on a particular gene..

The Gene Ontology (1/12)

Let it be clear what is…

The Challange

How could we attribute to the large lists of genes (identifiers) a biological interpretation?

The Gene Ontology (2/12)

• Looking for the existing annotations contained in databases that help to relate the selected genes with the biological knowledge.

• Bioinformatic resources often store data in a scientific natural language.

Rationale

Drawback

Annotation in this way is human readable and understandable, but it is difficult to interpret computationally.

The Gene Ontology (3/12)

What’s a cell?

• The same name can be used to describe different concepts.

• A concept can be described using different names.

• Comparison is difficult, especially across species or databases

The Gene Ontology (4/12)

• The most important thing you want to know is what the genes products are concerned with, i.e. their function.

• The best functional annotation systems use human beings who read the literature before assigning a function to a gene.

Functional Annotation

Some difficulties

• Different people use different words for the same function• They mean different things by the same word.• The context in which a gene was found may not be associated with its function.

• Inference of a function from sequence alone is error-prone and sometimes unreliable.

The Gene Ontology (5/12)

Functional Annotation

What can we do?

Attempts to overcome some of these drawbacks some useful annotation systems are the ontologies.

The Gene Ontology (6/12)

What’s an ontology?

Definition

An ontology is an entity that provides a set of vocabulary terms covering aconceptual domain.

• These terms must1. have an exhaustive and rigorous definition

2. be placed within a (hierarchical data) structure of relationships.

• The terms may be linked with two kinds of relationships1. “is-a” between parent and child.

2. “part-of” between part and the whole.

• They may have one or more

The Gene Ontology (7/12)

What’s an ontology?

But… What about the bilogical field?

A powerful ontology to perform biological interpretation of “our” experiments is the Gene Ontology (usually named GO)

The Gene Ontology (8/12)

• The GO project is a cooperative effort addressed to deal with the need for consistent descriptions of gene products in different databases.

• It is developed and maintained be the Gene Ontology Consortium.

• The GO is organized around three basic ontologies

GO ontologies

Ontology Number of Terms1

Molecular Functions (MF) 7220

Biological Process (BP) 9529

Cellular Component (CC) 1536

Total GO terms 18235

Gene Ontology

Molecular Function Biological Process Cellular Component

1 May 2005

The Gene Ontology (9/12)

GO graph

The Gene Ontology (10/12)

• Annotation contained in the GO database consist of two essential parts

• It highlighted different challenges

1. The ontologies that provide a structured vocabulary.2. The annotations that link the gene products to the

associated terms that define their function

• GO database attributes annotation in a species-independent way.

• Most important databases have cross-references with the GO database.

GO database

The Gene Ontology (11/12)

A given gene product may

• represent one or more molecular functions,

• be used in one or more biological processes and

• appear in one or more cellular components.

GO database

The Gene Ontology (12/12)

GO AnnotationsEvidence Codes

IEA Inferred from Electronic Annotation

ISS Inferred from Sequence Similarity

IEP Inferred from Expression Pattern

IMP Inferred from Mutant Phenotype

IGI Inferred from Genetic Interaction

IPI Inferred from Physical Interaction

IDA Inferred from Direct Assay

RCA Inferred from Reviewed Computational Analysis

TAS Traceable Author Statement

NAS Non-traceable Author Statement

IC Inferred by Curator

ND No biological Data available

Some Approaches To Find Biological Meaning

• Annotating results using an appropriate biological databases

• Rely on some form of grouping methods

1. Genes Set Enrichment: Hypergeometric tests, Fisher's Exact, GSEA,...2. Holistac Approaches: Category, globaltest, GlobalAncova,...3. Minimal Acceptance Strength4. ...

• Take a more global approach, relying on some type of

1. Graph-theoretic analysis, or2. Pathway analysis.

• Or the most (up-to-date) global approach: Systems biology

Quick overview

Gene Set Enrichment (1)

Consider the following frame

• N genes on a microarray

• M genes belong to a given GO term category (A),

• M − N do not belong to it (category Ac )

• K of these N genes are selected and assigned to a given class (e.g. regulated genes)

• x of these K genes will be in A

Hypergeometric test

Statistical Hypotessis

H0 : GO category A is equally represented in the microarray than in the class of differentially regulated genesH1 : GO category A is more (or less) represented in the microarray than in the class of differentially regulated genes

Example

Gene Set Enrichment (2)

• The probability that certain category occurs x times just by chance in the list of differentially regulated genes is modeled by a hypergeometric distribution with parameters (N , M , K ).

Hypergeometric distribution

Question

Assuming sampling without replacement, what is the probability of having exactly x genes of category A?

Gene Set Enrichment (3)

• The preceding model allows a user to compute a p-value for the test in which rejecting the null hypothesis corresponds to deciding that the category which is being tested is over-represented

• To test for under-representation we would use 1-p-values.

Hypergeometric distribution

Gene Set Enrichment (5)

Hypergeometric distribution

Some considerations…

•Different programs use slightly different approaches, most of which are equivalent in some sense

1. Fishers exact test2. Chi-Squared test3. Binomial test4. …

•It is necessary, and most programs do, to make some form of multiple testing adjustment due to the fact that one may be doing dozens or more tests simultaneously.

Gene Set Enrichment (6)

• Purposed by Subramanian et al. (2005)

STEP 1

● Compute a gene-wise measure (e.g. absolute t-statistics)● Rank genes according to this measure

STEP 2

● Assign labels A to genes belonging to a gene group of interest and B to all the other genes

● If group A is enriched with interesting genes, many of it’s genes will have high ranks and we will observe a separation in the ordered list

A B A A B A A A B A B B B A B B B B A B B B

Gene Set Enrichment Analysis (GSEA)

• Assign score nB to all genes A and −n

A to all genes B

• Draw the cumulative sum of these scores

• Is the maximum M of the cumulative sum unusually high? (Kolmogorov-Smirnov test)

Gene Set Enrichment Analysis (GSEA)

Gene Set Enrichment (7)

Evolution of the GO Tools

• In recent years many similar tools to analyze biological significance using the GO have been published and made available.

• Draghici (2005) reviews 15 of them.

• Huang (2008) classify 68 tools.

From the initial gap to the crowd

SerbGO: Searching for the best GO Tool (1/4)

• There are many tools to do not-so-many things.

• It is a bidirectional application. The user can...

1. ask for some features to get the appropriate tools for their interests

2. compare tools to check which cqapabilities are implemented in each one.

The project…

SerbGO Tool

It is intended to assist us in determine which microarray tools for gene expression analysis that make use of the GO ontologies are best suited to their projects.

SerbGO: Searching for the best GO Tool (2/4)

http://estbioinfo.stat.ub.es/apli/serbgo

SerbGO: Searching for the best GO Tool (3/4)

Which tools perform what tasks?

Many functionalities are available

Check your options in the form and move forward

SerbGO: Searching for the best GO Tool (4/4)

Comparing GO tools by their capabilities

Tick the tools that you want to compare

Appendix

Example

A Ac

K = 200 genes differentially expressed

N = 10000 genes on microarrays

M = 500 in the GO category A

N - M = 9500 belong to Ac

x = 25 genes are in A

Back