BITS - Comparative genomics: gene family analysis

Post on 10-May-2015

968 views 1 download

Tags:

description

This is the second presentation of the BITS training on 'Comparative genomics'. It reviews the different methods of investigating sequence homology on the gene family level.Thanks to Klaas Vandepoele of the PSB department.

Transcript of BITS - Comparative genomics: gene family analysis

Comparative genomicsin eukaryotes

Gene family analysis

Klaas Vandepoele, PhD

Professor Ghent UniversityComparative & Integrative GenomicsVIB – Ghent University, Belgium

http://www.bits.vib.be

2

Workflow

3

Applications of clustering the proteome(s)

Gene families form the basis for the evolutionary (or phylogenetic) analysis of Detection of orthologs and paralogs Gene duplication, family expansions,

pseudogene formation and gene loss Species taxonomies Horizontal Gene Transfer (HGT) Evolution of gene structure

• Introns• Protein domain organisation &

(re)arrangements Base composition and codon usage

4

I. Structural annotation: genome-wide versus family-wise

Rationale family-wise annotation Since every gene has different (sequence)

characteristics and different genes evolve at different rates, using these characteristics to determine homologous gene models will improve the overall structural annotation quality

Properties: Slow & nearly-manual procedure High-quality gene models revealing biological

novel findings

5

Workflow family-wise annotation procedure

MSA experimental representatives

HMMbuildFamily

HMM profile

Species Xproteome

HMMsearchPutative

Homologs

Protein motifs

Correction gene model

Classification usingPhylogenetic trees

Detailed characterization

Ab initio gene prediction

Collecting experi-mental representatives

EST/cDNA

http://hmmer.janelia.org/

BLAST

6

Experimental representatives

InterProScan

Clu

stal

w +

Jal

Vie

w

PFAM HMM logo

7

BLAST / HMMsearch

1. Use multiple sequence alignment to create HMM profile

2. Use HMM profile to search for similar proteins

8

Representatives + putative homologs

Multiple sequence alignments assist in the detection and correction of errors in the structural annotation (missed exon)

Suffix finalcds indicates corrected gene model compared to the original gene model generate by the ab-initio gene prediction

BioEdit Sequence Editor

9

Representatives + putative homologs

Multiple sequence alignments assist in the detection of errors in the structural annotation (false first exon)

Suffix finalcds indicates corrected gene model compared to the original gene model generate by the ab-initio gene prediction

10

Examples of family-specific protein motifs

B-type cyclins have HxKF signature Cyclin destruction boxes (B1-type cyclin R-[AV]LGDIGN)

11

Examples of family-specific protein motifs

D-type cyclins contain LxCxE Rb-binding motif Low conservation of phylogenetic signal at primary sequence level General rules are rarely general: exceptions (i.e. missing protein

motifs) are frequent and might indicate functional divergence

Arabidopsis

Rice

12

Classification using phylogenetic tree construction

D-type cyclins are G1-specific

A- and B-type cyclins are mitotic cyclins

H-type cyclins regulate activity of CDK-activating kinases

• The complexity of the cyclin gene family appears to be higher in plants than in mammals• Whether there is functional redundancy within A- and B-type cyclins or different regulation (and expression) of some cyclin subclasses remains to be analyzed

13

Unraveling functional divergence using large-scale expression compendia

Plant tissues

Gen

es

14

Unraveling functional divergence using large-scale expression compendia

A-type cyclin

B-type cyclin

D-type cyclin

Plant tissues

Gen

es

Genevestigator

15

II. Orthology & paralogy

A major goal of sequence analysis is evolutionary reconstruction. It is critical to distinguish between two principal types of homologous relationships, which differ in their evolutionary history and functional implications.

Orthologs, defined as homologous genes evolved through speciation (~evolutionary counterparts derived from a single ancestral gene in the last common ancestor of the given two species)

Paralogs, which are homologous genes evolved through duplication within the same (perhaps ancestral) genome.

These definitions were first introduced by Fitch (1970)

16

Orthology & paralogy inference

a1

b1

c1a2

b2

c2

Gene phylogenies Organism phylogeny(species tree)

A

B

C

gene duplication

a1

a2

b1

b2

c1

a)

b)

Outparalogs

Inparalogs

speciation

17

In- and outparalogy

Sonnhammer & Koonin: Orthology, paralogy and proposed classification for paralog subtypes

Tree reconciliation

The automatic detection of speciation and duplication events using a species tree and gene family tree

18

19

III. Types of proteome analysis

20

The evolution of multi-domain proteins

21

Interpreting the output of an all-against-all similarity search

Metrics for sequence similarity:• E-value, Bit score or percent identity• alignment coverage

22

Clustering of similar sequences

Proteins = vertices ~ nodesSequence similarity relationship = edges

23

Clustering of similar sequences

24

Advanced methods for protein (orthology) clustering

Sequence similarity-based COG (RBH) [Tatusov 1997] InParanoid [Remm et al., 2001] Tribe-MCL [Van Dongen 2000] OrthoMCL [Li et al., 2003]

Phylogenetic tree-based PhylomeDB [Huerta-Cepas et al., 2007] Ensembl Compara [Vilella et al., 2008]

Overview methodologies

25 Gabaldon, 2008

BBH

COG

Inparanoid

reconciliation

species overlap

IV. Resources

26

Resources (bis)

Ensembl (Vertebrates) EnsembGenomes (Metazoa, Protists,

Fungi, Plants & Bacteria)

OrthoMCLDB 5 (150 genomes) YGOB (>15 Fungi)

27

Hands-on

Goal: identify and characterize gene family members encoding for talin 2 (TLN2)

1. Select Query gene

2. Retrieve homo/orthologs

3. Create multiple sequence alignment

4. Identify conserved positions

5. Create phylogenetic tree and identify ortho/paralogous genes

28