Post on 10-May-2015
description
Comparative genomicsin eukaryotes
Gene family analysis
Klaas Vandepoele, PhD
Professor Ghent UniversityComparative & Integrative GenomicsVIB – Ghent University, Belgium
http://www.bits.vib.be
2
Workflow
3
Applications of clustering the proteome(s)
Gene families form the basis for the evolutionary (or phylogenetic) analysis of Detection of orthologs and paralogs Gene duplication, family expansions,
pseudogene formation and gene loss Species taxonomies Horizontal Gene Transfer (HGT) Evolution of gene structure
• Introns• Protein domain organisation &
(re)arrangements Base composition and codon usage
4
I. Structural annotation: genome-wide versus family-wise
Rationale family-wise annotation Since every gene has different (sequence)
characteristics and different genes evolve at different rates, using these characteristics to determine homologous gene models will improve the overall structural annotation quality
Properties: Slow & nearly-manual procedure High-quality gene models revealing biological
novel findings
5
Workflow family-wise annotation procedure
MSA experimental representatives
HMMbuildFamily
HMM profile
Species Xproteome
HMMsearchPutative
Homologs
Protein motifs
Correction gene model
Classification usingPhylogenetic trees
Detailed characterization
Ab initio gene prediction
Collecting experi-mental representatives
EST/cDNA
http://hmmer.janelia.org/
BLAST
6
Experimental representatives
InterProScan
Clu
stal
w +
Jal
Vie
w
PFAM HMM logo
7
BLAST / HMMsearch
1. Use multiple sequence alignment to create HMM profile
2. Use HMM profile to search for similar proteins
8
Representatives + putative homologs
Multiple sequence alignments assist in the detection and correction of errors in the structural annotation (missed exon)
Suffix finalcds indicates corrected gene model compared to the original gene model generate by the ab-initio gene prediction
BioEdit Sequence Editor
9
Representatives + putative homologs
Multiple sequence alignments assist in the detection of errors in the structural annotation (false first exon)
Suffix finalcds indicates corrected gene model compared to the original gene model generate by the ab-initio gene prediction
10
Examples of family-specific protein motifs
B-type cyclins have HxKF signature Cyclin destruction boxes (B1-type cyclin R-[AV]LGDIGN)
11
Examples of family-specific protein motifs
D-type cyclins contain LxCxE Rb-binding motif Low conservation of phylogenetic signal at primary sequence level General rules are rarely general: exceptions (i.e. missing protein
motifs) are frequent and might indicate functional divergence
Arabidopsis
Rice
12
Classification using phylogenetic tree construction
D-type cyclins are G1-specific
A- and B-type cyclins are mitotic cyclins
H-type cyclins regulate activity of CDK-activating kinases
• The complexity of the cyclin gene family appears to be higher in plants than in mammals• Whether there is functional redundancy within A- and B-type cyclins or different regulation (and expression) of some cyclin subclasses remains to be analyzed
13
Unraveling functional divergence using large-scale expression compendia
Plant tissues
Gen
es
14
Unraveling functional divergence using large-scale expression compendia
A-type cyclin
B-type cyclin
D-type cyclin
Plant tissues
Gen
es
Genevestigator
15
II. Orthology & paralogy
A major goal of sequence analysis is evolutionary reconstruction. It is critical to distinguish between two principal types of homologous relationships, which differ in their evolutionary history and functional implications.
Orthologs, defined as homologous genes evolved through speciation (~evolutionary counterparts derived from a single ancestral gene in the last common ancestor of the given two species)
Paralogs, which are homologous genes evolved through duplication within the same (perhaps ancestral) genome.
These definitions were first introduced by Fitch (1970)
16
Orthology & paralogy inference
a1
b1
c1a2
b2
c2
Gene phylogenies Organism phylogeny(species tree)
A
B
C
gene duplication
a1
a2
b1
b2
c1
a)
b)
Outparalogs
Inparalogs
speciation
17
In- and outparalogy
Sonnhammer & Koonin: Orthology, paralogy and proposed classification for paralog subtypes
Tree reconciliation
The automatic detection of speciation and duplication events using a species tree and gene family tree
18
19
III. Types of proteome analysis
20
The evolution of multi-domain proteins
21
Interpreting the output of an all-against-all similarity search
Metrics for sequence similarity:• E-value, Bit score or percent identity• alignment coverage
22
Clustering of similar sequences
Proteins = vertices ~ nodesSequence similarity relationship = edges
23
Clustering of similar sequences
24
Advanced methods for protein (orthology) clustering
Sequence similarity-based COG (RBH) [Tatusov 1997] InParanoid [Remm et al., 2001] Tribe-MCL [Van Dongen 2000] OrthoMCL [Li et al., 2003]
Phylogenetic tree-based PhylomeDB [Huerta-Cepas et al., 2007] Ensembl Compara [Vilella et al., 2008]
Overview methodologies
25 Gabaldon, 2008
BBH
COG
Inparanoid
reconciliation
species overlap
IV. Resources
26
Resources (bis)
Ensembl (Vertebrates) EnsembGenomes (Metazoa, Protists,
Fungi, Plants & Bacteria)
OrthoMCLDB 5 (150 genomes) YGOB (>15 Fungi)
27
Hands-on
Goal: identify and characterize gene family members encoding for talin 2 (TLN2)
1. Select Query gene
2. Retrieve homo/orthologs
3. Create multiple sequence alignment
4. Identify conserved positions
5. Create phylogenetic tree and identify ortho/paralogous genes
28