Ortholog assignment
-
Upload
melvin-zhang -
Category
Documents
-
view
662 -
download
1
Transcript of Ortholog assignment
Computational Prediction of Orthologs
Melvin Zhang
School of Computing,National University of Singapore
May 4, 2011
A gene is a unit of heredity in a living organism
One gene may encode for multiple proteins
Two genes are homologous if they descended from
a common ancestral gene1
In practice, homology is determined using sequence alignment.
Figure: A sequence alignment of two proteins
Have you seen phrases like “high homology”, “significanthomology”, or “35% homology”?
1with respect to a specific speciation event
Two genes are homologous if they descended from
a common ancestral gene1
In practice, homology is determined using sequence alignment.
Figure: A sequence alignment of two proteins
Have you seen phrases like “high homology”, “significanthomology”, or “35% homology”?
1with respect to a specific speciation event
Two genes are homologous if they descended from
a common ancestral gene1
In practice, homology is determined using sequence alignment.
Figure: A sequence alignment of two proteins
Have you seen phrases like “high homology”, “significanthomology”, or “35% homology”?
1with respect to a specific speciation event
Orthologs are due to speciation, paralogs are due
to duplication
MRCA of G and H
G H
speciation
duplication
main orthologs
orthologs
g h h′
paralogs
Orthologs maintain their function
Annotate genes with unknownfunctions.
Infer protein-proteininteractions.
Orthologs maintain their function
Annotate genes with unknownfunctions.
Infer protein-proteininteractions.
Orthologs are not one-to-one due to lineage
specific gene duplicationsMain orthologs are orthologs that have retained their ancestralposition.2
MRCA of G and H
G H
speciation
duplication
main orthologs
orthologs
g h h′
paralogs
2Burgetz et al., Evolutionary Bioinformatics 2006
Problem of identifying main orthologs
Input Position and sequences of genes in 2 genomes
Output For each gene in their common ancestor, find itsdirect descendant in G and H
Complications
I gene duplication
I gene loss
I horizontal gene transfer
I gene fusion, fission
Problem of identifying main orthologs
Input Position and sequences of genes in 2 genomes
Output For each gene in their common ancestor, find itsdirect descendant in G and H
Complications
I gene duplication
I gene loss
I horizontal gene transfer
I gene fusion, fission
Three main approaches for finding orthologs
Graph based Tree based Rearrangement based
Bidirectional Best Hit and variants
Most popular approach. Highlevel of functional relatedness.a
Reciprocal smallest distuse evolutionary distanceestimate instead of BLASTscores
OMA stable pairsintroduce a tolerance intervaland stable matching
aAltenhoff et al., PLoS CB 2009
EnsemblCompara GeneTrees3
Figure: Species tree for 4 species on top gene tree for gene A
Based on reconciliation of gene trees with species tree.
1. Partition genes into families and construct gene trees
2. Reconcile each gene tree and species tree3Vilella et al., Genome Res 2009
MSOAR24
Figure: Rearrangement scenario between human and mouse
1. Partition genes into families and assign a unique symbol
2. Reconstruct the most parsimonious rearrangement(inversion, translocation, fusion, fission, duplication)
3. Extract the corresponding orthologs
4Fu et al., JCB 2007
Can conserved gene neighborhood improveortholog predictions?
Human-mouse synteny blocksConserved synteny blocks between human and mouse genomegenerated by the Cinteny web server5
5Sinha and Meller, BMC Bioinformatics 2007
Local synteny criteria6
Figure: Local synteny: more than one unique match within +/- 3genes. Homology defined as BLASTP E-value < 1e-5
94% of sampled inter-species pairs are identified as orthologsby Inparanoid (based on BBH) and local synteny criteria.
6Jin Jun et al., BMC Genomics 2009
Local synteny score (LC)
G
H
g
h
The local synteny score of g and h is 4 since there are 4 edgesin the maximum matching.
Smith-Waterman alignment score (SW)
BBH-LS: bidirectional best hits based on linear
combination of SW and LC
G
H
g
h
+
sim(g , h) = (1−f )×SW(g , h)+f ×LC(g , h)
Human-Mouse-Rat dataset
InputHuman, mouse, and rat genes downloaded from Ensembl.
BenchmarkNo “golden” benchmark for true orthology.Assume that orthologs are assigned the same gene symbol.
Tuning the BBH-LS methodsim(g , h) = (1 − f ) × SW(g , h) + f × LC(g , h)
Figure: Performance of BBH-LS for different ratio of spatialsimilarity to sequence similarity on the human-mouse dataset.
Results for various methods on Human-Mouse
Figure: TP: same gene symbols, FP: different gene symbols
More true positives and less false positives than MSOAR2.
Results for various methods on Human-Rat
Figure: TP: same gene symbols, FP: different gene symbols
Results for various methods on Mouse-Rat
Figure: TP: same gene symbols, FP: different gene symbols
How local synteny helps
CTSH
MSH3
CKMT2RASGRF2MSH3RASGRF1 ANKRD34C
RASGRF2ANKRD34C RASGRF1 CKMT2CTSH
sw = 5265ls = 1
sw = 2003ls = 5
sw = 2466ls = 5
Humanchr 15
Humanchr 5
Mousechr 9
Mousechr 13
Bold edges are the pairing from BBH-LS, thin edges are thepairing from BBH.BBH paired RASGRF2 (human) to RASGRF1 (mouse) due tohigh SW, corrected by BBH-LS with LC.
Summary: Identifying main orthologs
MRCA of G and H
G H
speciation
duplication
main orthologs
orthologs
g h h′
paralogs
For each gene in their common ancestor, find its directdescendant in G and H
Summary: Three approaches
Graph based Tree based Rearrangement based
BBH-LS: bidirectional best hits based on linear
combination of SW and LC
G
H
g
h
+
BBH-LS: bidirectional best hits based on linear
combination of SW and LC
G
H
g
h
+