Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest...
-
Upload
bernard-horn -
Category
Documents
-
view
212 -
download
0
Transcript of Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest...
Identification of Ortholog Groups by OrthoMCL
Protein sequences from
organisms of interest
All-against-allBLASTP
Between Species:Reciprocal best similarity pairs
Putative orthologs
Within Species:Reciprocal better similarity pairs
(Recent) paralogs
Similarity cutoff:P-value % overlap
Similarity Matrix
Markov Clustering
Ortholog groupswith (recent) paralogs
Cluster tightness:Inflation values (I)
─22000B2
220─0150B1
00─200A2
0150200─A1
B2B1A2A1
Species B
Ortholog
150
Species A
A2 A1Paralog
200B1 B2
Paralog
220
Similarity Matrix
Similarity score
Markov Clustering (MCL) Algorithm
Transition probability
matrix
Markov Matrix
Matrix Inflation(entry powering)
Matrix Expansion(matrix powering)
Similarity Matrix
Final matrix as clustering
Terminate whenno further change
Application of OrthoMCL to Plasmodium, human and other model organisms
Plasmodium falciparum, Human, Arabidopsis,
Worm, Fly, YeastE. coli
6241 ortholog groups
160 allincluded
551 onlyEukaryotes
1182 only Metazoa
24 only Plasmodium & Arabidopsis
114 PlasmodiumNot human
…
An Example of Gamma-tubulin Ortholog Group
Comparing OrthoMCL with INPARANOID ( two species)
• INPARANOID clusters both orthologs and in-paralogs from two species by pairwise similarity – Find two-way best hits from pairwise similarity scores as main
ortholog pair– Add additional orthologs (in-paralogs) from the same species for
each main ortholog by comparing similarity scores between the main ortholog with putative in-paralogs with the score between the main ortholog pair
– Resolve overlapping groups by merging, deleting, dividing them based on a set of rules
• OrthoMCL can cluster orthologs and in-paralogs from multiple species
I. Yeast – Worm dataset (estimation)
Yeast: 6358 proteinsWorm: 19774 proteins
4985 proteins:Yeast: 2283Worm: 2702
1805 groups
4428 proteins:Yeast: 2158Worm: 2270
INPARANOIDOrthoMCL
I = ?
? (paralog groups?)
3931 same from both methods
? Coherent grouping
Contained groups
∩
OrthoMCL group INPARANOID group
∩
OrthoMCL groupINPARANOID group
Coherent groups = same groups + contained groups
Inflation (I)
# groups
# groups of paralogs
% seqs with same grouping*
% seqs with contained grouping*
% seqs with coherent grouping*
2 1892 159 80.2 16.9 97.1
1.5 1857 89 82.4 14.8 97.2
1.2 1814 7 85.4 11.7 97.1
1.1 1811 2 85.4 11.9 97.3
* Percentage of 3931 sequences identified by both OrthoMCL and Inparanoid
Inflation value (I) regulates cluster tightness
tight
loose
So, choose I = 1.1 as the optimal inflation value
Possible reasons for including different sequences
OrthoMCL INPARANOID
BLAST version WU-BLAST NCBI-BLAST
BLAST Search
All-against-all,
SEG filtered,
fixed database size
Pairwise
Similarity cutoff P<1e-5Score>=50bits
Overlap > 50%
Reciprocal “best” hits
P-value, percent identity
Score
Recent paralogsBi-directional better within-species similarity
One-way better within-species similarity from orthologs
Yeast: 6358 proteinsWorm: 19774 proteins
4985 proteins:Yeast: 2283Worm: 2702
1805 groups
3949 proteins:Yeast: 1927Worm: 2022
INPARANOIDOrthoMCL
I = 1.1
1614 groups
3765 same from both methods
86.3% same groups98.1% coherent groups
Default parameters:Similarity cutoff: P-value <1e-5, overlap > 50% Cluster tightness: Inflation values I =1.1
II. Worm – Fly dataset (test)
Worm: 19774 proteinsFly: 13288 proteins
10100 proteins:Worm: 5399
Fly: 4761
3988 groups
9623 proteinsWorm: 4997
Fly: 4626
INPARANOIDOrthoMCL
I = 1.1
3764 groups
8856 same from both methods
86% same groups98% coherent groups
In conclusion: OrthoMCL and INPARANOID have similar clustering behavior when comparing two species
Comparison of OrthoMCL with EGO (multiple species)
III. Yeast – Worm – Fly dataset
EGO: TC/NP Protein sequencesBLASTP
4776 unique proteins formed 3125 unique groups
10260 seqs 4776 proteinsRemove redundancy
OrthoMCL: 12459 proteins formed 4033 groups
4392 sameproteins from both
2.3% OrthoMCL contained in
EGO
44.2% same groups
62% EGO contained inOrthoMCL
93.8% coherentgroups
Hsc70-1Hsc70-4
Fly
SSA1SSA2SSA3SSA4
Hsp-1
Worm
Yeast
An Example: EGO Groups contained by OrthoMCL Groups
EGO : Hsp-1, Hsc70-4, SSA2OrthoMCL: Hsp-1, Hsc70-1, Hsc70-4, SSA1, SSA2, SSA3, SSA4
Back to Apicomplexa …
5333 Proteins
1846 orthologous to the other 6 organisms
1693 orthologousto Arabidopsis
483 orthologousto E. coli
1421 orthologousto yeast
1771 orthologousto fly, worm or human
1824 non-orthologousto human
Summary
• OrthoMCL automatically delineates the many-to-many orthologous relationship across multiple eukaryotic genomes
• When applied to pairwise comparison of two species, the performance of OrthoMCL is comparable to INPARANOID which was designed for comparing two species
• When applied to multiple species and compared with EGO database, OrthoMCL tend to identify more orthologous genes
• The underlying object-based relational storage model permits integration with organismal data and queries based on user-defined species distribution provides a snapshot of shared/diversified biological processes across species
Related Posters and Reference
• 114A. Web-Based Biological Discovery using an Integrated Database.
• 146A. The Genomics Unified Schema (GUS). • 170A. TESS-II: Describing and Finding Gene Regulatory
Sequences with Grammars.
• Remm et al. Automatic Clustering of Orthologs and In-paralogs from Pairwise Species Comparisons. J.MOL.Biol. (2001) 314
• Lee et al. Cross-Referencing Eukaryotic Genomes: TIGR Orthologous Gene Alignments (TOGA). Genome Res. (2002) 12
• Enright et al. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. (2002) 30