Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest...

20
ntification of Ortholog Groups by Ortho Protein sequences from organisms of interest All-against-all BLASTP Between Species: iprocal best similarity pairs Putative orthologs Within Species: Reciprocal better similarity p (Recent) paralogs Similarity cutoff: P-value % overlap

Transcript of Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest...

Page 1: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

Identification of Ortholog Groups by OrthoMCL

Protein sequences from

organisms of interest

All-against-allBLASTP

Between Species:Reciprocal best similarity pairs

Putative orthologs

Within Species:Reciprocal better similarity pairs

(Recent) paralogs

Similarity cutoff:P-value % overlap

Page 2: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

Similarity Matrix

Markov Clustering

Ortholog groupswith (recent) paralogs

Cluster tightness:Inflation values (I)

Page 3: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

─22000B2

220─0150B1

00─200A2

0150200─A1

B2B1A2A1

Species B

Ortholog

150

Species A

A2 A1Paralog

200B1 B2

Paralog

220

Similarity Matrix

Similarity score

Page 4: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

Markov Clustering (MCL) Algorithm

Transition probability

matrix

Markov Matrix

Matrix Inflation(entry powering)

Matrix Expansion(matrix powering)

Similarity Matrix

Final matrix as clustering

Terminate whenno further change

Page 5: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

Application of OrthoMCL to Plasmodium, human and other model organisms

Plasmodium falciparum, Human, Arabidopsis,

Worm, Fly, YeastE. coli

6241 ortholog groups

160 allincluded

551 onlyEukaryotes

1182 only Metazoa

24 only Plasmodium & Arabidopsis

114 PlasmodiumNot human

Page 6: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

An Example of Gamma-tubulin Ortholog Group

Page 7: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.
Page 8: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

Comparing OrthoMCL with INPARANOID ( two species)

• INPARANOID clusters both orthologs and in-paralogs from two species by pairwise similarity – Find two-way best hits from pairwise similarity scores as main

ortholog pair– Add additional orthologs (in-paralogs) from the same species for

each main ortholog by comparing similarity scores between the main ortholog with putative in-paralogs with the score between the main ortholog pair

– Resolve overlapping groups by merging, deleting, dividing them based on a set of rules

• OrthoMCL can cluster orthologs and in-paralogs from multiple species

Page 9: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

I. Yeast – Worm dataset (estimation)

Yeast: 6358 proteinsWorm: 19774 proteins

4985 proteins:Yeast: 2283Worm: 2702

1805 groups

4428 proteins:Yeast: 2158Worm: 2270

INPARANOIDOrthoMCL

I = ?

? (paralog groups?)

3931 same from both methods

? Coherent grouping

Page 10: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

Contained groups

OrthoMCL group INPARANOID group

OrthoMCL groupINPARANOID group

Coherent groups = same groups + contained groups

Page 11: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

Inflation (I)

# groups

# groups of paralogs

% seqs with same grouping*

% seqs with contained grouping*

% seqs with coherent grouping*

2 1892 159 80.2 16.9 97.1

1.5 1857 89 82.4 14.8 97.2

1.2 1814 7 85.4 11.7 97.1

1.1 1811 2 85.4 11.9 97.3

* Percentage of 3931 sequences identified by both OrthoMCL and Inparanoid

Inflation value (I) regulates cluster tightness

tight

loose

So, choose I = 1.1 as the optimal inflation value

Page 12: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

Possible reasons for including different sequences

OrthoMCL INPARANOID

BLAST version WU-BLAST NCBI-BLAST

BLAST Search

All-against-all,

SEG filtered,

fixed database size

Pairwise

Similarity cutoff P<1e-5Score>=50bits

Overlap > 50%

Reciprocal “best” hits

P-value, percent identity

Score

Recent paralogsBi-directional better within-species similarity

One-way better within-species similarity from orthologs

Page 13: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

Yeast: 6358 proteinsWorm: 19774 proteins

4985 proteins:Yeast: 2283Worm: 2702

1805 groups

3949 proteins:Yeast: 1927Worm: 2022

INPARANOIDOrthoMCL

I = 1.1

1614 groups

3765 same from both methods

86.3% same groups98.1% coherent groups

Default parameters:Similarity cutoff: P-value <1e-5, overlap > 50% Cluster tightness: Inflation values I =1.1

Page 14: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

II. Worm – Fly dataset (test)

Worm: 19774 proteinsFly: 13288 proteins

10100 proteins:Worm: 5399

Fly: 4761

3988 groups

9623 proteinsWorm: 4997

Fly: 4626

INPARANOIDOrthoMCL

I = 1.1

3764 groups

8856 same from both methods

86% same groups98% coherent groups

In conclusion: OrthoMCL and INPARANOID have similar clustering behavior when comparing two species

Page 15: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

Comparison of OrthoMCL with EGO (multiple species)

III. Yeast – Worm – Fly dataset

EGO: TC/NP Protein sequencesBLASTP

4776 unique proteins formed 3125 unique groups

10260 seqs 4776 proteinsRemove redundancy

OrthoMCL: 12459 proteins formed 4033 groups

Page 16: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

4392 sameproteins from both

2.3% OrthoMCL contained in

EGO

44.2% same groups

62% EGO contained inOrthoMCL

93.8% coherentgroups

Page 17: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

Hsc70-1Hsc70-4

Fly

SSA1SSA2SSA3SSA4

Hsp-1

Worm

Yeast

An Example: EGO Groups contained by OrthoMCL Groups

EGO : Hsp-1, Hsc70-4, SSA2OrthoMCL: Hsp-1, Hsc70-1, Hsc70-4, SSA1, SSA2, SSA3, SSA4

Page 18: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

Back to Apicomplexa …

5333 Proteins

1846 orthologous to the other 6 organisms

1693 orthologousto Arabidopsis

483 orthologousto E. coli

1421 orthologousto yeast

1771 orthologousto fly, worm or human

1824 non-orthologousto human

Page 19: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

Summary

• OrthoMCL automatically delineates the many-to-many orthologous relationship across multiple eukaryotic genomes

• When applied to pairwise comparison of two species, the performance of OrthoMCL is comparable to INPARANOID which was designed for comparing two species

• When applied to multiple species and compared with EGO database, OrthoMCL tend to identify more orthologous genes

• The underlying object-based relational storage model permits integration with organismal data and queries based on user-defined species distribution provides a snapshot of shared/diversified biological processes across species

Page 20: Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

Related Posters and Reference

• 114A. Web-Based Biological Discovery using an Integrated Database.

• 146A. The Genomics Unified Schema (GUS). • 170A. TESS-II: Describing and Finding Gene Regulatory

Sequences with Grammars.

• Remm et al. Automatic Clustering of Orthologs and In-paralogs from Pairwise Species Comparisons. J.MOL.Biol. (2001) 314

• Lee et al. Cross-Referencing Eukaryotic Genomes: TIGR Orthologous Gene Alignments (TOGA). Genome Res. (2002) 12

• Enright et al. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. (2002) 30