ARPAnno: a dedicated web tool for Annotation of Actin Related Proteins

1
Acknowledgments: Ministère de la Culture, de l’Enseignement Supérieur et de la Recherche du Luxembourg, Fonds National de la recherche du Luxembourg,CNRS, INSERM, France ARPAnno: a dedicated web tool for Annotation of Actin Related Proteins Jean Muller 1,3 , Yukako Oma 2 , Laurent Vallar 3 , Evelyne Friederich 3 , Olivier Poch 1 and Barbara Winsor 2 1 Laboratoire de Biologie et Génomique Structurales, IGBMC, CNRS/INSERM/ULP, BP 163, 67404 Illkirch cedex, France. 2 Laboratoire Modèles Levure de Pathologies Humaines, FRE2375, IPCB, CNRS, 21 rue Descartes, 67084 Strasbourg, France. 3 Laboratoire de Biologie Moléculaire, d'Analyse Génique et de Modélisation, CRP-Santé, 42, rue du Laboratoire, L-1911, Luxembourg. [email protected] Introduction Initial set ARP families characterisation ARPAnno web server Actin Related Proteins (ARPs) are key players in major biological processes important for cell life. In cytoskeleton activities, the ARP2/3 complex is essential for actin dynamics, ARP1 and ARP11 are involved in microtubule based vesicle trafficking, in nuclear functions (transcriptional activation, tumor suppression…), ARP4-ARP9 are components of many chromatin modulation complexes (SWI2/SNF2, SWR1, HAT). Conventional actins and ARPs together define a large family of homologous proteins, the actin superfamily, with a tertiary structure known as the “actin fold”. Since 1997 (Poch and Winsor), the unified classification of ARPs is composed of 11 families, based primarily on their decreasing relative sequence similarity to conventional actin sequences, where ARP1 is the most similar and ARP11 the least similar. Due to close sequence relationships between ARPs and actin sequences, it is frequently difficult to unambiguously annotate ARP sequences using classical database searches. It is then of high interest to develop discriminative tools to distinguish ARPs and actin, in order to understand the mechanisms in which they are involved. An initial dataset has been defined forming the basis of a multiple alignment of all ARP sequences. This set allows us to characterise each ARP family (sequence identity, specific residues and insertions, phylogenetic distribution) and to implement ARPAnno (http://bips.u-strasbg.fr/ARPAnno ) a web server dedicated to ARP sequence annotation. Score ARP blastp Increased number of ARP sequences in protein database (Uniprot) from 29 (1997) to 146 (July 2004). This can be divided in 3 groups of ARPs: >19 sequences for ARP1-4, >10 ARP5, ARP6 and ARP8 and ≤ 10 ARP1, ARP9, ARP10 and ARP11. Eukaryotic presence and absence distribution is cross validated using proteome searches (blastp in Uniprot) and genome exploration (tblastn) from 19 different organisms ranging from T. pseudonana (algae) to H. sapiens (mammals). pCover GID Global percent identity Percent sequence coverage pDR pDI Percent of specific residues Percent of specific insertions ARP4 and ARP6 are present in all organisms tested. Nuclear ARP is the minimum package for eukaryotic organisms. Conclusions and perspectives ARPAnno a new web server for the unambiguous identification of ARP sequences is available. The development of a high quality multiple alignment of ARP sequences permits the validation of the ARP classification and the definition of family features (residues and insertions). In future: Maintain ARP MACS up to date and add some structural features to ARPAnno. Extend the genome exploration. Validation All 146 sequences of available ARPs have been correctly annotated. 68 new sequences from recent version of Uniprot; 36 conventional actin, 3 Orphans, 6 ARP1, 7 ARP2, 6 ARP3, 8 ARP4, 1 ARP9 and 1 ARP10 from diverse organisms such as Y. lipolytica, D. hansenii, P. tetraurelia, X. tropicalis or G. gallus. High quality ARP Multiple Alignment of Complete Sequences (MACS) containing 692 sequences and 146 ARPs. n ID RefID n i S S REF i 1 , ) 1 ( 2 1 , n n ID FamID n j i S S j i i i i i i ARP ARP ARP ARP ARP pDI pDR pCover GID S 3 . 0 4 . 0 1 . 0 2 . 0 IniID 1 2 3 Basic sequence analysis Decreasing percent identity to reference actin (RefID) for ARP1 to ARP11. Definition of ARP family features Insertion Deletion Specific Insertion Specific residue or motif Distribution of ARP families among eukaryotes (blastp, ballast, DbClustal, Rascal, DPC) Presence and absence patterns reveal pairs of ARPs (ARP2 with ARP3, ARP4 with ARP6, and ARP5 with ARP8). This strongly correlates with biological data available for ARP containing complexes. The major ARP families are the nuclear ARP4 and ARP6. Knowledge Filter clustalw Highlights specific features such as conserved residues or motifs and insertions for ARP1-9. No specific features have been defined for the divergent ARP10 and ARP11. In depth protein database (Uniprot) searches to retrieve the maximum number of different ARP sequences using for each family distinct queries from distantly related organisms (i.e H. sapiens, D. melanogaster and S. cerevisiae) and the PipeAlign program. Assessment of 11 ARP family classifications. High family conservation (FamID) for ARP1-3, the main cytoplasmic ARPs in contrast to nuclear ARPs and the most divergent ARP10 and ARP11 families. Altschul, S.F., et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389-3402. ion of an ARP family Knowledge Filter which is a cornerstone for ARP annotation process. Actin sequence Actin subdomain 1, 2, and 3, 4 S. pombe and Y. lipolytica have no ARP7 but are the only yeast out of 31 to own a second ARP4 (ARP4 * ). Web interface >Q5ZM58_CHICK Hypothetical protein. MESYDVIANQPVVIDNGSGVIKAGFAGDQIP KYCFPNYVGRPKHVRVMAGALEGDIFIGPKA EEHRGLLSIRYPMEHGIVKDWNDMERIWQYV YSKDQLQTFSEEHPVLLTEAPLNPRKNRERA AEVFFETFNVPALFISMQAVLSLYATGRT Coloured multiple alignment available Fasta sequence http://bips.u-strasbg.fr/ARPAnno Actin ARP1 ARP2 ARP3 ARP4 ARP5 ARP6 ARP7 ARP8 ARP9 ARP10 ARP11 Table results Poch, O., and Winsor, B. (1997). Who's who among the Saccharomyces cerevisiae actin-related proteins? A classification and nomenclature proposal for a large family. Yeast 13, 1053- 1058. Plewniak, F., et al. (2003). PipeAlign: A new toolkit for protein family analysis. Nucleic Acids Res 31, 3829-3832. % Identity to group of 29 actins Mean percent identity inside a family: Initial percent identity used to classify ARP famili Mean ARP family percent identity to reference acti http://bips.u-strasbg.fr/PipeAlign Unknown potential actin like protein A multi-step process Local alignment with blastp and determination of eligible families for next step using GID and pCover. Global alignment with reference alignment of eligible families using clustalw. Filtering for specific residues, motifs (pDR) and insertions (pDI). Calculation of one score for each eligible family and determination of most suitable ARP family. 1 >Q5ZM58_CHICK Hypothetical protein. MESYDVIANQPVVIDNGSGVIKAGFAGDQIPKYCFPNYVGRPKHVRVMA GALEGDIFIGPKAEEHRGLLSIRYPMEHGIVKDWNDMERIWQYVYSKDQ LQTFSEEHPVLLTEAPLNPRKNRERAAEVFFETFNVPALFISMQAVLSL YATGRTTGVVLDSGDGVTHAVPIYEGFAMPHSMRIDIAGRDVSRFLRLY LRKEGYDFHTTSEFEIVKTIKERACYLSINPQKDETLETEKAQYYLPDG STIEIGSARFRAPELLFRPDLIGEECEGLHEVLVFAIQKSDMDLRRTLF SNIVLSGGSTLFKGFGDRLLSEVKKLAPKDVKIRISAPQERLYSTWIGG SILASLDTFKKMWVSKKEYEEDGARAIHRKTF 2 3 4 pots of insertions (A, B, C, D) can be seen in peripheral positions to core fold. A Hot spot of insertion/deletion 73340 proteins were detected, representing 4200 non redundant and “non fragment” sequences. Proteins with ≤ 15% amino acid identity or unrelated sequences, were not included in the final alignment. Actin is present in all eukaryotic organisms explored. Correlation of ARP organisms distribution with functional data is a benchmark case for phylogenetic profiling methods. Thompson, J.D., et al. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673-4680.

description

Actin sequence. Actin subdomain 1, 2, and 3, 4. Deletion. Insertion. Specific Insertion. Specific residue or motif. Hot spot of insertion/deletion. A. Actin. ARP1. ARP2. ARP3. ARP4. ARP5. ARP6. ARP7. ARP8. ARP9. ARP10. ARP11. - PowerPoint PPT Presentation

Transcript of ARPAnno: a dedicated web tool for Annotation of Actin Related Proteins

Page 1: ARPAnno: a dedicated web tool for Annotation of Actin Related Proteins

Acknowledgments: Ministère de la Culture, de l’Enseignement Supérieur et de la Recherche du Luxembourg, Fonds National de la recherche du Luxembourg,CNRS, INSERM, France

ARPAnno: a dedicated web tool for Annotation of Actin Related Proteins

Jean Muller1,3, Yukako Oma2, Laurent Vallar3, Evelyne Friederich3, Olivier Poch1 and Barbara Winsor2

1 Laboratoire de Biologie et Génomique Structurales, IGBMC, CNRS/INSERM/ULP, BP 163, 67404 Illkirch cedex, France.2 Laboratoire Modèles Levure de Pathologies Humaines, FRE2375, IPCB, CNRS, 21 rue Descartes, 67084 Strasbourg, France.

3 Laboratoire de Biologie Moléculaire, d'Analyse Génique et de Modélisation, CRP-Santé, 42, rue du Laboratoire, L-1911, [email protected]

Introduction

Initial set ARP families characterisation

ARPAnno web server

Actin Related Proteins (ARPs) are key players in major biological processes important for cell life. In cytoskeleton activities, the ARP2/3 complex is essential for actin dynamics, ARP1 and ARP11 are involved in microtubule based vesicle trafficking, in nuclear functions (transcriptional activation, tumor suppression…), ARP4-ARP9 are components of many chromatin modulation complexes (SWI2/SNF2, SWR1, HAT). Conventional actins and ARPs together define a large family of homologous proteins, the actin superfamily, with a tertiary structure known as the “actin fold”. Since 1997 (Poch and Winsor), the unified classification of ARPs is composed of 11 families, based primarily on their decreasing relative sequence similarity to conventional actin sequences, where ARP1 is the most similar and ARP11 the least similar. Due to close sequence relationships between ARPs and actin sequences, it is frequently difficult to unambiguously annotate ARP sequences using classical database searches. It is then of high interest to develop discriminative tools to distinguish ARPs and actin, in order to understand the mechanisms in which they are involved. An initial dataset has been defined forming the basis of a multiple alignment of all ARP sequences. This set allows us to characterise each ARP family (sequence identity, specific residues and insertions, phylogenetic distribution) and to implement ARPAnno (http://bips.u-strasbg.fr/ARPAnno) a web server dedicated to ARP sequence annotation.

ScoreARP

blastp

Increased number of ARP sequences in protein database (Uniprot) from 29 (1997) to 146 (July 2004). This can be divided in 3 groups of ARPs: >19 sequences for ARP1-4, >10 ARP5, ARP6 and ARP8 and ≤ 10 ARP1, ARP9, ARP10 and ARP11.

Eukaryotic presence and absence distribution is cross validated using proteome searches (blastp in Uniprot) and genome exploration (tblastn) from 19 different organisms ranging from T. pseudonana (algae) to H. sapiens (mammals).

pCover

GID Global percent identity

Percent sequence coverage

pDR

pDI

Percent of specific residues

Percent of specific insertions

ARP4 and ARP6 are present in all organisms tested. Nuclear ARP is the minimum package for eukaryotic organisms.

Conclusions and perspectives

•ARPAnno a new web server for the unambiguous identification of ARP sequences is available.

•The development of a high quality multiple alignment of ARP sequences permits the validation of the ARP classification and the definition of family features (residues and insertions).

•In future: Maintain ARP MACS up to date and add some structural features to ARPAnno.

•Extend the genome exploration.

Validation

All 146 sequences of available ARPs have been correctly annotated.

68 new sequences from recent version of Uniprot; 36 conventional actin, 3 Orphans, 6 ARP1, 7 ARP2, 6 ARP3, 8 ARP4, 1 ARP9 and 1 ARP10 from diverse organisms such as Y. lipolytica, D. hansenii, P. tetraurelia, X. tropicalis or G. gallus.

High quality ARP Multiple Alignment of Complete Sequences (MACS) containing 692 sequences and 146 ARPs.

n

IDRefID

n

iSS REFi

1,

)1(21

,

nn

ID

FamID njiSS ji

iiiii ARPARPARPARPARP pDIpDRpCoverGIDS 3.04.01.02.0

IniID

1

2 3

Basic sequence analysis

Decreasing percent identity to reference actin (RefID) for ARP1 to ARP11.

Definition of ARP family features

Insertion

Deletion

Specific Insertion

Specific residue or motif

Distribution of ARP families among eukaryotes

(blastp, ballast, DbClustal, Rascal, DPC)

Presence and absence patterns reveal pairs of ARPs (ARP2 with ARP3, ARP4 with ARP6, and ARP5 with ARP8). This strongly correlates with biological data available for ARP containing complexes.

•The major ARP families are the nuclear ARP4 and ARP6.

KnowledgeFilter

clustalw

Highlights specific features such as conserved residues or motifs and insertions for ARP1-9. No specific features have been defined for the divergent ARP10 and ARP11.

In depth protein database (Uniprot) searches to retrieve the maximum number of different ARP sequences using for each family distinct queries from distantly related organisms (i.e H. sapiens, D. melanogaster and S. cerevisiae) and the PipeAlign program.

Assessment of 11 ARP family classifications.

High family conservation (FamID) for ARP1-3, the main cytoplasmic ARPs in contrast to nuclear ARPs and the most divergent ARP10 and ARP11 families.

Altschul, S.F., et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389-3402.

Creation of an ARP family Knowledge Filter which is a cornerstone for ARP annotation process.

Actin sequence

Actin subdomain 1, 2, and 3, 4

S. pombe and Y. lipolytica have no ARP7 but are the only yeast out of 31 to own a second ARP4 (ARP4*).

Web interface

>Q5ZM58_CHICK Hypothetical protein. MESYDVIANQPVVIDNGSGVIKAGFAGDQIPKYCFPNYVGRPKHVRVMAGALEGDIFIGPKAEEHRGLLSIRYPMEHGIVKDWNDMERIWQYVYSKDQLQTFSEEHPVLLTEAPLNPRKNRERAAEVFFETFNVPALFISMQAVLSLYATGRT

Coloured multiple alignment available

Fasta sequence

http://bips.u-strasbg.fr/ARPAnno

Actin ARP1 ARP2 ARP3 ARP4 ARP5 ARP6 ARP7 ARP8 ARP9 ARP10 ARP11

Table results

Poch, O., and Winsor, B. (1997). Who's who among the Saccharomyces cerevisiae actin-related proteins? A classification and nomenclature proposal for a large family. Yeast 13, 1053-1058.

Plewniak, F., et al. (2003). PipeAlign: A new toolkit for protein family analysis. Nucleic Acids Res 31, 3829-3832.

% Identity to group of 29 actins

Mean percent identity inside a family:

Initial percent identity used to classify ARP families:

Mean ARP family percent identity to reference actin:http://bips.u-strasbg.fr/PipeAlign

Unknown potential actin like protein

A multi-step process

Local alignment with blastp and determination of eligible families for next step using GID and pCover.

Global alignment with reference alignment of eligible families using clustalw.

Filtering for specific residues, motifs (pDR) and insertions (pDI).

Calculation of one score for each eligible family and determination of most suitable ARP family.

1

>Q5ZM58_CHICK Hypothetical protein. MESYDVIANQPVVIDNGSGVIKAGFAGDQIPKYCFPNYVGRPKHVRVMAGALEGDIFIGPKAEEHRGLLSIRYPMEHGIVKDWNDMERIWQYVYSKDQLQTFSEEHPVLLTEAPLNPRKNRERAAEVFFETFNVPALFISMQAVLSLYATGRTTGVVLDSGDGVTHAVPIYEGFAMPHSMRIDIAGRDVSRFLRLYLRKEGYDFHTTSEFEIVKTIKERACYLSINPQKDETLETEKAQYYLPDGSTIEIGSARFRAPELLFRPDLIGEECEGLHEVLVFAIQKSDMDLRRTLFSNIVLSGGSTLFKGFGDRLLSEVKKLAPKDVKIRISAPQERLYSTWIGGSILASLDTFKKMWVSKKEYEEDGARAIHRKTF

2

3

4

4 hot spots of insertions (A, B, C, D) can be seen in peripheral positions to core fold.

A Hot spot of insertion/deletion

73340 proteins were detected, representing 4200 non redundant and “non fragment” sequences. Proteins with ≤ 15% amino acid identity or unrelated sequences, were not included in the final alignment.

Actin is present in all eukaryotic organisms explored.

•Correlation of ARP organisms distribution with functional data is a benchmark case for phylogenetic profiling methods.

Thompson, J.D., et al. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673-4680.