Insights into metazoan evolution from alvinella pompejana ...

15
Gagnière et al. BMC Genomics 2010, 11:634 http://www.biomedcentral.eom/1471-2164/11/634 Genomics RESEARCH ARTICLE Open Access Insights into metazoan evolution from alvinella pompejana cDNAs Nicolas Gagnière1, Didier Jollivet2p, Isabelle Boutet2p, Yann Brélivet1, Didier Busso1, Corinne Da Silva4, Françoise Galil5, Dominique Higuet6, Stéphane Hourdez2p, Bernard Knoops7, François Lallier2p, Emmanuelle Leize-Wagner8, Jean Mary2p, Dino Moras1, Emmanuel Perrodou1, Jean-Françols Rees7, Béatrice Segurens4, Bruce Shillito6, Arnaud Tanguy2'3, Jean-Claude Thierry1, Jean Weissenbach4, Patrick Wincker4, Franck Zal2p, Olivier Poch1, Odile Lecompte1* Abstract Background: Alvinella pompejana is a representative of Annelids, a key phylum for evo-devo studies that is still poorly studied at the sequence level. A. pompejana inhabits deep-sea hydrothermal vents and is currently known as one of the most thermotolerant Eukaryotes in marine environments, withstanding the largest known chemical and thermal ranges (from 5 to 105°C). This tube-dwelling worm forms dense colonies on the surface of hydrothermal chimneys and can withstand long periods of hypo/anoxia and long phases of exposure to hydrogen sulphides. A. pompejana specifically inhabits chimney walls of hydrothermal vents on the East Pacific Rise. To survive, Alvinella has developed numerous adaptations at the physiological and molecular levels, such as an increase in the thermostability of proteins and protein complexes. It represents an outstanding model organism for studying adaptation to harsh physicochemical conditions and for isolating stable macromolecules resistant to high temperatures. Results: We have constructed four full length enriched cDNA libraries to investigate the biology and evolution of this intriguing animal. Analysis of more than 75,000 high quality reads led to the identification of 15,858 transcripts and 9,221 putative protein sequences. Our annotation reveals a good coverage of most animal pathways and networks with a prevalence of transcripts involved in oxidative stress resistance, detoxification, anti-bacterial defence, and heat shock protection. Alvinella proteins seem to show a slow evolutionary rate and a higher similarity with proteins from Vertebrates compared to proteins from Arthropods or Nematodes. Their composition shows enrichment in positively charged amino acids that might contribute to their thermostability. The gene content of Alvinella reveals that an important pool of genes previously considered to be specific to Deuterostomes were in fact already present in the last common ancestor of the Bilaterian animals, but have been secondarily lost in model invertebrates. This pool is enriched in glycoproteins that play a key role in intercellular communication, hormonal regulation and immunity. Conclusions: Our study starts to unravel the gene content and sequence evolution of a deep-sea annelid, revealing key features in eukaryote adaptation to extreme environmental conditions and highlighting the proximity of Annelids and Vertebrates. * Correspondence: [email protected] departm ent of Structural Biology and Genomics, Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMQ, CERBM F-67400 lllkirch, France; INSERM, U596, F-67400 lllkirch, France; CNRS, UMR7104, F-67400 lllkirch, France; Faculté des Sciences de la Vie, Université de Strasbourg, F-67000 Strasbourg, France Full list of author information is available at the end of the article O © 2010 Gagnière et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Bio M ed Central Commons Attribution License (http7/creativecommons.org/licenses/by/2.0), which permits unrestricted use, distributio reproduction in any medium, provided the original work is properly cited.

Transcript of Insights into metazoan evolution from alvinella pompejana ...

Page 1: Insights into metazoan evolution from alvinella pompejana ...

Gagnière et al. BMC Genomics 2010, 11:634http://w w w .biom edcentral.eom /1471-2164/11/634

Genomics

RESEARCH ARTICLE Open Access

Insights into metazoan evolution from alvinella pompejana cDNAsNicolas G agn iè re1, D id ie r Jo llive t2p, Isabelle B ou te t2p, Yann B ré live t1, D id ie r Busso1, C orinne Da Silva4,

Françoise Galil5, D o m in iq u e H igue t6, S téphane H ourdez2p, Bernard K noops7, François Lallier2p,

E m m anue lle Le ize-W agner8, Jean M ary2p, D ino M oras1, Em m anuel P e rro do u1, Jean-Françols Rees7,

Béatrice Segurens4, Bruce S h illito6, A rnaud T anguy2'3, Jean-C laude T h ie rry1, Jean W eissenbach4, Patrick W incker4,

Franck Zal2p, O liv ier P och1, O dile L e c o m p te 1*

Abstract

Background: Alvinella pom pejana is a representative of Annelids, a key phylum for evo-devo studies th a t is still poorly studied at the seq u e n c e level. A. pom pejana inhabits deep-sea hydrothermal vents and is currently known as o n e of the m os t th e rm o to le ran t Eukaryotes in marine environm ents , w iths tanding th e largest known chemical and thermal ranges (from 5 to 105°C). This tube-dwelling w orm forms d en se colonies on th e surface of hydrothermal chim neys and can w iths tand long periods of hypo/anoxia and long phases of exposure to hydrogen sulphides. A. pom pejana specifically inhabits ch im ney walls of hydrothermal vents on th e East Pacific Rise. To survive, Alvinella has deve loped n u m erou s adap ta tions at th e physiological and molecular levels, such as an increase in th e thermostability of proteins and protein complexes. It represen ts an ou ts tand ing m odel organism for studying adap ta tion to harsh physicochemical condit ions and for isolating stable m acrom olecu les resistant to high tem peratures .

Results: We have constructed four full length enriched cDNA libraries to investigate th e biology and evolution of this intriguing animal. Analysis of m ore than 75,000 high quality reads led to th e identification of 15,858 transcripts and 9,221 putative protein sequences . Our anno ta t ion reveals a g o o d coverage of m o s t animal pathw ays and networks with a prevalence of transcripts involved in oxidative stress resistance, detoxification, anti-bacterial defence , and h ea t shock protection. Alvinella proteins seem to show a slow evolutionary rate and a higher similarity with proteins from Vertebrates c o m p ared to proteins from Arthropods or N em atodes. Their com posit ion show s en r ich m en t in positively charged am ino acids th a t m igh t con tr ibu te to their thermostability. The g e n e c o n te n t of Alvinella reveals th a t an im portan t pool of g en es previously considered to be specific to D eu te ros tom es w ere in fact already p resen t in th e last c o m m o n ances to r of the Bilaterian animals, bu t have been secondarily lost in m odel invertebrates. This pool is enriched in glycoproteins th a t play a key role in intercellular com m unicat ion , horm onal regulation and immunity.

Conclusions: Our study starts to unravel th e g e n e co n te n t and s eq u en ce evolution of a d eep-sea annelid, revealing key features in eukaryote adap ta tion to ex trem e environm ental condit ions and highlighting the proximity of Annelids and Vertebrates.

* C orrespondence : lecom pte@ igbm c.frd e p a r tm e n t o f S tructural Biology a n d G enom ics, Institut d e G éné tique e t d e Biologie M oléculaire e t Cellulaire (IGBMQ, CERBM F-67400 lllkirch, France;INSERM, U596, F-67400 lllkirch, France; CNRS, UMR7104, F-67400 lllkirch,France; Faculté d e s Sciences d e la Vie, Université d e S trasbourg , F-67000 Strasbourg , FranceFull list o f a u th o r inform ation is available a t th e e n d o f th e article

O © 2010 Gagnière et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative B io M ed Central Commons Attribution License (http7/creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 2: Insights into metazoan evolution from alvinella pompejana ...

Gagnière et al. BMC Genomics 2010, 11:634http://w w w .biom edcentral.eom /1471-2164/11/634

Page 2 of 15

BackgroundAnnelids, commonly known as segm ented worms, are typical triploblastic coelomate animals belonging to the P ro tostom es. A nnelids, and especially polychaetous annelids, are im portant systems for understanding evo­lution and development in animals (for recent reviews, see [1,2]). Fossil records [3], as well as comparative m or­phology studies [4], suggested that the urbilaterian (the last com m on ancestor of bilateral sym m etric animals) may have resem bled annelids. A lthough such an assum ption is difficult to verify, it is widely accepted tha t polychaetes exhibit m any ancestral traits in their body plan and embryonic development [5].

Despite this long history as evo-devo m odel organ­isms, polychaete annelids, and more generally Lophotro- chozoan representatives, are still poorly represented in sequence databases. Sequencing projects have mainly focused on D eu terostom es (C hordates and Echino- derms) and Ecdysozoa, i.e. molting Protostomes includ­ing a rth ro p o d s and nem atodes [6]. T he recen t enlargem ent and diversification of the sequencing pro­ject panel has played a decisive role in obtaining a more realistic picture of animal evolution. For instance, the analysis of genomic loci in the marine polychaete Platy­nereis dum erilii has revealed the in tron-rich nature of annelid genes [7]. M ore recently, the genome of a bila- terian sister group, the cnidarian sea anemone Nematos­tella vectensis, has proved to be m ore com plex than expected, with a gene repertoire, exon-intron structure, and large-scale gene linkage more similar to vertebrates than to flies or nematodes [8].

Among polychaete annelids, Alvinella pompejana [9], the “Pompeii worm ”, has attracted attention since it is currently considered to be one of the most thermotoler- ant eukaryotes on Earth, withstanding the largest known chemical and thermal ranges (from 5 to 105°C) [10-12]. This tube-dwelling worm forms dense colonies on the surface of hydrotherm al chim neys and can w ithstand long periods of hypo/anoxia and long phases of expo­sure to hydrogen sulphides [13]. A. pom pejana specifi­cally inhabits chimney walls of hydrotherm al vents on the East Pacific Rise [14,15]. It often co-occurs w ith Alvinella caudata, a very closely related species, and can be found in variable proportions according to the che­mical conditions. The chimney walls are characterised by high flows of vent fluid, and therefore the highest tem peratures for vent metazoans (tem peratures usually range between 25 and 60°C, with exceptional bursts up to 105°C [10,11,13]), as well as high concentrations of potentially toxic compounds (e.g. H2S). The therm oto­lerance of alvinellid w orm s has been confirm ed by laboratory observations of Paralvinella sulfincola th e r­m otaxis [16]. To survive, A lv in e lla has developed

num erous adaptations at the physiological and m olecu­lar levels, such as an increase in the therm ostability of p ro teins and p ro te in com plexes [17-22]. As such, A. pom pejana constitu tes a precious source of th e rm o ­stable proteins and macromolecular complexes of eukar­yotic origin for the b iochem ical, biophysical or structural characterisation of proteins of fundamental or biom edical relevance. It has been selected as a model organism for structural studies by the Structural Proteo- mics IN Europe 2 (SPINE2) initiative. The pertinence of the model is confirmed by the recent study of the super­stable superoxide dismutase recom binant protein [23]. T he crystal s tru c tu re at 0.99 A reso lu tio n reveals anchoring in te rac tio n m otifs in loops and term in i, accounting for the enhanced stability of the A. pom pe­jana protein compared to its human ortholog.

Here, we report the construction, sequencing and ana­lysis of four full-length enriched cDNA libraries of A. pompejana. One of these libraries has been constructed from complete animals after removal of the dorsal tegu­m ent, w hich harbours an episym biont com m unity of E psilonproteobacteria [24]. The o ther th ree libraries have been p repared from d is tin c t tissues: the gills located at the anterior part of the animal and oriented towards the outside of the tube, the ventral tissues and the posterior region. This latter includes the pygidium and the subterm inal growth zone of the animal where cell proliferation takes place. These three tissues, radi­cally different with respect to their physiological roles, have been chosen in order to improve the transcriptome coverage. They also represent samples along the antero­p o ste rio r axis of A. po m p eja n a since it has been reported that the body of the animal experiences a tem ­perature gradient, the posterior part of A. pom pejana being exposed to higher temperatures [10,12].

Analysis of the cDNA sequences allowed the determ i­nation of 9,221 putative p ro tein sequences th a t were annotated using a new integrative functional annotation pipeline. The observed abundance of transcripts related to oxidative stress resistance, detoxification, anti-bacter­ial defence and heat shock pro tection , as well as the com positional bias detected in the protein sequences, may contribute to the adaptation of the Pompeii worm to its challenging habitat. From an evolutionary perspec­tive, our analysis reveals striking sim ilarities between Annelids and Deuterostomes, both in terms of sequence conservation and gene repertoires that raise interesting issues concerning the nature of the Bilaterian ancestor.

Results and discussioncDNA libraries, EST assem bly and cDNA characterizationFour non norm alized libraries enriched in full length cDNA were constructed, with an average insert size of

Page 3: Insights into metazoan evolution from alvinella pompejana ...

Gagnière et al. BMC Genomics 2010, 11:634http://w w w .biom edcentral.eom /1471-2164/11/634

Page 3 of 15

Table 1 Summary of A. pompejan a cDNA libraries and assembliesC lo n e M in e r O lig o -c a p p in g G lo b a l ass e m b ly

W h o le a n im a l P y g id iu m V e n tra l tissue Gills

In it ia l c h ro m a to g ra m s 20,549 36,648 16,411 26,569 100,177

C lean s e q u en ces 19,739 (96%) 25,419 (69%) 12,871 (78%) 18,105 (68%) 76,134 (76%)

3' poly(A) (%) 1,599 (8%) 3,156 (12%) 5,465 (43%) 1,467 (8%) 11,687 (15%)

M ean le n g th (bp) 633 610 720 776 674

U n ig e n e s a f te r a s s e m b ly 5,425 8,682 2,831 3,760 15,858

C o n tig s 1,365 2,327 917 1,193 4,993

S in g le to n s 4,060 6,355 1,914 2,567 10,865

C o n tig m e a n le n g th (bp) 993 951 852 931 1,017

2.5-3 Kb. The first library was prepared using whole adult individuals while the others were constructed from dissected tissues: gills, ventral tissue, and pygidium. A total of 100,177 chrom atogram s were processed by our sem i-autom ated assembly pipeline (see M ethods). 76% of the initial raw sequences fulfil our stringent quality criteria, and have a m ean length of 674 bp (Table 1). The percentage of sequences exhibiting a poly(A) tail ranges betw een 8 and 12% in the d ifferent libraries, except in the ventral tissue library in which 42% of the sequences contain a poly(A) tail. The sequences were subm itted to the EST section of the EMBL database under accession num bers FP489021 to FP539727 and FP539730 to FP565142.

The assembly of the 76,134 selected sequences (12.4 Mb) was performed by CAP3 using stringent parameters to avoid misassembly problems such as creation of chi­m eric contigs or com bination of paralogs. W e p e r­form ed a global assembly of all sequences as well as a separate assembly for each library (Table 1). The global assem bly yielded 4,993 contigs and 10,865 singlets (15,858 unigene sequences). Contig size ranges from 2 to 7,845 reads w ith a m ean of 13 and a m edian of 3 (see Additional file 1, Figure SI). The average length of contigs is 1,017 bp. Taking into account our conserva­tive approach, each contig and singlet of this assembly ideally represents a unique version of an expressed gene, i.e. paralogs, divergent alleles or splicing variants should not coalesce into the same contig.

The 15,858 unique sequences from the global assem­bly w ere analysed using tw o in d ep en d en t m ethods (ESTScan and a similarity-based approach) to determine the boundaries of the CoDing Sequences (CDS) and U nTranslated Regions (UTR). A total of 9,221 coding regions was obtained, including 2,932 com plete CDS. The complete CDS lengths range from 90 nt to 2,694 nt with a mean of 540 n t (see the distribution of the corre­sponding protein lengths in Additional file 2, Figure S2). The m ean GC co n ten t in the CDS is 46.2%. As p re ­viously observed in eukaryotic mRNAs [25-27], the m ean GC con ten t in A. pom pejana is higher for the

5’UTR (45.7%) than for the 3’UTR (39.7% w ithout poly (A) tail) and is com parab le to the value previously observed in A nnelids (43.7% and 34.1% respectively, data compiled from UTRdb [28]).

A. pompejana genes undergo a neutral evolutionary process?To investigate the evolutionary model driving the GC content in A. pompejana, an in-depth GC analysis was performed on the 84 almost complete mRNA coding for ribosomal proteins, including 15 genes expressed at low levels associated with the mitochondrial ribosome, and a set of genes with mid-to-high expression associated with the nuclear ribosome. The analysis shows that the GC3 content of CDS (0.481 ± 0.075) was significantly higher than the overall GC content of CDS (0.422 ± 0.029) and both UTR regions (0.389 ± 0.051; pairwise t-test, p < 0.0001). Unexpectedly, the GC3 was found to be con­stant, regardless of the length of the coding regions (F = 0.92, p-value = 0.341), even though a significant positive relationship exists betw een the cDNA length and its level of transcription (F = 18.18, p-value = 5 10-5) when estim ated from the num ber of gene repeats in cDNA libraries. This num ber ranges from one (mt S proteins) to 223 copies (P0). A nother striking result was the non linear evolution of the GC3 content of ribosomal p ro ­tein transcripts with the level of gene expression. Both GC3 (CDS) and GC (UTRs) contents rapidly increased with the num ber of copies until they reached a plateau at a threshold value of c.a. 25-30 copies (see Additional file 3, Figure S3). The GC3 and GC content asymptotic values of CDS and UTR were close to 0.55 and 0.40, respectively. The correla tion was significant (coding regions: F = 9.41, p-value = 0.003), although it was not significant when the m t ribosom al-protein genes were removed from the dataset. Analysis of codon usage in the ribosom al set revealed tha t eight of the m ost fre­quent codons are term inated by C or G (Phe, Feu, Tyr, His, Gin, Asn, Fys and Glu), seven by A or T (Ser, Pro, T hr, Asp, Cys, Arg, Gly) and th ree by two equally- freq u en t codons (Ile, Val and Ala). The search of

Page 4: Insights into metazoan evolution from alvinella pompejana ...

Gagnière et al. BMC Genomics 2010, 11:634http://w w w .biom edcentral.eom /1471-2164/11/634

Page 4 of 15

optimal (Fop) codons using the factorial multivariate ana­lysis tool in the CodonW software indicated that 7 AT- ended codons (TTA, CTA, ATA, TTT, AGT, GAA and TCA) were clearly associated w ith low-expressed m ito­chondrial ribosomal genes (first axis, inertia = 9.95% of the whole variance: data not shown). Five GC-ended codons (TAC, GCG, ATC, AAG, GCC) and CGT (Arg) were posi­tioned at the other extremity of the first axis but without a clear relationship with the level of expression.

In general, synonym ous codon usage biases are explained by two alternative but non exclusive models: a neutral mutational-bias and a selective model [29]. The expectation of the mutational-bias model corresponds to a positive relationship between the base composition of synonymous sites and their neighbouring silent sites (i.e. UTR and/or introns). In agreement with this model, we found a positive correlation between the GC3 and the GC(UTR), suggesting that both GC classes are evolving in the same way. The selective m odel postulates a co­evolution betw een synonym ous codon usage and the abundance of tRNA to optim ize the transla tion effi­ciency (notion of ‘optimal’ codons). According to Eyre- Walker [30], selection maximizes the speed of the trans­lation and minimizes the costs of proofreading, resulting in a codon usage correlated w ith the expression level and the mRNA length . Such corre la tions have been observed in Drosophila and Caenorhabditis [31] but not in Vertebrates [29]. In A. pompejana, there is no corre­lation between the GC3 and the level of gene expression within the set of nuclear ribosomal genes used. Even if differences were observed when considering the level of ribosomal gene expression, this was mainly due to the presence of two distinct sets of ribosomal protein genes (i.e. m itochondrial ribosome versus nuclear ribosome). Such a resu lt also held for the Fop codons since no clear g rad ien t of expression was found am ong the nuclear ribosomal genes along the first axis of the COA although the preferred A T-ended codons were clearly associated w ith the m ito chondria l ribosom al genes. Additionally, no correlation was found betw een GC3 and cDNA length . T hus, th ere is no evidence for a selective process acting on silent sites although extended analyses on GC content bias over the genome and the whole transcriptom e [32] are clearly necessary to vali­date the p redom inance of the n eu tra l m odel in A. pompejana.

Integrative functional annotationThe 7,353 protein sequences from the global assembly th a t exhibit hom ology in the p ro te in databases were analysed using a new integrative functional annotation pipeline (Gagnière et al., m anuscrip t in preparation). The originality of this pipeline lies in the exploitation of the evolutionary context of the protein sequences based

Table 2 Overview of th e annotation resultsLevel 1 a n n o ta tio n s N u m b e r o f p ro te in s (% )

P ro te in s w ith h o m o lo g s 7,353 (100%)

Pfam -A d o m a in s 4,767 (65%)

G e n e O n to lo g y 5,949 (81%)

GO Biological P rocess 5,072 (69%)

GO C ellular C o m p o n e n t 4,530 (62%)

GO M olecu lar fu n c tio n 5,601 (76%)

Text m in in g d efin ition 4,611 (63%)

E nzym e classification 1,243 (17%)

EC level 4 (X.X.X.X) 1,180 (16%)

A n n o ta te d p ro te in s 6,252 (85%)

Level 2 a n n o ta tio n s N u m b e r o f n e tw o rk s

KEGG p a th w a y s 345

M a p p e d p a th w a y s 202

C o v e ra g e > 50% 82

STRING s u b n e tw o rk s 385

M a p p e d s u b n e tw o rk s 264

C o v e ra g e > 50% 63

on a c lu stered M ultip le A lignm ent of C om plete Sequences (MACS) [33]. The sequences are thus ana­lysed in the framework of the overall family and subfam­ily, allow ing a reliable p ropaga tion of sequence annotations in conserved regions. Thanks to this novel pipeline, 6,252 (85.0%) of the 7,353 protein sequences with homology were annotated with either a text mining definition, an EC num ber, Pfam-A domains or a Gene Ontology term (Table 2).

This primary annotation was used to map the A. pom ­pejana query proteins to the networks of the KEG G and EMBL STRING databases. The reconstructed metabolic pathways and in teraction networks constitute an in te­grative second level annotation which is essential to the study of the biological processes at work in A. pom pe­jana. Among the 345 metabolic reference pathways of the KEGG database, 82 pathw ays (40.6%) have been populated by A. pom pejana proteins at a level greater than 50% (in term s of num ber of d istinc t enzymes). Since the reference pathways are highly redundant (sev­eral enzymes can catalyse the sam e reaction), the A. pompejana cDNA appear to provide a good coverage of a large panel of metabolic pathways. Common pathways such as glycolysis, gluconeogenesis, citrate cycle, purine and pyrimidine metabolism are functionally complete or almost complete. M ore specific pathways are also well represented, such as androgen and estrogen metabolism or steroid biosynthesis, as illustrated in Additional file 4, Figure S4. 3,582 A. pom pejana pro teins (48.1%) have also been m apped to hum an networks in the STRING database, which were first cut into smaller sub-networks (see methods). 68.8% of these human sub-networks were populated by at least one A. pompejana ortholog.

Page 5: Insights into metazoan evolution from alvinella pompejana ...

G agnière et a i BMC Genomics 2010, 11:634http://w w w .biom edcentral.eom /1471-2164/11/634

Page 5 o f 15

In the absence of a complete annotated genome of a close relative of A . pom pejana and considering the diversity of genome size reported in polychaete annelids (from 58 Mb to 7 Gb [34], A. pompejana being inter­mediate with 675 Mb [35] for 32 chromosome pairs [36]), we cannot estimate the coverage of our cDNA libraries. However, the mapping of A. pompejana pro­teins on KEGG metabolic pathways and on STRING networks emphasizes the broad coverage of our cDNA databases and provides an integrative framework for the interpretation of A. pompejana’s proteome features.

A. pom pejana d a tab a se and w ebsiteThe sequences and annotation features are stored in a relational database [37] that m aintains fine grained information about (i) the library origin of each clone,(ii) the nucleic sequences and their phred quality values,(iii) the different assemblies and their associated para­meters, (iv) the predicted protein sequences and (v) all the results of the annotation process.

The data are accessible via a user-friendly web inter­face. To allow intuitive querying of the database and convivial data visualisation, we developed dedicated tools, ranging from trace visualization to the interactive display of the alignm ent of A . pompejana protein

sequences and their families (Figure 1). Two modules are available for querying the database: a homology search module using NCBI BLAST and a module for performing full-text searches of all annotations. Subsets of relevant data can then be selected from the search results for display purposes. Different views are avail­able: (1) nucleic cDNA sequence, (2) EST trace and six- frame translation, (3) contig schematic representation, (4) Consed-like [38] contig alignment view, (5) MAC- SIMS annotated protein alignment with customisable features display, (6) integrative view of the annotation process (text mining definition, EC num ber, Gene Ontology, Pfam-A domains, KEGG pathway mapping, EMBL STRING mapping).

A dapta tion to hypoxia, oxidative stress and heavy metalsAs an endemic species of the hydrothermal vent ecosys­tem colonizing the chimney walls, A pompejana has to deal with very variable conditions that are the result of a chaotic mixing of vent fluid (350°C, anoxic, C02- and sulphide-rich) and deep-sea water (2°C, mildly hypoxic). In this environment, oxygen and C 02 concentrations, pH and sulphide levels vary quickly and over a wide range. These challenging environmental conditions seem to be reflected in the highly expressed gene pool

ALVE0101 - Contig »equence of 'Ventr»r library

Albinella cDNAN Download aa MSF multiple alignment

W HAT'S NEW ?

CONSORTIUM

I KTMT Sequence > Chromflh i-tello Odile I «com pte

Adm inistration

ALfcN3413 S iriy le l seq u e n

BSffiBHI 1ÜME1 BHüjVertical scale: _ j p — _ j

CUMCCtTmACm&MITKCntTMAKtfCtAKUOKCTCATmTCm.

'STmtCSTMOiaTCeMTTTCaMTBUIKMCMMTKKCTMTTTnnaC_____ ‘CT<mc[CT«*D>CTHG«ÆTTHCi»Gr[jiAtcuttJ:i:c»CLrc»T-rcrc-*c:wt.r»[AtTW8<i-<K*K«TAACAcrKwmc(ucr6AAK6»tfc>Kitcu:tTc*TrTcrc!AcWCrMMTWHS’tTOUCCSWmTKCWTTTtCWtTMAK&KWreTKACtTCATTTCTCriK.

• : i a n n [ n :u ': i m : ;> G r m n u rccrc-TTTcrci»cItOiaWCirXX.IX. '■ I',-.. 1:1', .1 ....... :. ¡.»C U*UiT«AC(,rt<nr(rtT*t

KcrcjTTTcraAt

tCCTCATTTCTCIAC.

• . .. . . . . . . W

G S t C I i T T S T T T T H T T 6 C T I . T G

F r a n »? F

MTCATE-KKUWGrriCAGrWiSÜ-CTiAUCCTViCACrnWTnCCIAGrWiKfcKWKIW.CGrCATTTCTCIAC~ C l/jC t (JJtíiT A CA Df D1- "1 ! -TTTCTCTAC-

C ontig se q u e n c e : AL000373

View s e q u e n c e V iew co n tig m ap View b la s tx

A lv in e l la ! .f a s t a . s c re e n .ace.!_C ontig373 C7 re a d (s ) l

I I Whole reads I A ligned to consensus ■ X o r N runs

F ig u re 1 S c re e n s h o ts o f th e A lv in e lla w e b s ite i l lu s t r a t in g s o m e o f t h e v is u a lis a t io n to o ls .

Page 6: Insights into metazoan evolution from alvinella pompejana ...

Gagnière et al. BMC Genomics 2010, 11:634http://w w w .biom edcentral.eom /1471-2164/11/634

Page 6 of 15

Table 3 Highly expressed genes in A. pompejana librariesAccess F o n c tio n R e ad s*

TERA04282 C y to c h ro m e c o x id a se s u b u n it 1 (EC 1.9.3.1) 7845

TERA02741 H y po the tica l p ro te in 2029

TERA02189 H y po the tica l p ro te in 917

TERA02142 H y po the tica l p ro te in 879

TERA03177 Actin 533

TERA00344 Extracellular g lo b in (H aem o g lo b in A2 cha in p recu rso r)

524

TERA02067 H y po the tica l p ro te in 4 24

TERA00650 H y po the tica l p ro te in 422

TERA03305 n tracellu lar h ae m o g lo b in 386

TERA00833 Extracellular h a e m o g lo b in (H aem o g lo b in B2 cha in p recu rso r)

370

TERA00205 Extracellular h a e m o g lo b in linker L1 349

TERA03100 C y to c h ro m e c o x id a se s u b u n it 5A (EC 1.9.3.1) 344

TERA00354 C y to c h ro m e b 338

TERA04769 H y po the tica l p ro te in 335

TERA00845 Extracellular h a e m o g lo b in (H aem o g lo b in B1 cha in p recu rso r)

322

TERA02090 H y po the tica l p ro te in 304

TERAOI 907 Sm all h e a t s h o c k p ro te in (sHSP) 295

TERA02261 M yosin essen tia l ligh t cha in 285

TERAOI 929 H y po the tica l p ro te in 275

TERA00421 Extracellular h a e m o g lo b in linker L3 267

TERA03231 G lu ta th io n e p e ro x id ase 264

TERAO1189 H y po the tica l p ro te in 263

TERA02903 H y po the tica l p ro te in 258

TERAOI 828 T ro p o m y o sin 232

TERA00984 H y po the tica l p ro te in 230

TERA02160 E longa tion fac to r 1-a lp h a 218

TERAO1465 P e p tld o g ly ca n re co g n itio n p ro te in 200

*The last co lum n indicates the num ber o f reads in the g loba l assembly.

detected using two com plem entary indicators: the size of the cDNA clusters (in term s of the num ber of reads) (Table 3) and the abundance of transcripts correspond­ing to a PFAM-A domain (Figure 2). Domains involved in protein-protein or protein-DNA/RNA interactions are particularly abundant (calcium-binding domain EF-hand, W D -40 repeat, RNA recogn ition m o tif R R M 1 , ankyrin...), as frequently observed in studies of eukaryo­tic transcrip tom es [39-41]. The highly expressed genes also include genes encoding extracellular structural pro­teins, such as collagen, as well as cytoskeleton proteins (actin, myosin, tropom yosin, calponin, tubulin, tro p o ­nin). M ore in teresting ly , m ost abundan t tran sc rip ts encompass genes clearly linked to oxygen homeostasis, oxidative stress resistance, detoxification and to a lesser extent, antibacterial defense and heat shock protection. C onsidering the stress of depressurization , dram atic tem perature decrease, and the general traum a endured by animals during sampling, the transcrip t abundance shou ld be in te rp re ted cautiously. A m ore detailed

analysis using species-specific oligo microarrays on a set of specim ens from an isobaric co llection w ould be required to safely deduce the precise role of a given gene in the environmental response of the worm. Never­theless, highly expressed genes related to environm ent provide in teresting clues about general aspects of the worm adaptation.

Among these genes, the m ost im portant fraction cor­responds to pro te ins from the resp irato ry chain and three main types of hemoglobins (Fib) reported in Alvi­nellidae, namely a non-circulating cytoplasmic globin, the extracellular giant annelid hexagonal bilayer HBL- FFb of the vascular system and the circulating intracellu­lar FFb found in the coelom ic fluid (for a review, see [42]). The abundance of Hbs in A. pompejana and their high oxygen affinities [43] may be determ inant in the respiratory adaptation to hypoxic/anoxic environments. Interestingly, the set of highly expressed genes includes several “hypothetical proteins” (Table 3). They exhibit sequence segm ents w ith a biased residue com position and have no significant similarity with known proteins, except in some cases for low complexity regions. One of these specific proteins, TERA02189, belongs to a family of proteins that is highly conserved in A. pompejana. A comparative proteom ics study [44] revealed that three members (TERA02082, TERA02935 and TERA08242) of this family are differentially expressed depending on the oxygen concentration. These oxygen-responsive genes represent potential candidates tha t may contribute to oxygen homeostasis.

Despite the hypoxia encountered in its environment, the Pompeii worm can be subject to exogenous oxida­tive stress [45]. High levels of ferrous iron and sulphide have been reported to favour the form ation of reactive sulphide species (RSS), an analog to ROS. This is in agreement with the high level of expression observed for the major antioxidative enzymes in A. pompejana: M n and Cu/Zn superoxide dismutases, peroxiredoxins, glu- tathion peroxidases (GPX), thioredoxin. Interestingly, no catalase cDNA was detected in our set. A lthough we cannot exclude the existence of a catalase gene (possible low expression of this gene), the absence of cDNA in our sample suggests that H 202 formation might not be the most common mechanism of detoxification. An ear­lier study [45] also questioned the presence of catalase in Paralvinella grasslei, a close relative of A. pompejana. The authors suggested tha t alternative H 202-scaven- gers, such as antioxidant osmolytes or o ther enzymes might replace the catalase activity. Indeed, taking into account the diversity and level of expression of g lu­tathione peroxidases and peroxiredoxins in A. pom pe­ja n a , SO D -derived H 2 0 2 could be degraded by peroxidases ra th e r than by catalases as suggested by Dixon and colleagues [46].

Page 7: Insights into metazoan evolution from alvinella pompejana ...

G agnière et al. BMC Genomics 2010, 11:634http://w w w .biom edcentral.eom /1471-2164/11 /634

Page 7 of 15

10000 8110

1000

loo

io

31752317

1

17901228

_ 829 762 716 686657

51745 1 446 414 412 40 7 404 395 395

347 340 340 3i 8 3 i6 315 m æ 3 2g? ^ ^

xy

xT

O ' tS £> &

e*

<VF ig u re 2 T h e 3 0 m o s t f r e q u e n t P F A M -A d o m a in s id e n t i f ie d in A. pom pejana p r o te in s e q u e n c e s . The Y-axis indicates the total number o f reads encoding proteins w ith this domain. We used a logarithmic scale for representation constraints. Bars are coloured according to functional role categories: respiration, oxidative stress and/or detoxification in blue, protein-protein or protein-DNA/RNA interactions in green, structural proteins in red, anti-bacterial defence in orange and others in grey.

In addition to hypoxia and oxidative stress, A. pompe­ja n a has to deal w ith large am ounts of heavy metals. Invertebrates possess a variety of cellular detoxification pathways th a t reduce the concentrations of potentially toxic metals circulating in the blood (reviewed in [47]). These pathways include metal binding by cysteine-rich proteins known as m etallothioneins, followed by their elimination through the lysosomal endom em brane sys­tem. W e detected a single EST coding for a metallothio- nein -like p ro te in in o u r lib rary suggesting th a t involvement of metallothioneins is not the major detoxi­cation process. However, we cannot exclude the possibi­lity that the pool of highly expressed genes of unknown function contains genes coding for new metallothionein- like proteins, since the pool of unknown genes appears to be enriched in cysteine-rich proteins. A nother alter­native for heavy m etal detoxication is the intracellular sequestration in specific vacuoles producing solid gran­ules [47]. This would be in agreement with the presence o f arsenic, zinc and copper detected in A. pom pejana ep iderm al cells [48] and the p ro d u c tio n o f a large am ount of iron-contain ing granules by A. pom pejana m ucocytes [49]. Finally, rhodanese also appears to be p rep o n d e ran t am ong the highly expressed genes.

Rhodanese can perform a variety of roles (reviewed in [50]), including the modulation of general detoxification processes and the maintenance of redox homeostasis.

Therm o-adaptive features in amino-acid composit ionAs one of the m ost therm otolerant eukaryotes known to date, the Pompeii worm clearly provides a unique model for the study of adaptation to high tem perature in this dom ain of life. Its therm al regime generally fluctuates between 25 and 60°C, with exceptional bursts up to 105° C [11,13]. These high and variable tem peratures require adaptations a t the physiological and m olecular levels, even though we are far from the optim al tem perature range reported in hyperthermophilic prokaryotes. At the molecular level, several studies have revealed the higher thermostability of A. pompejana proteins and complexes com pared to th e ir o rtho logs from o th er eukaryotes [17-23]. W e thus investigated the com position of A . pom pejana p ro teins and found a biased com position com pared to th e ir o rtho logs from 5 m ajor M etazoa lineages (Vertebrates, Arthropods, Nematodes, Platyhel­m inthes, Cnidaria) (A dditional file 5, Figure S5). The am ino acid com position differed significantly am ong taxa (Homogeneity statistic: %2 = 0.01153 G = 0.01153).

Page 8: Insights into metazoan evolution from alvinella pompejana ...

Gagnière et al. BMC Genomics 2010, 11:634http://w w w .biom edcentral.eom /1471-2164/11/634

Page 8 of 15

A. pompejana exhibits the highest proportion of charged amino acids (nearly 25.5%): a characteristic also shared by the cnidarian N vectensis (25.4%). This is mainly due to an increase of the positively-charged am ino acids lysine and arginine (12.6%). This excess of charged resi­dues may enhance p ro te in stab ility in therm oph ilic eukaryotes, notably by increasing salt-bridges, if they can in te rac t w ith negatively-charged residues. This would be in keeping with the structural analysis of the superoxide dismutase of A. pom pejana [23], suggesting that extra salt-bridged interactions may be involved in the superstability of this protein.

In prokaryotes, therm oadaptive m olecular features appear to be multiple and variable [51-56]. Berezovsky and Shakhnovich [55] suggested two distinct evolution­ary strategies to conciliate these conflicting observations. In prokaryotes with an ancestral thermophilic character (e.g. Archaea such as Pyrococcus), proteins may be sig­nificantly m ore com pact and m ore hydrophobic than the ir m esophilic counterparts. Conversely, organism s that recently colonized a hot environm ent such as the bacteria Thermotoga maritima, may have evolved under a more "sequence-based” mechanism of thermostability. In this la tter case, a few charged am ino acid replace­m ents or amino acid deletions increased occurrences of hydrogen bonds and inter-subunit electrostatic interac­tions or decreased the length of surface loops respec­tively. The enrichm ent in charged residues detected in A. pompejana suggests that the sequence-based mechan­ism of therm ostability reported to hold in Thermotoga may also apply to A. pompejana. However, the molecu­lar basis of eukaryotic thermotolerance is probably more com plex than tha t of the bacterial/archaeal dom ains, and additional studies are needed to decipher the mole­cular basis of therm ostability in A. pom pejana. These may include massive structural comparisons between A. pompejana proteins and mesophilic orthologs, as well as in -d ep th com parisons of am ino acid com positions between close relatives of A. pompejana since therm oa­daptive features are often masked by a background of evolutionary sequence divergence.

Proximity betw een Annelids and VertebratesC o n trad ic to ry resu lts have been ob ta ined in w hole- genome based phylogenetic studies that favour either the classical “Coelomata hypothesis” positioning Nematodes and Plathyhelminthes as early branching clades [57,58] or the new animal phylogeny dividing Protostomes in Ecdy- sozoa (including A rth ro p o d s and N em atodes) and Lophotrochozoa (including Annelids, Molluscs and Platy­helminthes) [8]. To address the evolutionary relationships between A. pompejana and other animals, we first inves­tigated the phylogenetic position of A. pompejana using a pool of 76 ribosom al p ro te in s. A phylogenetic tree

reconstruction was perform ed on the concatenation of the corresponding m ultiple alignm ents using MrBayes (Figure 3). The obtained topology, together with a maxi­m um parsimony analysis (data not shown), supports the phylogenetic tree proposed by Aguinaldo et al., with the exception of Schistosoma japonicum (Platyhelminthes) th a t c lu stered w ith in the Ecdysozoa rep resen ted by Arthropods and Nematodes, rather than Lophotrochozoa (represented by Molluscs and Annelids).

Com pared to o ther P rotostom es used in this study, the annelids A. pompejana and Lumbricus rubellus and the mollusc Argopecten irradians show relatively small branch lengths. Thus, the slow evolutionary rate p re­viously observed in diverse species of the polychaete lineage [6,7] seems also to hold in A. pompejana, despite its challenging habitat. In contrast, fast evolutionary rates are observed in the parasitic worms Schistosoma or Caenorhabditis, that might lead to a long-branch attrac­tion artefact between these two lineages in the present analysis and partly explain the con trad ic to ry results obtained in whole-genome based phylogenetic studies. Differences in the rate of evolution are reflected in the mean percent identity between orthologs observed in a pool of 556 unambiguous ortholog families (86,727 posi­tions w ithou t gaps) conserved in 6 m ajor M etazoa lineages (Table 4). A. pompejana, Homo sapiens and N. vectensis exhibit high sequence conservation relative to S. japo n icu m or C. elegans while D. m elanogaster appears in te rm ed ia te . T hus, the prox im ity betw een annelid and V ertebrate sequences previously reported for proteins from the polychaete Platynereis dumerilii [7] is now observable at a larger scale and can be partly extended to Cnidaria.

Differential gene lossesThe proximity between Alvinella, Vertebrates and Cni­daria discussed above is also observable in the gene rep e rto ire of the Pom peii w orm . O nly 135 p ro te in families present in A. pom pejana are specific to Proto­stomes. Among them, only 13 are conserved in all the Arthropod, Nematode and Platyhelminthe representative species, illustrating the prevalence of differential gene losses in Protostomes. Moreover, only 5 proteins in our dataset are specific to Annelids and Platyhelminthes. As we only have access to the proteome of a parasitic Platy­helminthe, this apparent split may actually be the conse­quence of massive gene losses in Schistosoma linked to its parasitic lifestyle. This illustrates the urgent need for the sequencing of com plete genom es from free-living species of Platyhelm inthes, and m ore generally from diverse representatives of Lophotrochozoa, in order to identify the relationships and synapomorphies unifying the Spiralia (A nnelids and M olluscs) and P latyhel­m inthes w ithin Lophotrochozoa. Considering the high

Page 9: Insights into metazoan evolution from alvinella pompejana ...

Gagnière et al. BMC Genomics 2010, 11:634http://w w w .biom edcentral.eom /1471-2164/11/634

Page 9 of 15

Homo sapiens Mus musculus

Gallus gallus Xenopus laevis

Brachydanio rerio Ciona intestinalis

Alvinella pompejanaLumbricus rubellus

Argopecten irradians

Drosophila melanogaster------------ Anopheles gambiae

Caenorhabditis briggsae

LCaenorhabditis elegans

------------------------ Schistosoma japonicum-Nematostella vectensis

Monosiga brevicollis---------------------Saccharomyces cerevisiae

jV e r te b ra te s

( U r o c h o r d a te s

^ A n n e l id s

I M olluscs

^ A r th ro p o d s

IN em ato d es

(P latyhelm inthes

( C n id a r i a

(C h o a n o f la g e l la te s

0.3F ig u re 3 B a yes ian p h y lo g e n y o f M e ta z o a T h e ana lysis w as p e r fo rm e d o n a c o n c a te n a t io n o f 76 r ib o so m a l p ro te in fam ily a l ig n m e n ts , w ith Saccharom yces cerevisiae as th e o u tg ro u p . T h e sca le bar in d ic a te s th e e x p e c te d n u m b e r o f a m in o ac id su b s t itu t io n s p e r a l ig n e d p osition . All n o d e s w e re reso lv ed in 100% o f th e s a m p le d to p o lo g ie s fro m th e B ayesian analysis, e x c e p t th e n o d e in d ic a te d by a s g u a re ( s u p p o rt v a lu e o f 96%).

proportion of A. pompejana “specific” genes (20%), these genomes would be especially valuable for the discrimi­nation of genes tha t are tru ly specific to Alvinellidae (and possibly linked to environm ental adaptation) and those th a t are in fact shared by o th e r lineages of Lophotrochozoa.

In con trast to the small sets of p ro teins specific to P ro tostom es, 203 A. pom pejana p ro te in s belong to

Table 4 Mean percent identities betw een orthologousprotein sequences of Metazoa

A p Hs D m Ce Sj N v

A lv in e lla p o m p e ja n a (Ap) 100,0

H o m o sapiens (Hs) 65,9 100,0

D ro so p h ila m e la n o g a s te r (Dm) 62,7 61,7 100,0

C aenorhabd itis e legans (Ce) 56,3 55,3 55,1 100,0

S ch is tosom a ja p o n ic u m (Sj) 57,7 56,0 54,8 51,0 100,0

N e m atos te lla vectensis (Nv) 65,1 64,5 60,8 54,9 55,3 100,0

families or superfamilies specific to Deuterostomes. This Deuterostomia-set is significantly enriched in glycopro­teins (34 p ro teins ou t of 203) th a t play a key role in m any biological processes, in particu lar in tercellu lar communication and adhesiveness, horm onal regulation, or im m unity . For instance, the p ro te in TERA08399 belongs to the superfamily of secreted cysteine rich fac­tors and its N -term inal dom ain sequence exhibits the idiosyncratic features of the IGFBP (Insulin-Like Growth Factor Binding Protein) family repo rted to be verte­brate-specific [59]. A nother notew orthy resu lt is the enrichment in proteins containing an epidermal growth factor (EGF)-like domain that is frequently found in the extracellular part of m em brane-bound proteins or in proteins known to be secreted. In addition, the Deuter­o stom ia-se t is en riched in p ro te in s involved in the I-kappaB kinase/NF-kappaB cascade or in death-domain containing proteins that can be involved in the regula­tion of apoptosis and inflam m ation or linked to innate

Page 10: Insights into metazoan evolution from alvinella pompejana ...

Gagnière et al. BMC Genomics 2010, 11:634http://w w w .biom edcentral.eom /1471-2164/11/634

Page 10 of 15

immunity. This includes close homologs of the CRADD and DEDD/DEDD2 protein families (TERA03000 and TERA04373, respectively) that play a role in the stress- induced apoptosis signalling pathway and are im portant mediators for death receptors [60,61]. If we exclude the possib ility of h o rizo n ta l gene transfer, these genes encoding im portant functions previously considered as specific novelties of Deuterostomes were in fact already present in the Bilaterian ancestor and were subsequently lost in Ecdysozoa and Platyhelminthe model species or diverged beyond recognition in the representative spe­cies of Ecdysozoa and P latyhelm in thes used in our study.

In addition to this Deuterostomia-set, 147 A. pom pe­ja n a p ro te in families are specifically p resen t in both D euterostom es and Cnidaria, while 32 are specifically found in bo th C nidaria and at least one Protostom e. These 147 families present in the last common ancestor of the Eumetazoa may also have been lost in the Ecdy­sozoa and Platyhelminthe representatives. Interestingly, this set exhibits an enrichm ent in the selenium binding function. Notably, it includes the ortholog of the seleno­protein N involved in the regulation of oxidative stress and calcium homeostasis [62]. Differential gene losses in Ecdysozoa are also observed for m ore ancestral genes. For instance, the Pompeii worm possesses orthologs of the com ponen t of the phagocytic NADPH oxidase (Nox) (gp91phox and p22phox) and of some of its regu­latory proteins (p47phox, p67phox) that play a critical role in innate immunity of Deuterostomes. p47phox and p22phox genes are present in the Cnidaria N. vectensis and the unicellular choanoflagellate M. brevicollis, but are absent in several lineages of ecdysozoans including Drosophila and Caenorhabditis.

Indeed, w ith the m ultip lication of genom e and EST sequencing projects in invertebrates, many “vertebrate novelties” have been show n to be p resen t in Cnidaria and/or Placozoa, but lost in the canonical model Proto­stom es, i.e. D. m elanogaster and C. elegans [8,63-66]. There is now an increasing body of evidence supporting the prom inent role of lineage-specific losses in animal evolution (for a review, see [1]), especially in P ro to ­stomes. The present analysis suggests that massive losses are no t a shared tra it of the Protostom es, since genes involved in major metazoan functions are retained in A. pompejana, a Lophotrochozoan representative. However, we cannot exclude that losses exist in Annelids and/or M olluscs. For instance, no enzym e involved in urea excretion has been identified in the Alvinella database, as expected from previous studies reporting an incomplete or non-functional urea cycle in a num ber of annelid spe­cies (loss of the citrulline-arginine segment, see [67]).

Genes are not the only genome features differentially lost in the course of M etazoan evolution, as suggested

by a study of genomic regions of the Polychaete Platy­nereis dumerilii that revealed intron-rich genes in Anne­lids [7]. According to the authors’ estimates, two-thirds of hum an introns would have been present in the bila­terian ancestor and retained in Annelids, while lost in the insect and N em atode genomes. The hypothesis of an intron-rich Bilaterian ancestor (discussed in [68]) has been extended to the ancestor of M etazoa through the exam ination of the exon-intron structure of N em atos­tella and Trichoplax genes [8,66]. Thus, the emerging p icture of evolution is one of a com plex ancestor of Metazoa, with a gene toolkit and a gene structure closer to those of ex tan t V erteb rates and A nnelids th an to model Ecdysozoa. This contradicts the intuitive view of a linear evolution, from simple ancestral netw orks to more complex ones in Vertebrates, although it is in line w ith several studies suggesting a reductive evolution from a com plex com m unity of ancestors as a general trend in the evolution of life (see [69] and references therein).

ConclusionsThe construction and sequencing of four non norm al­ized cDNA libraries from A. pom pejana , one of the most therm otolerant eukaryotes known to date, resulted in 15,858 unique cDNA sequences and 9,221 annotated p ro tein sequences. As indicated by the pathways and interaction networks mapped, our cDNA libraries pro­vided a good coverage of the A. pompejana gene reper­to ire. O ur analysis revealed tha t, apart from house­keeping genes, m ost abundan t anno ta ted transcrip ts were directly related to adaptation to the challenging physico-chemical conditions encountered by A. pom pe­jana, in particular hypoxia, oxidative stress and heavy m etals. In addition to these annotated genes, we also detected an im portan t pool of unknow n, specific and highly expressed genes tha t represent valuable targets for the study of A. p o m pejana adap ta tion . W e also detected a compositional bias that may enhance protein thermostability in this eukaryote facing an extreme and variable thermal regime.

From an evolutionary perspective, our analyses sup­port the new animal phylogeny and seem to indicate a slow evolutionary rate in A. pompejana despite its chal­lenging environmental conditions. Moreover, we found tha t an im portan t pool of ancestral genes involved in m ajor m etazoan functions are lost in other representa­tives of Protostomes but retained in A. pompejana, sug­gesting that massive losses are not a shared trait of the P ro tostom es. Sequence conservation to g e th er w ith ancestral gene retention identified a surprising proximity between A. pompejana and Deuterostomes. This makes A. pompejana thermostable proteins outstanding models for the study of human protein targets.

Page 11: Insights into metazoan evolution from alvinella pompejana ...

Gagnière et al. BMC Genomics 2010, 11:634http://w w w .biom edcentral.eom /1471-2164/11/634

Page 11 of 15

Special a tten tio n has been paid to m aking the sequences, assembly and annotations accessible via a user-friendly web site. They represent a significant con­tribution to the successful exploitation of A. pompejana p ro te in s in the fu tu re an n o ta tio n of genom es from annelids and related phyla and will hopefully stimulate future research on metazoan evolution and adaptation.

MethodscDNA libraries and EST sequencingA. pom pejana samples were collected during the Bios- peedo 2004 oceanographic cruise on the sou th East Pacific Rise at latitudes ranging from 14°S to 21°33S (25 individuals of which 6 were dissected on board). A ni­mals were collected w ith the telem anipulated arm of Nautile in insulated boxes (4-5°C), recovered at atm o­spheric pressure but dissected a few minutes after being recovered onboard in R N A Later s tab iliza tion and storage solution. All individuals and/or tissues were con­served in liquid nitrogen. Four non-normalized and full- length-enriched cDNA libraries were constructed at the CNS Genoscope. One was prepared from 6 whole ani­mals while the others were constructed from specific tis­sues: gills (5 individuals), ventral tissue (5 individuals) and pygidium (3 individuals). The w hole anim al and pygidium libraries were constructed with the CloneMi- n er cDNA co n stru c tio n k it (Invitrogen), w hich is designed to construct cDNA libraries without the use of trad itional restric tion enzyme cloning m ethods. This technology combines the action of Superscriptll reverse T ranscrip tase w ith the Gateway Technology. Single­stranded mRNA was converted in to double stranded cDNA containing attB sequences on each end. Through site-specific recom bination , attB -flanked cDNA was cloned directly in to a ttP -contain ing donor vector by homologue recom bination. The gili and ventral tissues lib raries w ere p repared using the oligo-capping approach. Full length RNAs were enriched by the action of the bacteria l alkaline phosphatase to d igest 5’- uncapped mRNAs. A 30-m er 5’ oligo was linked using T4 RNA ligase after removing the 5’cap using Tobacco acid pyrophosphatase . The first s tran d cDNA was prim ed w ith an oligo(dT)-SfI p rim er and double stranded using specific 5’ and 3’ prim ers and amplified by PCR. The PCR Sfl-digested cDNA products were size selected to exclude fragm ents sm aller than 1 kb and then linked into pME18S-FL3 D ralll-digested vector. A total of 100,177 different clones were sequenced.

EST filtering, assem bly and clusteringW e developed a sem i-autom atic pipeline (TCL scripts) to m anage and process the data, from the ch rom ato­grams to the assembled sequences. The raw SCF chro­m atogram files were clipped w ith respect to quality,

repeats, and vector content. Quality clipping was based on phred [70,71] quality values: a window of size 20 bp was slid through the quality files from both sides, and the clip positions (left/right) were determ ined by the first window position with a phred-value above a thresh­old of 13. V ector m asking was perform ed by cross_- m atch against the UniVec database [72] and the pME18S and pDO N R222 vector sequences. M asked sequences were cleaned from em pty vector sequences and sh o rt sequences (<100 nt) were filtered out. To avoid misassembly, 5’ poly(A) and 3’ poly(T) sequence boundaries were masked using a 20/25 n t sliding w in­dow (in-house TCL script).

The EST assem bly was perfo rm ed using cap3 [73] with default parameters, with the exception of ‘overlap percen t iden tity c u to ff (-p) and ‘clipping range’ (-y) param eters set to 90% and 30 n t, respectively. The assembly was perform ed independently on each library as well as on the full complement of sequences.

cDNA characterizationCom plete and partial CoDing Sequences (CDS) were determ ined from assembled sequences using two inde­pendent approaches, similarity and ab initio prediction. Contig and singleton sequences were compared to pro­te in sequences of the U niprotK B [74] and PDB [75] databases using BLASTX [76]. C oding fram es were deduced from BLASTX best hit alignm ents (E-value < le-05). Then, the CDS were created by extending the matching region in both 5’ and 3’ directions to the end of the cDNA sequence or a stop codon. If a stop codon was encountered in the 5’end, the first ATG codon fol­lowing th is stop codon was chosen as the in itia tion codon. W hen a fram eshift was detected in the cDNA sequence, the translation of the incriminated region was replaced by masking symbols.

In parallel, we used ESTScan [77] to de tec t CDS. Since no large set of coding and noncoding sequences of annelids or m olluscs are available for training, we used the H. sapiens m odel. In o rder to optim ise a threshold for the ESTScan score, we established the dis­tr ib u tio n of sequences w ith or w ithou t hom ology according to ESTScan cut-off values. By setting an opti­mal cut-off value > 200, we obtained a specificity of 70% and a sensitivity of 66%. To be considered as complete, a CDS m ust start with an initiation codon and end with a stop codon. Additionally, for CDS sequences deduced from BLASTX, the protein sequence m ust cover at least 80% of the best BLASTX hit.

The GC content study on mRNA encoding ribosomal proteins was perform ed for 84 almost complete cDNA (including 15 m itochondrial cDNA). The GC content was plo tted against the num ber of repeats and subse­quently tested w ith several regression m odels (linear,

Page 12: Insights into metazoan evolution from alvinella pompejana ...

Gagnière et al. BMC Genomics 2010, 11:634http://w w w .biom edcentral.eom /1471-2164/11/634

Page 12 of 15

exponential, logarithmic and power) using the software SigmaPlot. The model that best fitted the dataset was a power function (y = ax+b). A search of optim al (Fop) codons was performed using a multifactorial correspon­dence analysis (CoA) of codon usage im plem ented in the CodonW software [78,79].

Integrative functional annotationMACS [33] protein alignments were generated with the PipeAlign [80] toolkit. Integrative annotation was based on the MACSIMS [81] and GOAnno [82] software fra­meworks. MACSIMS divides the m ultiple alignm ents into subfamilies according to conservation patterns. It th en validates or co rrec ts functional and s tru c tu ra l information mined from public databases before propa­gation to the query sequence. Pfam-A [83] annotations are extracted from MACSIMS. GOAnno provides Gene Ontology [84] annotations for the query, after analysis of the GO term s obtained for the query subfamily. In add ition to these program s, we have developed new software (Gagniere et al., manuscript in preparation) to: (1) generate a text m ining functional definition from close orthologs, (2) generate a consensus Enzyme Com­m ission n um ber from close ortho logs, (3) m ap the annotated proteins to the KEGG (Kyoto Encyclopedia of G enes and G enom es) pathw ays [85] and the EMBL STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) database [86].

The mapping of A. pompejana proteins to the STRING database was performed by retrieving data for the closest hum an U niprot homolog. This hom olog was then used to search the STRING database using different identifiers (U nipro t ID, U n ip ro t Ensem bl, RefSeq and G enom e Reviews accession num bers and gene nam es in th is order). If no STRING homolog was found using this tex­tual search, a BLASTP search was performed on STRING human protein sequences and the first best hit (E-value < le-05) was chosen. Then, STRING networks were built (com bined score cut-off > 0.9) and sub-netw orks were extracted by retrieving level 1 neighbours for each pro­tein. These small sub-networks were scored by the Ratio of consistency (Rc), defined as the ra tio betw een the observed num ber of sub-network edges (protein-protein interactions) and the m axim um theoretical num ber of edges. Rc will be high in sub-networks exhibiting a high level of intra sub-network interactions. In contrast, a low ratio indicates large sub-netw orks w ith few in tra sub­network interactions. In order to reduce the set of sub­networks, two sub-networks A and B were fused if they matched the following criteria:

1 - ( 1 - Ne ) X ( 1 - RcA ) X ( 1 - RcB ) > 0.7

Nc > 0.5

w ith Nc the ratio betw een the num ber of A and B shared nodes and the num ber of nodes in the smaller sub-network. These criteria help to preferentially fuse highly related sub-networks, while avoiding low consis­tency sub-netw orks that would otherwise agglomerate weakly related sub-netw orks. The final sub-netw orks were visualized using Cytoscape [87].

Functional enrichm entThe functional annotation clustering tool in the DAVID [88] software was used to study the the sets of differen­tially conserved genes. For each set, the closest hom o­logs of the A. pom pejana proteins from a given species (depending on the set u n d er study) were processed against the background of this species. Enrichment with a P-value < 0.01 was considered to be significant.

Phylogeny and molecular evolutionPhylogenetic reco n stru c tio n was p erfo rm ed on 17 model taxa covering the main eukaryotic lineages (choa- noflagellate: 1, cnidaria: 1, platyhelminthe: 1, nematod: 2, arthropod: 2, lophotrochozoa: 3 including A. pom pe­jana, chordate: 6). The tree was rooted with the yeast Saccharomyces cerevisiae. The phylogenetic tree and rates of amino acid substitution for each branch were inferred on a concatenated alignm ent of 76 ribosomal protein families using MrBayes 3 [89] under the WAG model.

Both observed and simulated amino-acid frequencies associated w ith the orthologous set of protein coding regions (65167 am ino acids) were obtained using the codeML package of the software PaML v3.14 [90] and the ‘un iversal’ genetic code. A m ino-acid alignm ents were validated manually, concatenated and exported in a PHYLIP form at using the software Se-AL v2.0 [91]. Regions containing gaps, misalignments or uncertainties were excluded from the analysis. PAML analyses were perform ed using a reference tree previously obtained from the ProML package of PHYLIP 3.68 [92] for the 9 taxa, using the JTT model of amino-acid substitutions. Amino acid frequencies were calculated using the aaML package (aadist = ‘equal’, with the jones.dat matrix) and standard deviations of frequencies were obtained from 100 rearrangem en ts (bootstrap) of the dataset. This allowed us to estim ate the proportion of hydrophobic, positively-charged and negatively-charged amino acids associated w ith the transla ted sequences across taxa, and to calculate hydrophobicities using the hydrophobic index based on the OM H scale of Sweet & Eisenberg [93]. This index is known to take into account the abil­ity of an amino acid to be replaced by another during the course of evolution.

Page 13: Insights into metazoan evolution from alvinella pompejana ...

Gagnière et al. BMC Genomics 2010, 11:634http://w w w .biom edcentral.eom /1471-2164/11/634

Page 13 of 15

Comparative genom icsSequences were compared to the Uniprot database using BlastP and classified according to the taxonomy of their hits. Taxonom ic groups taken into account include at least one species with a complete proteome represented in Uniprot: Deuterostomia, Nematoda, Arthropoda, Pla­tyhelminthes, Cnidaria, other Metazoa, Choanoflagellida, “protists” (Alveolata, Diplomonadida, Cryptophyta, Enta- moebidae, Euglenozoa, Mycetozoa, Parabasalidea, Rho­dophyta and Stramenopiles), Viridiplantae, Fungi, other Eukaryota, Prokaryotes and Viruses. H its obtained in A nnelids or M ollusca were ignored. W hen all BlastP hits (E-value < le-05)) of an A. pompejana protein were re s tric ted to a un ique taxon (w ith the exception of Annelids and Molluscs), this protein was considered to be specific to this taxon and Annelids (and potentially Molluscs).

Alvinella database and websiteThe database is managed by the PostgreSQL relational database m anagem ent system and is backed-up on a weekly basis. The A lvinella w ebsite uses an A pache H T T P server and PHP5 and was bu ilt from scra tch using the Smarty template engine, the ADOdb database abstraction library and the phpGACL generic access control list library.

D atabase accession num bersThe sequences have been subm itted to the EST section of the EMBL database u n d er accession num bers FP489021 to FP539727 and FP539730 to FP565142.

Additional material

A d d itio n a l file 1: F igure S I. Size d istributions o f con tig s in th e global assem bly.

A d d itio n a l file 2: F igure S2. Distribution o f co m p le te pro te in lengths.

A d d itio n a l file 3: F igure S3. GC c o n te n t and expression level of ribosom al genes.

A d d itio n a l file 4: F igure S4. Example o f Alvinella pro te ins m a p p ed on a KEGG pathw ay.

A d d itio n a l file 5: F igure S5. Am ino acid com position across m odel taxa.

A cknow led gem entsW e wish to acknow ledge Julie T hom pson for a critical reading o f th e m anuscrip t an d Mare Robinson-Rechavi for fruitful discussions. W e w ould also like to thank th e Bioinformatics Platform o f S trasbourg and th e Structural Biology and G enom ics Platform for the ir assistance during this work. W e thank th e crew and pilots o f th e RV L'Atalante and th e DSV Nautile for the ir assistance an d technical su p p o rt during th e cruise B ioSpeedo'04 for collecting anim als and Florence PradilIon for send ing prelim inary sam ples help ing in th e p re sen t studies. This w ork w as su p p o rted by institutional funds from INSERM, CNRS and UDS, by European Com m ission fund ing th ro u g h th e SPINE2-C0MPLEXES pro ject LSHG-CT-2006- 031220 an d by ANR-05-BLAN-0407 grant.

A u th o r detailsd e p a r tm e n t o f Structural Biology and G enom ics, Institut d e G énétique e t d e Biologie M oléculaire e t Cellulaire (IGBMC), CERBM F-67400 lllkirch, France; INSERM, U596, F-67400 lllkirch, France; CNRS, UMR7104, F-67400 lllkirch, France; Faculté d es Sciences d e la Vie, Université d e S trasbourg, F-67000 Strasbourg, France. 2CNRS, UMR 7144, A dapta tion e t Diversité en Milieu Marin, S tation Biologique d e Roscoff, 29682, Roscoff, France. 3UPMC Université Paris 6, S tation Biologique d e Roscoff, 29682, Roscoff, France. 4G en o sco p e - C en tre National d e S équençage, 2 rue G aston Crém ieux CP5706 91057 Evry cedex, France. 5CNRS Institut Ecologie e t E nvironnem ent (INEE), 3 rue M ichel-Ange, 75794, Paris cedex 16, France. 6UPMC Université Paris 6, UMR 7138, S ystém atique, A dap ta tion e t Evolution, C am pus d e Jussieu, 75005 Paris, France. 7Université C atholique d e Louvain, Laboratoire d e Biologie Cellulaire, Institut des Sciences d e la vie, Croix du sud 5, B-1348, Louvain-la-neuve, Belgium. 8UMR 7177 CNRS-UDS, LDSM2 Institut d e Chim ie d e S trasbourg, 1 rue Biaise Pascal -BP 296 R8, 67008 Strasbourg cedex, France.

A u tho rs ' con tribu tion sDJ, FG, DH, BK, FL, ELW, DM, JFR, BS, JCT, FZ, OP an d OL w ere found ing m em bers o f th e project. FZ provided biological sam ples. BS, CDS, JW and PW w ere involved in th e construction an d seq u en c in g o f cDNA libraries. NG, YB, DB, EP replicated th e cDNA libraries. NG, OP and OL carried o u t th e se q u en ce assem bly and anno ta tion , th e d a tab ase an d w ebsite construction , th e g e n e expression stud ies an d th e phy logenetic and com parative genom ics analyses. DJ carried o u t th e GC an d co d o n analyses. DJ and NG perfo rm ed th e search for the rm o-adap tive fea tu res in am ino-acids com position . IB, SH, BK, JM and AT analysed g en e s involved in adap ta tion to ex trem e env ironm ent. NG, DJ, AT, JM, OP and OL w ro te th e paper. All au thors read and ap p roved th e final m anuscript.

Received: 9 F eb rua ry 2010 A cce p te d : 16 N o v e m b e r 2010 Pub lished : 16 N o v e m b e r 2010

References1. De Robertis EM: Evo-devo: va ria tions on ancestral them es. Cell 2008,

132(2):185-195.2. McDougall C, Hui JH, M onteiro A, Takahashi T, Ferrier DE: Annelids in

evo lu tiona ry deve lopm enta l b io log y and com para tive genom ics. Parasite 2008, 15(3):321 -328.

3. Morris SC: The Crucib le o f Creation: The Burgess Shale and the Rise o f Anim als. Oxford University Press; 1998.

4. Arendt D, Nubler-Jung K: Inversion o f dorsoven tra l axis? Nature 1994,371 (6492):26.

5. Irvine SM, Martindale MQ: Cellu lar and m olecu lar m echanism s o f segm enta tion in annelids. Seminars In Cell & Developmental Biology 1996, 7(4):593-604.

6. Aguinaldo AM, Turbeville JM, Linford LS, Rivera MC, Garey JR, Raff RA,Lake JA: Evidence fo r a c lade o f nem atodes, a rth rop ods and o the r m o u ltin g animals. Nature 1997, 387(6632):489-493.

7. Raible F, Tessmar-Raible K, O soegaw a K, Wincker P, Jubin C, Balavoine G, Ferrier D, Benes V, de Jong P, W eissenbach J, et al: V ertebra te -type in tron - rich genes in th e m arine anne lid Platynereis dum erilii. Science 2005,310(5752):1325-1326.

8. Putnam NH, Srivastava M, Hellsten U, Dirks B, Chapm an J, Salamov A,Terry A, Shapiro H, Lindquist E, Kapitonov W , et al: Sea anem one genom e reveals ancestral eum etazoan gene reperto ire and genom ic o rgan iza tion . Science 2007, 317(5834):86-94.

9. Desbruyeres D, Laubier L: Alvine lla pom pejana gen. sp. now., aberrant A m phare tidae fro m East Pacific Rise hydro therm a l vents. Oceanol Acta 1980,3:267-274.

10. Caty SC, Shank T, Stein J: W orm s bask in extrem e tem pera tures. Nature 1998, 391 (6667) :545-546.

11. Chevaldonné P, Desbruyeres D, Childress JJ: Some like it hot... and some even ho tte r. Nature 1992, 359(6396):593-594.

12. Le Bris N, Zbinden M, Gaill F: Processes c o n tro llin g th e physico-chem ical m ic ro -env ironm ents associated w ith Pom peii worm s. Deep-Sea Research Part l-Oceanographic Research Papers 2005, 52(6):1071-1083.

13. Le Bris N, Gaill F: H ow does th e anne lid A lv ine lla pom pejana deal w ith an extrem e hydro therm a l env ironm ent? Reviews In Environmental Science and BioTechnology 2007, 6:102-119.

Page 14: Insights into metazoan evolution from alvinella pompejana ...

Gagnière et o!. BMC Genomics 2010, 11:634http://w w w .biom edcentral.eom /1471-2164/11/634

Page 14 of 15

14. Desbruyeres D, Chevaldonné P, Alayse AM, Jollivet D, Lallier FH, Jouin- Toulmond C, Zal F, Sarradin PM, Cosson R, Caprais JC, et al: Bio logy and eco logy o f th e Pom peii w o rm (A lvinella pom pejana Desbruyères and Laubier), a norm al dw e lle r o f an extrem e deep-sea env iro nm e nt: A synthesis o f cu rre n t know led ge and recent deve lopm ents. Deep-sea research 1998, 45(1-3):383-422.

15. Pradillon F, Zbinden M, Mullineaux LS, Gaill F: C olonisation o f new ly- opened hab ita t by a p ioneer species, A lv ine lla pom pejana (Polychaeta: A lvine llidae), a t East Pacific Rise v e n t sites. Marine Ecology-Progress Series 2005, 302:147-157.

16. Girguis PR, Lee RW: Therm al preference and to lerance o f alvinellids.

Science 2006, 312(5771 ):231.17. Dahlhoff E, Som ero GN: Pressure and tem p era tu re adap ta tion o f cytosolic

m alate dehydrogenases o f sha llow and dee p-liv ing m arine invertebrates: evidence fo r h igh b o d y tem pera tu res in hydro therm a l v e n t animals.Journal o f Experimental Biology 1991, 159:473-487.

18. Jollivet D, Desbruyeres D, Ladrat C, Laubier L: Evidence fo r d ifferences in the allozym e th e rm o s ta b ility o f deep-sea hydro therm a l v e n t polychaetes (A lvinellidae): a possib le se lection by hab ita t. Marine Ecology Progress Series 1995, 123:125-136.

19. Burjanadze TV: N ew analysis o f th e phy logene tic change o f collagen the rm ostab ility . Biopolymers 2000, 53:523-528.

20. Sicot FX, M esnage M, Masselot M, Expósito JY, Garrone R, Deutsch J, Gaill F: M olecu lar adap ta tion to an extrem e e nv iro nm e n t: o rig in o f th e the rm a l s tab ility o f th e pom pe ii w o rm collagen. J Mol Biol 2000, 302(4):811-820.

21. Piccino P, Viard F, Sarradin PM, Le Bris N, Le Guen D, Jollivet D: Therm al selection o f PGM allozym es in new ly fo u n d e d pop u la tion s o f the th e rm o to le ra n t v e n t polychaete A lv ine lla pom pejana. Proc Biol Sei 2004, 271 (1555):2351-2359.

22. Henscheid KL, Shin DS, Cary SC, Berglund JA: The sp lic ing fac to r U2AF65 is fu n c tio n a lly conserved in th e th e rm o to le ra n t deep-sea w o rm A lvine lla pom pejana. Biochim Biophys Acta 2005, 1 727(3):197-207.

23. Shin DS, D idonato M, Barondeau DP, Hura GL, Hitomi C, Berglund JA, Getzoff ED, Caty SC, Tainer JA: S uperoxide d ism utase fro m th e eukaryotic th e rm o p h ile A lv ine lla pom pejana: structures, s tab ility , m echanism , and insights in to am yo trop h ic la teral sclerosis. J M ol Biol 2009,385(5):1534-1555.

24. Grzymski JJ, Murray AE, Campbell BJ, Kaplarevic M, Gao GR, Lee C, Daniel R, Ghadiri A, Feldm an RA, Caty SC: M etagenom e analysis o f an extrem e m icrob ia l sym biosis reveals eu rytherm a l adap ta tion and m etabo lic flex ib ility . Proc Natl Acad Sei USA 2008, 105(45)0 7516-17521.

25. Pesóle G, Grillo G, Larizza A, Liuni S: The untrans la ted regions o f eukaryotic mRNAs: s tructure, fu n c tio n , evo lu tio n and b io in fo rm a tic tools fo r th e ir analysis. Brief Bioinform 2000, 1 (3):236-249.

26. Zhang L, Kasif S, Cantor CR, Broude NE: G C/AT-content spikes as genom ic pun ctua tio n marks. Proc Natl Acad Sei USA 2004, 101(48)0 6855-16860.

27. Bechtel JM, W ittenschlaeger T, Dwyer T, Song J, Arunachalam S, Ramakrishnan SK, Shepard S, Fedorov A: G enom ic m id-range in ho m o g e n e ity corre lates w ith an abundance o f RNA secondary structures. BMC Genomics 2008, 9:284.

28. M ignone F, Grillo G, Licciulli F, lacono M, Liuni S, Kersey PJ, Duarte J, Saccone C, Pesóle G: UTRdb and UTRsite: a co llection o f sequences and regu la to ry m otifs o f the untrans la ted regions o f eukaryotic mRNAs. Nucleic Acids Res 20 0 5 ,, 33 Database: D141 -146.

29. Duret L: Evolution o f synonym ous codon usage in m etazoans. Current Opinion in Genetics & Development 2002, 12(6):640-649.

30. Eyre-Walker A: Synonym ous codon bias is re lated to gene leng th in Escherichia coli: selection fo r trans la tiona l accuracy? M ol Biol Evol 1996, 13(6):864-872.

31. Duret L, M ouchiroud D: Expression patte rn and, surpris ingly, gene length shape codon usage in C aenorhabditis, Drosophila, and Arabidopsis. ProcNatl Acad Sei USA 1999, 96(8)4482-4487.

32. dos Reis M, Wernisch L: Estim ating trans la tiona l selection in eukaryotic genom es. M ol Biol Evol 2009, 26(2)451-461.

33. Lecom pte O, Thom pson JD, Plewniak F, Thierry J, Poch O: M ultip le a lig n m e n t o f com p le te sequences (MACS) in th e pos t-genom ic era. Gene 2001, 270(1-2)4 7-30.

34. An im al G enom e Size Database, [h ttp://w w w .genom esize.com ],35. Bonnivard E, Catrice O, Ravaux J, Brown SC, Higuet D: Survey o f genom e

size in 28 hyd ro the rm a l v e n t species covering 10 fam ilies. Genome 2009, 52(6):524-536.

36. Dixon DR, Jolly MT, Vevers WF, Dixon LRJ: Chrom osom es o f Pacific hydro therm a l v e n t invertebrates: tow ards a greater unders tand ing o f th e re la tionsh ip betw een chrom osom e and m olecu lar evo lu tion . Journal o f the Marine Biological Association o f the United Kingdom 2009, 90:15-31.

37. A lv ine lla pom pejana w ebsite . [http://alvinella.igbmc.fr/Alvinella/].38. Gordon D: V iew ing and e d itin g assem bled sequences using Consed. Curr

Protoc Bioinformatics 2003, Chapter 11, U n it l 1 12.39. Udall JA, Swanson JM, Haller K, Rapp RA, Sparks ME, Hatfield J, Yu Y, Wu Y,

Dowd C, Arpat AB, et at: A g loba l assem bly o f co tton ESTs. Genome Res 2006, 16(3)441-450.

40. Pavy N, Paule C, Parsons L, Crow JA, Morency MJ, Cooke J, Johnson JE, N oum en E, Guillet-Claude C, Butterfield Y, et at: G eneration, anno ta tion , analysis and database in teg ra tio n o f 16,500 w h ite spruce EST clusters. BMG Genomics 2005, 6:144.

41. Karim N, Jones JT, Okada H, Kikuchi T: Analysis o f expressed sequence tags and id en tifica tion o f genes enco d ing ce ll-w a ll-degrad ing enzymes fro m the fung ivo ro us nem atode Aphe lenchus avenae. BMC Genomics 2009, 10:525.

42. Hourdez S, W eber RE: M olecu lar and fu n c tio n a l adap ta tions in deep-sea hem oglob ins. J Inorg Biochem 2005, 99(1)4 30-141.

43. Hourdez S, Lallier FH, De Cian MC, Green BN, W eber RE, Toulmond A: Gas transfer system in A lv ine lla pom pejana (Annelida polychaeta, Terebellida): fun c tio n a l p roperties o f in trace llu la r and extracellu lar hem oglob ins. Physiol Biochem Zool 2000, 73(3):365-373.

44. Mary J, Rogniaux H, Rees JF, Zal F: Response o f A lv ine lla pom pejana to variab le oxygen stress: a p ro teom ic approach. Proteomics

10(12):2250-2258.45. Marie B, Genard B, Rees J, Zal F: Effect o f a m b ie n t oxygen concentra tion

on activ ities o f enzym atic an tio x id a n t defences and aerobic m etabo lism in th e hyd ro the rm a l v e n t w o rm , Paralvinella grasslei. Marine Biology 2006, 150:273-284.

46. Dixon D, Dixon L, Shillito B, Gwynn J: Background and induced levels o f DNA dam age in PaciWc deep-sea vents polychaetes: th e case fo r avoidance. Cahier de Biologie Marine 2002, 43:333-336.

47. Ahearn GA, Mandai PK, Mandai A: M echanism s o f heavy-m etal sequestration and d e tox ifica tion in crustaceans: a review . J Comp Physiol B 2004, 1 74(6)439-452.

48. Gaill F, Halpern S, Quintana C, Desbruyeres D: Presence in trace llu la ire d 'arsenic e t de zinc assocides au soufre chez une Polychete des sources hydro therm a les. C R Acad Sei III 1984, 298:331-335.

49. Vovelle J, Gaill F: Données m orpho log iques, h is toch im iques et m icroana ly tiques sur l'é labo ra tion du tube o rganom inéra l d 'A lv inella pom pejana, Polychète des sources hydrotherm a les, e t leurs im p lica tions phy logéné tiques. Zool Scripta 1986, 15(1 ):33-43.

50. Cipollone R, Ascenzi P, Visca P: Com m on them es and va ria tions in the rhodanese superfam ily. IUBMB Life 2007, 59(2):51 -59.

51. Vogt G, Woeil S, Argos P: P rotein the rm a l s tab ility , hydrogen bonds, and ion pairs. / M ol Biol 1997, 269(4):631-643.

52. Haney PJ, Stees M, Konisky J: Analysis o f the rm a l stab iliz ing in teractions in m esoph ilic and th e rm o p h ilic adenyla te kinases fro m th e genus M ethanococcus. J Biol Chem 1999, 274(40):28453-28458.

53. Szilagyi A, Zavodszky P: Structura l d iffe rences betw een m esophilic , m odera te ly th e rm o p h ilic and ex trem e ly th e rm o p h ilic p ro te in subunits: results o f a com prehensive survey. Structure 2000, 8(5)493-504.

54. Nishio Y, Nakamura Y, Kawarabayasi Y, Usuda Y, Kimura E, Sugim oto S, Matsui K, Yamagishi A, Kikuchi H, Ikeo K, et at: C om parative com p le te gen om e sequence analysis o f the am ino acid replacem ents responsible fo r th e th e rm o s ta b ility o f C orynebacterium efficiens. Genome Res 2003, 13(7)4 572-1579.

55. Berezovsky IN, Shakhnovich El: Physics and evo lu tio n o f th e rm o p h ilic adap ta tion . Proc Natl Acad Sei USA 2005, 102(36)4 2742-12747.

56. Robinson-Rechavi M, Alibes A, Godzik A: C o n trib u tion o f e lectrostatic in teractions, com pactness and qua te rnary structu re to pro te in the rm os tab ility : lessons fro m structura l genom ics o f Therm otoga m aritim a. J M ol Biol 2006, 356(2):547-557.

57. Wolf Yl, Rogozin IB, Koonin EV: Coelom ata and no t Ecdysozoa: evidence fro m gen om e -w ide phy logene tic analysis. Genome Res 2004, 14(1 ):29-36.

58. Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P: Toward au tom a tic reconstruction o f a h ig h ly resolved tree o f life. Science 2006, 311(5765)4 283-1287.

Page 15: Insights into metazoan evolution from alvinella pompejana ...

Gagnière et al. BMC Genomics 2010, 11:634h ttp://www.b¡om edcentra l .com/1471 -2164/11 /634

Page 15 o f 15

59.

60.

62.

63.

64.

65.

66 .

67.

69.

70.

72.73.

74.

75.

76.

77.

79.

82.

83.

84.

85.

Vilmos P, G audenz K, H egedus Z, Marsh JL: The Tw isted gastru la tion fa m ily o f pro te ins, to g e th e r w ith th e IGFBP and CCN fam ilies, com prise th e TIC supe rfam ily o f cyste ine rich secreted factors. M ol Pathol 2001, 54(5):317-323.Alcivar A, Hu S, Tang J, Yang X: DEDD and DEDD2 associate w ith caspase- 8 /10 and signa l cell dea th . Oncogene 2003, 22(2):291-297.Tinei A, Tschopp J: The PIDDosome, a p ro te in com p le x im p lica ted in ac tiva tion o f caspase-2 in response to g en o tox ic stress. Science 2004, 304(5672) :843-846.Lescure A, Rederstorff M, Krol A, G uicheney P, Alla m and V: S e lenoprote in fu n c tio n and m uscle disease. Biochim Biophys Acta 2009,1790(11):1569-1574.Hughes AL, Friedman R: D iffe ren tia l loss o f ancestral gene fam ilies as a source o f gen om ic d ive rgence in animals. Proc Biol Sei 2004, 271(Suppl 3):S107-109.Zmasek CM, Zhang Q, Ye Y, Godzik A: Surpris ing c o m p le x ity o f th e ancestral apoptos is ne tw o rk. Genome Biol 2007, 8(10):R226.Miller DJ, Hemmrich G, Ball EE, Hayward DC, Khalturin K, Funayama N,Agata K, Bosch TC: The inna te im m u n e reperto ire in cn ida ria -ancestra l co m p le x ity and stochastic gen e loss. Genome Biol 2007, 8(4):R59.Srivastava M, Begovic E, C hapm an J, Putnam NH, Hellsten U, Kawashima T, Kuo A, Mitros T, Salamov A, Carpenter ML, et at. The T richop lax gen om e and th e na tu re o f placozoans. Nature 2008, 454(7207):955-960.N atesan S, Jayasundaram m a B, Ramamurthi R, Reddy SR: Presence o f a partia l urea cycle in th e leech, Poecilobdella granulosa. Experientia 1992, 48(8) :729-731.Roy SW: In tron-rich ancestors. Trends Genet 2006, 22(9):468-471.Glansdorff N, Xu Y, Labedan B: The last universal com m o n ancestor: em ergence, c o n s titu tio n and g en e tic legacy o f an elusive fo re runner.Biol Direct 2008, 3:29.Ewing B, Hillier L, W endl MC, Green P: Base-calling o f au tom ate d sequencer traces using phred. I. Accuracy assessment. Genome Res 1998, 8(3):175-185.Ewing B, Green P: Base-calling o f au tom a te d sequencer traces using phred. II. Error p robab ilities. Genome Res 1998, 8(3):186-194.The Univec database, [h ttp://w w w .phrap.org/phrap_docum entation.htm l]. Huang X, M adan A: CAP3: A DNA sequence assem bly program . Genome Res 1999, 9(9):868-877.consortium U: The universal p ro te in resource (UniProt). Nucleic Acids Res 20 0 8 ,, 36 Database: D190-195.Berman HM, W estbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The P rote in Data Bank. Nucleic Acids Res 2000, 28(1 ):235-242.Altschul SF, M adden TL, Schaffer AA, Zhang J, Zhang Z, Miller W,Lipman DJ: G apped BLAST and PSI-BLAST: a n e w genera tion o f p ro te in database search program s. Nucleic Acids Res 1997, 25(17):3389-3402.Lottaz C, Iseli C, Jongeneel CV, Bucher P: M ode ling sequencing errors by c o m b in in g H idden M arkov m odels. Bioinformatics 2003, 19(Suppl 2): iii 03-112.CodonW . [h ttp ://sourceforge.net/projects/codonw ].Peden JF: Analysis o f codon usage. Phd thesis University of Nottingham; 1999.Plewniak F, Bianchetti L, Brelivet Y, Carles A, Chalmel F, Lecom pte O,Mochel T, Moulinier L, Muller A, Muller J, et a t PipeA lign: A new to o lk it fo r p ro te in fa m ily analysis. Nucleic Acids Res 2003, 31 (13):3829-3832.Thom pson JD, Muller A, W aterhouse A, Procter J, Barton GJ, Plewniak F, Poch O: MACSIMS: m u lt ip le a lig n m e n t o f co m p le te sequences in fo rm a tion m anagem ent system. BMC Bioinformatics 2006, 7:318.Chalmel F, Lardenois A, Thom pson JD, Muller J, Sahel JA, Leveillard T,Poch O: G OAnno: GO a nn o ta tio n based on m u lt ip le a lignm ent. Bioinformatics 2005, 21(9):2095-2096.Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V,Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, et a t Pfam: clans, w e b to o ls and services. Nucleic Acids Res 2 0 0 6 ,, 34 Database: D247-251. A shburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et a t Gene on to logy : to o l fo r th e un ifica tion o f b io log y . The Gene O n to lo g y Consortium . N at Genet 2000, 25(1 ):25-29.Okuda S, Yamada T, Hamajima M, Itoh M, Katayama T, Bork P, G oto S, Kanehisa M: KEGG Atlas m a pp ing fo r g loba l analysis o f m etabo lic pathw ays. Nucleic Acids Res 2008, 36(W eb Server issue)W 423-426.

87.

90.

93.

von Mering C, Jensen U, Kuhn M, Chaffron S, Doerks T, Kruger B, Snel B, Bork P: STRING 7 -re ce n t deve lopm en ts in th e in teg ra tio n and p red ic tion o f p ro te in in teractions. Nucleic Acids Res 200 7 ,, 35 Database: D358-362. Cline MS, Sm oot M, Cerami E, Kuch insky A, Landys N, W orkman C,Christmas R, Avila-Campilo I, Creech M, Gross B, et a t In teg ra tio n o f b io log ica l ne tw o rks and gene expression data using Cytoscape. Nat Protoc 2007, 2(10):2366-2382.Dennis G Jr, Sherm an BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database fo r A n n o ta tion , V isualization , and In teg ra ted Discovery.Genome Biol 2003, 4(5) :P3.Ronquist F, Huelsenbeck JP: MrBayes 3: Bayesian phy logene tic in ference under m ixed m odels. Bioinformatics 2003, 19(12):1572-1574.Yang Z: PAML: a p rogram package fo r phy logene tic analysis by m ax im um like lihood . Comput Appl Biosci 1997, 13(5):555-556.Se-AI. [http://tree.bio.ed.ac.uk/softw are/seal/].Felsenstein J: PHYLIP - P hylogeny Inference Package (Version 3.2). Cladistics 1989, 5:164-166.Sweet RM, Eisenberg D: C orre la tion o f sequence h yd roph ob ic ities m easures s im ila r ity in th ree-d im ens iona l p ro te in s tructure. J M ol Biol 1983, 171 (4):479-488.

do i:10.1186/1471 -2164-11 -634Cite this article as: G agnière et al:. Ins igh ts in to m etazoan e vo lu tio n fro m alvinella pompejana cDNAs. BMC Genomics 2010 11:634.

Subm it y ou r nex t m anuscrip t to BioMed Central an d ta k e full a d v a n ta g e of:

• C o n v e n ie n t o n lin e su b m iss io n

• T h o ro u g h pee r re v ie w

• No space c o n s tra in ts o r c o lo r f ig u r e charges

• Im m e d ia te p u b lic a tio n on accep tance

• Inc lus ion in P u bM e d , CAS, Scopus a n d G o o g le S cho la r

• Research w h ic h is fre e ly a v a ila b le fo r re d is t r ib u t io n

Subm it yo u r m anuscrip t a t w w w .b io m edcentra l.com/subm it O“ Central