Analysis of Phenetic Trees Based on Metabolic Capabilites Across the Three Domains of Life

22
Analysis of Phenetic Trees Based on Metabolic Capabilites Across the Three Domains of Life Daniel Aguilar 1 , Francesc X. Aviles 1 , Enrique Querol 1 and Michael J. E. Sternberg 2 * 1 Institut de Biotecnologia i Biomedicina, Universitat Auto `noma de Barcelona, 08193 Bellaterra (Barcelona), Spain 2 Structural Bioinformatics Group, Biochemistry Building Department of Biological Sciences, Imperial College London, South Kensington campus, London SW7 2AZ UK Here, we used data of complete genomes to study comparatively the metabolism of different species. We built phenetic trees based on the enzymatic functions present in different parts of metabolism. Seven broad metabolic classes, comprising a total of 69 metabolic pathways, were comparatively analyzed for 27 fully sequenced organisms of the domains Eukarya, Bacteria and Archaea. Phylogenetic profiles based on the presence/absence of enzymatic functions for each metabolic class were determined and distance matrices for all the organisms were then derived from the profiles. Unrooted phenetic trees based upon the matrices revealed the distribution of the organisms according to their metabolic capabilities, reflecting the ecological pressures and adaptations that those species underwent during their evolution. We found that organisms that are closely related in phylogenetic terms could be distantly related metabolically and that the opposite is also true. For example, obli- gate bacterial pathogens were usually grouped together in our metabolic trees, demonstrating that obligate pathogens share common metabolic fea- tures regardless of their diverse phylogenetic origins. The branching order of proteobacteria often did not match their classical phylogenetic classifi- cation and Gram-positive bacteria showed diverse metabolic affinities. Archaea were found to be metabolically as distant from free-living bac- teria as from eukaryotes, and sometimes were placed close to the metabo- lically highly specialized group of obligate bacterial pathogens. Metabolic trees represent an integrative approach for the comparison of the evol- ution of the metabolism and its correlation with the evolution of the gen- ome, helping to find new relationships in the tree of life. q 2004 Elsevier Ltd. All rights reserved. Keywords: functional genomics; metabolic pathways; metabolic databases; comparative genomics; phylogenetic profiles *Corresponding author Introduction The classification of organisms is one of the major challenges in biology. 1 A classification based on rRNA sequences as a phylogenetic marker established the three-domain tree of life, namely Eukarya, Bacteria (eubacteria) and Archaea (archaebacteria), 2,3 although this classification has been questioned. 4–6 Once all the information present in several genomes has become available, there are vast amounts of data that can be used comparatively to examine multiple features of the genomes of different species, thereby giving rise to a “genome-based” phylogenetic approach. Studies involving sequence comparison include the analysis of protein motifs 7 and the presence of orthologous groups. 8 – 11 Analysis not directly related to sequence similarity includes the presence/absence of protein families, 12 the species-specific codon usage and C þ G content, 13,14 the amino acid composition, 15,16 the dis- tribution of protein folds, 15,17 the presence of con- served gene pairs 18 and the comparative gene order of orthologs. 11,19,20 The overall gene repertoire of the genome in terms of shared genes has also been studied. 21,22 Comparison with the classical 16 S rRNA phylogeny 2,23 has shown that most of 0022-2836/$ - see front matter q 2004 Elsevier Ltd. All rights reserved. E-mail address of the corresponding author: [email protected] Abbreviations used: KEGG, Kyoto encyclopedia of genes and genomes; ORFs, open reading frames. doi:10.1016/j.jmb.2004.04.059 J. Mol. Biol. (2004) 340, 491–512

Transcript of Analysis of Phenetic Trees Based on Metabolic Capabilites Across the Three Domains of Life

Analysis of Phenetic Trees Based on MetabolicCapabilites Across the Three Domains of Life

Daniel Aguilar1, Francesc X. Aviles1, Enrique Querol1 andMichael J. E. Sternberg2*

1Institut de Biotecnologia iBiomedicina, UniversitatAutonoma de Barcelona, 08193Bellaterra (Barcelona), Spain

2Structural BioinformaticsGroup, Biochemistry BuildingDepartment of BiologicalSciences, Imperial CollegeLondon, South Kensingtoncampus, London SW7 2AZUK

Here, we used data of complete genomes to study comparatively themetabolism of different species. We built phenetic trees based on theenzymatic functions present in different parts of metabolism. Sevenbroad metabolic classes, comprising a total of 69 metabolic pathways,were comparatively analyzed for 27 fully sequenced organisms of thedomains Eukarya, Bacteria and Archaea. Phylogenetic profiles based onthe presence/absence of enzymatic functions for each metabolic classwere determined and distance matrices for all the organisms were thenderived from the profiles. Unrooted phenetic trees based upon thematrices revealed the distribution of the organisms according to theirmetabolic capabilities, reflecting the ecological pressures and adaptationsthat those species underwent during their evolution. We found thatorganisms that are closely related in phylogenetic terms could be distantlyrelated metabolically and that the opposite is also true. For example, obli-gate bacterial pathogens were usually grouped together in our metabolictrees, demonstrating that obligate pathogens share common metabolic fea-tures regardless of their diverse phylogenetic origins. The branching orderof proteobacteria often did not match their classical phylogenetic classifi-cation and Gram-positive bacteria showed diverse metabolic affinities.Archaea were found to be metabolically as distant from free-living bac-teria as from eukaryotes, and sometimes were placed close to the metabo-lically highly specialized group of obligate bacterial pathogens. Metabolictrees represent an integrative approach for the comparison of the evol-ution of the metabolism and its correlation with the evolution of the gen-ome, helping to find new relationships in the tree of life.

q 2004 Elsevier Ltd. All rights reserved.

Keywords: functional genomics; metabolic pathways; metabolic databases;comparative genomics; phylogenetic profiles*Corresponding author

Introduction

The classification of organisms is one of themajor challenges in biology.1 A classification basedon rRNA sequences as a phylogenetic markerestablished the three-domain tree of life, namelyEukarya, Bacteria (eubacteria) and Archaea(archaebacteria),2,3 although this classification hasbeen questioned.4–6 Once all the informationpresent in several genomes has become available,there are vast amounts of data that can be used

comparatively to examine multiple features of thegenomes of different species, thereby giving riseto a “genome-based” phylogenetic approach.Studies involving sequence comparison includethe analysis of protein motifs7 and the presence oforthologous groups.8–11 Analysis not directlyrelated to sequence similarity includes thepresence/absence of protein families,12 thespecies-specific codon usage and C þ Gcontent,13,14 the amino acid composition,15,16 the dis-tribution of protein folds,15,17 the presence of con-served gene pairs18 and the comparative geneorder of orthologs.11,19,20 The overall gene repertoireof the genome in terms of shared genes has alsobeen studied.21,22 Comparison with the classical16 S rRNA phylogeny2,23 has shown that most of

0022-2836/$ - see front matter q 2004 Elsevier Ltd. All rights reserved.

E-mail address of the corresponding author:[email protected]

Abbreviations used: KEGG, Kyoto encyclopedia ofgenes and genomes; ORFs, open reading frames.

doi:10.1016/j.jmb.2004.04.059 J. Mol. Biol. (2004) 340, 491–512

the genome-based trees broadly agree with theclassical phylogeny15,18,22,24 although detailedexamination of the distribution of organismsshows new relationships between lineages, reveal-ing evolutionary phenomena such as preferentialhorizontal transfer and lineage-specific gene loss.

There are, however, other levels of analysisbeyond the genotype. Genes usually do not actindividually but they generally form networksand pathways of varying complexity. Metabolismis a good example of this, with the enzymes beingthe building blocks which can be combined in avariety of ways.25–28 This enzymatic shufflingprovides the metabolic plasticity which is essentialfor the successful adaptation to different niches.The distribution of enzymatic functions in thedifferent lineages can be correlated with thephysiological traits of the lineages, the environ-mental pressures that those lineages underwentduring the evolutionary history and features suchas pathogenicity. Furthermore, the knowledge onthe metabolome (i.e. all the metabolic machinerypresent in a cell at a given time, includingmetabolites and coenzymes as well as theenzymes) has potential applications in many fieldsof molecular biology such as the characterizationof unknown gene products and metabolicengineering.29–32

Here, we present an integrative approach to theanalysis of the relationships between species ofthe three domains of life, complementary withprevious whole-genome analysis. We analyze thetopologies of a number of trees based on the mainmetabolic pathways for a set of representativeorganisms, providing an insight into their meta-bolic features and correspondence of those featureswith the evolution of their genomes. We refer toour trees as “metabolic trees”. The basis of ourmethod is different from that of sequence-basedwhole-genome analysis and, due to the nature ofthe data we are using, our trees are “phenetic”:they do not necessarily reflect evolutionaryrelationships but classify organisms according totheir metabolic features. We established relation-ships according to the distribution of enzymaticfunctions in metabolism, showing that metabolismreflects the evolutionary pressures on differentlineages through their specializations to survive inparticular ecological niches and we compare howthose changes are reflected in their classical phylo-genetic distribution based on rRNA.

Construction of the Metabolic Trees

Selecting the set of organisms

We required a set of organisms covering a repre-sentative range of evolutionary lines. We calculatedthe total number of proteins in the SWISS-PROT(manual annotation) and TrEMBL (automatedannotation) databases for all the fully sequencedorganisms of each domain of the tree of life

(namely, Eukarya, Bacteria and Archaea). Then weselected those organisms that fulfilled two con-ditions: (i) the number of proteins for an organismin the SWISS-PROT database is .2000 foreukaryotes, .600 for bacteria or .400 for Archaea;and (ii) the total number of proteins for anorganism in the combined SWISS-PROT andTrEMBL databases is .5000 for eukaryotes,.2000 for bacteria or .1500 for Archaea. Threeobligate bacterial pathogens were additionallyincluded in the set because of the metabolicinterest of their adaptations to a peculiar lifestyle.Archaea Thermoplasma acidophilum and Aeropyrumpernix were included as well. The resulting set oforganisms contains species of diverse origin andlifestyle, priorizing those with better annotatedgenomes. Table 1 shows the detailed list of organ-isms and a brief description of their maincharacteristics.

Selecting the metabolic pathways, metabolicclasses and enzymes

Metabolic pathways were taken from the Kyotoencyclopedia of genes and genomes (KEGG)database,33 where they are non-overlappingly hier-archically grouped into 11 metabolic classes†, themain criterion for grouping being the biochemicalnature of the compounds metabolized. We willrefer to a group of related metabolic pathways(according to KEGG’s classification) as a “meta-bolic class”. The classes named Metabolism ofOther Amino Acids, Biosynthesis of SecondaryMetabolites and Metabolism of Cofactors and Vita-mins were not included in our analysis, since theydo not comprise a metabolically coherent group ofmetabolic pathways and the class named Biodegra-dation of Xenobiotics was not included because themetabolic pathways included are too rare amongthe majority of organisms studied. The number ofmetabolic classes analyzed was thereby reducedto seven (Amino Acid Metabolism, EnergyMetabolism, Nucleotide Metabolism, Carbo-hydrate Metabolism, Lipid Metabolism,Metabolism of Complex Lipids, Metabolism ofComplex Carbohydrates) comprising 69 differentmetabolic pathways.

Enzymatic functions (as represented by theIUPAC four-digits EC code) present in the meta-bolic pathways were taken from the LIGANDdatabase‡ 34 where they are related to the differentmetabolic pathways. Since EC codes do not containspecies-specific information, organism-specificenzymes were identified and extracted by parsingthe GENES section of the enzyme file of theLIGAND database, which contains the genesencoding proteins with each particular ECfunction.

†http://www.genome.ad.jp/kegg/metabolism.html‡http://www.genome.ad.jp/ligand/

492 Trees Based on Metabolic Capabilities

Building the metabolic trees

The core of our analysis is closely related to theso-called “phylogenetic profiles” approach.35 Foreach metabolic class in our database, we built abinary profile for each organism according to thepresence/absence of the enzymatic functionsbelonging to that metabolic class in that particularorganism (an outline of the procedure can be seenin Figure 1). The nature of our data is, thus,character-based. No metabolic reconstruction hasbeen attempted: enzymatic functions were addedto the profile in no specific order. Using a normal-

ized Hamming distance (which is the number ofbits to change in one binary string to turn it intoanother binary string) we constructed an “all versusall” organism distance matrix for each metabolicclass. The subsequent tree derived from that matrixwas constructed with the neighbor-joiningalgorithm present in the NEIGHBOR programfrom the PHYLIP software suite.36 The final stepof our analysis was to build a metabolic tree usingthe enzymatic data of all the metabolic pathwaysconsidered (the total metabolic tree). Bootstrapanalysis with 1000 replicates was performed inorder to test the trees. The trees were finally

Table 1. Organisms in our dataset

Organism Code Domain TaxonomyGramclassification Pathogenicity Lifestyle

Carbonsource

Haemophilusinfluenzae

HAEIN Bacteria g-Proteobacteria(Pasteurellales)

Gram (2) Facultativepathogen

Mesophile Heterotroph

Pasteurellamultocida

PASMU Bacteria g-Proteobacteria(Pasteurellales)

Gram (2) Facultativepathogen

Mesophile Heterotroph

Salmonellatyphimurium

SALTY Bacteria g-Proteobacteria(Enterobacteriales)

Gram (2) Facultativepathogen

Mesophile Heterotroph

Shigella flexneri SHIFL Bacteria g-Proteobacteria(Enterobacteriales)

Gram (2) Facultativepathogen

Mesophile Heterotroph

Yersinia pestis YERPE Bacteria g-Proteobacteria(Enterobacteriales)

Gram (2) Facultativepathogen

Mesophile Heterotroph

Escherichia coli ECOLI Bacteria g-Proteobacteria(Enterobacteriales)

Gram (2) Facultativepathogen

Mesophile Heterotroph

Pseudomonasaeruginosa

PSEAE Bacteria g-Proteobacteria(Pseudomonadales)

Gram (2) Facultativepathogen

Mesophile Heterotroph

Vibrio cholerae VIBCH Bacteria g-Proteobacteria(Vibrionales)

Gram (2) Facultativepathogen

Mesophile Heterotroph

Xilella fastidiosa XILFA Bacteria g-Proteobacteria(Xanthomonadales)

Gram (2) Facult. phyto-pathogen

Mesophile Heterotroph

Neisseriameningitidis

NEIME Bacteria b-Proteobacteria Gram (2) Facultativepathogen

Mesophile Heterotroph

Mycobacteriumtuberculosis

MYCTU Bacteria Firmicutes(Actinobacteria)

Gram (þ)high GC

Facultativepathogen

Mesophile Heterotroph

Streptococcuspyogenes

STRPY Bacteria Firmicutes (Bacilli) Gram (þ)low GC

Facultativepathogen

Mesophile Heterotroph

Staphylococcusaureus

STAAU Bacteria Firmicutes (Bacilli) Gram (þ)low GC

Facultativepathogen

Mesophile Heterotroph

Bacillus subtilis BACSU Bacteria Firmicutes (Bacilli) Gram (þ)low GC

Non-pathogenic Mesophile Heterotroph

Mycoplasmapneumoniae

MYCPN Bacteria Firmicutes (Molli-cutes)

Gram (þ)low GC

Obligatepathogen

Mesophile Heterotroph

Treponema pallidum TREPA Bacteria Spirochaetes Gram (2) Obligatepathogen

Mesophile Heterotroph

Chlamydiatrachomatis

CHLMU Bacteria Chlamydiae Gram (2) Obligatepathogen

Mesophile Heterotroph

Archaeoglobusfulgidus

ARCFU Archaea Euryarchaeota – – Thermophile Autotroph

Pyrococcus abyssi PYRAB Archaea Euryarchaeota – – Thermophile AutotrophThermoplasmaacidophilum

THEAC Archaea Euryarchaeota – – Thermophile Heterotroph

Aeropyrum pernix AERPE Archaea Crenarchaeota – – Thermophile HeterotrophArabidopsis thaliana ARATH Eukarya Viridiplantae – – Mesophile HeterotrophMus musculus MOUSE Eukarya Metazoa (Mammalia) – – Mesophile HeterotrophHomo sapiens HUMAN Eukarya Metazoa (Mammalia) – – Mesophile HeterotrophCaenorhabditiselegans

CAEEL Eukarya Metazoa (Nematoda) – – Mesophile Heterotroph

Saccharomycescerevisiae

YEAST Eukarya Fungi – – Mesophile Heterotroph

Schizosaccharomycespombe

SCHPO Eukarya Fungi – – Mesophile Heterotroph

All organisms are chemotrophs. References: A. fulgidus;69 A. pernix;70 A. thaliana;71,72 B. subtilis;73 C. elegans;74 C. trachomatis;75 E. coli;76

H. influenzae;77 H. sapiens;78,79 M. musculus;80,81 M. pneumoniae;47 M. tuberculosis;49 N. meningitidis;82,83 P. abyssi;84 P. aeruginosa;85

P. multocida;86 S. aureus;87 S. cerevisiae;88 S. flexneri;89,90 S. pombe;91 S. pyogenes;92,93 S. typhimurium;94 T. acidophilum;95 T. pallidum;48

V. cholerae;96 X. fastidiosa;97 Y. pestis.98

Trees Based on Metabolic Capabilities 493

drawn with the program PHYLODENDRON(q 1997 by D. G. Gilbert).

Some authors have used character-basedmethods (e.g. parsimony) for the construction oftrees based on the presence/absence of data.15,18 Inour study, we considered it preferable to use a dis-tance-based method such as neighbor-joining,because distance-based methods are less affectedby the great divergence of the organisms studied.15

Furthermore, the parsimony method was, in ouropinion, not the most suitable for the complexnature of our data: the assumptions of lack ofhomoplasy and independent evolution of thestates37–39 were unlikely to be met by our data,which consists in enzymatic functions organizedin a complex network (where all elements areinter-connected and inter-dependent). For com-parative purposes, however, we considered it

informative to analyze our data with a character-based method, since it would result in an extravalidation of the reliability of the constructedtrees. Parsimonious trees for all the metabolicclasses using the same presence data were con-structed using the MIX program from the PHYLIPsoftware suite.

Building the rRNA tree

We built an rRNA-based tree for the organismsin our set in order to compare the topology of ourmetabolic trees with that of a standard phylo-genetic tree. Sequences of the rRNA genes for allthe organisms in our dataset were retrieved fromthe European Ribosomal RNA Database.40

A sequence distance matrix was later obtainedafter aligning the sequences with the CLUSTAL W

Figure 1. Construction of the metabolic trees. Profile of a hypothetical metabolic class for a set of five organisms.Fa…Fl are enzymatic functions. In the binary matrix, 0/1 represents the absence/presence of a particular enzymaticfunction. The organism distance matrix is built by calculating the normalized Hamming distance for each pair oforganisms. The tree is built by neighbor-joining from the distance matrix.

494 Trees Based on Metabolic Capabilities

program.41 The derived tree was built usingthe NEIGHBOR program of the PHYLIPsoftware suite and was drawn with the programPHYLODENDRON. The resulting tree is shown inFigure 2.

Building the gene content tree

Since our approach is complementary to otherwhole-genome comparisons (see Introduction), wealso wanted to compare our trees to a tree basedon gene content. For this purpose, we constructeda tree based on the contents of the genomes usingthe SHOTweb-based tool.42

Comparing metabolic trees

All the metabolic trees and the rRNA tree werecompared in an “all versus all” basis with theTREEDIST program of the PHYLIP package,which uses the Symmetric Distance algorithm43 tocompare the topology of two trees. Distances areshown in Table 2.

Results

We built one tree for each one of the seven meta-

bolic classes mentioned which comprised a total of69 different metabolic pathways. We also built atotal metabolic tree comprising the enzymes of allthe metabolic classes. We analyzed the distributionof the organisms in the trees relating it to theirenvironmental needs and phylogenetic affinities.Using the Symmetric Distance algorithm, we com-pared the topology of each tree to the rRNA treeand to the other trees (Table 2).

Total metabolic tree

This tree (Figure 3) includes the enzymes presentin all of the 69 metabolic pathways of the sevenmetabolic classes of the KEGG classification (asexplained above). Of all the trees examined, thistree is topologically closest to the rRNA tree. Thistree shares with the tree based on the energymetabolism the highest similarity in topologicalterms of all the trees (Table 2). The main featuresof the total metabolic tree are:

. Eukaryotic organisms are placed together in adistribution that resembles that of the classi-cal rRNA tree. Mammals Homo sapiens andMus musculus are placed closely together andlie together with Caenorhabditis elegans form-ing a metazoan group. Plant Arabidopsis

Figure 2. Tree based upon the alignment of rRNA sequences for the organisms in our dataset.

Trees Based on Metabolic Capabilities 495

Table 2. Symmetric distances between the trees for each metabolic class and the rRNA tree (from Figure 2)

rRNAtree

Aminoacid

metabolismCarbohydratemetabolism

Energymetabolism

Lipidmetabolism

Metab.complex

carbohydrates

Metab.complexlipids

Nucleotidemetabolism

Totalmetabolism

rRNA tree – 0.294 0.294 0.275 0.333 0.275 0.353 0.255 0.216Amino acid metabolism – 0.314 0.235 0.333 0.314 0.314 0.314 0.216Carbohydrate metabolism – 0.216 0.294 0.275 0.333 0.176 0.235Energy metabolism – 0.294 0.275 0.216 0.235 0.176Lipoid metabolism – 0.314 0.373 0.275 0.275Metab. complex carbohydrates – 0.333 0.235 0.255Metab. complex lipids – 0.314 0.275Nucleotide metabolism – 0.196Total metabolism –

Distances range from 0 (identical topology) to 1 (maximum topological difference), see Materials and Methods.

thaliana and fungi Saccharomyces cerevisiae andSchizosaccharomyces pombe join deeper inthe eukaryotic branch. Eukaryotes show acoherent metabolism that is distinct from theother domains in all the metabolic treesanalyzed: they are clustered together in sixof the seven metabolic classes and in thetotal metabolic tree, with a similar topologyin all of them.

. The archaeal organisms in our set (A. pernix,Archaeoglobus fulgidus, Pyrococcus abyssi andT. acidophilum) are clustered together. Theirbranch joins the branch of the obligate bac-terial pathogens deeper in the tree, formingan unusual cluster of prokaryotes adapted tohighly specialized environments.

. All obligate pathogens (Mycoplasmapneumoniae, Treponema pallidum and Chlamydiatrachomatis) are clustered together despitetheir divergent positions in the rRNA tree,possibly as a result of a convergence of theirmetabolism due to a shared lifestyle. Obligateparasites can potentially lose all of the meta-bolic pathways present in the host cell cyto-plasm. Furthermore, they live in smallisolated populations, which enhances fixationrates for deleterious mutations44,45 and limitsthe lateral gene transfer rate. This leads to areductive, convergent evolution whichdefines a minimal set of genes for the cellular

functions which the organism cannot derivefrom the host.

. Low-GC and high-GC Gram-positive organ-isms Bacillus subtilis, Staphylococcus aureusand Mycobacterium tuberculosis are grouped.However, Streptococcus pyogenes (a close rela-tive of B. subtilis and S. aureus in the classicalphylogenies and with a similar genome size)is placed distantly from the rest of Gram-positives and branches off the branch ofobligate bacterial pathogens, probablybecause of its limited biosynthetic ability.46

A similar distribution was found in othertrees.

. Proteobacteria are grouped in three separatebut related branches. The deepest branchcontains b-proteobacteria N. meningitidis andg-proteobacteria Xylella fastidiosa and Pseudo-monas aeruginosa, all of them organisms withmoderate-sized genomes. The next branchgroups Haemophilus influenzae and Pasteurellamultocida, both close relatives belonging tothe Pasteurellales order of the g-proteo-bacteria. The terminal branch contains g-pro-teobacteria with large genomes: Salmonellatyphimurium, Escherichia coli, Shigella flexneriand Yersinia pestis (all four members of theorder Enterobacteriales) and Vibrio cholerae.

. Topological comparison of the total tree withthe rRNA tree shows that this tree is the

Figure 3. Total metabolic tree. This tree is based on the distribution of the enzymes of the seven metabolicclasses considered in this work. The colored squares indicate the level of bootstrap support: green, 90–100%; orange,70–90%; yellow, 50–70%.

Trees Based on Metabolic Capabilities 497

closest to the rRNA tree in topological terms(together with the tree based on nucleotidemetabolism).

Amino acid metabolism

Figure 4 shows a tree based on the metabolism ofthe more common amino acid residues. This treeshows some remarkable features that reflectlineage-independent processes of loss of metaboliccapabilities:

. Eukaryotes show the most unusual distri-bution of all the trees, being split into two dis-tant groups: fungi and plant A. thaliana isfound closer to the bacterial and archaealbranches, whereas metazoans H. sapiens,M. musculus and C. elegans are clusteredcloser to the branch of the obligate parasites.This clustering reflects the fact thateukaryotes belonging to the fungal cladeshare a higher number of enzymatic functionsrelated to amino acid biosynthesis withbacteria than with other eukaryotes.Eukaryotes belonging to the metazoa taxonshow a diminished amino acid metabolism,which relies largely on the amino acid intakefrom the diet and from the symbiotic relation-ship with gastrointestinal bacteria.

. Low-GC Gram-positive S. pyogenes, capableof synthesizing only a few amino acidresidues,46 branches off the branch of theobligate pathogens. The remaining low-GCand high-GC Gram-positives are clusteredtogether.

. Obligate bacterial pathogens are clearlyplaced in a single group, indicating thedevelopment of specialized strategies regard-ing amino acid biosynthesis and metaboliza-tion, generally by parasitic use of the aminoacid production of the host.47–49 This partiallack of amino acidic biosynthetic capabilityalso explains the relation between the highereukaryotes and obligate bacterial pathogensin this tree.

. Proteobacteria are split into two mainbranches. One contains enterobacteria plusP. aeruginosa and V. cholerae. The other branchgroups X. fastidiosa, N. meningitidis and theclose relatives H. influenzae and P. multocida.

Carbohydrate metabolism

This KEGG metabolic class is perhaps the mostarbitrary, since it contains some metabolic path-ways that could be placed within other metabolicclasses. For instance, the glycolysis/gluconeo-genesis pathway has strong implications in theamino acid metabolism, the TCA cycle pathwayinfluences the amino acid and energy metabolismsor the pentose phosphate pathway is related to thebiosynthesis of complex carbohydrates. Using theclassification of pathways as it is, the main featuresof the resulting tree (Figure 5) are:

. The distribution of Eukaryotes, Archaea andobligate bacterial pathogens agrees with thatof the total metabolic tree.

. Low-GC Gram-positives are closer togetherthan in the total metabolic tree. S. pyogenes ispaired with its relative S. aureus and they arenot far from B. subtilis. High-GC Gram-positive M. tuberculosis is surprisingly placedin an isolated branch close to the eukaryoticcluster of the tree.

Energy metabolism

This is another somewhat arbitrary part ofKEGG metabolic classification, since some of itsmetabolic pathways could be considered as mem-bers of other parts of the metabolism, especiallyoxidative phosphorylation, carbon fixation and thereductive carboxylate cycle (all of them closelyrelated to the carbohydrate metabolism). Othermetabolic pathways, however, are purely energetic,such as ATP synthesis of photosynthesis. Theresults undoubtedly reflect the criterion chosen tocategorize the metabolic pathways. Figure 6 showsthe tree whose main features are:

. The distribution of Eukaryotes and Archaeaagrees with that of the total metabolic tree.

. Obligate bacterial pathogens form a clearlyseparated division, since most of them, tosome extent, use their energetic metabolismof the host as a result of reductive genomicevolution. For instance, the genus Chlamydiaexploits host-derived ATP productionthrough a suitable transport system.50,51

M. pneumoniae derives much of its energyfrom the degradation of host-derived lipids,although its need for an internal source ofenergy are minimum.45 T. pallidum also haslimited energetic capabilities.48

. Although proteobacteria show a distributionreminiscent of that of the total metabolic tree,Gram-positives have a taxon-independentdistribution (interspersed with proteo-bacteria) that is only marginally supportedby the bootstrap analysis. Gram-positiveS. pyogenes is placed not far from the clusterof the obligate pathogens.

Lipid metabolism

This tree is one of the most different from therRNA tree than from any other metabolic tree intopological terms (Table 2). However, the bootstrapsupport for the topology of this tree is low,especially in the terminal branches of the bacterialdomain. The main features of this tree (Figure 7)are:

. Eukaryotes and Archaea have a distributionsimilar to that of the total metabolic tree.

. The distribution of obligate bacterial patho-gens in this tree neither shows the metabolicconvergence present in other metabolic trees

498 Trees Based on Metabolic Capabilities

Figure 4. Tree based on the distribution of the enzymes related to the amino acid metabolism. The colored squares indicate the level of bootstrap support as in Figure 3.

nor reflects the phylogenetic signal of therRNA tree. T. pallidum is placed deep into abranch of proteobacteria with moderate-sized genomes. M. pneumoniae and C. tracho-matis are placed halfway between proteobac-teria and the archaeal group.

. Gram-positives B. subtilis and M. tuberculosisare found within the Gram-negative proteo-bacterial cluster. The position of the Low-GCGram-positives S. aureus and S. pyogenes isnoteworthy: they are widely separated fromtheir relative B. subtilis and placed betweenthe archaeal and eukaryotic branches.

. Distribution of proteobacteria basicallyreflects the separation between Entero-bacterial and non-Enterobacterial organisms.

Nucleotide metabolism

Of all the trees built on the individual metabolicclasses of the KEGG classification, this tree(Figure 8) has the closest topology to the rRNAtree. Only the total metabolic tree has a closer top-ology to the rRNA tree (Table 2). The main featuresof this tree are:

. Eukaryotes have a distribution similar to thatof the total metabolic tree. Proteobacteriaalso show a similar branching order to thatof the total metabolic tree, in spite of the

presence of Low-GC Gram-positives branch-ing off the g-proteobacterial branch andhigh-GC Gram-positive M. tuberculosis joiningthe branch of proteobacteria with moderategenome sizes.

. Archaea A. pernix is unusually placed half-way between obligate bacterial pathogensand the remaining archaeas.

. Obligate bacterial pathogens are together.Obligate parasites show deficiencies in purineand pyrimidine biosynthesis with differentdegrees of severity.46,47,49,50

Metabolism of complex carbohydrates

The features of this metabolic tree (Figure 9) are:

. Eukaryotes and Archaea have a distributionsimilar to that of the total metabolic tree.

. Obligate pathogenic lifestyle does not seem toaffect this part of the metabolism, sinceobligate bacterial pathogens are unusuallyunrelated in this tree. M. pneumoniae andC. trachomatis are consistently placed in thearchaeal branch, indicating that these organ-isms have a unique composition of this partof the metabolism compared to other bacteria.T. pallidum is isolated between the archaealand bacterial branches.

. Low-GC Gram-positives B. subtilis, S. aureusand S. pyogenes share a branch, agreeing with

Figure 5. Tree based on the distribution of the enzymes related to the carbohydrate metabolism. The colored squaresindicate the level of bootstrap support as in Figure 3.

500 Trees Based on Metabolic Capabilities

Figure 6. Tree based on the distribution of the enzymes related to the energy metabolism. The colored squares indicate the level of bootstrap support as in Figure 3.

Figure 7. Tree based on the distribution of the enzymes related to the lipid metabolism. The colored squares indicatethe level of bootstrap support as in Figure 3.

Figure 8. Tree based on the distribution of the enzymes related to the nucleotide metabolism. The colored squaresindicate the level of bootstrap support as in Figure 3.

502 Trees Based on Metabolic Capabilities

the rRNA tree but not with the total meta-bolic tree. Together with M. tuberculosis,placed nearby, they form a Gram-positivegroup.

Metabolism of complex lipids

This tree has the most different topology withrespect to the rRNA tree (Table 2). Bootstrap valuesfor the archaeal and bacterial domains are gener-ally low and the bacterial domain is inconsistentwith both the rRNA phylogeny and with the totalmetabolic tree. The main features of this metabolictree (Figure 10) are:

. Eukaryotes, Archaea and obligate bacterialpathogens have a distribution similar to thatof the total metabolic tree.

. All proteobacteria are together but thebranching order is quite inconsistent. Entero-bacteria are grouped with the highest boot-strap support, but Gram-positives B. subtilisand M. tuberculosis pair with proteobacteriaX. fastidiosa and P. aeruginosa, respectively.S. pyogenes is closer to obligate bacterialpathogens than to B. subtilis or S. aureus.

Comparison to parsimony trees

We built parsimonious trees for all the metabolicclasses. As an example, the parsimonious treebased on all the enzymatic functions (the parsi-

monious equivalent to the total metabolic tree inFigure 3) is shown in Figure 11. All the parsi-monious trees are available†. Parsimonious treesshowed a remarkable similarity to the neighbor-joining trees. The three-domain system is broadlymaintained but less so than in neighbor-joiningtrees. In particular, eukaryotes form a clearly con-sistent group and the Archaea domain showed aless consistent branching order than in neighbor-joining trees. The atypical clustering of eukaryotesin the neighbor-joining amino acid metabolismtree (Figure 4) is reflected in the parsimonious treebased on the amino acid metabolism. The groupingof obligate bacterial pathogens agreed with that ofthe neighbor-joining trees. Proteobacteria alsoshowed roughly the same pattern observed in thetrees constructed by the neighbor-joining methodbut Gram-positive bacteria showed less consistencyas an independent group.

Comparison to gene content trees

A gene content tree constructed using the pro-gram SHOT is shown in Figure 12. The tree isbased on the similarity between genomes at thelevel of shared orthologous genes. The resultingtree is broadly similar to the classical phylogeny.Obligate bacterial pathogens are not placed

Figure 9. Tree based on the distribution of the enzymes related to the metabolism of complex carbohydrates. Thecolored squares indicate the level of bootstrap support as in Figure 3.

†http://bioinf.uab.es/phenetictrees/index.html

Trees Based on Metabolic Capabilities 503

Figure 10. Tree based on the distribution of the enzymes related to the metabolism of complex lipids. The colored squares indicate the level of bootstrap support as inFigure 3.

Figure 11. Total metabolic tree constructed using parsimony. This tree is based on the distribution of the enzymes of the seven metabolic classes considered in this work.

together as they were in our metabolic trees, prob-ably because the effect of the size of the genomein the clustering is avoided by the SHOT programby normalizing genome sizes. The organization ofthe terminal proteobacterial branches disagreesboth with the rRNA tree and with the organizationof the terminal branches in our trees (despite theobvious clustering of close relatives H. influenzaand P. multocida). Gram-positives are placedaccording to the classical phylogeny, includingS. pyogenes which was placed closer to obligatebacterial pathogens in our trees.

A similar study based on the comparison ofshared gene products from complete proteomes21

also yielded trees similar to the rRNA tree. How-ever, other trees did not share such a similaritywith the classical phylogeny. Trees based on thepresence/absence of orthologous genes built byWolf et al.18 clustered obligate bacterial pathogensand grouped proteobacteria with moderate-sizedgenomes (H. influenzae, N. meningitidis andP. multocida) in a fashion found in the most of ourmetabolic trees, probably because of the lack ofcorrection for gene sizes. Furthermore, archaealorganisms are somewhat closer in the mentionedtrees18 to bacterial pathogens than to free-livingbacteria, agreeing with our results.

Discussion

Analysis of metabolic pathways

We analyzed the distribution of enzymaticfunctions and built a series of tree based on theoccurrence of those functions throughout the meta-bolome of a set of organisms, using seven of themetabolic classes present in the KEGG database(see Selecting the metabolic pathways, metabolicclasses and enzymes). The topology of the treesfor each metabolic class defines the adaptation ofthe organisms to the niche by means of mutation,gene loss and horizontal gene acquisition. Therelationships between the organisms in each treecan be interpreted in evolutionary terms for theparticular part of the metabolism which the treerepresents, showing the evolutionary relationshipsbetween species living in similar or differentniches.

Because we are working with enzymatic func-tions, and not with protein sequences, we areavoiding the source of some of the discrepanciesoften found when comparing phylogenetic trees:the presence of gene duplications and paralogy,which contribute in part to the complexity of thetrees based on sequence homology and the changesin mutation rates of sequences when comparingspecies separated by large evolutionary periods.52

Thus, our trees are phenetic trees which hierarchi-cally classify the metabolic capabilities of theorganisms but do not establish evolutionarydescent. On the other hand, our results are stronglydependent on the compartmentalization of the

metabolism into pathways. The criteria chosen todefine where the end of a particular metabolicpathway occurs and where another metabolicpathway begins can be arbitrary (as discussedabove). Furthermore, the clustering of individualmetabolic pathways non-overlappingly intohigher-level metabolic classes adds anotherdimension to this problem.

Set of organisms

Although the set of organisms used covers awide range of lineages and metabolic regimes (seeTable 1) there is a problem that is, so far, unavoid-able: because for no fully sequenced organism allthe enzymatic functions on its genome have beenexperimentally found, our data is a partial view ofthe metabolome. Furthermore, there is anotherbias towards preferentially sequencing genomesfrom non-free-living bacterial pathogens, which isreflected in our set of organisms. However, usinga set of organisms rich in pathogenic bacteriaprovides interesting results when comparing meta-bolomes because of their highly divergent andspecialized lifestyle. Some approaches to genome-tree building found that the presence of parasiticorganisms distorted the topology of the trees.12,21,53

This was not the case of our metabolic trees,where obligate pathogens usually formed anhomogeneous cluster and their removal from theset did not change substantially the distribution ofother clades (data not shown). Two extra archaealorganisms were included in the set in order toreflect the diversity of the Archaea domain.

Conclusion

The main results produced by our analysis of themetabolic trees are:

. The total metabolic tree, which includes allthe enzymes in the 69 metabolic pathwaysconsidered, is broadly similar to the rRNAtree in terms of the placement of organismsinto the three domains. Although the distri-bution of the organisms inside the eukaryoticand archaeal clusters agrees with theirrRNA-based counterparts, the topology ofthe branches inside the bacterial divisionshows some differences.

. The tree based on the nucleotide metabolismagreed best in terms of topology with therRNA tree and was one of the most similarto the total metabolic tree. This might be theconsequence of nucleotide metabolism beingone of the more essential parts of metabolism.The tree with the best topological agreementwith the total metabolic tree was the energymetabolism tree. The trees with poorest topo-logical agreement with the rRNA tree werethose based on lipid metabolism and complexlipid metabolism.

506 Trees Based on Metabolic Capabilities

Figure 12. Gene content tree constructed using the program SHOT.42 The genome size was considered to be the number of coding open reading frames (ORFs) with at leastone homologue in the other completed genomes in the database of the program. Four of the organisms in our set (S. typhimurium, S. flexneri, Y. pestis and M. musculus) werenot present in the database of the program and could not be included in the tree.

. Eukaryotes were found to be clusteredtogether (and clearly separated from Bacteriaand Archaea) in all trees but the amino acidmetabolism tree. This location of eukaryotesin the metabolic trees is similar to that foundin the classical rRNA tree, showing that theclear eukaryotic divergence from the otherdomains is observed both in the rRNAsequence and in the evolution of themetabolism.

. In the total metabolic tree and in five of theseven other metabolic trees, Archaea andobligate bacterial pathogens are clearlysister branches. In the metabolism ofcomplex carbohydrates tree, archaealorganisms and obligate bacterial pathogensare intermixed. This is probably becausetheir characteristic lifestyles make themmetabolically distant from both eukaryotesand free-living bacteria. Thus, the locationof the archaeal organisms separates themfrom the free-living bacteria as well asfrom eukaryotes. Indeed, the relationship ofArchaea with the rest of prokaryotesand even the true nature of Archaea as ameaningful domain has been debatedaccording to ecological and moleculardata.54–59 A closer inspection of the differenttrees reveals that archaeal organisms aresometimes closer to eukaryotes than tobacteria but in other trees are metabolicallycloser to free-living bacteria than toeukaryotes.

. Three obligate pathogens (M. pneumoniae,T. pallidum and C. trachomatis) were includedin our analysis. Although these organismsare distant from one another in the rRNAtrees, in five of the seven metabolic trees andin the total metabolic tree they are clearlygrouped. This shows that adaptation toobligate pathogenic lifestyle has been thecause of a similar type of reduction of themetabolism in these obligate bacterial patho-gens. Indeed the loss of genetic material(and, thus, of metabolic capabilities) of theseorganisms is a classical example of conver-gent evolution.45

. Proteobacteria are consistently split into threedifferent but usually related groups in all thetrees, including the total metabolic tree. Onegroup contains enterobacteria and V. cholerae,other group contains the two Pasteurellalesin our set of organisms and the third groupclusters P. aeruginosa, N. meningitidis andX. fastidiosa, three proteobacteria from diverseproteobacterial orders but with moderate-sized genomes.

. Gram-positives show an overall inconsistentdistribution. S. pyogenes, in particular, isfound closer to the obligate bacterial patho-gens than to close relatives B. subtilis andS. aureus in four of the seven metabolic treesand in the total metabolic tree.

To conclude, metabolic trees represent anintegrative approach for the comparison of theevolution of metabolism. This high-level view ofmetabolism can identify relationships betweenorganisms, which in turn can assist in suggestingrequired metabolic capabilities whose presencecould be searched for in uncharacterized geneproducts. Relationships between lineages at ametabolic level are strongly related to ecologicalfactors and should help to find new relationshipsin the tree of life.

Materials and Methods

Protein databases

We used SWISS-PROT version 41, TrEMBL versionreleased on 24th June 2003, LIGAND database release27.0.

rRNA databases

Sequences of the rRNA genes for all the organisms inour dataset were retrieved from the European RibosomalRNA Database†. Sequences were aligned with theCLUSTALW program (v. 1.83).

Selecting the metabolic pathways, metabolicclasses and enzymes

Metabolic pathways were taken from the KEGGdatabase†. Enzymatic functions for each metabolic path-way and genes encoding the enzymes were retrieved byautomated text parsing of the LIGAND database‡.Specifically, the genes were retrieved from the GENESsection of the enzyme file distributed with the LIGANDdatabase, which in turn, is a subset of the GENESdatabase‡. GENES is a secondary database built semi-automatically mainly from three databases: GenBank,NCBI RefSeq and the EMBL database. The informationis then subject to internal re-annotation. For our set of27 organisms, there were 1051 different enzymatic func-tions (EC numbers) in the release of the database used.For the same set of organisms, SWISS-PROT contains1415 enzymatic functions and TrEMBL contains 1501.The difference is either because some enzymatic func-tions are not present in the LIGAND database or becausethey are not assigned to any KEGG metabolic pathway.We are aware of the current bias in the enzymatic anno-tation: (a) the number of annotated genes in public data-bases for a given organism usually differ;60 and (b)because protein function is related to amino acidsequence, a large number of proteins in the databasesare assigned a function by automated means. Someauthors have raised some concerns about the use ofavailable annotation information from genome sequen-cing projects,61–63 arguing that the majority of the proteinfunctions present in the databases have been inferredfrom sequence similarities, whereas, however, sequencesimilarity does not necessarily imply the same function(the reverse is also true). Furthermore, genes with no

†http://oberon.fvms.ugent.be:8080/rRNA/index.html‡http://www.genome.ad.jp/kegg/kegg2.html#genes

508 Trees Based on Metabolic Capabilities

assigned function encode 20–60% of the total number ofgenes in most genomes.64 On the other hand, includingpostulated protein functions in our dataset allowed usto study a more reliable picture of the metabolism, thusavoiding the bias caused by using a partial represen-tation of the enzymatic pool. The high degree of inte-gration between the GENES database and the otherdatabases used here allowed a consistent assignment ofthe enzymatic functions to the individual metabolicpathways.

Building the metabolic trees

The NEIGHBOR program present in the PHYLIP soft-ware suite uses the neighbor-joining algorithm.65 TheMIX program from the same package was used for con-structing the parsimonious trees using the assumptionof equal probability of change between states andunknown ancestral states.66,67 Random input order ofthe species was used. The most parsimonious trees werethen merged using the CONSENSE program withdefault settings. The PHYLODENDRON tree-drawingprogram by D.G. Gilbert is available at a web-basedinterface† and can be downloaded‡.

Bootstrap analysis with 1000 replicates was performedin order to test the resulting trees. The replicates weremade with the SEQBOOT program of the PHYLIP soft-ware suite (version 3.5). SEQBOOT allows the user togenerate multiple data sets that are resampled versionsof an input data, which in our study had the DiscreteMorphological Character data type. The bootstrap treewas then constructed with the CONSENSE programfrom the same package. In our case, the CONSENSE pro-gram carries out an extended majority rule consensus,part of the Ml methods.68 The number at each node ofthe resulting tree indicates how many times the groupwhich consists of the species to the right of (descendedfrom) the node occurred. Thus, the distances in thedrawn tree show the degree of conservation of the distri-bution of the species.

Tree topologies were compared with the TREEDISTprogram of the PHYLIP package, which computes theSymmetric Distance described by Robinson and Foulds.43

It should be noted that this algorithm considers only treetopologies, not branch distances. Considering a partitionas a branch of the tree dividing the set of organisms intwo sets into two groups (those connected to one end ofthe branch and those connected to the other), the dis-tance between two trees, their symmetric distance, is acount of how many partitions there are, among the twotrees, that are on one tree and not on the other. Whenthe two trees are fully resolved bifurcating trees, theirsymmetric distance must be an even number; it rangesfrom 0 to twice the number of internal branches, whichfor n species is 4n2 6. Distances in Table 2 are shown asa fraction of that maximum possible distance and rangefrom 0 to 1.

The gene content tree was constructed using theSHOT program§.42 Default parameters were used. Weused genes with homologs in sequenced species as thegenome size definition, which means that the genomesize was considered to be the number of coding open

reading frames (ORFs) with at least one homologue inthe other completed genomes in the database of theprogram. Four of the organisms in our set(S. typhimurium, S. flexneri, Y. pestis and M. musculus)were not present in the database of the program andcould not be included in the tree.

Acknowledgements

We thank Dr Baldomero Oliva, Dr FlorencioPazos and Dr David Leak for helpful discussions.D.A. thanks the Integrated Approaches for Func-tional Genomics program of the European ScienceFoundation and the Red Nacional de Bioinforma-tica (Spain) for financial support. E.Q. acknowl-edges the grant MCYT BIO2001-2064 from theCICYT (Ministerio de Ciencia y Tecnologıa,Spain). F.X.A. acknowledges the grant MYCTBIO2001-2046 from the CICYT (Ministerio de Cien-cia y Tecnologıa, Spain).

References

1. Woese, C. R. (1994). There must be a prokaryotesomewhere: microbiology’s search for itself.Microbiol. Rev. 58, 1–9.

2. Woese, C. R., Kandler, O. & Wheelis, M. L. (1990).Towards a natural system of organisms: proposalfor the domains Archaea, Bacteria, and Eucarya.Proc. Natl Acad. Sci. USA, 87, 4576–4579.

3. Brown, J. R. & Doolittle, W. F. (1997). Archaea andthe prokaryote-to-eukaryote transition. Microbiol.Mol. Biol. Rev. 61, 456–502.

4. Gupta, R. S. (1998). Protein phylogenies andsignature sequences: a reappraisal of evolutionaryrelationships among Archaea, eubacteria, andeukaryotes. Microbiol. Mol. Biol. Rev. 62, 1435–1491.

5. Gupta, R. S. (1998). Life’s third domain (Archaea): anestablished fact or an endangered paradigm? Theor.Popul. Biol. 54, 91–104.

6. Mayr, E. (1998). Related two empires or three? Proc.Natl Acad. Sci. USA, 95, 9720–9723.

7. Koonin, E. V., Mushegian, A. R., Galperin, M. Y. &Walker, D. R. (1997). Comparison of archaeal andbacterial genomes: computer analysis of proteinsequences predicts novel functions and suggests achimeric origin for the archaea. Mol. Microbiol. 25,619–637.

8. Montague, M. G. & Hutchison, C. A., III (2000). Genecontent phylogeny of herpesviruses. Proc. Natl Acad.Sci. USA, 97, 5334–5339.

9. Natale, D. A., Galperin, M. Y., Tatusov, R. L. &Koonin, E. V. (2000). Using the COG database toimprove gene recognition in complete genomes.Genetica, 108, 9–17.

10. Bansal, A. K. & Meyer, T. E. (2002). Evolutionaryanalysis by whole-genome comparisons. J. Bacteriol.184, 2260–2272.

11. Wolf, Y. I., Rogozin, I. B., Grishin, N. V. & Koonin, E.(2002). Genome trees and the tree of life. TrendsGenet. 18, 472–479.

12. Fitz-Gibbon, S. T. & House, C. H. (1999). Wholegenome-based phylogenetic analysis of free-livingmicroorganisms. Nucl. Acids Res. 27, 4218–4222.

†http://iubio.bio.indiana.edu/treeapp/treeprint-form.html‡http://iubio.bio.indiana.edu/soft/molbio/java/

apps/trees/§http://www.bork.embl-heidelberg.de/SHOT/

Trees Based on Metabolic Capabilities 509

13. Osawa, S., Jukes, T. H., Watanabe, K. & Muto, A.(1992). Recent evidence for evolution of the geneticcode. Microbiol. Rev. 56, 229–264.

14. Jukes, T. H. & Osawa, S. (1993). Evolutionarychanges in the genetic code. Comp. Biochem. Physiol.B, 106, 489–494.

15. Lin, J. & Gerstein, M. (2000). Whole-genome treesbased on the occurrence of folds and orthologs:implications for comparing genomes on differentlevels. Genome Res. 10, 808–818.

16. Tekaia, F., Yeramian, E. & Dujon, B. (2002). Aminoacid composition of genomes, lifestyles of organisms,and evolutionary trends: a global picture with corre-spondence analysis. Gene, 297, 51–60.

17. Wolf, Y. I., Brenner, S. E., Bash, P. A. & Koonin, E. V.(1999). Distribution of protein folds in the threesuperkingdoms of life. Genome Res. 9, 17–26.

18. Wolf, Y. I., Rogozin, I. B., Grishin, N. V., Tatusov, R. L.& Koonin, E. V. (2001). Genome trees constructedusing five different approaches suggest new majorbacterial clades. BMC Evol. Biol. 1, 8.

19. Sankoff, D., Leduc, G., Antoine, N., Paquin, B., Lang,B. F. & Cedergren, R. (1992). Gene order comparisonsfor phylogenetic inference: evolution of the mito-chondrial genome. Proc. Natl Acad. Sci. USA, 89,6575–6579.

20. Boore, J. L. & Brown, W. M. (1998). Big trees fromlittle genomes: mitochondrial gene order as a phylo-genetic tool. Curr. Opin. Genet. Dev. 8, 668–674.

21. Snel, B., Bork, P. & Huynen, M. A. (1999). Genomephylogeny based on gene content. Nature Genet. 21,108–110.

22. Tekaia, F., Lazcano, A. & Dujon, B. (1999). Thegenomic tree as revealed from whole proteome com-parisons. Genome Res. 9, 550–557.

23. Woese, C. R. & Fox, G. E. (1977). Phylogeneticstructure of the prokaryotic domain: the primarykingdoms. Proc. Natl Acad. Sci. USA, 74, 5088–5090.

24. Ling, L., Wang, J., Cui, Y., Li, W. & Chen, R. (2002).Proteome-wide analysis of protein function compo-sition reveals the clustering and phylogeneticproperties of organisms. Mol. Phylogenet. Evol. 25,101–111.

25. Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. &Barabasi, A. L. (2000). The large-scale organizationof metabolic networks. Nature, 407, 651–654.

26. Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai,Z. N. & Barabasi, A. L. (2002). Hierarchical organiz-ation of modularity in metabolic networks. Science,297, 1551–1555.

27. Stelling, J., Klamt, S., Bettenbrock, K., Schuster, S. &Gilles, E. D. (2002). Metabolic network structuredetermines key aspects of functionality andregulation. Nature, 420, 190–193.

28. Fiehn, O. & Weckwerth, W. (2003). Decipheringmetabolic networks. Eur. J. Biochem. 270, 579–588.

29. Brazhnik, P., de la Fuente, A. & Mendes, P. (2002).Gene networks: how to put the function in genomics.Trends Biotechnol. 20, 467–472.

30. Fiehn, O. (2002). Metabolomics—the link betweengenotypes and phenotypes. Plant Mol. Biol. 48,155–171.

31. Forster, J., Gombert, A. K. & Nielsen, J. (2002).A functional genomics approach using metabolomicsand in silico pathway analysis. Biotechnol. Bioeng. 79,703–712.

32. Osterman, A. & Overbeek, R. (2003). Missing genesin metabolic pathways: a comparative genomicsapproach. Curr. Opin. Chem. Biol. 7, 238–251.

33. Kanehisa, M., Goto, S., Kawashima, S. & Nakaya, A.(2002). The KEGG databases at GenomeNet. Nucl.Acids Res. 30, 42–46.

34. Goto, S., Okuno, Y., Hattori, M., Nishioka, T. &Kanehisa, M. (2002). LIGAND: database of chemicalcompounds and reactions in biological pathways.Nucl. Acids Res. 30, 402–404.

35. Pellegrini, M., Marcotte, E. M., Thompson, M. J.,Eisenberg, D. & Yeates, T. O. (1999). Assigningprotein functions by comparative genome analysis:protein phylogenetic profiles. Proc. Natl Acad. Sci.USA, 96, 4285–4288.

36. Felsenstein, J. (1989). PHYLIP—phylogeny inferencepackage (version 3.2). Cladistics, 5, 164–166.

37. Felsenstein, J. (1973). Maximum likelihood and mini-mum-steps methods for estimating evolutionarytrees from data on discrete characters. Syst. Zool. 22,240–249.

38. Felsenstein, J. (1983). Parsimony in systematics:biological and statistical issues. Annu. Rev. Ecol. Syst.14, 313–333.

39. Felsenstein, J. (1988). Phylogenies from molecularsequences: inference and reliability. Annu. Rev.Genet. 22, 521–565.

40. Wuyts, J., Van de Peer, Y., Winkelmans, T. & DeWachter, R. (2002). The European database on smallsubunit ribosomal RNA. Nucl. Acids Res. 30, 183–185.

41. Thompson, J. D., Higgins, D. G. & Gibson, T. J.(1994). CLUSTAL W: improving the sensitivity ofprogressive multiple sequence alignment throughsequence weighting, position-specific gap penaltiesand weight matrix choice. Nucl. Acids Res. 22,4673–4680.

42. Korbel, J. O., Snel, B., Huynen, M. A. & Bork, P.(2002). SHOT: a web server for the construction ofgenome phylogenies. Trends Genet. 18, 158–6236.

43. Robinson, D. F. & Foulds, L. R. (1981). Comparison ofphylogenetic trees. Math. BioSci. 53, 131–147.

44. Andersson, S. G. (1998). Bioenergetics of the obligateintracellular parasite Rickettsia prowazekii. Biochim.Biophys. Acta, 1365, 105–111.

45. Andersson, S. G. & Kurland, C. G. (1998). Reductiveevolution of resident genomes. Trends Microbiol. 6,263–268.

46. Ferretti, J. J., McShan, W. M., Ajdic, D., Savic, D. J.,Savic, G., Lyon, K. et al. (2001). Complete genomesequence of an M1 strain of Streptococcus pyogenes.Proc. Natl Acad. Sci. USA, 98, 4658–4663.

47. Himmelreich, R., Hilbert, H., Plagens, H., Pirkl, E.,Li, B. C. & Herrmann, R. (1996). Complete sequenceanalysis of the genome of the bacterium Mycoplasmapneumoniae. Nucl. Acids Res. 24, 4420–4449.

48. Fraser, C. M., Norris, S. J., Weinstock, G. M., White,O., Sutton, G. G., Dodson, R. et al. (1998). Completegenome sequence of Treponema pallidum, the syphilisspirochete. Science, 281, 375–388.

49. Meseguer, M. A., Alvarez, A., Rejas, M. T., Sanchez,C., Perez-Diaz, J. C. & Baquero, F. (2003). Mycoplasmapneumoniae: a reduced-genome intracellular bacterialpathogen. Infect. Genet. Evol. 3, 47–55.

50. Zomorodipour, A. & Andersson, S. G. (1999).Obligate intracellular parasites: Rickettsia prowazekiiand Chlamydia trachomatis. FEBS Letters, 452, 11–15.

51. Davis, R. W. & Stephens, R. S. (1999). Comparativegenomes of Chlamydia pneumoniae and C. trachomatis.Nature Genet. 21, 385–389.

52. Castresana, J. (2001). Comparative genomics andbioenergetics. Biochim. Biophys. Acta, 1506, 147–162.

53. Wolf, Y. I., Rogozin, I. B., Kondrashov, A. S. &

510 Trees Based on Metabolic Capabilities

Koonin, E. V. (2001). Genome alignment, evolution ofprokaryotic genome organization, and prediction ofgene function using genomic context. Genome Res.11, 356–372.

54. Gupta, R. S. (1998). What are Archaea: life’s thirddomain or monoderm prokaryotes related to Gram-positive bacteria? A new proposal for the classifi-cation of prokaryotic organisms. Mol. Microbiol. 29,695–707.

55. Lopez-Garcia, P. & Moreira, D. (1999). Metabolicsymbiosis at the origin of eukaryotes. Trends Biochem.Sci. 24, 88–93.

56. Penny, D. & Poole, A. (1999). The nature of the lastuniversal common ancestor. Curr. Opin. Genet. Dev.9, 672–677.

57. Glansdorff, N. (2000). About the last commonancestor, the universal life-tree and lateral genetransfer: a reappraisal. Mol. Microbiol. 38, 177–185.

58. Woese, C. R. (2000). Interpreting the universal phylo-genetic tree. Proc. Natl Acad. Sci. USA, 97, 8392–8396.

59. Cavalier-Smith, T. (2002). The neomuran origin ofArchaea, the negibacterial root of the universal treeand bacterial megaclassification. Int. J. Syst. Evol.Microbiol. 52, 7–76.

60. Ouzounis, A. & Karp, P. D. (2002). The past, presentand future of genome-wide re-annotation. GenomeBiol. 3, 2.

61. Karp, P. D., Paley, S. & Zhu, J. (2001). Databaseverification studies of SWISS-PROT and GenBank.Bioinformatics, 17, 526–532.

62. Eisen, J. A. (1998). Phylogenomics: improvingfunctional predictions for uncharacterized genes byevolutionary analysis. Genome Res. 8, 163–167.

63. Sicheritz-Ponten, T. & Andersson, S. G. (2001).A phylogenomic approach to microbial evolution.Nucl. Acids Res. 29, 545–552.

64. Osterman, A. & Overbeek, R. (2003). Missing genesin metabolic pathways: a comparative genomicsapproach. Curr. Opin. Chem. Biol. 7, 238–251.

65. Saitou, N. & Nei, M. (1987). The neighbor-joiningmethod: a new method for reconstructing phylo-genetic trees. Mol. Biol. Evol. 4, 406–425.

66. Farris, J. S. (1970). Methods for computing Wagnertrees. Syst. Zool. 19, 83–92.

67. Swofford, D. L. & Maddison, W. P. (1987). Recon-structing ancestral character states under Wagnerparsimony. Math. Biosci. 87, 199–229.

68. Margush, T. & McMorris, F. R. (1981). Consensusn-trees. Bull. Math. Biol. 43, 239–244.

69. Klenk, H. P., Clayton, R. A., Tomb, J. F., White, O.,Nelson, K. E., Ketchum, K. A. et al. (1997). The com-plete genome sequence of the hyperthermophilic,sulphate-reducing archaeon Archaeoglobus fulgidus.Nature, 390, 364–370.

70. Kawarabayasi, Y., Hino, Y., Horikawa, H., Yamazaki,S., Haikawa, Y., Jin-no, K. et al. (1999). Completegenome sequence of an aerobic hyper-thermophiliccrenarchaeon, Aeropyrum pernix K1. DNA Res. 6,83–101. See also pp. 145–152.

71. Tabata, S., Kaneko, T., Nakamura, Y., Kotani, H.,Kato, T., Asamizu, E. et al. (2000). Sequence andanalysis of chromosome 5 of the plant Arabidopsisthaliana. Nature, 408, 823–826.

72. Salanoubat, M., Lemcke, K., Rieger, M., Ansorge, W.,Unseld, M., Fartmann, B. et al. (2000). Sequence andanalysis of chromosome 3 of the plant Arabidopsisthaliana. Nature, 408, 820–822.

73. Kunst, F., Ogasawara, N., Moszer, I., Albertini, A. M.,Alloni, G., Azevedo, V. et al. (1997). The complete

genome sequence of the Gram-positive bacteriumBacillus subtilis. Nature, 390, 249–256.

74. Consortium TCeS (1998). Genome sequence of thenematode C. elegans: a platform for investigatingbiology. Science, 282, 2012–2018.

75. Stephens, R. S., Kalman, S., Lammel, C., Fan, J.,Marathe, R., Aravind, L. et al. (1998). Genomesequence of an obligate intracellular pathogen ofhumans: Chlamydia trachomatis. Science, 282, 754–759.

76. Blattner, F. R., Plunkett, G., III, Bloch, C. A., Perna,N. T., Burland, V., Riley, M. et al. (1997). The completegenome sequence of Escherichia coli K-12. Science, 277,1453–1474.

77. Fleischmann, R. D., Adams, M. D., White, O.,Clayton, R. A., Kirkness, E. F., Kerlavage, A. R. et al.(1995). Whole-genome random sequencing andassembly of Haemophilus influenzae Rd. Science, 269,496–512.

78. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C.,Zody, M. C., Baldwin, J. et al. (2001). Initial sequenc-ing and analysis of the human genome. Nature, 409,860–921.

79. Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W.,Mural, R. J. et al. (2001). The sequence of the humangenome. Science, 291, 1304–1351.

80. Marra, M., Hillier, L., Kucaba, T., Allen, M., Barstead,R., Beck, C. et al. (1999). An encyclopedia of mousegenes. Nature Genet. 21, 191–194.

81. Carninci, P., Waki, K., Shiraki, T., Konno, H., Shibata,K., Itoh, M. et al. (2003). Targeting a complex tran-scriptome: the construction of the mouse full-lengthcDNA encyclopedia. Genome Res. 13, 1273–1289.

82. Parkhill, J., Achtman, M., James, K. D., Bentley, S. D.,Churcher, C., Klee, S. R. et al. (2000). Complete DNAsequence of a serogroup A strain of Neisseriameningitides Z2491. Nature, 404, 502–506.

83. Tettelin, H., Saunders, N. J., Heidelberg, J., Jeffries,A. C., Nelson, K. E., Eisen, J. A. et al. (2000).Complete genome sequence of Neisseria meningitidisserogroup B strain MC58. Science, 287, 1809–1815.

84. Cohen, G. N., Barbe, V., Flament, D., Galperin, M.,Heilig, R., Lecompte, O. et al. (2003). An integratedanalysis of the genome of the hyperthermophilicarchaeon Pyrococcus abyssi. Mol. Microbiol. 47,1495–1512.

85. Stover, C. K., Pham, X. Q., Erwin, A. L., Mizoguchi,S. D., Warrener, P., Hickey, M. J. et al. (2000).Complete genome sequence of Pseudomonasaeruginosa PA01, an opportunistic pathogen. Nature,406, 959–964.

86. May, B. J., Zhang, Q., Li, L. L., Paustian, M. L.,Whittam, T. S. & Kapur, V. (2001). Complete genomicsequence of Pasteurella multocida, Pm70. Proc. NatlAcad. Sci. USA, 98, 3460–3465.

87. Kuroda, M., Ohta, T., Uchiyama, I., Baba, T., Yuzawa,H., Kobayashi, I. et al. (2001). Whole genomesequencing of meticillin-resistant Staphylococcusaureus. Lancet, 357, 1225–1240.

88. Goffeau, A., Barrell, B. G., Bussey, H., Davis, R. W.,Dujon, B., Feldmann, H. et al. (1996). Life with 6000genes. Science, 274, 563–567.

89. Jin, Q., Yuan, Z., Xu, J., Wang, Y., Shen, Y., Lu, W. et al.(2002). Genome sequence of Shigella flexneri 2a:insights into pathogenicity through comparisonwith genomes of Escherichia coli K12 and O157. Nucl.Acids Res. 30, 4432–4441.

90. Wei, J., Goldberg, M. B., Burland, V., Venkatesan,M. M., Deng, W., Fournier, G. et al. (2003). Completegenome sequence and comparative genomics of

Trees Based on Metabolic Capabilities 511

Shigella flexneri serotype 2a strain 2457T. Infect.Immunol. 71, 2775–2786.

91. Wood, V., Gwilliam, R., Rajandream, M. A., Lyne, M.,Lyne, R., Stewart, A. et al. (2002). The genomesequence of Schizosaccharomyces pombe. Nature, 415,871–880.

92. Ferretti, J. J., McShan, W. M., Ajdic, D., Savic, D. J.,Savic, G., Lyon, K. et al. (2001). Complete genomesequence of an M1 strain of Streptococcus pyogenes.Proc. Natl Acad. Sci. USA, 98, 4658–4663.

93. Nakagawa, I., Kurokawa, K., Yamashita, A., Nakata,M., Tomiyasu, Y., Okahashi, N. et al. (2003). Genomesequence of an M3 strain of Streptococcus pyogenesreveals a large-scale genomic rearrangement ininvasive strains and new insights into phageevolution. Genome Res. 13, 1042–1055.

94. McClelland, M., Sanderson, K. E., Spieth, J., Clifton,S. W., Latreille, P., Courtney, L. et al. (2001). Completegenome sequence of Salmonella enterica serovarTyphimurium LT2. Nature, 413, 852–856.

95. Ruepp, A., Graml, W., Santos-Martinez, M. L.,Koretke, K. K., Volker, C., Mewes, H. W. et al. (2000).The genome sequence of the thermoacidophilicscavenger Thermoplasma acidophilum. Nature, 407,508–513.

96. Heidelberg, J. F., Eisen, J. A., Nelson, W. C., Clayton,R. A., Gwinn, M. L., Dodson, R. J. et al. (2000). DNAsequence of both chromosomes of the cholerapathogen Vibrio cholerae. Nature, 406, 477–483.

97. Simpson, A. J., Reinach, F. C., Arruda, P., Abreu, F. A.,Acencio, M., Alvarenga, R. et al. (2000). The genomesequence of the plant pathogen Xylella fastidiosa. TheXylella fastidiosa Consortium of the Organization forNucleotide Sequencing and Analysis. Nature, 406,151–157.

98. Parkhill, J., Wren, B. W., Thomson, N. R., Titball,R. W., Holden, M. T., Prentice, M. B. et al. (2001).Genome sequence of Yersinia pestis, the causativeagent of plague. Nature, 413, 523–527.

Edited by J. Thornton

(Received 17 December 2003; received in revised form 26 April 2004; accepted 29 April 2004)

512 Trees Based on Metabolic Capabilities