Identify gene markers for different taxonomic groups in Archaea and Bacteria Genomes Dongying Wu...

1
Identify gene markers for different taxonomic groups in Archaea and Bacteria Genomes Dongying Wu 1,2 , Jonathan A. Eisen 1,2 1. DOE Joint Genome Institute, Walnut Creek, California 94598, USA 2. University of California, Davis, Davis, California 95616, USA Introduction The sequencing and phylogenetic analysis of rRNA molecules demonstrated that all organisms could be placed on a single tree of life. Highly conserved, homologous 16S rRNA genes' presence in all organismal lineages makes them the only universal marker that has been adopted by biologist. Unfortunately phylogenetic trees based on rRNA sequences do not always accurately reflect the evolutionary history of the organisms represented due to the occurrence of lateral gene transfer, different rates of evolution in different lineages, and convergent evolution of rRNAs that can result in rRNA sequences from distantly related species becoming more similar to each other over time. There are also difficulties in generating accurate alignments of rRNA genes. For metagenomic studies of bacteria and archaea, more phylogenetic markers are needed in addition to the small subunit rRNA gene. Phylogenetic analysis using protein markers have been limited in scope. To date, the AMPHORA package developed by our lab only include 31 protein markers for bacteria. To address such issue, we started to identifying phylogenetic markers at different taxonomic levels systematically. The progress of genomic sequencing especially the phylogenetic diversity driven Genome Encyclopedia of Bacteria and Archaea (GEBA) project provides a great dataset for our top-down approach for marker identification. Conclusions We’ve established a protocol for the automatic identifications of phylogenetic markers for any given phylogentic groups. The protocol uses BLAST and MCL clustering algorithms to generate gene families for a selection of genomes. Phylogenetic trees are built for the gene families and clades from the trees are automatically sampled and evaluated for universality and evenness in terms of their distributions across genomes of interest. HMM profiles are built for universally distributed single-copied genes that form monophyletic clades on phylogenetic trees. HMM searches against the genomes of interest are applied to help us to decide the the families that are potential phylogenetic markers. We’ve build gene families and captured potential markers at the follow taxonomic levels: Bacteria, Archaea, Actinobacteria, Bacteroides, Chlamydae, Chloroflexi, Cyanobacteria, Firmicutes, Spirochaetes, Thermi, Thermotogae, and five classes of Proteobacteria. HMM profiles were built for 5133 families that can be potential markers at the taxonomic levels we’ve examined. We’ve also carried out clustering and tree building analysis for all the taxonomic specific marker families. As a result, we’ve identified 209 super-families that each can be a potential phylogentic marker for at least 5 taxonomic groups we’ve studied. We are currently using A. Gene Family Building for Different Taxonomic Groups Only taxonomic groups with more than five completed genomes were selected for marker identification (see Table 1 for the list). Amino acid sequences were downloaded from the JGI Integrated Microbial Genomes (IMG) website. Peptides from 9 gene families that with extraordinary high copy numbers (>1000) in the bacterial level family building were filtered out first. All vs ALL BLASTP search were performed for the entire proteomes in the group (with the e- value cutoff set at 1e-10), followed by MCL clustering (with inflation value set at 2). Neighbor join trees were build for all the MCL families. We've developed a tool to parse all the trees and identify clades with single-copies genes across the genomes in the group. HMM profiles were built for the selected clades and HMM search against the entire proteomes in the group was applied to evaluate how distinct the gene families were. We consider a gene family as a marker candidate if all the genomes in the group have only one copy of the gene family members (high universality and high evenness, see Figure 1), and the family members can be distinctly picked up by the HMM profile built for the family. 5133 families were identified that met our requirements (Table 1). Methods and Results Monophyletic factor = Compare the clade distribution to the ideal monophylic scenario Clade Size (sort by size) Monophyletic Assumption Square of Distance s 1 n (n-S 1 ) 2 s 2 0 S 2 2 0 s c 0 S c 2 Euclidean distance Clade number factor= 1- [(c-1)/n] 2 Monophyletic Value= 100 X Clade number factor X monophyletic factor Table 1. Phylogenetic Marker Candidates for Different Taxonomic Groups. Phylogenetic group Genome Number Gene Number Maker Candidates Archaea 62 145415 106 Actinobacteria 63 267783 136 Alphaproteobacteria 94 347287 121 Betaproteobacteria 56 266362 311 Gammaproteobacteria 126 483632 118 Deltaproteobacteria 25 102115 206 Epislonproteobacteria 18 33416 455 Bacteriodes 25 71531 286 Chlamydae 13 13823 560 Chloroflexi 10 33577 323 Cyanobacteria 36 124080 590 Firmicutes 106 312309 87 Spirochaetes 18 38832 176 Thermi 5 14160 974 Thermotogae 9 17037 684 B. Grouping Families at Different Taxonomic Levels into Super-families Many gene families for different phylogenetics groups might be sub-families of higher level of taxonomic groups. To capture the relationship of the gene families and identify super families that span multiple phylogentic groups, we performed a tree-based method to sort and clustering the gene families. HMM profiles were built for each of the 5133 gene families that can be marker candidates at local taxonomy levels and one consensus sequence was emitted from each HMM profile. The consensus sequences were subsequently clustered into single linkage clusters according after an all vs all BLASTP search within the consensus sequences. A neighbor-join tree was built for each cluster by FastTree. We picked the clades that had only one consensus sequence from one taxonomic group, and the sequences were distinct from other consensus sequences in the tree using the same tree sampling program we used to identify phylogentic marker candidates at local levels (Figure 2). Figure 2. A neighbor-join tree of all the consensus sequences in the carbamoyl-phosphate synthase cluster. Our tree parsing and sampling program automatically identify 10 marker super- families that span multiple taxonomic groups (colored). C. Building Phylogenetic Marker Super-families HMM profiles were built for all the potential marker super- families identified by consensus sequence trees. Genes belonging to the super-families that had not been included were retrieved by hmmsearch. D. Evaluate the Marker Super-families at Different Taxonomic Level A family might be a good marker for certain taxonomic groups but not suitable for other groups. To make a good marker, a family needs to be universally distributed across the genomes at a given taxonomy level and single-copied in each genome. Further more, the genes for a given taxonomic group need to be monophyletic in a phylogenetic tree. We developed a monophyletic measurement because we have to tolerate, to a certain degree, of divergence in terms of monophyly. The monophyletic measurement is demonstrated in Figure 3. The monophyletic value calculation for a genome family at different taxonomic level is exemplified by the ribosomal protein S4 family (see Figure 4). Figure 3. Monophyletic value calculation. In a phylogenetic tree, a list of taxa (total number is n) can be divided in a number of monophyletic clades (clade number is c) by a phylogenetic tree. S is the number of taxa in a given monophyletic clade. Figure 4. The monophyletic value at different taxonomic level based on a PHYML tree of ribosomal protein S4 family. E. Identify Super-families that can be Markers for at least 5 Taxonomic Level For each super-familie, a phylogenetic tree was built. At all the taxonomic levels we were interested in, we calculated the universality (0-100), the evenness (0-100) and the monophyletic value (0-100) of the super-family members in terms of genome distributions. We only keep the super- families that can be marker candidates for at least 5 taxonomic levels (for each taxonomic group of interest, the multiplication of universality, evenness and monophyletic value should be larger than 729000). Figure 5. The multiplication of universality, evenness and monophyletic value at different taxonomic levels for 209 super-families. Evenness =100× e −4 Ng × / Ni Nm/ i Ni: the number of the gene family members from the genome i; Nm: the medium of Ni for all the genomes with the family Ng: the number of genomes with the family Universality = 100 x Number of Genomes Covered by the Family Total Number of Genomes Figure 1. Gene family universality and evenness calculations.

Transcript of Identify gene markers for different taxonomic groups in Archaea and Bacteria Genomes Dongying Wu...

Page 1: Identify gene markers for different taxonomic groups in Archaea and Bacteria Genomes Dongying Wu 1,2, Jonathan A. Eisen 1,2 1. DOE Joint Genome Institute,

Identify gene markers for different taxonomic groups in Archaea and Bacteria Genomes

Dongying Wu1,2, Jonathan A. Eisen1,2

1. DOE Joint Genome Institute, Walnut Creek, California 94598, USA2. University of California, Davis, Davis, California 95616, USA

Introduction

The sequencing and phylogenetic analysis of rRNA molecules demonstrated that all organisms could be placed on a single tree of life. Highly conserved, homologous 16S rRNA genes' presence in all organismal lineages makes them the only universal marker that has been adopted by biologist. Unfortunately phylogenetic trees based on rRNA sequences do not always accurately reflect the evolutionary history of the organisms represented due to the occurrence of lateral gene transfer, different rates of evolution in different lineages, and convergent evolution of rRNAs that can result in rRNA sequences from distantly related species becoming more similar to each other over time. There are also difficulties in generating accurate alignments of rRNA genes. For metagenomic studies of bacteria and archaea, more phylogenetic markers are needed in addition to the small subunit rRNA gene. Phylogenetic analysis using protein markers have been limited in scope. To date, the AMPHORA package developed by our lab only include 31 protein markers for bacteria. To address such issue, we started to identifying phylogenetic markers at different taxonomic levels systematically. The progress of genomic sequencing especially the phylogenetic diversity driven Genome Encyclopedia of Bacteria and Archaea (GEBA) project provides a great dataset for our top-down approach for marker identification.

Conclusions

We’ve established a protocol for the automatic identifications of phylogenetic markers for any given phylogentic groups. The protocol uses BLAST and MCL clustering algorithms to generate gene families for a selection of genomes. Phylogenetic trees are built for the gene families and clades from the trees are automatically sampled and evaluated for universality and evenness in terms of their distributions across genomes of interest. HMM profiles are built for universally distributed single-copied genes that form monophyletic clades on phylogenetic trees. HMM searches against the genomes of interest are applied to help us to decide the the families that are potential phylogenetic markers.

We’ve build gene families and captured potential markers at the follow taxonomic levels: Bacteria, Archaea, Actinobacteria, Bacteroides, Chlamydae, Chloroflexi, Cyanobacteria, Firmicutes, Spirochaetes, Thermi, Thermotogae, and five classes of Proteobacteria. HMM profiles were built for 5133 families that can be potential markers at the taxonomic levels we’ve examined. We’ve also carried out clustering and tree building analysis for all the taxonomic specific marker families. As a result, we’ve identified 209 super-families that each can be a potential phylogentic marker for at least 5 taxonomic groups we’ve studied. We are currently using the novel phylogenetic markers we’ve discovered to analyze metagenomic data, thus evaluating their impact in community diversity and richness studies.

A. Gene Family Building for Different Taxonomic Groups Only taxonomic groups with more than five completed genomes were selected for marker identification (see Table 1 for the list). Amino acid sequences were downloaded from the JGI Integrated Microbial Genomes (IMG) website.

Peptides from 9 gene families that with extraordinary high copy numbers (>1000) in the bacterial level family building were filtered out first. All vs ALL BLASTP search were performed for the entire proteomes in the group (with the e-value cutoff set at 1e-10), followed by MCL clustering (with inflation value set at 2). Neighbor join trees were build for all the MCL families. We've developed a tool to parse all the trees and identify clades with single-copies genes across the genomes in the group. HMM profiles were built for the selected clades and HMM search against the entire proteomes in the group was applied to evaluate how distinct the gene families were.

We consider a gene family as a marker candidate if all the genomes in the group have only one copy of the gene family members (high universality and high evenness, see Figure 1), and the family members can be distinctly picked up by the HMM profile built for the family. 5133 families were identified that met our requirements (Table 1).

Methods and Results

Monophyletic factor =

Compare the clade distribution to the ideal monophylic scenario

Clade Size

(sort by size)

Monophyletic Assumption Square of Distance

s1n (n-S1)2

s20 S2

2

… 0 …

sc0 Sc

2

Euclidean distance

Clade number factor= 1- [(c-1)/n]2

Monophyletic Value= 100 X Clade number factor X monophyletic factor

Table 1. Phylogenetic Marker Candidates for Different Taxonomic Groups.

Phylogenetic group Genome Number Gene Number Maker Candidates

Archaea 62 145415 106

Actinobacteria 63 267783 136

Alphaproteobacteria 94 347287 121

Betaproteobacteria 56 266362 311

Gammaproteobacteria 126 483632 118

Deltaproteobacteria 25 102115 206

Epislonproteobacteria 18 33416 455

Bacteriodes 25 71531 286

Chlamydae 13 13823 560

Chloroflexi 10 33577 323

Cyanobacteria 36 124080 590

Firmicutes 106 312309 87

Spirochaetes 18 38832 176

Thermi 5 14160 974

Thermotogae 9 17037 684

B. Grouping Families at Different Taxonomic Levels into Super-families Many gene families for different phylogenetics groups might be sub-families of higher level of taxonomic groups. To capture the relationship of the gene families and identify super families that span multiple phylogentic groups, we performed a tree-based method to sort and clustering the gene families. HMM profiles were built for each of the 5133 gene families that can be marker candidates at local taxonomy levels and one consensus sequence was emitted from each HMM profile. The consensus sequences were subsequently clustered into single linkage clusters according after an all vs all BLASTP search within the consensus sequences. A neighbor-join tree was built for each cluster by FastTree. We picked the clades that had only one consensus sequence from one taxonomic group, and the sequences were distinct from other consensus sequences in the tree using the same tree sampling program we used to identify phylogentic marker candidates at local levels (Figure 2).

Figure 2. A neighbor-join tree of all the consensus sequences in the carbamoyl-phosphate synthase cluster. Our tree parsing and sampling program automatically identify 10 marker super-families that span multiple taxonomic groups (colored).

C. Building Phylogenetic Marker Super-families HMM profiles were built for all the potential marker super-families identified by

consensus sequence trees. Genes belonging to the super-families that had not been included were retrieved by hmmsearch.

D. Evaluate the Marker Super-families at Different Taxonomic LevelA family might be a good marker for certain taxonomic groups but not suitable for other

groups. To make a good marker, a family needs to be universally distributed across the genomes at a given taxonomy level and single-copied in each genome. Further more, the genes for a given taxonomic group need to be monophyletic in a phylogenetic tree. We developed a monophyletic measurement because we have to tolerate, to a certain degree, of divergence in terms of monophyly. The monophyletic measurement is demonstrated in Figure 3. The monophyletic value calculation for a genome family at different taxonomic level is exemplified by the ribosomal protein S4 family (see Figure 4).

Figure 3. Monophyletic value calculation. In a phylogenetic tree, a list of taxa (total number is n) can be divided in a number of monophyletic clades (clade number is c) by a phylogenetic tree. S is the number of taxa in a given monophyletic clade.

Figure 4. The monophyletic value at different taxonomic level based on a PHYML tree of ribosomal protein S4 family.

E. Identify Super-families that can be Markers for at least 5 Taxonomic LevelFor each super-familie, a phylogenetic tree was built. At all the taxonomic levels we

were interested in, we calculated the universality (0-100), the evenness (0-100) and the monophyletic value (0-100) of the super-family members in terms of genome distributions. We only keep the super-families that can be marker candidates for at least 5 taxonomic levels (for each taxonomic group of interest, the multiplication of universality, evenness and monophyletic value should be larger than 729000).

Figure 5. The multiplication of universality, evenness and monophyletic value at different taxonomic levels for 209 super-families.

Evenness =100 × e−4

Ng× /Ni−Nm /i

Ni: the number of the gene family members from the genome i;Nm: the medium of Ni for all the genomes with the familyNg: the number of genomes with the family

Universality = 100 x Number of Genomes Covered by the FamilyTotal Number of Genomes

Figure 1. Gene family universality and evenness calculations.