BMC Microbiology BioMed Central · Sarah Waller 3, Kristin M Pullen 3, Yasser Y El-Sayed , M Mark...

11
BioMed Central Page 1 of 11 (page number not for citation purposes) BMC Microbiology Open Access Methodology article Bacterial flora-typing with targeted, chip-based Pyrosequencing Andreas Sundquist* 1 , Saharnaz Bigdeli 2 , Roxana Jalili 2 , Maurice L Druzin 3 , Sarah Waller 3 , Kristin M Pullen 3 , Yasser Y El-Sayed 3 , M Mark Taslimi 3 , Serafim Batzoglou 1 and Mostafa Ronaghi 2 Address: 1 Department of Computer Science, Stanford University, Stanford, CA 94305, USA, 2 Stanford Genome Technology Center, Stanford University, Palo Alto, CA 94304, USA and 3 Department of Obstetrics and Gynecology, Stanford University Medical Center, Palo Alto, CA 94305, USA Email: Andreas Sundquist* - [email protected]; Saharnaz Bigdeli - [email protected]; Roxana Jalili - [email protected]; Maurice L Druzin - [email protected]; Sarah Waller - [email protected]; Kristin M Pullen - [email protected]; Yasser Y El- Sayed - [email protected]; M Mark Taslimi - [email protected]; Serafim Batzoglou - [email protected]; Mostafa Ronaghi - [email protected] * Corresponding author Abstract Background: The metagenomic analysis of microbial communities holds the potential to improve our understanding of the role of microbes in clinical conditions. Recent, dramatic improvements in DNA sequencing throughput and cost will enable such analyses on individuals. However, such advances in throughput generally come at the cost of shorter read-lengths, limiting the discriminatory power of each read. In particular, classifying the microbial content of samples by sequencing the < 1,600 bp 16S rRNA gene will be affected by such limitations. Results: We describe a method for identifying the phylogenetic content of bacterial samples using high-throughput Pyrosequencing targeted at the 16S rRNA gene. Our analysis is adapted to the shorter read-lengths of such technology and uses a database of 16S rDNA to determine the most specific phylogenetic classification for reads, resulting in a weighted phylogenetic tree characterizing the content of the sample. We present results for six samples obtained from the human vagina during pregnancy that corroborates previous studies using conventional techniques. Next, we analyze the power of our method to classify reads at each level of the phylogeny using simulation experiments. We assess the impacts of read-length and database completeness on our method, and predict how we do as technology improves and more bacteria are sequenced. Finally, we study the utility of targeting specific 16S variable regions and show that such an approach considerably improves results for certain types of microbial samples. Using simulation, our method can be used to determine the most informative variable region. Conclusion: This study provides positive validation of the effectiveness of targeting 16S metagenomes using short-read sequencing technology. Our methodology allows us to infer the most specific assignment of the sequence reads within the phylogeny, and to identify the most discriminative variable region to target. The analysis of high-throughput Pyrosequencing on human flora samples will accelerate the study of the relationship between the microbial world and ourselves. Published: 30 November 2007 BMC Microbiology 2007, 7:108 doi:10.1186/1471-2180-7-108 Received: 6 July 2007 Accepted: 30 November 2007 This article is available from: http://www.biomedcentral.com/1471-2180/7/108 © 2007 Sundquist et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Transcript of BMC Microbiology BioMed Central · Sarah Waller 3, Kristin M Pullen 3, Yasser Y El-Sayed , M Mark...

Page 1: BMC Microbiology BioMed Central · Sarah Waller 3, Kristin M Pullen 3, Yasser Y El-Sayed , M Mark Taslimi , Serafim Batzoglou 1 and Mostafa Ronaghi 2 Address: 1 Department of Computer

BioMed CentralBMC Microbiology

ss

Open AcceMethodology articleBacterial flora-typing with targeted, chip-based PyrosequencingAndreas Sundquist*1, Saharnaz Bigdeli2, Roxana Jalili2, Maurice L Druzin3, Sarah Waller3, Kristin M Pullen3, Yasser Y El-Sayed3, M Mark Taslimi3, Serafim Batzoglou1 and Mostafa Ronaghi2

Address: 1Department of Computer Science, Stanford University, Stanford, CA 94305, USA, 2Stanford Genome Technology Center, Stanford University, Palo Alto, CA 94304, USA and 3Department of Obstetrics and Gynecology, Stanford University Medical Center, Palo Alto, CA 94305, USA

Email: Andreas Sundquist* - [email protected]; Saharnaz Bigdeli - [email protected]; Roxana Jalili - [email protected]; Maurice L Druzin - [email protected]; Sarah Waller - [email protected]; Kristin M Pullen - [email protected]; Yasser Y El-Sayed - [email protected]; M Mark Taslimi - [email protected]; Serafim Batzoglou - [email protected]; Mostafa Ronaghi - [email protected]

* Corresponding author

AbstractBackground: The metagenomic analysis of microbial communities holds the potential to improveour understanding of the role of microbes in clinical conditions. Recent, dramatic improvements inDNA sequencing throughput and cost will enable such analyses on individuals. However, suchadvances in throughput generally come at the cost of shorter read-lengths, limiting thediscriminatory power of each read. In particular, classifying the microbial content of samples bysequencing the < 1,600 bp 16S rRNA gene will be affected by such limitations.

Results: We describe a method for identifying the phylogenetic content of bacterial samples usinghigh-throughput Pyrosequencing targeted at the 16S rRNA gene. Our analysis is adapted to theshorter read-lengths of such technology and uses a database of 16S rDNA to determine the mostspecific phylogenetic classification for reads, resulting in a weighted phylogenetic tree characterizingthe content of the sample. We present results for six samples obtained from the human vaginaduring pregnancy that corroborates previous studies using conventional techniques.

Next, we analyze the power of our method to classify reads at each level of the phylogeny usingsimulation experiments. We assess the impacts of read-length and database completeness on ourmethod, and predict how we do as technology improves and more bacteria are sequenced. Finally,we study the utility of targeting specific 16S variable regions and show that such an approachconsiderably improves results for certain types of microbial samples. Using simulation, our methodcan be used to determine the most informative variable region.

Conclusion: This study provides positive validation of the effectiveness of targeting 16Smetagenomes using short-read sequencing technology. Our methodology allows us to infer themost specific assignment of the sequence reads within the phylogeny, and to identify the mostdiscriminative variable region to target. The analysis of high-throughput Pyrosequencing on humanflora samples will accelerate the study of the relationship between the microbial world andourselves.

Published: 30 November 2007

BMC Microbiology 2007, 7:108 doi:10.1186/1471-2180-7-108

Received: 6 July 2007Accepted: 30 November 2007

This article is available from: http://www.biomedcentral.com/1471-2180/7/108

© 2007 Sundquist et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 1 of 11(page number not for citation purposes)

Page 2: BMC Microbiology BioMed Central · Sarah Waller 3, Kristin M Pullen 3, Yasser Y El-Sayed , M Mark Taslimi , Serafim Batzoglou 1 and Mostafa Ronaghi 2 Address: 1 Department of Computer

BMC Microbiology 2007, 7:108 http://www.biomedcentral.com/1471-2180/7/108

BackgroundMetagenomics enables the genomic study of microbialcommunities that are sampled directly from their environ-ment, eliminating the need for isolating and cultivatingspecific microbes [1-3]. Metagenomic analyses of humanflora samples [4] are a new type of assay with intriguingpotential applications for the diagnosis and prediction ofclinical outcomes [5]. Studies of human vaginal bacte-rium during pregnancy so far include the use of direct cul-ture methods and conventional PCR studies of clinicallysuspected infectious microorganisms. Although infectionand inflammation likely play a major role in the patho-genesis of preterm labor and delivery [6,7], these studiesreveal only a fraction of the potential microorganic inhab-itants. A comprehensive identification and catalog ofthese organisms will enable future investigators to target adefined population of species that may be correlated withpreterm labor, premature rupture of amniotic mem-branes, chorioamnionitis, and other complications ofpregnancy [8-12].

Metagenomics analyses will become increasingly practicalas DNA sequencing costs fall dramatically with the adventof new technologies [13,14] including Pyrosequencing™[15]. One challenge common to these revolutionarysequencing technologies is the short length of reads,which limits the amount of unique, discriminatingsequence available within each read. Sequencing the 16SrRNA gene (16S rDNA) using conventional Sangersequencing produces reads of at least 500 bp in length,which is sufficient to identify the precise source species foreach gene [3]. In fact, though there is a danger of produc-ing chimeras, the reads are often long enough that theycan be assembled into near-complete 16S rDNAsequences [16]. Despite the promise of high-throughputtechnologies like Pyrosequencing, current versions pro-duce short reads, making the accurate identification of thesource of these reads a daunting task. One solution usedchip-based Pyrosequencing targeted at a small variableregion within the 16S rDNA to show that there exists amuch greater variety of rare microorganisms than previ-ously thought [17].

We describe a methodology for phylogenetic classifica-tion based on short, 16S rRNA gene sequence reads andapply the technique to reads obtained via high-through-put, chip-based Pyrosequencing of human vaginal florasamples during pregnancy. The resulting phylogenetictrees reveal the vast diversity of bacterial inhabitants seenin other studies, and will assist in future investigations ofthe link between microorganisms and pregnancy compli-cations. Next, we examine the ability of our methodologyto classify reads at different levels in the phylogeny anddiscuss limitations of our technique. Using simulations,we study the effect of read-length on our methodology to

understand the consequence of using high-throughputPyrosequencing instead of conventional technologies.Finally, we explore the effectiveness of isolating specific16S variable regions using validated universal primers.Our methodology for analyzing short 16S rDNA sequencereads will enable the accurate and informative study ofhuman flora samples using new, high-throughputsequencing technologies.

Results and DiscussionMethodology overviewTwelve samples from vaginal epithelial tissue and dis-charge from pregnant women in all three trimesters werecollected. DNA extraction was performed, followed by tar-get-specific PCR amplification of approximately 1500 bpof the 16S rDNA using universal primers. The productswere subjected to nebulization and clonal amplification,followed by Pyrosequencing of six samples with theGenome Sequencer 20 system (454 Life Sciences). As aresult, 100,000 to 200,000 sequence reads of 100 bp aver-age length were obtained for each of the six samples(details are provided in Additional file 1).

In this paper, we independently determine for every readthe most specific classification within the bacterial phyl-ogeny, and produce a weighted tree that expresses thephylogenetic makeup of the sample. For each read, we useBLAT, the BLAST-like alignment tool [18], to search forhomology against a database of bacterial 16S rDNAsequences obtained from the Ribosomal Database Project[19] and archaeal 16S rDNA sequence from prokMSA [20].We score each resulting homology between the read and a16S rDNA sequence from the database, filter out weakhomologies, and thus produce a set of possible organismsfrom which the read was obtained. Finally, we assign theread to the most specific location within the phylogenythat includes all these potential organisms (details of thisalgorithm are described in Methods). By assigning allreads to the phylogeny with the above procedure, we con-struct a weighted phylogeny representing the 16S rDNAcontent of the sample. This process is depicted in Figure 1.

Further analysis of bacterial samples involving the transla-tion of read counts to organism concentrations must beundertaken conservatively due to the following caveats.First, there may be an amplification bias of 16S rDNAsequence due to differences in primer annealing prefer-ence. Also, variation in 16S rDNA multiplicity in diversebacterial genomes, among other complications, mayresult in the over- or under-representation of certainorganisms' 16S sequences [18,21].

Our ability to place reads in the phylogeny has two dis-tinct limitations, namely short read-length and unrepre-sented organisms in the 16S rDNA sequence database.

Page 2 of 11(page number not for citation purposes)

Page 3: BMC Microbiology BioMed Central · Sarah Waller 3, Kristin M Pullen 3, Yasser Y El-Sayed , M Mark Taslimi , Serafim Batzoglou 1 and Mostafa Ronaghi 2 Address: 1 Department of Computer

BMC Microbiology 2007, 7:108 http://www.biomedcentral.com/1471-2180/7/108

Short read-lengths often lead to high-fidelity matches tomultiple 16S sequences in the database. This situationoccurs whenever the region from which the read was sam-pled is highly similar across species of a given genus, fam-ily, or even phylum. In this case we are resolution-limited inplacing a read below a certain depth in the phylogeny. On

the other hand, because of the incomplete nature of the16S rDNA database, a read may not match in its entirelength to any known 16S sequence. However, since webelieve a priori that all reads are derived from amplified16S rDNA sequences, the closest partial matches of theread to known organisms still allow us to assign the read

Methodology overviewFigure 1Methodology overview. After collecting the bacterial sample, DNA is extracted followed by amplification of 16S rDNA using universal primers. These fragments are then sequenced with high-throughput Pyrosequencing. Each read is queried against a database of known 16S rDNA sequence (mostly obtained from the Ribosomal Database Project) using the program BLAT and assigned to the most specific and confident node in the phylogeny. Accumulating all the reads in this fashion yields a weighted phylogenetic tree characterizing the bacterial content of the sample.

RDP

Collect sample

Extract DNA

PCR amplify 16S sequence

SequenceATGATTCTATAC ATGTCATATAC TGGGATTTAGAA

BLAT against known sequences

Score match fidelity

Place read in phylogeny

Accumulate reads in phylogeny

Page 3 of 11(page number not for citation purposes)

Page 4: BMC Microbiology BioMed Central · Sarah Waller 3, Kristin M Pullen 3, Yasser Y El-Sayed , M Mark Taslimi , Serafim Batzoglou 1 and Mostafa Ronaghi 2 Address: 1 Department of Computer

BMC Microbiology 2007, 7:108 http://www.biomedcentral.com/1471-2180/7/108

to the subtree that contains these organisms, although itsplacement below that level is labeled unknown.

Sample analysisSamples subjected to the above analysis demonstratedsubstantial overlap with similar studies previouslyreported [16], as well as significant differences betweenthe samples. Weighted phylogenetic trees obtained fromapplying our analysis to the six samples are shown inAdditional file 2. Figure 2 presents a composite tree gener-ated by accumulating these six trees in equal proportions.Starting from the top, the width of the tree edges repre-sents the proportion of reads that can be confidentlyplaced at that level in the phylogeny. A tree edge that fadesinto white represents reads that were resolution-limitedbelow that level, while reads whose placement isunknown below a particular node are represented by treeedges that fade into black. In Figure 3 we list the top 30

genera discovered in the six samples and identify the pro-portions of reads belonging to these genera within eachsample. Corroborating other studies performed on vagi-nal bacterial flora, we identified Lactobacillus as the domi-nant genus and detected a significant presence of othergenera, including Psychrobacter, Magnetobacterium, Prevo-tella, Bifidobacterium, and Veillonella [16]. Aside from thecommon presence of Lactobacillus, each sample exhibiteda unique profile of other bacteria, which may be useful inthe future for diagnosing abnormal conditions such asvaginosis [5] or predicting the onset of preterm labor [6].Additional file 3 lists the top 30 genera identified in eachsample along with the percentage of reads classifiedwithin the genera.

In Figure 4 we graph our ability to classify reads into a par-ticular branch at each level of the phylogeny. For thosereads that cannot be classified we show the proportion

Combined sample phylogenetic contentFigure 2Combined sample phylogenetic content. The proposed methodology was applied to the reads obtained from six samples and results were aggregated in a single tree. Branch widths indicate the proportion of reads assigned down those branches in the phylogeny. Branches fading into white represent reads that are resolution limited due to similarity among multiple sub-branches in the phylogeny. Branches fading into black represent reads that do not have full-length homology with any known 16S sequence.

Page 4 of 11(page number not for citation purposes)

Page 5: BMC Microbiology BioMed Central · Sarah Waller 3, Kristin M Pullen 3, Yasser Y El-Sayed , M Mark Taslimi , Serafim Batzoglou 1 and Mostafa Ronaghi 2 Address: 1 Department of Computer

BMC Microbiology 2007, 7:108 http://www.biomedcentral.com/1471-2180/7/108

that is resolution-limited versus unknown. Figure 5 plotsthese results for each sample separately. Our methodol-ogy recognizes 89 – 97% of the reads in each sample asbacterial and fewer than 2% as archaeal; the remainingreads are unrecognizable in our database. While we wereable to categorize the genus of 28 – 39% of the reads, only3 – 12% could be identified with a particular species.Under-representation of 16S rDNA sequence in the data-base appears to be our dominant limitation in identifyingreads at the levels of domain through genus. Fortunately,we expect this limitation to diminish as more 16S rDNAsequences are added to the database. Our ability to iden-tify a particular species, however, is primarily resolution-limited due to the overwhelming similarity between spe-cies within a genus.

Effect of read-lengthTo study the effect of read-length on our ability to placereads in the phylogeny, we simulated the sampling ofreads from hypothetical profiles of bacteria for a range ofread-lengths from 30 to 800 bp. We analyzed reads sam-

pled from two distinct profiles of bacteria: a random profileof 387 diverse bacteria selected from across the entireknown phylogenetic tree and a sample profile with concen-trations of 330 bacteria derived from the analysis of oursix samples. The results of applying our methodology tothese samples are graphed Figure 6. Solid lines show theproportion of reads that were placed within a particularbranch at each level of the phylogeny. Dotted lines withthe same color show the proportion of reads that were cor-rectly placed at each level. As both graphs illustrate, read-lengths of 30 and 60 bp are not very effective for discrim-inating between different bacteria, even at such a broadlevel as phylum and class.

For the sample profile of bacteria our ability to identify gen-era improves substantially when read-lengths areincreased beyond 100 bp due to the high degree of simi-larity between bacteria in the actual samples we examined(Figure 6b). At 800 bp, we are able to accurately determinealmost all of the reads at the genus level, and also correctlydetermine the species for over half of the reads. As dem-

Top identified genera in samplesFigure 3Top identified genera in samples. The top 30 genera identified in all six samples are listed in descending order. The per-centages of reads in each sample belonging to these genera are indiecated by the height of the bar.

Page 5 of 11(page number not for citation purposes)

Page 6: BMC Microbiology BioMed Central · Sarah Waller 3, Kristin M Pullen 3, Yasser Y El-Sayed , M Mark Taslimi , Serafim Batzoglou 1 and Mostafa Ronaghi 2 Address: 1 Department of Computer

BMC Microbiology 2007, 7:108 http://www.biomedcentral.com/1471-2180/7/108

onstrated in Figure 6b, running the simulation for 100 bpread-lengths closely reproduced the read resolution graphobtained from our six samples, which lends confidence tothe stability of the methodology.

For the more diverse, random profile of bacteria, the 16SrDNA sequences are sufficiently different that read-lengths greater than 100 bp do not provide much addi-tional benefit (Figure 6a). A read-length of 100 bp, whichcorresponds to sample data presented here, appears to becompetitive with even the longest read-length of 800 bp.Thus, with a very wide diversity of bacteria, it seems thatour methodology does not require much greater read-lengths than 100 bp. In practice, however, the sample pro-file may be more relevant, and therefore longer reads aredesirable to improve the resolution of read placement.

There is evidence to suggest that the classification of spe-cies within the RDP phylogeny has errors that limit theability of our methodology to unambiguously classifyreads down to the lowest levels of the phylogeny. As anexample, suppose we have two species A and B that trulybelong to the same genus X, but that species B was mis-classified in genus Y. Then, a read that matches both spe-cies A and B will be assigned to the family of genera X andY or an even broader classification. A more accurate data-base classification will improve the ability of our method-ology to identify the genera of the reads.

Restriction to variable regionsOur reads often sampled regions in the 16S rDNA that areindistinguishable between species, genera, and evenphyla. Restricting the sequencing to short, specific varia-ble regions within the 16S sequence can provide moreinformative reads [22]. We performed further simulationsto assess the effectiveness of such an approach for 100 bp-long Pyrosequencing reads, designing primers for ampli-fying seven regions each containing one of the 16S rDNAvariable regions V1 – V6 and V9, and one region contain-ing V7 and V8 [23]. We describe the construction and ver-ification of the primers in Methods and list the eightamplified regions in Table 1. Figure 7 graphs the read res-olution and accuracy results. In both graphs, the bold,black line shows the simulation results when we samplethe reads from across the entire 16S rDNA sequenceinstead of restricting it to a particular variable region.

For the random profile of bacteria we could slightlyimprove our resolving power by restricting reads toshorter variable regions, particularly with region V1 (Fig-ure 7a). For the more realistic sample profile, by choosingthe appropriate variable region we could improve resultsdramatically and achieve a resolution similar to 150 – 200bp reads sampled from across the entire 16S gene. Whenreading from region V1 we were able to identify the genus

Read resolution within phylogenetic tree for samplesFigure 5Read resolution within phylogenetic tree for samples. Individual categorization of reads in six samples into their proportion of reads identified (solid line)/resolution limited (dotted line)/unknown (dashed line).

Read resolution within phylogenetic tree for combined sam-plesFigure 4Read resolution within phylogenetic tree for com-bined samples. Combined categorization of the reads in all six samples into their proportion of reads identified (dark blue)/resolution limited (medium)/unknown (light).

Page 6 of 11(page number not for citation purposes)

Page 7: BMC Microbiology BioMed Central · Sarah Waller 3, Kristin M Pullen 3, Yasser Y El-Sayed , M Mark Taslimi , Serafim Batzoglou 1 and Mostafa Ronaghi 2 Address: 1 Department of Computer

BMC Microbiology 2007, 7:108 http://www.biomedcentral.com/1471-2180/7/108

of 79% of the reads instead of 57% when sampling acrossthe entire 16S sequence and 44% of the species instead ofonly 14% (Figure 7). Region V2 was best able to deter-mine the classification for the reads at the level of order,correctly identifying 84% of the reads compared to 70%when sampling from across the entire 16S gene. Thus, ourstudy suggests that identifying the phylogenetic content ofbacterial communities with short reads will be bestachieved by targeting variable regions that are specificallychosen for each class of bacterial environment.

ConclusionBy combining high-throughput Pyrosequencing with anovel analysis methodology, we identified phylogenies ofbacteria present in the human vagina during pregnancy.Previous studies of the correlation between identified bac-teria and preterm labor, and attempts to treat such micro-organisms have produced conflicting results [24-26]. Ourmethodology for studying in-depth the ecology of humanpregnancy will assist in understanding the correlationbetween vaginal microorganisms and complications inpregnancy.

Our simulations indicate that the methodology is cur-rently limited by two factors: short read-lengths of Pyrose-quencing and the incomplete nature of 16S rDNA

databases. As more bacteria are sequenced and added tothe database, the effects of the second limitation willdecrease. Improvements in sequencing technology willincrease read-lengths and enhance our ability to distin-guish between genera. In order to best identify particularspecies, using our methodology we can identify and iso-late the most informative 16S variable region.

MethodsIdentifying 16S sequenceThe first stage of our analysis matches reads againstknown 16S rDNA sequences, or finds the closest matchesto known organisms. We leverage the Ribosomal DatabaseProject release 9 update 39 for its catalog of bacteria andtheir phylogenetic relationships [19] and the prokMSAdatabase as a representative set of archaeal sequence [20].For each read, we used the tool BLAT (BLAST-Like Align-ment Tool) [18] to quickly identify matches betweenreads and the combined bacterial/archaeal database,using a minimum identity of 90% and a minimummatch/mismatch score of 15 bases.

To score the homology between a read and a databasesequence, we approximate the probability that the readcame from an organism with a p = 98% sequence similar-ity to the given read as follows:

Simulated read resolution for varying read-lengths in diverse and representative samplesFigure 6Simulated read resolution for varying read-lengths in diverse and representative samples. Simulation results are presented for (a) a diverse set of 387 bacteria and (b) 330 species representative of our samples. Simulated reads had a 20% standard deviation in read-length and a 1% sequencing error rate. Solid lines show assignments made by our methodology, while dashed lines show the proportion that are correctly assigned.

Page 7 of 11(page number not for citation purposes)

Page 8: BMC Microbiology BioMed Central · Sarah Waller 3, Kristin M Pullen 3, Yasser Y El-Sayed , M Mark Taslimi , Serafim Batzoglou 1 and Mostafa Ronaghi 2 Address: 1 Department of Computer

BMC Microbiology 2007, 7:108 http://www.biomedcentral.com/1471-2180/7/108

where Mi are indicator variables that are 1 if position i inthe read matches with the database sequence in theiralignment and 0 otherwise. The variables Qi are the prob-abilities that the read bases were called correctly, derivedfrom the sequence quality scores.

Then, to judge whether or not we believe a read came fromthe organism's phylotype, we compare this probabilityagainst the probability for a hypothetical read that falls atthe boundary of similarity

P (related limit) = ppL (1 - p)(1 - p) L.

Reads that score above this probability limit are classifiedas known, while reads that score below the limit are classi-fied as unknown.

P M pQ M pQi i i i

i

L

related( ) = + −( ) −( )⎡⎣ ⎤⎦=

∏ 1 11

,

Simulated read resolution for targeted variable regions in diverse and representative samplesFigure 7Simulated read resolution for targeted variable regions in diverse and representative samples. Simulation results are presented for (a) a diverse set of 387 bacteria and (b) 330 species representative of our samples. Simulated reads had lengths of 100 ± 20 bp and a 1% sequencing error rate. Solid lines show assignments made by our methodology, while dashed lines show the proportion that are correctly assigned.

Table 1: 16S variable region range definitions.

Variable region E. coli 16S rDNA range 5' primer 3' primer

start end length

V1 8 120 113 5'-AGAGTTTGATCMTGGCTCAG 5'-TTACTCACCCGTICGCCRCTV2 101 361 261 5'-AGYGGCGIACGGGTGAGTAA 5'-CYIACTGCTGCCTCCCGTAGV3 338 534 197 5'-ACTCCTACGGGAGGCAGCAG 5'-ATTACCGCGGCTGCTGGV4 519 806 288 5'-TGCCAGCAGCCGCGGTAA 5'-GGACTACARGGTATCTAATV5 787 926 140 5'-ATTAGATACCYTGTAGTCC 5'-CCGTCAATTCMTTTGAGTTTV6 907 1073 167 5'-AAACTCAAAKGAATTGACGG 5'-ACGAGCTGACGACARCCATG

V7 & VV8 1054 1406 353 5'-CATGGYTGTCGTCAGCTCGT 5'-ACGGGCGGTGTGTACV9 1392 1507 116 5'-GTACACACCGCCCGT 5'-TACCTTGTTACGACTT

Regions were chosen to be mostly non-overlapping, each containing one or two variable regions. Coordinates are given relative to the 1542 bp E. coli K12 16S rDNA sequence.

Page 8 of 11(page number not for citation purposes)

Page 9: BMC Microbiology BioMed Central · Sarah Waller 3, Kristin M Pullen 3, Yasser Y El-Sayed , M Mark Taslimi , Serafim Batzoglou 1 and Mostafa Ronaghi 2 Address: 1 Department of Computer

BMC Microbiology 2007, 7:108 http://www.biomedcentral.com/1471-2180/7/108

Placing reads in the phylogenyEach read that results in a known match will typically alsomatch with many additional organisms. For example, aread may match with several species within the same

genus, in which case we cannot identify the exact speciesof the read. However, if all the known hits at least fallwithin the same genus then we are confident the read wassampled from an organism belonging to that genus. In

Placing read in phylogenyFigure 8Placing read in phylogeny. Computing the most specific and confident placement of a read in the phylogenetic tree occurs in two stages. First, for each internal node we compute a score that is equal to the maximum of the scores of its children. Sec-ond, we traverse down the tree from the root until we find a node for which the child with the second maximum score is within a threshold T of the maximum score, or until we reach a leaf node.

B(s1) =P(read ↔ s1)

Species 4

Species 5

Genus B

Species 1

Species 2

Species 3

Genus A

B(gA) =max[B(s1), B(s2), B(s3)]

B(gB) =max[B(s4), B(s5)]

Root

...

... ... ...

B(s2) =P(read ↔ s2)

B(s3) =P(read ↔ s3)

B(s4) =P(read ↔ s4)

B(s5) =P(read ↔ s5)

B(root) = max[B(i) ∀i]

Species 1

Species 2

Species 3

B(max1) / B(max2) > T

max1 max2max3

Family X

Genus A

max1max2max3

B(max1) / B(max2) < TGenus

BGenus

C

a

b

Page 9 of 11(page number not for citation purposes)

Page 10: BMC Microbiology BioMed Central · Sarah Waller 3, Kristin M Pullen 3, Yasser Y El-Sayed , M Mark Taslimi , Serafim Batzoglou 1 and Mostafa Ronaghi 2 Address: 1 Department of Computer

BMC Microbiology 2007, 7:108 http://www.biomedcentral.com/1471-2180/7/108

this way, for each read our goal is to determine the mostspecific classification within the phylogeny that likelycontains the organism from which the read was obtained.

We analyze each read r using the following alrogithm. Forevery node i in the phylogenetic tree, we assign a valueB(i) as follows. For leaf nodes, we set B(i) = P(r related toi) defined above for organisms with a scored BLAT hit,and B(i) = 0 otherwise. For internal nodes, we set B(i) =maxj ∈ children (i) B(j). This process is illustrated in Figure 8a.At the root node we will therefore have B(root) = maxi P(rrelated to i).

Next, we traverse down the tree starting at the root node.At each internal node, if the ratio of the two maximalchild nodes j and k exceeds a threshold T (i.e. B(j)/B(k) >T), or if B(j) is the only non-zero child, then we descendto node j and repeat the procedure. Once the procedureterminates in an internal node or a leaf node i, we believewith a confidence level related to T that the read camefrom an organism belonging to the subtree rooted at i. Anexample of this is illustrated in Figure 8b. We experi-mented with the choice of T over several orders of magni-tude and found that the resulting analysis varied only veryslightly. For the analyses performed in this study, we usedT = 0.01.

Simulating readsFor our analysis of read-length and variable region resolv-ing power, we simulated reads from two hypothetical col-lections of species. The random profile consists of 387species of bacteria, selected by randomly traversing downthe phylogenetic tree from the root to a leaf, picking eachbranch with uniform probability, resulting in very highdiversity. The sample profile constitutes 330 species of bac-teria, created by sampling species from a distribution ofgenera and species that was consistent with the analysisresults from our six samples.

For each simulation, a read simulator generated readssampled from 16S rDNA sequence selected randomlyaccording to either the random or the sample profile. Readswere sampled with uniform probability from across therDNA sequence, with a read-length drawn from a Gaus-sian distribution with average read-lengths of 30, 60, 100,150, 200, 400, or 800 bp and standard deviations of 20%.Sequencing errors were introduced into the reads at a rateof 1% that consisted of mutations, insertions, deletions,and homopolymer run count errors characteristic of Pyro-sequencing.

To understand the effect of read-length on the resolvingpower of our methodology, we simulated reads from boththe random and sample profile with the seven read-lengthsL of 30 – 800 bp. For each of the 2 × 7 = 14 parameter sets

we produced 30 Mb of simulated read data (N reads @ Lbp = 30 · 106 bp) and applied our analysis. By annotatingthe source species for each read we able to measure theaccuracy of its placement in the phylogenetic tree, as inFigure 6.

To study the effectiveness of restricting the sequencing toknown variable regions, we first selected a set of eightminimally-overlapping regions within the 16S rDNAsequence: seven regions each contained one of the knownvariable regions V1 – V6 and V9, and one region con-tained both V7 and V8 [27]. These regions are listed inTable 1 with their E. coli 16S rDNA sequence coordinateranges as well as 5' and 3' broad-range amplification prim-ers, which we validated by PCR amplifying 16 rDNA fromE. coli. For each primer pair we performed 15 cycles oftouch-down PCR, going from 94°C for 30 s, to an anneal-ing temperature ranging from 70°C to 50°C for 30 s, andfinally extending at 72°C for 30 s. We then performed 30additional cycles at 94°C for 45 s, 50°C for 45 s, and72°C for 45 s, and verified the resulting products via gelelectrophoresis.

Next, we again produced sets of simulated reads for boththe random and the sample profile, restricted to each region,for a total of 2 × 8 = 16 sets. Each read data set consistedof 300,000 reads with an average read-length of 100 bpand a 1% error rate. We applied our analysis and meas-ured the accuracy of read placement in the phylogeny inFigure 7.

Authors' contributionsAS, S Batzoglou, and MR designed the analysis methodol-ogy. AS implemented the methodology and conductedthe analysis. S Bigdeli, RJ, and MR performed the samplepreparation and sequencing. MLD, SW, KMP, YYEL, andMMT collected the samples. AS, YYES, S Batzoglou, andMR wrote the manuscript. All authors read and approvedthe final manuscript.

Additional material

Additional File 1Sequencing statistics for six samples. This spreadsheet provides statistics for the sequence read data obtained for the six samples.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2180-7-108-S1.xls]

Additional File 2Weighted phylogenetic trees for six samples. The images represent weighted phylogenetic trees resulting from the application of our analysis to each of the six samples.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2180-7-108-S2.pdf]

Page 10 of 11(page number not for citation purposes)

Page 11: BMC Microbiology BioMed Central · Sarah Waller 3, Kristin M Pullen 3, Yasser Y El-Sayed , M Mark Taslimi , Serafim Batzoglou 1 and Mostafa Ronaghi 2 Address: 1 Department of Computer

BMC Microbiology 2007, 7:108 http://www.biomedcentral.com/1471-2180/7/108

Publish with BioMed Central and every scientist can read your work free of charge

"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral

AcknowledgementsThe authors would like to thank the members of the Batzoglou lab for their support for this work. AS is partly supported by an SAP Labs Stanford Graduate Fellowship. This project is supported in part by NIH grant R01HG003571-02, the Stanford Pediatric Research Fund, and research funds from the Division of Maternal-Fetal Medicine at Stanford University.

References1. Amann RI, Ludwig W, Schleifer KH: Phylogenetic identification

and in situ detection of individual microbial cells without cul-tivation. Microbiol Rev 1995, 59:143-169.

2. Rappé MS, Giovannoni SJ: The Uncultured Microbial Majority.Annu Rev Microbiol 2003, 57:369-394.

3. Tringe SG, Rubin EM: Metagenomics: DNA sequencing of envi-ronmental samples. Nat Rev Genet 2005, 6:805-814.

4. Anderson BE, Dawson JE, Jones DC, Wilson KH: Ehrlichia chaffeen-sis, a new species associated with human ehrlichiosis. J ClinMicrobiol 1991, 29:2838-2842.

5. Verhelst R, Verstraelen H, Claeys G, Verschraegen G, Delanghe J, VanSimaey L, De Ganck C, Temmerman M, Vaneechoutte M: Cloning of16S rRNA genes amplified from normal and disturbed vagi-nal microflora suggests a strong association between Atopo-bium vaginae, Gardnerella vaginalis and bacterial vaginosis.BMC Microbiol 2004, 4:16-20.

6. Goldenberg RL, Hauth JC, Andrews WW: Intrauterine Infectionand Preterm Delivery. N Engl J Med 2000, 342:1500-1507.

7. Gravett MG, Novy MJ, Rosenfeld RG, Reddy AP, Jacob T, Turner M:Diagnosis of intra-amniotic infection by proteomic profilingand identification of novel biomarkers. JAMA 2004,292:462-469.

8. Sbarra AJ, Selvaraj RJ, Cetrulo CL, Feingold M, Newton E, ThomasGB: Infection and phagocytosis as possible mechanisms ofrupture in premature rupture of the membranes. Am J ObstetGynecol 1985, 153:38-43.

9. McGregor JA, Lawellin D, Franco-Buff A, Todd JK, Makowski EL: Pro-tease production by microorganisms associated with repro-ductive tract infection. Am J Obstet Gynecol 1986, 154:109-114.

10. Andrews WW, Hauth JC, Goldenberg RL, Gomez R, Romero R, Cas-sell GH: Amniotic fluid interleukin-6: correlation with uppergenital tract microbial colonization and gestational age inwomen delivered after spontaneous labor versus indicateddelivery. Am J Obstet Gynecol 1995, 173:606-612.

11. Fortunato SJ, Menon RP, Swan KF, Menon R: Inflammatorycytokine (interleukins 1, 6 and 8 and tumor necrosis factor-alpha) release from cultured human fetal membranes inresponse to endotoxic lipopolysaccharide mirrors amnioticfluid concentrations. Am J Obstet Gynecol 1996, 174:1855-1861.

12. Fortunato SJ, Menon R, Lombarda SJ: Role of tumor necrosis fac-tor-alpha in the premature rupture of membranes and pre-term labor pathways. Am J Obstet Gynecol 2002, 187:1159-1162.

13. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA,Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM,Gomes XV, Godwin BZC, He W, Helgesen S, Ho CH, Irzyk GP, JandoSC, Alenquer MLI, Jarvie TP, Jirage KB, Kim J-B, Knight JR, Lanza JR,Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB,McDade KE, McKenna MP, Myers EW, Nickerson E, Nobile JR, PlantR, Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson JW,Srinivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer GA, WangSH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM: Genomesequencing in microfabricated high-density picolitre reac-tors. Nature 2005, 437:376-380.

14. Rogers YH, Venter JC: Genomics: massively parallel sequenc-ing. Nature 2005, 437:326-327.

15. Ronaghi M, Karamohamed S, Petterson B, Uhlen M, Nyren P: Real-time DNA sequencing using detection of pyrophosphaterelease. Anal Biochem 1996, 242:84-89.

16. Hyman RW, Fukushima M, Diamona L, Kumm J, Giudice LC, DavisRW: Microbes on the human vaginal epithelium. Proc Natl AcadSci 2005, 102:7952-7957.

17. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR,Arrieta JM, Herndl GJ: Microbial diversity in the deep sea andthe underexplored "rare biosphere". Proc Natl Acad Sci 2006,103:12115-12120.

18. Kent WJ: BLAT – the BLAST-Like Alignment Tool. GenomeRes 2002, 12:656-664.

19. Cole JR, Chai B, Farris RJ, Wang Q, Kulam SA, McGarrell DM, GarrityGM, Tiedje JM: The Ribosomal Database Project (RDP-II):sequences and tools for high-throughput rRNA analysis.Nucleic Acids Res 2005, 33:D294-D296.

20. DeSantis TZ, Dubosarskiy I, Murray SR, Andersen GL: Comprehen-sive aligned sequence construction for automated design ofeffective probes (CASCADE-P) using 16S rDNA. Bioinformat-ics 2003, 19:1461-1468.

21. Von Wintzingerode F, Gobel UB, Stackebrandt E: Determinationof microbial diversity in environmental samples: pitfalls ofPCR-based rRNA analysis. FEMS Microbiol Rev 1997, 21:213-229.

22. Monstein H, Nikpour-Badi S, Jonasson J: Rapid molecular identifi-cation and subtyping of Helicobacter pylori by Pyrosequenc-ing of the 16S rDNA variable V1 and V3 regions. FEMSMicrobiol Lett 2001, 199:103-107.

23. Gray MW, Sankoff D, Cedergren RJ: On the evolutionary descentof organisms and organelles: a global phylogeny based on ahighly conserved structural core in small subunit ribosomalRNA. Nucleic Acids Res 1984, 12:5837-5852.

24. Carey JC, Klebanoff MA, Hauth JC, Hillier SL, Thom EA, Ernest JM,Heine RP, Nugent RP, Fischer ML, Leveno KJ, Wapner R, Varner M,Trout W, Moawad A, Sibai BM, Miodovnik M, Dombrowski M, O'Sul-livan MJ, Van Dorsten JP, Langer O, Roberts J: Metronidazole toprevent preterm delivery in pregnant women with asympto-matic bacterial vaginosis. N Engl J Med 2000, 342:534-540.

25. Klebanoff MA, Carey JC, Hauth JC, Hillier SL, Nugent RP, Thom EA,Ernest JM, Heine RP, Wapner RJ, Trout W, Moawad A, Miodovnik M,Sibai BM, Van Dorsten JP, Dombrowski MP, O'Sullivan MJ, Varner M,Langer O, McNellis D, Roberts JM, Leveno KJ: Failure of Metroni-dazole to Prevent Preterm Delivery among PregnantWomen with Asymptomatic Trichomonas vaginalis Infection.N Engl J Med 2001, 345:487-493.

26. Thinkhamrop J, Hofmeyr GJ, Adetoro O, Lumbiganon P: Prophylac-tic antibiotic administration in pregnancy to prevent infec-tious morbidity and mortality. Cochrane Database Syst Rev2002:CD002250.

27. Neefs J, Van de Peer Y, De Rijk P, Goris A, De Wachter R: Compi-lation of small ribosomal subunit RNA sequences. NucleicAcids Res 1991, 19:1987-2015.

Additional File 3Top genera identified in six samples. This spreadsheet lists the top 30 genera identified in each sample along with their proportion of reads.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2180-7-108-S3.xls]

Page 11 of 11(page number not for citation purposes)