Environmental Genome Shotgun Sequencing of the Science The … · 2008. 1. 30. · Environmental...

10
DOI: 10.1126/science.1093857 , 66 (2004); 304 Science et al. J. Craig Venter, Sargasso Sea Environmental Genome Shotgun Sequencing of the www.sciencemag.org (this information is current as of January 29, 2008 ): The following resources related to this article are available online at http://www.sciencemag.org/cgi/content/full/304/5667/66 version of this article at: including high-resolution figures, can be found in the online Updated information and services, http://www.sciencemag.org/cgi/content/full/1093857/DC1 can be found at: Supporting Online Material found at: can be related to this article A list of selected additional articles on the Science Web sites http://www.sciencemag.org/cgi/content/full/304/5667/66#related-content http://www.sciencemag.org/cgi/content/full/304/5667/66#otherarticles , 14 of which can be accessed for free: cites 34 articles This article 619 article(s) on the ISI Web of Science. cited by This article has been http://www.sciencemag.org/cgi/content/full/304/5667/66#otherarticles 92 articles hosted by HighWire Press; see: cited by This article has been http://www.sciencemag.org/cgi/collection/genetics Genetics : subject collections This article appears in the following http://www.sciencemag.org/about/permissions.dtl in whole or in part can be found at: this article permission to reproduce of this article or about obtaining reprints Information about obtaining registered trademark of AAAS. is a Science 2004 by the American Association for the Advancement of Science; all rights reserved. The title Copyright American Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by the Science on January 29, 2008 www.sciencemag.org Downloaded from

Transcript of Environmental Genome Shotgun Sequencing of the Science The … · 2008. 1. 30. · Environmental...

Page 1: Environmental Genome Shotgun Sequencing of the Science The … · 2008. 1. 30. · Environmental Genome Shotgun Sequencing of the Sargasso Sea J. Craig Venter, 1 * Karin Remington,

DOI: 10.1126/science.1093857 , 66 (2004); 304Science

et al.J. Craig Venter,Sargasso SeaEnvironmental Genome Shotgun Sequencing of the

www.sciencemag.org (this information is current as of January 29, 2008 ):The following resources related to this article are available online at

http://www.sciencemag.org/cgi/content/full/304/5667/66version of this article at:

including high-resolution figures, can be found in the onlineUpdated information and services,

http://www.sciencemag.org/cgi/content/full/1093857/DC1 can be found at: Supporting Online Material

found at: can berelated to this articleA list of selected additional articles on the Science Web sites

http://www.sciencemag.org/cgi/content/full/304/5667/66#related-content

http://www.sciencemag.org/cgi/content/full/304/5667/66#otherarticles, 14 of which can be accessed for free: cites 34 articlesThis article

619 article(s) on the ISI Web of Science. cited byThis article has been

http://www.sciencemag.org/cgi/content/full/304/5667/66#otherarticles 92 articles hosted by HighWire Press; see: cited byThis article has been

http://www.sciencemag.org/cgi/collection/geneticsGenetics

: subject collectionsThis article appears in the following

http://www.sciencemag.org/about/permissions.dtl in whole or in part can be found at: this article

permission to reproduce of this article or about obtaining reprintsInformation about obtaining

registered trademark of AAAS. is aScience2004 by the American Association for the Advancement of Science; all rights reserved. The title

CopyrightAmerican Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by theScience

on

Janu

ary

29, 2

008

www.

scie

ncem

ag.o

rgDo

wnlo

aded

from

Page 2: Environmental Genome Shotgun Sequencing of the Science The … · 2008. 1. 30. · Environmental Genome Shotgun Sequencing of the Sargasso Sea J. Craig Venter, 1 * Karin Remington,

Environmental Genome ShotgunSequencing of the Sargasso SeaJ. Craig Venter,1* Karin Remington,1 John F. Heidelberg,3

Aaron L. Halpern,2 Doug Rusch,2 Jonathan A. Eisen,3

Dongying Wu,3 Ian Paulsen,3 Karen E. Nelson,3 William Nelson,3

Derrick E. Fouts,3 Samuel Levy,2 Anthony H. Knap,6

Michael W. Lomas,6 Ken Nealson,5 Owen White,3

Jeremy Peterson,3 Jeff Hoffman,1 Rachel Parsons,6

Holly Baden-Tillson,1 Cynthia Pfannkoch,1 Yu-Hui Rogers,4

Hamilton O. Smith1

Wehave applied “whole-genome shotgun sequencing” tomicrobial populationscollected enmasse on tangential flow and impact filters from seawater samplescollected from the Sargasso Sea near Bermuda. A total of 1.045 billion base pairsof nonredundant sequencewas generated, annotated, and analyzed to elucidatethe gene content, diversity, and relative abundance of the organisms withinthese environmental samples. These data are estimated to derive from at least1800 genomic species based on sequence relatedness, including 148 previouslyunknown bacterial phylotypes. We have identified over 1.2 million previouslyunknown genes represented in these samples, including more than 782 newrhodopsin-like photoreceptors. Variation in species present and stoichiometrysuggests substantial oceanic microbial diversity.

Microorganisms are responsible for most of thebiogeochemical cycles that shape the environ-ment of Earth and its oceans. Yet, these organ-isms are the least well understood on Earth, asthe ability to study and understand the metabol-ic potential of microorganisms has been ham-pered by the inability to generate pure cultures.Recent studies have begun to explore environ-mental bacteria in a culture-independent man-ner by isolating DNA from environmental sam-ples and transforming it into large insert clones.For example, a previously unknown light-drivenproton pump, proteorhodopsin, was discoveredwithin a bacterial artificial chromosome (BAC)from the genome of a SAR86 ribotype (1), andsoil microbial DNA libraries have been construct-ed and screened for specific activities (2).

Here we have applied whole-genome shot-gun sequencing to environmental-pooled DNAsamples to test whether new genomic approach-es can be effectively applied to gene and spe-cies discovery and to overall environmental

characterization. To help ensure a tractable pilotstudy, we sampled in the Sargasso Sea, a nutrient-limited, open ocean environment. Further, weconcentrated on the genetic material captured onfilters sized to isolate primarily microbial inhabit-ants of the environment, leaving detailed analysisof dissolved DNA and viral particles on one endof the size spectrum and eukaryotic inhabitants onthe other, for subsequent studies.The Sargasso Sea. The northwest Sar-

gasso Sea, at the Bermuda Atlantic Time-seriesStudy site (BATS), is one of the best-studiedand arguably most well-characterized regionsof the global ocean. The Gulf Stream representsthe western and northern boundaries of thisregion and provides a strong physical boundary,separating the low nutrient, oligotrophic openocean from the more nutrient-rich waters of theU.S. continental shelf. The Sargasso Sea hasbeen intensively studied as part of the 50-yeartime series of ocean physics and biogeochem-istry (3, 4) and provides an opportunity forinterpretation of environmental genomic data inan oceanographic context. In this region, for-mation of subtropical mode water occurs eachwinter as the passage of cold fronts across theregion erodes the seasonal thermocline andcauses convective mixing, resulting in mixedlayers of 150 to 300 m depth. The introductionof nutrient-rich deep water, following thebreakdown of seasonal thermoclines into thebrightly lit surface waters, leads to the bloom-ing of single cell phytoplankton, including twocyanobacteria species, Synechococcus and Pro-

chlorococcus, that numerically dominate thephotosynthetic biomass in the Sargasso Sea.

Surface water samples (170 to 200 liters)were collected aboard the RV Weatherbird IIfrom three sites off the coast of Bermuda inFebruary 2003. Additional samples were col-lected aboard the SV Sorcerer II from “Hydro-station S” in May 2003. Sample site locationsare indicated on Fig. 1 and described in tableS1; sampling protocols were fine-tuned fromone expedition to the next (5). Genomic DNAwas extracted from filters of 0.1 to 3.0 !m, andgenomic libraries with insert sizes ranging from2 to 6 kb were made as described (5). Theprepared plasmid clones were sequenced fromboth ends to provide paired-end reads at the J.Craig Venter Science Foundation Joint Tech-nology Center on ABI 3730XL DNA sequenc-ers (Applied Biosystems, Foster City, CA).Whole-genome random shotgun sequencing ofthe Weatherbird II samples (table S1, samples 1 to4) produced 1.66 million reads averaging 818 bpin length, for a total of approximately 1.36 Gbp ofmicrobial DNA sequence. An additional 325,561sequences were generated from the Sorcerer IIsamples (table S1, samples 5 to 7), yielding ap-proximately 265 Mbp of DNA sequence.Environmental genome shotgun as-

sembly. Whole-genome shotgun sequencingprojects have traditionally been applied to iden-tify the genome sequence(s) from one particularorganism, whereas the approach taken here isintended to capture representative sequencefrom many diverse organisms simultaneously.Variation in genome size and relative abun-dance determines the depth of coverage of anyparticular organism in the sample at a givenlevel of sequencing and has strong implicationsfor both the application of assembly algorithmsand for the metrics used in evaluating the re-sulting assembly. Although we would expectabundant species to be deeply covered and wellassembled, species of lower abundance may berepresented by only a few sequences. For asingle genome analysis, assembly coveragedepth in unique regions should approximate aPoisson distribution. The mean of this distribu-tion can be estimated from the observed data,looking at the depth of coverage of contigsgenerated before any scaffolding. The assem-bler used in this study, the Celera Assembler(6), uses this value to heuristically identifyclearly unique regions to form the backbone ofthe final assembly within the scaffolding phase.However, when the starting material consists ofa mixture of genomes of varying abundance, athreshold estimated in this way would classifysamples from the most abundant organism(s) asrepetitive, due to their greater-than-averagedepth of coverage, paradoxically leaving themost abundant organisms poorly assembled.We therefore used manual curation of an initial

1The Institute for Biological Energy Alternatives, 2TheCenter for the Advancement of Genomics, 1901 Re-search Boulevard, Rockville, MD 20850, USA. 3TheInstitute for Genomic Research, 9712 Medical CenterDrive, Rockville, MD 20850, USA. 4The J. Craig VenterScience Foundation Joint Technology Center, 5 Re-search Place, Rockville, MD 20850, USA. 5University ofSouthern California, 223 Science Hall, Los Angeles, CA90089–0740, USA. 6Bermuda Biological Station forResearch, Inc., 17 Biological Lane, St George GE 01,Bermuda.

*To whom correspondence should be addressed. E-mail: [email protected]

RESEARCH ARTICLE

2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org66

on

Janu

ary

29, 2

008

www.

scie

ncem

ag.o

rgDo

wnlo

aded

from

Page 3: Environmental Genome Shotgun Sequencing of the Science The … · 2008. 1. 30. · Environmental Genome Shotgun Sequencing of the Sargasso Sea J. Craig Venter, 1 * Karin Remington,

assembly to identify a set of large, deeply as-sembling nonrepetitive contigs. This was used toset the expected coverage in unique regions (to23") for a final run of the assembler. This al-lowed the deep contigs to be treated as uniquesequence when they would otherwise be labeledas repetitive. We evaluated our final assemblyresults in a tiered fashion, looking at well-sampledgenomic regions separately from those barelysampled at our current level of sequencing.

The 1.66 million sequences from theWeatherbird II samples (table S1; samples 1 to4; stations 3, 11, and 13), were pooled andassembled to provide a single master assemblyfor comparative purposes. The assembly gener-ated 64,398 scaffolds ranging in size from 826bp to 2.1 Mbp, containing 256 Mbp of uniquesequence and spanning 400 Mbp. After assem-bly, there remained 217,015 paired-end reads,or “mini-scaffolds,” spanning 820.7 Mbp aswell as an additional 215,038 unassembled sin-gleton reads covering 169.9 Mbp (table S2,column 1). The Sorcerer II samples providedalmost no assembly, so we consider for thesesamples only the 153,458 mini-scaffolds, span-ning 518.4 Mbp, and the remaining 18,692singleton reads (table S2, column 2). In total,1.045 Gbp of nonredundant sequence was gen-erated. The lack of overlapping reads within theunassembled set indicates that lack of addition-al assembly was not due to algorithmic limita-tions but to the relatively limited depth of se-quencing coverage given the level of diversitywithin the sample.

The whole-genome shotgun (WGS) assemblyhas been deposited at DDBJ/EMBL/GenBankunder the project accession AACY00000000,and all traces have been deposited in a corre-sponding TraceDB trace archive. The versiondescribed in this paper is the first version,AACY01000000. Unlike a conventional WGSentry, we have deposited not just contigs andscaffolds but the unassembled paired singletonsand individual singletons in order to accurate-ly reflect the diversity in the sample andallow searches across the entire sample with-in a single database.Genomes and large assemblies. Our

analysis first focused on the well-sampled ge-nomes by characterizing scaffolds with at least3" coverage depth. There were 333 scaffoldscomprising 2226 contigs and spanning 30.9Mbp that met this criterion (table S3), account-ing for roughly 410,000 reads, or 25% of thepooled assembly data set. From this set of well-sampled material, we were able to cluster andclassify assemblies by organism; from the rarespecies in our sample, we used sequence similar-ity based methods together with computationalgene finding to obtain both qualitative and quan-titative estimates of genomic and functional diver-sity within this particular marine environment.

We employed several criteria to sort themajor assembly pieces into tentative organism“bins”; these include depth of coverage, oligo-

nucleotide frequencies (7), and similarity topreviously sequenced genomes (5). With thesetechniques, the majority of sequence assignedto the most abundant species (16.5 Mbp of the30.9 Mb in the main scaffolds) could be sepa-rated based on several corroborating indicators.In particular, we identified a distinct group ofscaffolds representing an abundant populationclearly related to Burkholderia (fig. S2) andtwo groups of scaffolds representing two dis-tinct strains closely related to the published

Shewanella oneidensis genome (8) (fig. S3).There is a group of scaffolds assembling at over6" coverage that appears to represent the ge-nome of a SAR86 (table S3). Scaffold setsrepresenting a conglomerate of Prochlorococ-cus strains (Fig. 2), as well as an unculturedmarine archaeon, were also identified (table S3;Fig. 3). Additionally, 10 putative mega plasmidswere found in the main scaffold set, coveredat depths ranging from 4" to 36" (indicatedwith shading in table S3 with nine depicted in

Fig. 1. MODIS-Aqua satellite image ofocean chlorophyll in the Sargasso Sea gridabout the BATS site from 22 February2003. The station locations are overlainwith their respective identifications. Notethe elevated levels of chlorophyll (greencolor shades) around station 3, which arenot present around stations 11 and 13.

Fig. 2. Gene conser-vation among closelyrelated Prochlorococ-cus. The outermostconcentric circle ofthe diagram depictsthe competed genom-ic sequence of Pro-chlorococcus marinusMED4 (11). Fragmentsfrom environmentalsequencing were com-pared to this complet-ed Prochlorococcus ge-nome and are shown inthe inner concentriccircles and were givenboxed outlines. Genesfor the outermost cir-cle have been as-signed psuedospec-trum colors based onthe position of thosegenes along the chro-mosome, where genesnearer to the start ofthe genome are col-ored in red, and genesnearer to the end of the genome are colored in blue. Fragments from environmental sequencingwere subjected to an analysis that identifies conserved gene order between those fragments andthe completed Prochlorococcus MED4 genome. Genes on the environmental genome segmentsthat exhibited conserved gene order are colored with the same color assignments as theProchlorococcus MED4 chromosome. Colored regions on the environmental segments exhibitingcolor differences from the adjacent outermost concentric circle are the result of conserved geneorder with other MED4 regions and probably represent chromosomal rearrangements. Genes thatdid not exhibit conserved gene order are colored in black.

R E S E A R C H A R T I C L E

www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004 67

on

Janu

ary

29, 2

008

www.

scie

ncem

ag.o

rgDo

wnlo

aded

from

Page 4: Environmental Genome Shotgun Sequencing of the Science The … · 2008. 1. 30. · Environmental Genome Shotgun Sequencing of the Sargasso Sea J. Craig Venter, 1 * Karin Remington,

Fig. 4). Other organisms were not so readilyseparated, presumably reflecting some combi-nation of shorter assemblies with less “taxo-nomic signal,” less distinctive sequence, andgreater divergence from previously sequencedgenomes (9).Discrete species versus a population

continuum. The most deeply covered of thescaffolds (21 scaffolds with over 14" coverageand 9.35 Mb of sequence), contain just over 1single nucleotide polymorphism (SNP) per10,000 base pairs, strongly supporting the pres-ence of discrete species within the sample. Inthe remaining main scaffolds (table S3), theSNP rate ranges from 0 to 26 per 1000 bp, witha length-weighted average of 3.6 per 1000 bp.We closely examined the multiple sequencealignments of the contigs with high SNP ratesand were able to classify these into two fairlydistinct classes: regions where several closelyrelated haplotypes have been collapsed, in-creasing the depth of coverage accordingly(10), and regions that appear to be a relativelyhomogenous blend of discrepancies from theconsensus without any apparent separation intohaplotypes, such as the Prochlorococcus scaf-fold region (Fig. 5). Indeed, the Prochlorococ-cus scaffolds display considerable heterogene-ity not only at the nucleotide sequence level(Fig. 5) but also at the genomic level, wheremultiple scaffolds align with the same region ofthe MED4 (11) genome but differ due to geneor genomic island insertion, deletion, rearrange-ment events. This observation is consistent withprevious findings (12). For instance, scaffolds2221918 and 2223700 share gene synteny witheach other and MED4 but differ by the insertionof 15 genes of probable phage origin, likelyrepresenting an integrated bacteriophage. Thesegenomic differences are displayed graphicallyin Fig. 2, where it is evident that up to fourconflicting scaffolds can align with the sameregion of the MED4 genome. More than 85%of the Prochlorococcus MED4 genome can bealigned with Sargasso Sea scaffolds greaterthan 10 kb; however, there appear to be acouple of regions of MED4 that are not repre-sented in the 10-kb scaffolds (Fig. 2). Thelarger of these two regions (PMM1187 toPMM1277) consists primarily of a gene clustercoding for surface polysaccharide biosynthesis,which may represent a MED4-specific polysac-charide absent or highly diverged in our Sar-gasso Sea Prochlorococcus bacteria. The heter-ogeneity of the Prochlorococcus scaffolds suggestthat the scaffolds are not derived from a singlediscrete strain, but instead probably represent aconglomerate assembled from a population ofclosely related Prochlorococcus biotypes.The gene complement of the Sargasso.

The heterogeneity of the Sargasso sequencescomplicates the identification of microbialgenes. The typical approach for microbial an-notation, model-based gene finding, relies en-tirely on training with a subset of manually

Fig. 3. Comparison ofSargasso Sea scaf-folds to Crenarchaealclone 4B7. Predictedproteins from 4B7and the scaffoldsshowing significanthomology to 4B7 bytBLASTx are arrayedin positional orderalong the x and yaxes. Colored boxesrepresent BLASTpmatches scoring atleast 25% similarityand with an e valueof better than 1e-5.Black vertical andhorizontal lines delin-eate scaffold borders.

Fig. 4. Circular diagrams of nine complete megaplasmids. Genes encoded in the forward directionare shown in the outer concentric circle; reverse coding genes are shown in the inner concentriccircle. The genes have been given role category assignment and colored accordingly: amino acidbiosynthesis, violet; biosynthesis of cofactors, prosthetic groups, and carriers, light blue; cellenvelope, light green; cellular processes, red; central intermediary metabolism, brown; DNAmetabolism, gold; energy metabolism, light gray; fatty acid and phospholipid metabolism, magenta;protein fate and protein synthesis, pink; purines, pyrimidines, nucleosides, and nucleotides, orange;regulatory functions and signal transduction, olive; transcription, dark green; transport and bindingproteins, blue-green; genes with no known homology to other proteins and genes with homologyto genes with no known function, white; genes of unknown function, gray; Tick marks are placedon 10-kb intervals.

R E S E A R C H A R T I C L E

2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org68

on

Janu

ary

29, 2

008

www.

scie

ncem

ag.o

rgDo

wnlo

aded

from

Page 5: Environmental Genome Shotgun Sequencing of the Science The … · 2008. 1. 30. · Environmental Genome Shotgun Sequencing of the Sargasso Sea J. Craig Venter, 1 * Karin Remington,

identified and curated genes. With the vast ma-jority of the Sargasso sequence in short (lessthan 10 kb), unassociated scaffolds and single-tons from hundreds of different organisms, it isimpractical to apply this approach. Instead, wedeveloped an evidence-based gene finder (5).Briefly, evidence in the form of protein align-ments to sequences in the bacterial portion ofthe nonredundant amino acid (nraa) data set(13) was used to determine the most likelycoding frame. Likewise, approximate start andstop positions were determined from the bound-ing coordinates of the alignments and refined toidentify specific start and stop codons. Thisapproach identified 1,214,207 genes coveringover 700 MB of the total data set. This repre-sents approximately an order of magnitudemore sequences than currently archived in thecurated SwissProt database (14), which con-tains 137,885 sequence entries at the time ofwriting; roughly the same number of sequencesas have been deposited into the uncuratedREM-TrEMBL database (14) since its incep-tion in 1996. After excluding all intervals cov-ered by previously identified genes, additionalhypothetical genes were identified on the basisof the presence of conserved open readingframes (5). A total of 69,901 novel genes be-longing to 15,601 single link clusters were iden-tified. The predicted genes were categorized

Fig. 5. Prochlorococcus-related scaffold 2223290 illustrates the assembly of a broad commu-nity of closely related organisms, distinctly nonpunctate in nature. The image represents (A)global structure of Scaffold 2223290 with respect to assembly and (B) a sample of the multiplesequence alignment. Blue segments, contigs; green segments, fragments; and yellow segments,stages of the assembly of fragments into the resulting contigs. The yellow bars indicate thatfragments were initially assembled in several different pieces, which in places collapsed toform the final contig structure. The multiple sequence alignment for this region shows ahomogenous blend of haplotypes, none with sufficient depth of coverage to provide aseparate assembly.

Table 1. Gene count breakdown by TIGR rolecategory. Gene set includes those found on as-semblies from samples 1 to 4 and fragment readsfrom samples 5 to 7. A more detailed table, sep-arating Weatherbird II samples from the Sorcerer IIsamples is presented in the SOM (table S4). Notethat there are 28,023 genes which were classifiedin more than one role category.

TIGR role category Totalgenes

Amino acid biosynthesis 37,118Biosynthesis of cofactors,prosthetic groups, and carriers

25,905

Cell envelope 27,883Cellular processes 17,260Central intermediary metabolism 13,639DNA metabolism 25,346Energy metabolism 69,718Fatty acid and phospholipidmetabolism

18,558

Mobile and extrachromosomalelement functions

1,061

Protein fate 28,768Protein synthesis 48,012Purines, pyrimidines, nucleosides,and nucleotides

19,912

Regulatory functions 8,392Signal transduction 4,817Transcription 12,756Transport and binding proteins 49,185Unknown function 38,067Miscellaneous 1,864Conserved hypothetical 794,061

Total number of roles assigned 1,242,230

Total number of genes 1,214,207

R E S E A R C H A R T I C L E

www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004 69

on

Janu

ary

29, 2

008

www.

scie

ncem

ag.o

rgDo

wnlo

aded

from

Page 6: Environmental Genome Shotgun Sequencing of the Science The … · 2008. 1. 30. · Environmental Genome Shotgun Sequencing of the Sargasso Sea J. Craig Venter, 1 * Karin Remington,

using the curated TIGR role categories (5). Abreakdown of predicted genes by category isgiven in Table 1.

The samples analyzed here represent onlyspecific size fractions of the sampled environ-ment, dictated by the pore size of the collectionfilters. By our selection of filter pore sizes, wedeliberately focused this initial study on theidentification and analysis of microbial organ-isms. However, we did examine the data for thepresence of eukaryotic content as well. Al-though the bulk of known protists are 10 !mand larger, there are some known in the rangeof 1 to 1.5 !m in diameter [for example, Os-treococcus tauri (15) and the Bolidomonas spe-cies (16)], and such organisms could potentiallywork their way through a 0.80 !m prefilter. Aninitial screening for 18S ribosomal RNA(rRNA), a commonly used eukaryotic marker,identified 69 18S rRNA genes, with 63 of theseon singletons and the remaining 6 on verysmall, lowcoverage assemblies. These 18SrRNAs are similar to uncultured marine eu-karyotes and are indicative of a eukaryotic pres-ence but inconclusive on their own. Becausebacterial DNA contains a much greater densityof genes than eukaryotic DNA, the relativeproportion of gene content can be used as an-other indicator to distinguish eukaryotic mate-rial in our sample. An inverse relation wasobserved between the pore size of the pre-filtersand collection filters and the fraction of se-quence coding for genes (table S5). This rela-tion, together with the presence of 18S rRNAgenes in the samples, is strong evidence thateukaryotic material was indeed captured.Diversity and species richness. Most

phylogenetic surveys of uncultured organismshave been based on studies of rRNA genesusing polymerase chain reaction (PCR) withprimers for highly conserved positions in thosegenes. More than 60,000 small subunit rRNAsequences from a wide diversity of prokaryotictaxa have been reported (17). However, PCR-based studies are inherently biased, because notall rRNA genes amplify with the same “univer-sal” primers. Within our shotgun sequence dataand assemblies, we identified 1164 distinctsmall subunit rRNA genes or fragments ofgenes in the Weatherbird II assemblies andanother 248 within the Sorcerer II reads (5).Using a 97% sequence similarity cutoff to dis-tinguish unique phylotypes, we identified 148previously unknown phylotypes in our samplewhen compared against the RDP II database(17). With a 99% similarity cutoff, this numberincreases to 643. Though sequence similarity isnot necessarily an accurate predictor of func-tional conservation and sequence divergencedoes not universally correlate with the biologi-cal notion of “species,” defining species (alsoknown as phylotypes) by sequence similaritywithin the rRNA genes is the accepted standardin studies of uncultured microbes. All sampledrRNAs were then assigned to taxonomic groups

using an automated rRNA classification pro-gram (5). Our samples are dominated by rRNAgenes from Proteobacteria (primarily membersof the #, $, and % subgroups) with moderatecontributions from Firmicutes (low-GC Grampositive), Cyanobacteria, and species in theCFB phyla (Cytophaga, Flavobacterium, andBacteroides) (fig. S4A; Fig. 6). The patterns wesee are similar in broad outline to those ob-served by rRNA PCR studies from the SargassoSea (18), but with some quantitative differencesthat reflect either biases in PCR studies or dif-ferences in the species found in our sampleversus those in other studies.

An additional disadvantage associated withrelying on rRNA for estimates of species diver-sity and abundance is the varying number ofcopies of rRNA genes between taxa (more thanan order of magnitude among prokaryotes)(19). Therefore, we constructed phylogenetictrees (fig. S4, B to E) using other representedphylogenetic markers found in our data set,[RecA/RadA, heat shock protein 70 (HSP70),elongation factor Tu (EF-Tu), and elongationfactor G (EF-G)]. Each marker gene interval inour data set (with a minimum length of 75amino acids) was assigned to a putative taxo-nomic group using the phylogenetic analysisdescribed for rRNA. For example, our data set

contains over 600 recA homologs fromthroughout the bacterial phylogeny, includingrepresentatives of Proteobacteria, low- andhigh-GC Gram positives, Cyanobacteria, greensulfur and green nonsulfur bacteria, and othergroups. Assignment to phylogenetic groupsshows a broad consensus among the differentphylogenetic markers. For most taxa, therRNA-based proportion is the highest or lowestin comparison to the other markers. We believethis is due to the large amount of variation incopy number of rRNA genes between species.For example, the rRNA-based estimate of theproportion of %Proteobacteria is the highest,while the estimate for cyanobacteria is the low-est, which is consistent with the reports thatmembers of the %-Proteobacteria frequentlyhave more than five rRNA operon copies,whereas cyanobacteria frequently have fewerthan three (19).

Just as phylogenetic classification isstrengthened by a more comprehensive markerset, so too is the estimation of species richness.In this analysis, we define “genomic” species asa clustering of assemblies or unassembled readsmore than 94% identical on the nucleotide lev-el. This cutoff, adjusted for the protein-codingmarker genes, is roughly comparable to the97% cutoff traditionally used for rRNA. Thus

Fig. 6. Phylogenetic diversity of Sargasso Sea sequences using multiple phylogenetic markers. Therelative contribution of organisms from different major phylogenetic groups (phylotypes) wasmeasured using multiple phylogenetic markers that have been used previously in phylogeneticstudies of prokaryotes: 16S rRNA, RecA, EF-Tu, EF-G, HSP70, and RNA polymerase B (RpoB). Therelative proportion of different phylotypes for each sequence (weighted by the depth of coverageof the contigs from which those sequences came) is shown. The phylotype distribution wasdetermined as follows: (i) Sequences in the Sargasso data set corresponding to each of these geneswere identified using HMM and BLAST searches. (ii) Phylogenetic analysis was performed for eachphylogenetic marker identified in the Sargasso data separately compared with all members of thatgene family in all complete genome sequences (only complete genomes were used to control forthe differential sampling of these markers in GenBank). (iii) The phylogenetic affinity of eachsequence was assigned based on the classification of the nearest neighbor in the phylogenetic tree.

R E S E A R C H A R T I C L E

2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org70

on

Janu

ary

29, 2

008

www.

scie

ncem

ag.o

rgDo

wnlo

aded

from

Page 7: Environmental Genome Shotgun Sequencing of the Science The … · 2008. 1. 30. · Environmental Genome Shotgun Sequencing of the Sargasso Sea J. Craig Venter, 1 * Karin Remington,

defined, the mean number of species at thepoint of deepest coverage was 451 [averagedover the six genes analyzed; range 341 to 569(Table 2)]; this serves as the most conservativeestimate of species richness.

Although counts of observed species in asample are directly obtainable, the true numberof distinct species within a sample is almostcertainly greater than that which can be ob-served by finite sequence sampling. That is,given a diverse sample at any given level ofsequencing, random sampling is likely to en-tirely miss some subset of the species at lowestabundance. We considered three approaches toestimating the true diversity: nonparametricmethods for small sample corrections (20),parametric methods assuming a log-normal dis-tribution of species abundance (21), and a novelmethod based on fitting the observed depth ofcoverage to a theoretical model of assemblyprogress for a sample corresponding to a mix-

ture of organisms at different abundances. Allthree methods agree on a minimum of at least300 species per sample, with more than 1000species in the combined sample (5). We de-scribe in detail a model based on assemblydepth of coverage (5). Assuming standard ran-dom models for shotgun sequencing, sequenc-ing of an environmental sample should result indepths of sequence coverage reflecting a mix-ture of Poisson distributions. We computed theempirical distribution of coverage depth at ev-ery position in the full set of assemblies (includ-ing single fragment contigs, but not countinggaps between contigs), and compared it withhand-constructed mixtures of Poisson distribu-tions. The three depth of coverage–based mod-els shown in Table 3 indicate that there are atleast 1800 species in the combined sample, andthat a minimum of 12-fold deeper samplingwould be required to obtain 95% of the uniquesequence. However, these are only lower

bounds. The depth of coverage modeling isconsistent with as much as 80% of the assem-bled sequence being contributed by organismsat very low individual abundance and thuswould be compatible with total diversity ordersof magnitude greater than the lower bound. Theassembly coverage data also implies that morethan 100 Mbp of genome (i.e., probably morethan 50 species) is present at coverage highenough to permit assembly of a complete ornearly complete genome were we to sequenceto 5- to 10-fold greater sampling depth.

Taking the well-known marine rRNA cladeSAR11 as an example, one can readily get amore tangible view of the diversity within oursample. The SAR11 rRNA group accounts for26% of all RNA clones that have been identi-fied by culture-independent PCR amplificationof seawater (22) and has been found in nearlyevery pelagic marine bacterioplankton commu-nity. However, there are very few cultured rep-resentatives of this clade (23), and little isknown about its metabolic diversity. In total, 89scaffolds and 291 singletons from our data setcontain a SAR11 rRNA sequence. However,even with these nearly 400 representatives ofSAR11 within our sample, assembly depth ofcoverage ranges from only from 0.94- to 2.2-fold, and the largest scaffold is quite small at21,000 bp. This indicates a much more diversepopulation of organisms than previously attrib-uted to SAR11 based on rRNA PCR methods.Variability in species abundance. A

key advantage to the random sampling ap-proach described here is that it allows assess-ment of the stoichiometry (that is, estimation ofthe relative abundance of the dominant organ-isms). For example, we found that more thanhalf of all assemblies with more than 50 frag-ments were from organisms that were not atequal relative abundance in all of the samplescollected, suggesting widespread “patchiness”

Table 2. Diversity of ubiquitous single copy protein coding phylogenetic markers. Protein column usessymbols that identify six proteins encoded by exactly one gene in virtually all known bacteria. SequenceID specifies the GenBank identifier for corresponding E. coli sequence. Ortholog cutoff identifies BLASTxe-value chosen to identify orthologs when querying the E. coli sequence against the complete SargassoSea data set. Maximum fragment depth shows the number of reads satisfying the ortholog cutoff at thepoint along the query for which this value is maximal. Observed “species” shows the number of distinctclusters of reads from the maximum fragment depth column, after grouping reads whose containingassemblies had an overlap of at least 40 bp with & 94% nucleotide identity (single-link clustering).Singleton “species” shows the number of distinct clusters from the observed “species” column thatconsist of a single read. Most abundant column shows the fraction of the maximum fragment depth thatconsists of single largest cluster.

Protein Sequence ID Orthologcutoff

Max.fragmentdepth

Observed“species”

Singleton“species”

Mostabundant(%)

AtpD NTL01EC03653 1e-32 836 456 317 6GyrB NTL01EC03620 1e-11 924 569 429 4Hsp70 NT01EC0015 1e-31 812 515 394 4RecA NTL01EC02639 1e-21 592 341 244 8RpoB NTL01EC03885 1e-41 669 428 331 7TufA NTL01EC03262 1e-41 597 397 307 3

Table 3. Diversity models based on depth of coverage. Each row corre-sponds to an abundance class of organisms. The first column in eachmodel “fr(asm)” gives the fraction of the assembly consensus modeleddue to organisms at an abundance giving “Depth” coverage depth (second

column) in the sample. The third column [“E(s)”] gives the fraction of sucha genome expected to be sampled. The fourth column (“Genomes”)gives the resulting estimated number of genomes in the abun-dance class.

Model 1 Model 2 Model 3

fr(asm) Depth E(s) Genomes fr(asm) Depth E(s) Genomes fr(asm) Depth E(s) Genomes

0.0055 25 1.00E'00 2.5 0.0055 25 1.00E'00 2.5 0.0055 25 1.00E'00 2.50.005 21 1.00E'00 2.3 0.005 21 1.00E'00 2.3 0.005 21 1.00E'00 2.30.0035 13 1.00E'00 1.6 0.0035 13 1.00E'00 1.6 0.0035 13 1.00E'00 1.60.004 9 1.00E'00 1.8 0.004 9 1.00E'00 1.8 0.004 9 1.00E'00 1.80.008 7 9.99E-01 3.6 0.0088 7 9.99E-01 4 0.0088 7 9.99E-01 40.0047 6 9.98E-01 2.1 0.0047 6 9.98E-01 2.1 0.0047 6 9.98E-01 2.10.01 4 9.82E-01 4.6 0.029 2.4 9.09E-01 14.4 0.029 2.4 9.09E-01 14.40.0258 2.4 9.09E-01 12.8 0.096 2 8.65E-01 50 0.097 2 8.65E-01 50.50.07 2 8.65E-01 36.4 0.0235 1 6.32E-01 16.7 0.0225 1 6.32E-01 160.8635 0.25 2.21E-01 1,756.7 0.06 0.5 3.93E-01 68.6 0.06 0.5 3.93E-01 68.6

0.76 0.09 8.61E-02 3,973.6 0.66 0.124 1.17E-01 2,546.70.1 0.001 1.00E-03 45,022.5

Total 1824.4 4137.6 47,733

R E S E A R C H A R T I C L E

www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004 71

on

Janu

ary

29, 2

008

www.

scie

ncem

ag.o

rgDo

wnlo

aded

from

Page 8: Environmental Genome Shotgun Sequencing of the Science The … · 2008. 1. 30. · Environmental Genome Shotgun Sequencing of the Sargasso Sea J. Craig Venter, 1 * Karin Remington,

in the bacterioplankton distributions (table S3).A patchy distribution in the abundance of largereukaryotic planktonic organisms (i.e., phyto-plankton and zooplankton) within marine eco-systems has been well documented (24, 25).However, equivalent patchiness in the bacterio-plankton (including autotrophic cyanobacteria)has not been well studied, largely due to obser-vations that bacterial numbers appear not toexhibit large spatial or temporal changes withinspecific oceanographic regimes (26, 27). Ourapproach provides a means to better elucidatebacterioplankton distributions.

Microheterogeneity of environments andthe existence of microbes on easily disruptedparticulate matter (such as the tiny bits of debrisknown as “marine snow” or copepod fecal pel-lets) is an increasingly wellrecognized compo-nent of biology and biogeochemistry of theocean (28–30) and is supported by our findings.Within water samples as large as the ones usedin this study ((200 to 340 l), there are likely tobe numerous particles providing microhabitatsfor specific bacterioplankton, and disruptionduring the collection process would allow at-tached organisms to be captured within our sizefraction. For example, unique to sample 1 werescaffolds with 6- and 7-fold coverage that seemto represent two distinct strains of Shewanella,comprising roughly 14.4% of the sample 1 data,and a group of 21-fold coverage scaffolds withsimilarity to a Burkholderia spp. (table S3),comprising roughly 38.5% of the sample 1 data.These findings were initially quite surprising;Shewanella is an abundant genus in aquatic,usually nutrient-rich environments (31), andBurkholderia is typically found only in terres-trial environmental samples. Our rigorous pro-tocols (5) lead us to believe these sequences donot represent contamination. The finding ofBurkholderia in samples taken from marinemammals (32) provides a plausible mechanismfor the transfer of this typically terrestrial orcoastal organism to the open ocean, and micro-habitats such as marine snow could certainlyprovide point sources of protein material forgrowth of these species.

Some organisms expected to be more com-mon at lower depths were also represented inour surface samples in significant numbers. Forexample, scaffolds from an uncultured marinearchaeon were identified (based on a 16S rDNAsequence and two 5S ribosomal sequences). Amore inclusive set of 18 scaffolds covering 396kb were identified and putatively designated asbeing derived from Archaea (33) and examinedfurther. Several of these scaffolds (Fig. 3) weresimilar to (&90% identity) and have syntenywith a genomic fragment that was derived fromplanktonic marine Archaea clone 4B7 (34),collected at a depth of 200 m in the easternNorth Pacific. Further, the collection of scaf-folds related to clone 4B7 can be separatedfrom another group of archaeal-like assemblies,with sequence depth approximately fourfold,

with between 40 and 60% sequence identity tosequences derived from various completely se-quenced archaea including Pyrococcus horiko-shii and Sulfolobus solfataricus. Among theactivities encoded in these archaeal assembliesare pathways for chorismate biosynthesis,oxoacid-ferrodoxin oxidoreductases, and gluta-mate and phosphoglycerate dehydrogenases.The separation of the archaeal assemblies intotwo groups of (2" and (4" coverage, re-spectively, the presence of at least two distinctgroups of ribosomal proteins, and the associa-tion of the assemblies with different specieswithin this phylum suggest that these assem-blies represent genomic sequences from at leasttwo distinct archaeal lineages.Plasmids. Interesting as vehicles of lateral

gene transfer and as reservoirs of genes forimportant environmental adaptations, plasmidsare also abundant in the assembly. In the scaf-fold set, we find evidence for six putative plas-mids larger than 100 kbp in length, two plas-mids (70 to 80 kbp, and two plasmids under10 kbp (table S3; Fig. 4). In addition to genesfor plasmid replication, putative genes for ar-senate, mercury, copper, and cadmium resist-ance were also found. Additionally, putativeconjugal transfer systems (type IV secretion)and putative plasmid transfer genes (i.e., tragenes) were identified, suggesting that at leastsome of the plasmids from this sample may bereadily mobilized. One of the plasmids encodeshomologs of the UmuCD DNA damage-induced DNA polymerase of Escherchia coli.In the Sargasso Sea, UmuCD genes could playa role in ultraviolet (UV) resistance (the UmuCDNA polymerase of E. coli is able to replicateDNA even when it has been damaged by UVirradiation). Alternatively, these genes couldconvey a mutator phenotype on their hosts, asother plasmid-encoded UmuCD homologshave been shown to do in various bacterialspecies. Further studies on the specificity of theputative trace metal resistance genes will fur-ther our understanding of the role of microbesin trace metal cycling in the oligotrophic ocean.Bacteriophage. Due to the experimental

design and approach in this sampling, onlydouble-stranded DNA bacteriophages were ob-servable. Searching the assembly againstknown phage sequence databases (5), we iden-tified 71 scaffolds greater than 10 kb in lengthcontaining identifiable clusters of phage genes,with Burkholderia- and Shewanella-associatedscaffolds accounting for roughly one-third ofthese. Because each functional phage containsone copy of the major capsid, portal protein,and large terminase gene (among others), weused matches to these to determine that thereare at least 50 major capsid, portal, and/or largeterminase groupings genes in the scaffolds andthree times as many of these in the singletons.

The primary phage sequences identifiedwere matches to lytic phages and the archaealphage, Sulfolobus islandicus filamentous virus

(SIFV), with the marine Vibrio T-4–likeKVP40 phage that appears most frequently.The distribution of best hits across viral familiesis similar to observations in a recent study ofsamples from Mission Bay and Scripps Pier(35). Among the matches to lytic phages areproteins classically involved in recycling orscavenging phosphate (e.g., ribonucleoside or-tide reductases, phoH, thymidine synthetases,and endo- or exonucleases). Historically, 1 to4% of planktonic free-living bacteria are visiblyinfected (36). On the basis of gene order, nocomplete replicative phage genomes could beidentified to corroborate this number. Given thehigh diversity of phages in the population, it isreasonable to hypothesize that our current dataset was not sampled deeply enough to com-pletely assemble a clonal phage.Photobiology of the Sargasso Sea. Mi-

crobial photosynthesis in the Sargasso Sea isthought to be dominated by the cyanobacteriaProchlorococcus and Synechococcus. Howev-er, more than 90% of the cyanobacterial-likescaffolds across our samples appear to be de-rived from Prochlorococcus rather than Syn-echococcus species. This probably reflects, inpart, the relative distribution of these two gen-era in different gradients of open ocean waters(37) as well as the larger average size of Syn-echococcus cells that would have excludedmany from 0.2- to 0.8-!m size fraction. Thediversity of homologs of ribulose bis-phosphatecarboxylase (RubisCO) was low in our samples(e.g., we only found 37 BLASTp matches to thelarge subunit of RubisCO in the predicted geneset). This was surprising, given that RubisCO isthe key enzyme of the Calvin Cycle, the mainprocess for carbon fixation in plants, cyanobac-teria, and many types of photosynthetic bacte-ria, but might simply reflect the dominance of asingle group of autotrophs (i.e., Prochlorococ-cus) in the size fractions studied here.

The recent discovery of a homolog ofbacteriorhodopsin in an uncultured %-Proteobacteria from the Monterey Bay revealedthe basis of a form of phototrophy in marinesystems (38) that was observed previously byoceanographers (39, 40). Bacteriorhodopsin al-lows for the coupling of light energy harvestingand carbon cycling in the ocean through anon–chlorophyll-based pathway. Environmentalculture-independent gene surveys with PCRhave since shown that proteorhodopsin is notlimited to a single oceanographic locationand revealed some 67 additional closely re-lated proteorhodopsin homologs (41). Morethan 650 rhodopsin homologs were identifiedwithin the Weatherbird II samples (samples 1to 4), and an additional 132 were identified inthe Sorcerer II samples (samples 5 to 7),increasing the total number of proteorhodop-sin examples by almost an order of magni-tude. Phylogenetic analysis of all availablerhodopsin-like genes reveals that the addi-tional sequences we identified in the Sargasso

R E S E A R C H A R T I C L E

2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org72

on

Janu

ary

29, 2

008

www.

scie

ncem

ag.o

rgDo

wnlo

aded

from

Page 9: Environmental Genome Shotgun Sequencing of the Science The … · 2008. 1. 30. · Environmental Genome Shotgun Sequencing of the Sargasso Sea J. Craig Venter, 1 * Karin Remington,

Sea populations also represent a great increase inthe sequence diversity of proteorhodopsin-likegenes, in comparison to what had been seenpreviously (Fig. 7). In total, we identified 13distinct subfamilies of rhodopsin-like genes.These include four families of proteinsknown from cultured organisms (halorhodp-sin, bacterorhodopsin, sensory opsins, andfungal opsin), and nine families from uncul-tured species of which seven are only knownfrom the Sargasso Sea populations. Of thesesubfamilies, many are quite distant from ei-ther proteorhodopsin or from rhodopsins incultured species. In situ protein profiles forthis area are not available, so expressionlevels of these genes remain to be deter-mined. The capability of our approach to iden-tify genes of particular biological function incontext with phylogenetically informativemarkers allowed us to look beyond the diversityof the rhodopsins themselves to the phyloge-netic diversity of the organisms containing theidentified rhodopsins. Analysis of the scaffoldscontaining both a rhodopsin and a phylogeneticmarker demonstrate that the rhodopsins arefound well outside the groups of proteobacteriawhere they had previously been discovered(39). For example, on one scaffold we observedthese as well as a DNA-directed RNA polymer-ase ) subunit originating from the CFB group(fig. S10). Such findings are consistent withthe hypothesis that rhodopsins are wide-spread and abundant in marine planktonicbacteria; however, this study demonstratesthat the distribution of rhodopsins is evengreater than previously established.

Conclusion. We demonstrate here thatshotgun sequencing provides a wealth of phy-logenetic markers that can be used to assess thephylogenetic diversity of a sample with morepower than conventional PCR-based rRNAstudies allow. We find that although the quali-tative picture that emerges is similar to thatbased on analysis of rRNA genes alone, thequantitative picture is significantly different forcertain taxonomic groups. Further, just as shot-gun sequencing provides a relatively unbiasedway to examine species diversity, it can alsoallow a relatively unbiased identification of thediversity of genes in particular gene families.Using these results to drive gene expressionsurveys and physiological studies provides amore comprehensive approach to environmen-tal biology than previously available.

We are a long way from a full understand-ing of the biology of the organisms sampledhere, but this relatively small genomics studydemonstrates areas where important insightsmay be gained. For example, because it hasbeen believed that only members of the bac-terial domain were capable of oceanic nitri-fication, it is interesting to note that an am-monium monooxygenase gene was found onan archaeal-associated scaffold within ourdata set. Though ultraviolet (UV) light in thesurface ocean is likely to inhibit ammoniaoxidation by chemoautotrophs (42, 43), ni-trite concentrations are nearly as great asnitrate concentrations during the period ofwinter mixing when our samples were col-lected. Together with our finding, this sug-gests that the high nitrite concentrations may

result from archaeal ammonium oxidation,which would not be inhibited by UV light.Our study also sheds light on the mechanismsby which the Sargasso Sea inhabitants mayefficiently utilize the available phosphorus inthis extremely phosphorus-limited environ-ment. Transport mechanisms for phospho-nates that were recently been identified in theProchlorococcus and Synechococcus ge-nomes (11, 44) appear in our data set, inaddition to a large number of published genesresponsible for utilization of polyphosphatesand pyrophosphates, which are also availablein this environment. Further, the high-affinityinorganic phosphorus transporter PstS (45,46), one of the Pho Regulon group of genesassociated with Synechococcus, was alsoidentified within our samples. Indeed, itwould appear that a great variety of themicrobial population present at the time ofthis sampling possessed mechanisms to uti-lize the dominant form of phosphorus in theSargasso Sea. This opens a door to furtherstudies of expression and physiology.

There are obviously challenges to genomeassembly in the environmental context. Partic-ularly well-conserved regions (e.g., 16S rRNA)may assemble across species, similar to anover-collapsed repeat in a standard eukaryoticassembly project. Such over-collapsing, whichresults in a pattern of inconsistent mate-pairrelationships, causes scaffolds to break, result-ing in suboptimal assembly. Similarly, twoclosely related organisms may contain stretchesof high similarity that co-assemble, whereasother, more dissimilar regions assemble appro-priately by organism. A third challenge is thatposed by the low sequence coverage of the lessabundant organisms, which may call for moreaggressive use of mate links to achieve assem-bly. With deeper coverage, multiple mate linksbetween contigs provide greater statistical sup-port for scaffolds. At lower coverage, theregenerally will be fewer confirming matelinks between contigs, and therefore lessstatistical support for the scaffolds. Mis-mated mate reads, while rare, occur withsome probability and, in a low coveragescenario, could contribute to misassembly.None of these issues is unique to environ-mental assembly, and advances in theseareas will find application in single organ-ism assembly projects as well. Our assem-bly results demonstrate that despite thesechallenges, one can apply whole-genomeassembly algorithms successfully in an en-vironmental context, with the only real lim-itation being the sequencing cost.

Although it may be expensive to sequenceat great depths at this particular point in time,the economics of sequencing is rapidly chang-ing. Were sequencing costs not to drop dramat-ically (as we fully expect they will), normaliza-tion procedures to remove over-sampled organ-isms from the DNA sample to better study the

Fig. 7. Phylogenetictree of rhodopsinlikegenes in the SargassoSea data along withall homologs of thesegenes in GenBank.The sequences arecolored according tothe type of sample inwhich they werefound: blue, culturedspecies; yellow, se-quences from uncul-tured organisms inother environmentalsamples; and red, se-quences from uncul-tured species in theSargasso Sea. The treewas divided into whatwe propose are dis-tinct subfamilies ofsequences, which arelabeled on the right.The tree was con-structed as follows: (i)All homologs of halo-rhodopsin were iden-tified in the predictedproteins from the Sargasso Sea assemblies using BLASTp searches with representatives of previ-ously identified halorhodpsinlike protein families as query sequences. (ii) All sequences greater than75 amino acids in length were aligned to each other using CLUSTALw, and a neighbor-joiningphylogenetic tree was inferred using the protdist and neighbor programs of Phylip.

R E S E A R C H A R T I C L E

www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004 73

on

Janu

ary

29, 2

008

www.

scie

ncem

ag.o

rgDo

wnlo

aded

from

Page 10: Environmental Genome Shotgun Sequencing of the Science The … · 2008. 1. 30. · Environmental Genome Shotgun Sequencing of the Sargasso Sea J. Craig Venter, 1 * Karin Remington,

remainder or to isolate DNA of particular inter-est based on the initial survey make this a verypractical method for examining the genomiccontent of environmental communities.

References and Notes1. O. Beja et al., Science 289, 1902 (2000).2. M. R. Rondon et al., Appl. Environ. Microbiol. 66, 2541(2000).

3. M. H. Conte, Oceanus 40, 15 (1998).4. D. K. Steinberg et al., Deep-Sea Res. II 48, 1405 (2001).5. Materials and Methods are available as supportingonline material (SOM) on Science Online.

6. E. W. Myers et al., Science 287, 2196 (2000).7. S. Karlin, C. Burge, Trends Genet. 11, 283 (1995).8. J. F. Heidelberg et al., Nature Biotechnol. 20, 1118(2002).

9. For example, Burkholderia may be the only well-represented genus of $-Proteobacteria, with highlydistinctive GC content, whereas there may be manydifferent %-Proteobacteria.

10. These putative strains can be easily separated post-hoc by reassembly with more stringent overlap cri-teria. See SOM for details.

11. G. Rocap et al., Nature 424, 1042 (2003).12. B. Palenik, Appl. Environ. Microbiol. 60, 3212 (1994).13. Searches were performed on the sequences longerthan 40 bp with BLASTx against the entire nraa dataset archived at GenBank.

14. B. Boeckmann et al., Nucleic Acids Res. 31, 365 (2003).15. M. J. Chretiennotdinet et al., Phycologia 34, 285 (1995).16. L. Guillou et al., J. Phycol. 35, 368 (1999).17. J. R. Cole et al., Nucleic Acids Res. 31, 442 (2003).18. C. A. Carlson et al., Aquat. Microb. Ecol. 30, 19 (2002).19. J. A. Klappenbach, J. M. Dunbar, T. M. Schmidt, Appl.Environ. Microbiol. 66, 1328 (2000).

20. A. Chao, Scand. J. Stat. 11, 265 (1984).

21. T. P. Curtis, W. T. Sloan, J. W. Scannell, Proc. Natl.Acad. Sci. U.S.A. 99, 10494 (2002).

22. S. Giovanonni, M. S. Rappe, in Microbial Ecology ofthe Oceans, D. L. Kirchman, Ed. (Wiley, New York,2000), pp. 47–84.

23. M. S. Rappe, S. A. Connon, K. L. Vergin, S. J. Giovan-noni, Nature 418, 630 (2002).

24. J. Raymont, Plankton and Productivity in the Oceans(William Clowes Limited, London, ed. 2, 1980).

25. T. Parsons, M. Takahashi, B. Hargrave, BiologicalOceanographic Processes (BPCC Wheatons Ltd., Exeter,UK, ed. 3, 1984).

26. J. E. Hobbie, R. J. Daley, S. Jasper, Appl. Environ.Microbiol. 33, 1225 (1977).

27. P. del Giorgio, Y. Prairie, D. Bird, Microb. Ecol. 34, 144(1997).

28. A. Alldredge, Y. Cohen, Science 235, 689 (1987).29. E. F. DeLong, D. Franks, A. Alldredge, Limnol. Ocean-ogr. 38, 924 (1993).

30. U. Passow, Prog. Oceanogr. 55, 287 (2002).31. K. H. Nealson, J. Scott, in The Prokaryotes: An EvolvingElectronic Resource for the Microbiological Commu-nity, E. A. Dworkin, Ed. (Springer-Verlag, NY, 2003).

32. C. L. Hicks, R. Kinoshita, P. W. Ladds, Aust. Vet. J. 78,193 (2000).

33. The criteria for this filtering were that more than50% of the assembly had a database hit and morethan 30% of the sequence had the best hit to ar-chaeal gene sequences.

34. J. L. Stein, T. L. Marsh, K. Y. Wu, H. Shizuya, E. F.DeLong, J. Bacteriol. 178, 591 (1996).

35. M. Breitbart et al., Proc. Natl. Acad. Sci. U.S.A. 99,14250 (2002).

36. L. M. Proctor, J. A. Fuhrman, Nature 343, 60 (1990).37. K. K. Cavender-Bares, D. M. Karl, S. W. Chisholm,Deep-Sea Res. I 48, 2373 (2001).

38. O. Beja et al., Science 289, 1902 (2000).39. J. R. de la Torre et al., Proc. Natl. Acad. Sci. U.S.A.100, 12830 (2003).

40. O. Beja, E. N. Spudich, J. L. Spudich, M. Leclerc, E. F.DeLong, Nature 411, 786 (2001).

41. G. Sabehi et al., Environ. Microbiol. 5, 842 (2003).42. M. Guerero, R. Jones, Mar. Ecol. Prog. Ser. 141, 193(1996).

43. M. Guerrero, R. Jones, Mar. Ecol. Prog. Ser. 141, 183(1996).

44. B. Palenik et al., Nature 424, 1037 (2003).45. D. Scanlan et al., Appl. Environ. Microbiol. 63, 2411(1997).

46. D. Scanlan, W. H. Wilson, Hydrobiologia 401, 149 (1999).47. The authors would like to acknowledge P. Lethaby, M.Roadman, D. Clougherty, N. Buck, J. Selengut, A.Delcher, M. Pop, H. Koo, R. Doering, M. Wu, J. Badger, K.Moffat, S. Yooseph, E. Kirkness, D. Karl, K. Heidelberg, B.Friedman, H. Kowalski, and the staff of the J. CraigVenter Science Foundation Joint Technology Center.We also acknowledge the help of N. Nelson at UCSB-ICESS for assistance in acquiring the satellite image inFig. 1. Further, we acknowledge the NSF Division ofOcean Sciences for its ongoing support of the BATSProgram and the RV Weatherbird II, and the Depart-ment of Energy Genomes to Life program for its supportof K. Nealson. This research was supported by theOffice of Science (B.E.R.), U.S. Department of Energygrant no. DE-FG02-02ER63453, and the J. Craig VenterScience Foundation. This is contribution No. 1646 of theBermuda Biological Station for Research, Inc.

Supporting Online Materialwww.sciencemag.org/cgi/content/full/1093857/DC1Materials and MethodsSOM TextFigs. S1 to S10Tables S1 to S5References

20 November 2003; accepted 20 February 2004Published online 4 March 2004;10.1126/science.1093857Include this information when citing this paper.

REPORTSApproaching the Quantum Limitof a Nanomechanical ResonatorM. D. LaHaye,1,2 O. Buu,1,2 B. Camarota,1,2 K. C. Schwab1*

By coupling a single-electron transistor to a high–quality factor, 19.7-mega-hertz nanomechanical resonator, we demonstrate position detection approach-ing that set by the Heisenberg uncertainty principle limit. At millikelvin tem-peratures, position resolution a factor of 4.3 above the quantum limit isachieved and demonstrates the near-ideal performance of the single-electrontransistor as a linear amplifier. We have observed the resonator’s thermal motionat temperatures as lowas56millikelvin,withquantumoccupation factors ofNTH*58. The implications of this experiment reach from the ultimate limits of forcemicroscopy to qubit readout for quantum information devices.

Since the development of quantum mechanics,it has been appreciated that there is a funda-mental limit to the precision of repeated posi-tion measurements (1). This is a consequence of

the Heisenberg uncertainty principle (2), whichplaces a limit on the simultaneous knowledge

of position x and momentum p: +x • +p !,

2,

where 2- • , is Planck’s constant. When ap-plied to a simple harmonic oscillator of mass mand angular resonant frequency .0, this rela-tionship places a limit on the precision of twoinstantaneous, strong position measurements,

what is called the “standard quantum limit,”

+xSQL * ! ,

2m.0(3).

Although the standard quantum limit capturesthe physics of the uncertainty principle, it is farfrom the situation found when one continuouslymeasures the position with a linear detector. Lin-ear amplifiers not only detect and amplify theincoming desired signal but also impose back-action onto the object under study (4); the currentnoise emanating from the input of a voltage pre-amplifier or the momentum noise imparted to amirror in an optical interferometer are manifesta-tions of this back-action. The uncertainty principleagain appears and places a quantum limit on theminimum possible back-action for a linear ampli-fier. Previous work (5) has concluded that theminimum possible amplifier noise temperature is

TQL *,.0

ln3 • kB. Applying this result to the con-

tinuous readout of a simple harmonic oscillatoryields the ultimate position resolution (6):

+xQL * ! ,

ln3 • m.0/ 1.35 • +xSQL, which

1Laboratory for Physical Sciences, 8050 Greenmead Drive,College Park, MD 20740, USA. 2Department of Physics,University of Maryland, College Park, MD 20740, USA.

*To whom correspondence should be addressed. E-mail: [email protected]

R E S E A R C H A R T I C L E

2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org74

on

Janu

ary

29, 2

008

www.

scie

ncem

ag.o

rgDo

wnlo

aded

from