Creating the Genomic Encyclopedia for Bacteria and Archaea

Rob Edwards, Jonathan A. Eisen, Ross Overbeek, George Garrity, Veronika Vonstein, Sveta Gerdes, Folker Meyer, Kevin White,Tim Lilburn, Barney Whitman, et. al.

Creating theGenomic Encyclopedia for Bacteria

and Archaea

Rick Stevens Eddy Rubin

Argonne National Laboratory Joint Genome Institute

The University of Chicago Berkeley Lab

The Basic Idea of the Project

• To build an enterprise that can take advantage of the expected exponential improvements of sequencing capabilities to sequence “all known” cultured and described prokaryotes• Ride the expected “Moore’s law” of sequencing capability

• To develop a distributed high-throughput “industrial” approach to the cultivation, characterization, sequencing, annotation and analysis of prokaryotic genomes• Build a team from groups that have expertise and track records

• To build and curate a database of genome sequences, metabolic reconstructions, and standardized phenotype assays associated with each target organism• Streamline the release of data, provide a foundation for derivative

projects

Concept of the Bergey’s/GEBA Sequencing Project

• A Fixed cost annual investment• Each year more can be sequenced as sequencing costs decrease and

as cultivation efficiencies improve based on experience

• Leverage the expected improvement of sequencing costs• Address the overall scope within 5 to 6 years• Increase amount of near complete sequences per year

• Optimize the choice of organisms to maximize diversity at each stage• Exploit the Bergey’s Trust and International Committee on Systematics

for Prokaryotes for Taxonomic coverage (e.g. Garrity and Whitman) • Involve the microbiology community for prioritization

• Industrialize the pipeline• Biological Resource Centers to produce and characterize type material• DOE JGI, NIAID/DMID Centers, NSF/USDA Centers for Sequencing• Laboratories for bioinformatics (Argonne, JGI, TIGR, ORNL, etc.)• Universities and Laboratories for modeling and analysis

The Question is not if, but When and How ?

• Why should we want to accelerate this transition?• Why not just let it happen as a matter of course?• What is in the current sequencing pipeline?• Completed Genomes Ongoing/In the Pipeline

• Archaeal 29 56• Bacterial397 991• Eukaryal 44 631

• The existing process of bottoms up selection of organisms for sequencing is leaving many important groups underrepresented, closure will take a long time

• There are groups are well represented in the literature, but not in the sequencing databases

• Under representation is also an issue in environmental sequencing data

Tapping into prokaryotic biodiversity - Industrial Biotechnology

Hans E. Schoemaker, et al. 2003. Science 299:1694-97

• Rapidly growing field • by 2010 biocatalysis will be used in production of 60% of fine chemicals (McKinsey analysis)

• In US coordinated by USDA Biobased Products and Bioenergy Coordination Council (BBCC)

• Applications:• pharmaceuticals• food ingredients (sweeteners, vitamins)• feed additives and other agrochemicals• organic solvents• polymer raw materials• biofuels

• Advantages over chemical methods: • exquisite substrate specificity

• excellent chemo-, regio- and stereoselectivity

• environmentally friendly “green chemistry” based on biorenewables

• Needed: • novel enzymes and pathways

• “Periodic table” of biochemical transformations

Straathof et al. 2002. Curr Opinion Biothech 13:548-56

~150 compounds are currently produced on industrial scale using biocatalysts. Examples:

Analysis of 1000s of new bacterial genomes will likely yield completely novel pathways and enzymes for industrial applications

Still to be discovered: enzymes involved in the biosynthesis or catabolism of approximately 40 naturally occurring

chemical functional groups are still not known

• Hydoxylaminobenzene mutase

• Aldoximine dehydratase

• Azetidine-2-carboxylate hydrolase

• Benzylsuccinate synthase

• Phenylboronic acid oxygenase

L.P. Wackett. 2004. Current Opinion in Biotechnology, 15:280–284

Examples of recently discovered biocatalytic transformations of novel organic functional groups:

• Current approaches to discovery of new enzymes:

• Screening environmental samples by enrichment cultures (BUT: only <<1% prokaryotes are currently culturable)

• Metagenome approach: cloning & expression of DNA samples in a surrogate host, then screening for desired function (BUT: only known functions can be screened for, new biochemistry cannot be discovered)

• Sequence-based discovery (growing explosively, generating knowledge base for basic sciences and biotechnological applications)

Building the Case• There is a disparity between the literature and the existing genomes

• We can’t fully exploit the community’s historical knowledge and investments without closing this gap

• There is a disparity between the rank/abundance curves from 16s studies and from environmental sequencing projects and the existing genomes• We can’t fully understand the new datasets without closing this gap (I.e. lack

of complete sequence coverage of known culturables is holding back future work)

• There is likely to be new biochemical pathways and novel enzymes in the set of culturable but unsequenced organisms, sequencing non-cultured organisms to expand diversity• These represent the low hanging fruit for discovery since the investment has

already be made in determining culture conditions

• A comprehensive database produced under controlled conditions that includes phenotype data and genotype data will accelerate research in understanding the genotype-phenotype relationship• Genome-Scale reconstruction and modeling will be dramatically accelerated

by comprehensive databases that include phenotype data

Estimated Sequencing Rates

Year 2007 2008 2009 2010 2011 2012 2013 2014 Notes

Base Pairs per dollar 200 300 450 675 1,013 1,519 2,278 3,417 50% improvement per year

Bacterial Genome Cost in $ 20,000 13,333 8,889 5,926 3,951 2,634 1,756 1,171 ~4M bp per genome

Number Genomes for $5M 250 375 563 844 1,266 1,898 2,848 4,271Cumulative Genomes Sequenced 250 625 1,188 2,031 3,297 5,195 8,043 12,314

Selection of Targets

ProduceDNA

SequencingAssembly

RapidAnnotation(24 Hours)

MetabolicReconstruction

ModelGeneration

PhenotypePrediction

DatabaseRepository

Technical Feasibility FAQ• How many genomes would the project propose to sequence?

• About 5000 over 5-7 years

• Who would produce the biomass needed for DNA extraction?• Type culture centers until enrichment and environmental methods mature

• Will the biomass/DNA be available for distribution?• Yes, both the DNA and the libraries could be stored for distribution

• What throughput is needed for DNA production?• In the beginning of the project ~300 taxa per year to 2000 per yr at the end

• What combinations of sequencing technologies need to be employed?• Sanger and Pyrosequencing initially, others as they come online

• What throughput is needed for annotation?• 24 hour turnaround from assembled sequence to initial availability this has already been

achieved at Argonne, TIGR and elsewhere

• Is is possible to have a standard set of phenotype assays given the broad spectrum of organisms and conditions?

• We are considering Biolog as a model, but it is too limited

• How would the genomes be selected and prioritized?• At each cycle we choose genomes (e.g. via 16s) to minimize the diversity gaps• Community input would be solicited to insure the project is tracking the communities

interests

• Is it necessary to “close” the genomes?• We think no. Libraries would be archived for groups that might be interested in closing.

The Project Would Provide a Comprehensive Set of Genome Sequences for:

• Biofuels, and bioproduction of alternative feedstocks• Understanding and managing the microbial carbon cycle• Soil and subsurface microbial ecology• Bioremediation and bioconversion of waste streams• Evolution and microbial ecological dynamics• Context for environmental sequencing and metagenomics• Basis for developing predictive models of phenotypes• Source of components for synthetic biology• Improving our understanding of cultivability• Dramatically improving the reliability and quality of genome

annotations

How Many Known Cultured Organisms?

• Latest version of the Prokaryotic Taxonomic Outline will contain 7951 named species of Bacteria and Archaea.

• Of these, 178 are non-cultivable or not represented by viable type material.

• An additional 1222 are synonyms. • Of the 6543 type strains for which viable material is

reportedly deposited, we have assembled a minimal set of 6389 strains that are available from 16 major public culture collections or biological resource centers in the US, Europe, and Asia.

• The remaining 154 are in minor or non-public collections.

• This information is derived from Release 6.1 of the Taxonomic Outline of the Prokaryotes which will be published in 2007 and is current through May 2006.

What Has Been Sequenced or is In Play

• Of the 6400 strains available from public sources• About 380 are human, animal or plant pathogens

• Order 1/3-1/2 of the known pathogens have been sequenced

• 360 complete prokaryotic genomes published• 56 archaeal and 940 bacterial genomes in progress• From 897 prokaryotic genomes in progress in GOLD

• ~400 are pathogens (many duplicate taxa)• ~221 are supported by DOE (156 biotech, 51 environment)

• Approximately ~5000 prokaroytes not yet in play• We estimate about 4800 non-pathogen taxa

Strain Distribution in CollectionsUS Collections / BRCs StrainsAmerican Type Culture Collection (ATCC) 4027 USDA ARS Collection (NRRL) 223European Collections

Deutsche Sammlung vor Microoransmen (DSMZ) 1302Culture Collection University Gottenberg (CCUG) 183Pasteur Institute (CIP) 170Laboratory for Micrbiology, Gent (LMG) 101National Collection of Industrial and

Marine Bacteria 25French Collection of Phytopathogens (CFPB) 15National Collection of Type Cultures (NCTC) 12National Collection of Phytopathogenic

Bacteria 11Asia

Japan Collection of Microorganisms (JCM) 185Institute of Fermentation, Osaka (IFO) 34Korean Collection of Type Cultures (KCTC) 28Institute of Applied Microbiology, Tokyo (IAM) 26National Institute of Technology

And Evaluation (NBRC) 24All-Russian Collection of Microorganisms (VKM) 13

Distribution of Genome Sizes in the Pipeline

Microbial Genome Size

0

2000

4000

6000

8000

10000

12000

14000

1 27 53 79 105 131 157 183 209 235 261 287 313 339 365 391 417 443 469 495 521

Taxa

Ba

se

pa

irs (

1,0

00

's)

Series1

Average Sequence ~ 4Mbp

Getting Value from the Genomes

• Genomes would be assembled by the groups doing the sequencing

• Assembled contigs would be sent to the initial high-throughput annotation server for draft annotations and immediately published on-line

• The accumulated (additional) genomes will be used to improve annotations (gene calls, functional coupling)

• Genomes will be integrated into databases to support comparative analysis and evolutionary analysis

• Annotated genomes can be used to semi-automatically construct genome-scale models which could be used to make metabolic phenotype predictions

Background

online at• http://www.sequencingbergeys.org• login required (just ask us)• guest read-only access after the meeting?• make maximum information availableBergey hierarchy, NCBI taxonomy, 16s RNA,

strain collections, GOLD, SEED,

http://www.sequencingbergeys.org/

List of organisms for sequencing

- based on 16s clusters

Cluster Page

select strain for cluster

“Bergey” Browser

Species Page

Creating the Genomic Encyclopedia for Bacteria and Archaea

Documents

Transcript of Creating the Genomic Encyclopedia for Bacteria and Archaea