The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the...

18
The Sequence of the Human Genome J. Craig Venter, 1 * Mark D. Adams, 1 Eugene W. Myers, 1 Peter W. Li, 1 Richard J. Mural, 1 Granger G. Sutton, 1 Hamilton O. Smith, 1 Mark Yandell, 1 Cheryl A. Evans, 1 Robert A. Holt, 1 Jeannine D. Gocayne, 1 Peter Amanatides, 1 Richard M. Ballew, 1 Daniel H. Huson, 1 Jennifer Russo Wortman, 1 Qing Zhang, 1 Chinnappa D. Kodira, 1 Xiangqun H. Zheng, 1 Lin Chen, 1 Marian Skupski, 1 Gangadharan Subramanian, 1 Paul D. Thomas, 1 Jinghui Zhang, 1 George L. Gabor Miklos, 2 Catherine Nelson, 3 Samuel Broder, 1 Andrew G. Clark, 4 Joe Nadeau, 5 Victor A. McKusick, 6 Norton Zinder, 7 Arnold J. Levine, 7 Richard J. Roberts, 8 Mel Simon, 9 Carolyn Slayman, 10 Michael Hunkapiller, 11 Randall Bolanos, 1 Arthur Delcher, 1 Ian Dew, 1 Daniel Fasulo, 1 Michael Flanigan, 1 Liliana Florea, 1 Aaron Halpern, 1 Sridhar Hannenhalli, 1 Saul Kravitz, 1 Samuel Levy, 1 Clark Mobarry, 1 Knut Reinert, 1 Karin Remington, 1 Jane Abu-Threideh, 1 Ellen Beasley, 1 Kendra Biddick, 1 Vivien Bonazzi, 1 Rhonda Brandon, 1 Michele Cargill, 1 Ishwar Chandramouliswaran, 1 Rosane Charlab, 1 Kabir Chaturvedi, 1 Zuoming Deng, 1 Valentina Di Francesco, 1 Patrick Dunn, 1 Karen Eilbeck, 1 Carlos Evangelista, 1 Andrei E. Gabrielian, 1 Weiniu Gan, 1 Wangmao Ge, 1 Fangcheng Gong, 1 Zhiping Gu, 1 Ping Guan, 1 Thomas J. Heiman, 1 Maureen E. Higgins, 1 Rui-Ru Ji, 1 Zhaoxi Ke, 1 Karen A. Ketchum, 1 Zhongwu Lai, 1 Yiding Lei, 1 Zhenya Li, 1 Jiayin Li, 1 Yong Liang, 1 Xiaoying Lin, 1 Fu Lu, 1 Gennady V. Merkulov, 1 Natalia Milshina, 1 Helen M. Moore, 1 Ashwinikumar K Naik, 1 Vaibhav A. Narayan, 1 Beena Neelam, 1 Deborah Nusskern, 1 Douglas B. Rusch, 1 Steven Salzberg, 12 Wei Shao, 1 Bixiong Shue, 1 Jingtao Sun, 1 Zhen Yuan Wang, 1 Aihui Wang, 1 Xin Wang, 1 Jian Wang, 1 Ming-Hui Wei, 1 Ron Wides, 13 Chunlin Xiao, 1 Chunhua Yan, 1 Alison Yao, 1 Jane Ye, 1 Ming Zhan, 1 Weiqing Zhang, 1 Hongyu Zhang, 1 Qi Zhao, 1 Liansheng Zheng, 1 Fei Zhong, 1 Wenyan Zhong, 1 Shiaoping C. Zhu, 1 Shaying Zhao, 12 Dennis Gilbert, 1 Suzanna Baumhueter, 1 Gene Spier, 1 Christine Carter, 1 Anibal Cravchik, 1 Trevor Woodage, 1 Feroze Ali, 1 Huijin An, 1 Aderonke Awe, 1 Danita Baldwin, 1 Holly Baden, 1 Mary Barnstead, 1 Ian Barrow, 1 Karen Beeson, 1 Dana Busam, 1 Amy Carver, 1 Angela Center, 1 Ming Lai Cheng, 1 Liz Curry, 1 Steve Danaher, 1 Lionel Davenport, 1 Raymond Desilets, 1 Susanne Dietz, 1 Kristina Dodson, 1 Lisa Doup, 1 Steven Ferriera, 1 Neha Garg, 1 Andres Gluecksmann, 1 Brit Hart, 1 Jason Haynes, 1 Charles Haynes, 1 Cheryl Heiner, 1 Suzanne Hladun, 1 Damon Hostin, 1 Jarrett Houck, 1 Timothy Howland, 1 Chinyere Ibegwam, 1 Jeffery Johnson, 1 Francis Kalush, 1 Lesley Kline, 1 Shashi Koduru, 1 Amy Love, 1 Felecia Mann, 1 David May, 1 Steven McCawley, 1 Tina McIntosh, 1 Ivy McMullen, 1 Mee Moy, 1 Linda Moy, 1 Brian Murphy, 1 Keith Nelson, 1 Cynthia Pfannkoch, 1 Eric Pratts, 1 Vinita Puri, 1 Hina Qureshi, 1 Matthew Reardon, 1 Robert Rodriguez, 1 Yu-Hui Rogers, 1 Deanna Romblad, 1 Bob Ruhfel, 1 Richard Scott, 1 Cynthia Sitter, 1 Michelle Smallwood, 1 Erin Stewart, 1 Renee Strong, 1 Ellen Suh, 1 Reginald Thomas, 1 Ni Ni Tint, 1 Sukyee Tse, 1 Claire Vech, 1 Gary Wang, 1 Jeremy Wetter, 1 Sherita Williams, 1 Monica Williams, 1 Sandra Windsor, 1 Emily Winn-Deen, 1 Keriellen Wolfe, 1 Jayshree Zaveri, 1 Karena Zaveri, 1 Josep F. Abril, 14 Roderic Guigo ´, 14 Michael J. Campbell, 1 Kimmen V. Sjolander, 1 Brian Karlak, 1 Anish Kejariwal, 1 Huaiyu Mi, 1 Betty Lazareva, 1 Thomas Hatton, 1 Apurva Narechania, 1 Karen Diemer, 1 Anushya Muruganujan, 1 Nan Guo, 1 Shinji Sato, 1 Vineet Bafna, 1 Sorin Istrail, 1 Ross Lippert, 1 Russell Schwartz, 1 Brian Walenz, 1 Shibu Yooseph, 1 David Allen, 1 Anand Basu, 1 James Baxendale, 1 Louis Blick, 1 Marcelo Caminha, 1 John Carnes-Stine, 1 Parris Caulk, 1 Yen-Hui Chiang, 1 My Coyne, 1 Carl Dahlke, 1 Anne Deslattes Mays, 1 Maria Dombroski, 1 Michael Donnelly, 1 Dale Ely, 1 Shiva Esparham, 1 Carl Fosler, 1 Harold Gire, 1 Stephen Glanowski, 1 Kenneth Glasser, 1 Anna Glodek, 1 Mark Gorokhov, 1 Ken Graham, 1 Barry Gropman, 1 Michael Harris, 1 Jeremy Heil, 1 Scott Henderson, 1 Jeffrey Hoover, 1 Donald Jennings, 1 Catherine Jordan, 1 James Jordan, 1 John Kasha, 1 Leonid Kagan, 1 Cheryl Kraft, 1 Alexander Levitsky, 1 Mark Lewis, 1 Xiangjun Liu, 1 John Lopez, 1 Daniel Ma, 1 William Majoros, 1 Joe McDaniel, 1 Sean Murphy, 1 Matthew Newman, 1 Trung Nguyen, 1 Ngoc Nguyen, 1 Marc Nodell, 1 Sue Pan, 1 Jim Peck, 1 Marshall Peterson, 1 William Rowe, 1 Robert Sanders, 1 John Scott, 1 Michael Simpson, 1 Thomas Smith, 1 Arlan Sprague, 1 Timothy Stockwell, 1 Russell Turner, 1 Eli Venter, 1 Mei Wang, 1 Meiyuan Wen, 1 David Wu, 1 Mitchell Wu, 1 Ashley Xia, 1 Ali Zandieh, 1 Xiaohong Zhu 1 T HE H UMAN G ENOME 16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org 1304

Transcript of The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the...

Page 1: The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the

The Sequence of the Human GenomeJ. Craig Venter,1* Mark D. Adams,1 Eugene W. Myers,1 Peter W. Li,1 Richard J. Mural,1

Granger G. Sutton,1 Hamilton O. Smith,1 Mark Yandell,1 Cheryl A. Evans,1 Robert A. Holt,1

Jeannine D. Gocayne,1 Peter Amanatides,1 Richard M. Ballew,1 Daniel H. Huson,1

Jennifer Russo Wortman,1 Qing Zhang,1 Chinnappa D. Kodira,1 Xiangqun H. Zheng,1 Lin Chen,1

Marian Skupski,1 Gangadharan Subramanian,1 Paul D. Thomas,1 Jinghui Zhang,1

George L. Gabor Miklos,2 Catherine Nelson,3 Samuel Broder,1 Andrew G. Clark,4 Joe Nadeau,5

Victor A. McKusick,6 Norton Zinder,7 Arnold J. Levine,7 Richard J. Roberts,8 Mel Simon,9

Carolyn Slayman,10 Michael Hunkapiller,11 Randall Bolanos,1 Arthur Delcher,1 Ian Dew,1 Daniel Fasulo,1

Michael Flanigan,1 Liliana Florea,1 Aaron Halpern,1 Sridhar Hannenhalli,1 Saul Kravitz,1 Samuel Levy,1

Clark Mobarry,1 Knut Reinert,1 Karin Remington,1 Jane Abu-Threideh,1 Ellen Beasley,1 Kendra Biddick,1

Vivien Bonazzi,1 Rhonda Brandon,1 Michele Cargill,1 Ishwar Chandramouliswaran,1 Rosane Charlab,1

Kabir Chaturvedi,1 Zuoming Deng,1 Valentina Di Francesco,1 Patrick Dunn,1 Karen Eilbeck,1

Carlos Evangelista,1 Andrei E. Gabrielian,1 Weiniu Gan,1 Wangmao Ge,1 Fangcheng Gong,1 Zhiping Gu,1

Ping Guan,1 Thomas J. Heiman,1 Maureen E. Higgins,1 Rui-Ru Ji,1 Zhaoxi Ke,1 Karen A. Ketchum,1

Zhongwu Lai,1 Yiding Lei,1 Zhenya Li,1 Jiayin Li,1 Yong Liang,1 Xiaoying Lin,1 Fu Lu,1

Gennady V. Merkulov,1 Natalia Milshina,1 Helen M. Moore,1 Ashwinikumar K Naik,1

Vaibhav A. Narayan,1 Beena Neelam,1 Deborah Nusskern,1 Douglas B. Rusch,1 Steven Salzberg,12

Wei Shao,1 Bixiong Shue,1 Jingtao Sun,1 Zhen Yuan Wang,1 Aihui Wang,1 Xin Wang,1 Jian Wang,1

Ming-Hui Wei,1 Ron Wides,13 Chunlin Xiao,1 Chunhua Yan,1 Alison Yao,1 Jane Ye,1 Ming Zhan,1

Weiqing Zhang,1 Hongyu Zhang,1 Qi Zhao,1 Liansheng Zheng,1 Fei Zhong,1 Wenyan Zhong,1

Shiaoping C. Zhu,1 Shaying Zhao,12 Dennis Gilbert,1 Suzanna Baumhueter,1 Gene Spier,1

Christine Carter,1 Anibal Cravchik,1 Trevor Woodage,1 Feroze Ali,1 Huijin An,1 Aderonke Awe,1

Danita Baldwin,1 Holly Baden,1 Mary Barnstead,1 Ian Barrow,1 Karen Beeson,1 Dana Busam,1

Amy Carver,1 Angela Center,1 Ming Lai Cheng,1 Liz Curry,1 Steve Danaher,1 Lionel Davenport,1

Raymond Desilets,1 Susanne Dietz,1 Kristina Dodson,1 Lisa Doup,1 Steven Ferriera,1 Neha Garg,1

Andres Gluecksmann,1 Brit Hart,1 Jason Haynes,1 Charles Haynes,1 Cheryl Heiner,1 Suzanne Hladun,1

Damon Hostin,1 Jarrett Houck,1 Timothy Howland,1 Chinyere Ibegwam,1 Jeffery Johnson,1

Francis Kalush,1 Lesley Kline,1 Shashi Koduru,1 Amy Love,1 Felecia Mann,1 David May,1

Steven McCawley,1 Tina McIntosh,1 Ivy McMullen,1 Mee Moy,1 Linda Moy,1 Brian Murphy,1

Keith Nelson,1 Cynthia Pfannkoch,1 Eric Pratts,1 Vinita Puri,1 Hina Qureshi,1 Matthew Reardon,1

Robert Rodriguez,1 Yu-Hui Rogers,1 Deanna Romblad,1 Bob Ruhfel,1 Richard Scott,1 Cynthia Sitter,1

Michelle Smallwood,1 Erin Stewart,1 Renee Strong,1 Ellen Suh,1 Reginald Thomas,1 Ni Ni Tint,1

Sukyee Tse,1 Claire Vech,1 Gary Wang,1 Jeremy Wetter,1 Sherita Williams,1 Monica Williams,1

Sandra Windsor,1 Emily Winn-Deen,1 Keriellen Wolfe,1 Jayshree Zaveri,1 Karena Zaveri,1

Josep F. Abril,14 Roderic Guigo,14 Michael J. Campbell,1 Kimmen V. Sjolander,1 Brian Karlak,1

Anish Kejariwal,1 Huaiyu Mi,1 Betty Lazareva,1 Thomas Hatton,1 Apurva Narechania,1 Karen Diemer,1

Anushya Muruganujan,1 Nan Guo,1 Shinji Sato,1 Vineet Bafna,1 Sorin Istrail,1 Ross Lippert,1

Russell Schwartz,1 Brian Walenz,1 Shibu Yooseph,1 David Allen,1 Anand Basu,1 James Baxendale,1

Louis Blick,1 Marcelo Caminha,1 John Carnes-Stine,1 Parris Caulk,1 Yen-Hui Chiang,1 My Coyne,1

Carl Dahlke,1 Anne Deslattes Mays,1 Maria Dombroski,1 Michael Donnelly,1 Dale Ely,1 Shiva Esparham,1

Carl Fosler,1 Harold Gire,1 Stephen Glanowski,1 Kenneth Glasser,1 Anna Glodek,1 Mark Gorokhov,1

Ken Graham,1 Barry Gropman,1 Michael Harris,1 Jeremy Heil,1 Scott Henderson,1 Jeffrey Hoover,1

Donald Jennings,1 Catherine Jordan,1 James Jordan,1 John Kasha,1 Leonid Kagan,1 Cheryl Kraft,1

Alexander Levitsky,1 Mark Lewis,1 Xiangjun Liu,1 John Lopez,1 Daniel Ma,1 William Majoros,1

Joe McDaniel,1 Sean Murphy,1 Matthew Newman,1 Trung Nguyen,1 Ngoc Nguyen,1 Marc Nodell,1

Sue Pan,1 Jim Peck,1 Marshall Peterson,1 William Rowe,1 Robert Sanders,1 John Scott,1

Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy Stockwell,1 Russell Turner,1 Eli Venter,1

Mei Wang,1 Meiyuan Wen,1 David Wu,1 Mitchell Wu,1 Ashley Xia,1 Ali Zandieh,1 Xiaohong Zhu1

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1304

Page 2: The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the

A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion ofthe human genome was generated by the whole-genome shotgun sequencingmethod. The 14.8-billion bp DNA sequence was generated over 9 months from27,271,853 high-quality sequence reads (5.11-fold coverage of the genome)from both ends of plasmid clones made from the DNA of five individuals. Twoassembly strategies—a whole-genome assembly and a regional chromosomeassembly—were used, each combining sequence data from Celera and thepublicly funded genome effort. The public data were shredded into 550-bpsegments to create a 2.9-fold coverage of those genome regions that had beensequenced, without including biases inherent in the cloning and assemblyprocedure used by the publicly funded group. This brought the effective cov-erage in the assemblies to eightfold, reducing the number and size of gaps inthe final assembly over what would be obtained with 5.11-fold coverage. Thetwo assembly strategies yielded very similar results that largely agree withindependent mapping data. The assemblies effectively cover the euchromaticregions of the human chromosomes. More than 90% of the genome is inscaffold assemblies of 100,000 bp or more, and 25% of the genome is inscaffolds of 10 million bp or larger. Analysis of the genome sequence revealed26,588 protein-encoding transcripts for which there was strong corroboratingevidence and an additional ;12,000 computationally derived genes with mousematches or other weak supporting evidence. Although gene-dense clusters areobvious, almost half the genes are dispersed in low G1C sequence separatedby large tracts of apparently noncoding sequence. Only 1.1% of the genomeis spanned by exons, whereas 24% is in introns, with 75% of the genome beingintergenic DNA. Duplications of segmental blocks, ranging in size up to chro-mosomal lengths, are abundant throughout the genome and reveal a complexevolutionary history. Comparative genomic analysis indicates vertebrate ex-pansions of genes associated with neuronal function, with tissue-specific de-velopmental regulation, and with the hemostasis and immune systems. DNAsequence comparisons between the consensus sequence and publicly fundedgenome data provided locations of 2.1 million single-nucleotide polymorphisms(SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per1250 on average, but there was marked heterogeneity in the level of poly-morphism across the genome. Less than 1% of all SNPs resulted in variation inproteins, but the task of determining which SNPs have functional consequencesremains an open challenge.

Decoding of the DNA that constitutes thehuman genome has been widely anticipatedfor the contribution it will make toward un-

derstanding human evolution, the causationof disease, and the interplay between theenvironment and heredity in defining the hu-man condition. A project with the goal ofdetermining the complete nucleotide se-quence of the human genome was first for-mally proposed in 1985 (1). In subsequentyears, the idea met with mixed reactions inthe scientific community (2). However, in1990, the Human Genome Project (HGP) wasofficially initiated in the United States underthe direction of the National Institutes ofHealth and the U.S. Department of Energywith a 15-year, $3 billion plan for completingthe genome sequence. In 1998 we announcedour intention to build a unique genome-sequencing facility, to determine the se-quence of the human genome over a 3-yearperiod. Here we report the penultimate mile-stone along the path toward that goal, a nearlycomplete sequence of the euchromatic por-tion of the human genome. The sequencingwas performed by a whole-genome randomshotgun method with subsequent assembly ofthe sequenced segments.

The modern history of DNA sequencingbegan in 1977, when Sanger reported his meth-od for determining the order of nucleotides of

DNA using chain-terminating nucleotide ana-logs (3). In the same year, the first human genewas isolated and sequenced (4). In 1986, Hoodand co-workers (5) described an improvementin the Sanger sequencing method that includedattaching fluorescent dyes to the nucleotides,which permitted them to be sequentially readby a computer. The first automated DNA se-quencer, developed by Applied Biosystems inCalifornia in 1987, was shown to be successfulwhen the sequences of two genes were obtainedwith this new technology (6). From early se-quencing of human genomic regions (7), itbecame clear that cDNA sequences (which arereverse-transcribed from RNA) would be es-sential to annotate and validate gene predictionsin the human genome. These studies were thebasis in part for the development of the ex-pressed sequence tag (EST) method of geneidentification (8), which is a random selection,very high throughput sequencing approach tocharacterize cDNA libraries. The EST methodled to the rapid discovery and mapping of hu-man genes (9). The increasing numbers of hu-man EST sequences necessitated the develop-ment of new computer algorithms to analyzelarge amounts of sequence data, and in 1993 atThe Institute for Genomic Research (TIGR), analgorithm was developed that permitted assem-bly and analysis of hundreds of thousands ofESTs. This algorithm permitted characteriza-tion and annotation of human genes on the basisof 30,000 EST assemblies (10).

The complete 49-kbp bacteriophage lamb-da genome sequence was determined by ashotgun restriction digest method in 1982(11). When considering methods for sequenc-ing the smallpox virus genome in 1991 (12),a whole-genome shotgun sequencing methodwas discussed and subsequently rejected ow-ing to the lack of appropriate software toolsfor genome assembly. However, in 1994,when a microbial genome-sequencing projectwas contemplated at TIGR, a whole-genomeshotgun sequencing approach was consideredpossible with the TIGR EST assembly algo-rithm. In 1995, the 1.8-Mbp Haemophilusinfluenzae genome was completed by awhole-genome shotgun sequencing method(13). The experience with several subsequentgenome-sequencing efforts established thebroad applicability of this approach (14, 15).

A key feature of the sequencing approachused for these megabase-size and larger ge-nomes was the use of paired-end sequences(also called mate pairs), derived from sub-clone libraries with distinct insert sizes andcloning characteristics. Paired-end sequencesare sequences 500 to 600 bp in length fromboth ends of double-stranded DNA clones ofprescribed lengths. The success of using endsequences from long segments (18 to 20 kbp)of DNA cloned into bacteriophage lambda inassembly of the microbial genomes led to thesuggestion (16 ) of an approach to simulta-

1Celera Genomics, 45 West Gude Drive, Rockville, MD20850, USA. 2GenetixXpress, 78 Pacific Road, PalmBeach, Sydney 2108, Australia. 3Berkeley DrosophilaGenome Project, University of California, Berkeley, CA94720, USA. 4Department of Biology, Penn State Uni-versity, 208 Mueller Lab, University Park, PA 16802,USA. 5Department of Genetics, Case Western ReserveUniversity School of Medicine, BRB-630, 10900 EuclidAvenue, Cleveland, OH 44106, USA. 6Johns HopkinsUniversity School of Medicine, Johns Hopkins Hospi-tal, 600 North Wolfe Street, Blalock 1007, Baltimore,MD 21287–4922, USA. 7Rockefeller University, 1230York Avenue, New York, NY 10021–6399, USA. 8NewEngland BioLabs, 32 Tozer Road, Beverly, MA 01915,USA. 9Division of Biology, 147-75, California Instituteof Technology, 1200 East California Boulevard, Pasa-dena, CA 91125, USA. 10Yale University School ofMedicine, 333 Cedar Street, P.O. Box 208000, NewHaven, CT 06520–8000, USA. 11Applied Biosystems,850 Lincoln Centre Drive, Foster City, CA 94404, USA.12The Institute for Genomic Research, 9712 MedicalCenter Drive, Rockville, MD 20850, USA. 13Faculty ofLife Sciences, Bar-Ilan University, Ramat-Gan, 52900Israel. 14Grup de Recerca en Informatica Medica, In-stitut Municipal d’Investigacio Medica, UniversitatPompeu Fabra, 08003-Barcelona, Catalonia, Spain.

*To whom correspondence should be addressed. E-mail: [email protected]

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1305

Page 3: The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the

neously map and sequence the human ge-nome by means of end sequences from 150-kbp bacterial artificial chromosomes (BACs)(17, 18). The end sequences spanned byknown distances provide long-range continu-ity across the genome. A modification of theBAC end-sequencing (BES) method was ap-plied successfully to complete chromosome 2from the Arabidopsis thaliana genome (19).

In 1997, Weber and Myers (20) proposedwhole-genome shotgun sequencing of thehuman genome. Their proposal was not wellreceived (21). However, by early 1998, asless than 5% of the genome had been se-quenced, it was clear that the rate of progressin human genome sequencing worldwidewas very slow (22), and the prospects forfinishing the genome by the 2005 goal wereuncertain.

In early 1998, PE Biosystems (now AppliedBiosystems) developed an automated, high-throughput capillary DNA sequencer, subse-quently called the ABI PRISM 3700 DNAAnalyzer. Discussions between PE Biosystemsand TIGR scientists resulted in a plan to under-take the sequencing of the human genome withthe 3700 DNA Analyzer and the whole-genomeshotgun sequencing techniques developed atTIGR (23). Many of the principles of operationof a genome-sequencing facility were estab-lished in the TIGR facility (24). However, thefacility envisioned for Celera would have acapacity roughly 50 times that of TIGR, andthus new developments were required for sam-ple preparation and tracking and for whole-genome assembly. Some argued that the re-quired 150-fold scale-up from the H. influenzaegenome to the human genome with its complexrepeat sequences was not feasible (25). TheDrosophila melanogaster genome was thuschosen as a test case for whole-genome assem-bly on a large and complex eukaryotic genome.In collaboration with Gerald Rubin and theBerkeley Drosophila Genome Project, the nu-cleotide sequence of the 120-Mbp euchromaticportion of the Drosophila genome was deter-mined over a 1-year period (26–28). The Dro-sophila genome-sequencing effort resulted intwo key findings: (i) that the assembly algo-rithms could generate chromosome assemblieswith highly accurate order and orientation withsubstantially less than 10-fold coverage, and (ii)that undertaking multiple interim assemblies inplace of one comprehensive final assembly wasnot of value.

These findings, together with the dramaticchanges in the public genome effort subsequentto the formation of Celera (29), led to a modi-fied whole-genome shotgun sequencing ap-proach to the human genome. We initially pro-posed to do 10-fold sequence coverage of thegenome over a 3-year period and to make in-terim assembled sequence data available quar-terly. The modifications included a plan to per-form random shotgun sequencing to ;5-fold

coverage and to use the unordered and unori-ented BAC sequence fragments and subassem-blies published in GenBank by the publiclyfunded genome effort (30) to accelerate theproject. We also abandoned the quarterly an-nouncements in the absence of interim assem-blies to report.

Although this strategy provided a reason-able result very early that was consistent with awhole-genome shotgun assembly with eight-fold coverage, the human genome sequence isnot as finished as the Drosophila genome waswith an effective 13-fold coverage. However, itbecame clear that even with this reduced cov-erage strategy, Celera could generate an accu-rately ordered and oriented scaffold sequence ofthe human genome in less than 1 year. Humangenome sequencing was initiated 8 September1999 and completed 17 June 2000. The firstassembly was completed 25 June 2000, and theassembly reported here was completed 1 Octo-ber 2000. Here we describe the whole-genomerandom shotgun sequencing effort applied tothe human genome. We developed two differ-ent assembly approaches for assembling the ;3billion bp that make up the 23 pairs of chromo-somes of the Homo sapiens genome. Any Gen-Bank-derived data were shredded to removepotential bias to the final sequence from chi-meric clones, foreign DNA contamination, ormisassembled contigs. Insofar as a correctlyand accurately assembled genome sequencewith faithful order and orientation of contigsis essential for an accurate analysis of thehuman genetic code, we have devoted a con-siderable portion of this manuscript to thedocumentation of the quality of our recon-struction of the genome. We also describe ourpreliminary analysis of the human geneticcode on the basis of computational methods.Figure 1 (see fold-out chart associated withthis issue; files for each chromosome can befound in Web fig. 1 on Science Online atwww.sciencemag.org/cgi/content/full/291/5507/1304/DC1) provides a graphical over-view of the genome and the features encodedin it. The detailed manual curation and inter-pretation of the genome are just beginning.

To aid the reader in locating specific an-alytical sections, we have divided the paperinto seven broad sections. A summary of themajor results appears at the beginning of eachsection.

1 Sources of DNA and Sequencing Methods2 Genome Assembly Strategy and

Characterization3 Gene Prediction and Annotation4 Genome Structure5 Genome Evolution6 A Genome-Wide Examination of

Sequence Variations7 An Overview of the Predicted Protein-

Coding Genes in the Human Genome8 Conclusions

1 Sources of DNA and SequencingMethods

Summary. This section discusses the rationaleand ethical rules governing donor selection toensure ethnic and gender diversity along withthe methodologies for DNA extraction and li-brary construction. The plasmid library con-struction is the first critical step in shotgunsequencing. If the DNA libraries are not uni-form in size, nonchimeric, and do not randomlyrepresent the genome, then the subsequent stepscannot accurately reconstruct the genome se-quence. We used automated high-throughputDNA sequencing and the computational infra-structure to enable efficient tracking of enor-mous amounts of sequence information (27.3million sequence reads; 14.9 billion bp of se-quence). Sequencing and tracking from bothends of plasmid clones from 2-, 10-, and 50-kbplibraries were essential to the computationalreconstruction of the genome. Our evidenceindicates that the accurate pairing rate of endsequences was greater than 98%.

Various policies of the United States and theWorld Medical Association, specifically theDeclaration of Helsinki, offer recommenda-tions for conducting experiments with humansubjects. We convened an Institutional Re-view Board (IRB) (31) that helped us estab-lish the protocol for obtaining and using hu-man DNA and the informed consent processused to enroll research volunteers for theDNA-sequencing studies reported here. Weadopted several steps and procedures to pro-tect the privacy rights and confidentiality ofthe research subjects (donors). These includ-ed a two-stage consent process, a secure ran-dom alphanumeric coding system for speci-mens and records, circumscribed contact withthe subjects by researchers, and options foroff-site contact of donors. In addition, Celeraapplied for and received a Certificate of Con-fidentiality from the Department of Healthand Human Services. This Certificate autho-rized Celera to protect the privacy of theindividuals who volunteered to be donors asprovided in Section 301(d) of the PublicHealth Service Act 42 U.S.C. 241(d).

Celera and the IRB believed that the ini-tial version of a completed human genomeshould be a composite derived from multipledonors of diverse ethnic backgrounds Pro-spective donors were asked, on a voluntarybasis, to self-designate an ethnogeographiccategory (e.g., African-American, Chinese,Hispanic, Caucasian, etc.). We enrolled 21donors (32).

Three basic items of information fromeach donor were recorded and linked by con-fidential code to the donated sample: age,sex, and self-designated ethnogeographicgroup. From females, ;130 ml of whole,heparinized blood was collected. From males,;130 ml of whole, heparinized blood was

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1306

Page 4: The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the

collected, as well as five specimens of semen,collected over a 6-week period. Permanentlymphoblastoid cell lines were created byEpstein-Barr virus immortalization. DNAfrom five subjects was selected for genomicDNA sequencing: two males and three fe-males—one African-American, one Asian-Chinese, one Hispanic-Mexican, and twoCaucasians (see Web fig. 2 on Science Onlineat www.sciencemag.org/cgi/content/291/5507/1304/DC1). The decision of whose DNA tosequence was based on a complex mix of fac-tors, including the goal of achieving diversity aswell as technical issues such as the quality ofthe DNA libraries and availability of immortal-ized cell lines.

1.1 Library construction andsequencingCentral to the whole-genome shotgun sequenc-ing process is preparation of high-quality plas-mid libraries in a variety of insert sizes so thatpairs of sequence reads (mates) are obtained,one read from both ends of each plasmid insert.High-quality libraries have an equal representa-tion of all parts of the genome, a small numberof clones without inserts, and no contaminationfrom such sources as the mitochondrial genomeand Escherichia coli genomic DNA. DNA fromeach donor was used to construct plasmid librar-ies in one or more of three size classes: 2 kbp, 10kbp, and 50 kbp (Table 1) (33).

In designing the DNA-sequencing pro-cess, we focused on developing a simplesystem that could be implemented in a robustand reproducible manner and monitored ef-fectively (Fig. 2) (34 ).

Current sequencing protocols are based on

the dideoxy sequencing method (35), whichtypically yields only 500 to 750 bp of sequenceper reaction. This limitation on read length hasmade monumental gains in throughput a pre-requisite for the analysis of large eukaryoticgenomes. We accomplished this at the Celerafacility, which occupies about 30,000 squarefeet of laboratory space and produces sequencedata continuously at a rate of 175,000 totalreads per day. The DNA-sequencing facility issupported by a high-performance computation-al facility (36).

The process for DNA sequencing was mod-ular by design and automated. Intermodulesample backlogs allowed four principalmodules to operate independently: (i) li-brary transformation, plating, and colonypicking; (ii) DNA template preparation;(iii) dideoxy sequencing reaction set-upand purification; and (iv) sequence deter-mination with the ABI PRISM 3700 DNAAnalyzer. Because the inputs and outputsof each module have been carefullymatched and sample backlogs are continu-ously managed, sequencing has proceededwithout a single day’s interruption since theinitiation of the Drosophila project in May1999. The ABI 3700 is a fully automatedcapillary array sequencer and as such canbe operated with a minimal amount ofhands-on time, currently estimated at about15 min per day. The capillary system alsofacilitates correct associations of sequenc-ing traces with samples through the elimi-nation of manual sample loading and lane-tracking errors associated with slab gels.About 65 production staff were hired andtrained, and were rotated on a regular basis

through the four production modules. Acentral laboratory information managementsystem (LIMS) tracked all sample plates byunique bar code identifiers. The facility wassupported by a quality control team that per-formed raw material and in-process testingand a quality assurance group with responsi-bilities including document control, valida-tion, and auditing of the facility. Critical tothe success of the scale-up was the validationof all software and instrumentation beforeimplementation, and production-scale testingof any process changes.

1.2 Trace processingAn automated trace-processing pipeline hasbeen developed to process each sequence file(37 ). After quality and vector trimming, theaverage trimmed sequence length was 543bp, and the sequencing accuracy was expo-nentially distributed with a mean of 99.5%and with less than 1 in 1000 reads being lessthan 98% accurate (26 ). Each trimmed se-quence was screened for matches to contam-inants including sequences of vector alone, E.coli genomic DNA, and human mitochondri-al DNA. The entire read for any sequencewith a significant match to a contaminant wasdiscarded. A total of 713 reads matched E.coli genomic DNA and 2114 reads matchedthe human mitochondrial genome.

1.3 Quality assessment and controlThe importance of the base-pair level ac-curacy of the sequence data increases as thesize and repetitive nature of the genome tobe sequenced increases. Each sequenceread must be placed uniquely in the ge-

Table 1. Celera-generated data input into assembly.

IndividualNumber of reads for different insert libraries

Total number ofbase pairs

2 kbp 10 kbp 50 kbp Total

No. of sequencing reads A 0 0 2,767,357 2,767,357 1,502,674,851B 11,736,757 7,467,755 66,930 19,271,442 10,464,393,006C 853,819 881,290 0 1,735,109 942,164,187D 952,523 1,046,815 0 1,999,338 1,085,640,534F 0 1,498,607 0 1,498,607 813,743,601

Total 13,543,099 10,894,467 2,834,287 27,271,853 14,808,616,179

Fold sequence coverage A 0 0 0.52 0.52(2.9-Gb genome) B 2.20 1.40 0.01 3.61

C 0.16 1.17 0 0.32D 0.18 0.20 0 0.37F 0 0.28 0 0.28

Total 2.54 2.04 0.53 5.11

Fold clone coverage A 0 0 18.39 18.39B 2.96 11.26 0.44 14.67C 0.22 1.33 0 1.54D 0.24 1.58 0 1.82F 0 2.26 0 2.26

Total 3.42 16.43 18.84 38.68

Insert size* (mean) Average 1,951 bp 10,800 bp 50,715 bpInsert size* (SD) Average 6.10% 8.10% 14.90%% Mates† Average 74.50 80.80 75.60

*Insert size and SD are calculated from assembly of mates on contigs. †% Mates is based on laboratory tracking of sequencing runs.

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1307

Page 5: The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the

nome, and even a modest error rate canreduce the effectiveness of assembly. Inaddition, maintaining the validity of mate-pair information is absolutely critical forthe algorithms described below. Proceduralcontrols were established for maintainingthe validity of sequence mate-pairs as se-quencing reactions proceeded through theprocess, including strict rules built into theLIMS. The accuracy of sequence data pro-duced by the Celera process was validatedin the course of the Drosophila genomeproject (26 ). By collecting data for the

entire human genome in a single facility,we were able to ensure uniform qualitystandards and the cost advantages associat-ed with automation, an economy of scale,and process consistency.

2 Genome Assembly Strategy andCharacterizationSummary. We describe in this section the twoapproaches that we used to assemble the ge-nome. One method involves the computationalcombination of all sequence reads with shred-ded data from GenBank to generate an indepen-

dent, nonbiased view of the genome. The sec-ond approach involves clustering all of the frag-ments to a region or chromosome on the basisof mapping information. The clustered datawere then shredded and subjected to computa-tional assembly. Both approaches provided es-sentially the same reconstruction of assembledDNA sequence with proper order and orienta-tion. The second method provided slightlygreater sequence coverage (fewer gaps) andwas the principal sequence used for the analysisphase. In addition, we document the complete-ness and correctness of this assembly process

Fig. 2. Flow diagram for sequencing pipeline. Samples are received,selected, and processed in compliance with standard operating proce-dures, with a focus on quality within and across departments. Eachprocess has defined inputs and outputs with the capability to exchange

samples and data with both internal and external entities according todefined quality guidelines. Manufacturing pipeline processes, products,quality control measures, and responsible parties are indicated and aredescribed further in the text.

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1308

Page 6: The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the

and provide a comparison to the public genomesequence, which was reconstructed largely byan independent BAC-by-BAC approach. Ourassemblies effectively covered the euchromaticregions of the human chromosomes. More than90% of the genome was in scaffold assembliesof 100,000 bp or greater, and 25% of the ge-nome was in scaffolds of 10 million bp orlarger.

Shotgun sequence assembly is a classicexample of an inverse problem: given a setof reads randomly sampled from a targetsequence, reconstruct the order and the po-sition of those reads in the target. Genomeassembly algorithms developed for Dro-sophila have now been extended to assemblethe ;25-fold larger human genome. Celera as-semblies consist of a set of contigs that areordered and oriented into scaffolds that are thenmapped to chromosomal locations by usingknown markers. The contigs consist of a col-lection of overlapping sequence reads that pro-vide a consensus reconstruction for a contigu-ous interval of the genome. Mate pairs are acentral component of the assembly strategy.They are used to produce scaffolds in which thesize of gaps between consecutive contigs isknown with reasonable precision. This is ac-complished by observing that a pair of reads,one of which is in one contig, and the other ofwhich is in another, implies an orientation anddistance between the two contigs (Fig. 3). Fi-nally, our assemblies did not incorporate allreads into the final set of reported scaffolds.This set of unincorporated reads is termed“chaff,” and typically consisted of reads fromwithin highly repetitive regions, data from otherorganisms introduced through various routes asfound in many genome projects, and data ofpoor quality or with untrimmed vector.

2.1 Assembly data setsWe used two independent sets of data for ourassemblies. The first was a random shotgundata set of 27.27 million reads of average length543 bp produced at Celera. This consistedlargely of mate-pair reads from 16 librariesconstructed from DNA samples taken from fivedifferent donors. Libraries with insert sizes of 2,10, and 50 kbp were used. By looking at howmate pairs from a library were positioned inknown sequenced stretches of the genome, wewere able to characterize the range of insertsizes in each library and determine a mean andstandard deviation. Table 1 details the numberof reads, sequencing coverage, and clone cov-erage achieved by the data set. The clone cov-erage is the coverage of the genome in clonedDNA, considering the entire insert of eachclone that has sequence from both ends. Theclone coverage provides a measure of theamount of physical DNA coverage of the ge-nome. Assuming a genome size of 2.9 Gbp, theCelera trimmed sequences gave a 5.13 cover-age of the genome, and clone coverage was3.423, 16.403, and 18.843 for the 2-, 10-, and50-kbp libraries, respectively, for a total of38.73 clone coverage.

The second data set was from the publiclyfunded Human Genome Project (PFP) and isprimarily derived from BAC clones (30). TheBAC data input to the assemblies came from adownload of GenBank on 1 September 2000(Table 2) totaling 4443.3 Mbp of sequence.The data for each BAC is deposited at one offour levels of completion. Phase 0 data are a setof generally unassembled sequencing readsfrom a very light shotgun of the BAC, typicallyless than 13. Phase 1 data are unordered as-semblies of contigs, which we call BAC contigsor bactigs. Phase 2 data are ordered assembliesof bactigs. Phase 3 data are complete BAC

sequences. In the past 2 years the PFP hasfocused on a product of lower quality and com-pleteness, but on a faster time-course, by con-centrating on the production of Phase 1 datafrom a 33 to 43 light-shotgun of each BACclone.

We screened the bactig sequences for con-taminants by using the BLAST algorithmagainst three data sets: (i) vector sequencesin Univec core (38), filtered for a 25-bpmatch at 98% sequence identity at the endsof the sequence and a 30-bp match internalto the sequence; (ii) the nonhuman portionof the High Throughput Genomic (HTG)Seqences division of GenBank (39), fil-tered at 200 bp at 98%; and (iii) the non-redundant nucleotide sequences from Gen-Bank without primate and human virus en-tries, filtered at 200 bp at 98%. Whenever25 bp or more of vector was found within50 bp of the end of a contig, the tip up tothe matching vector was excised. Underthese criteria we removed 2.6 Mbp of pos-sible contaminant and vector from thePhase 3 data, 61.0 Mbp from the Phase 1and 2 data, and 16.1 Mbp from the Phase 0data (Table 2). This left us with a total of4363.7 Mbp of PFP sequence data 20%finished, 75% rough-draft (Phase 1 and 2),and 5% single sequencing reads (Phase 0).An additional 104,018 BAC end-sequencemate pairs were also downloaded and in-cluded in the data sets for both assemblyprocesses (18).

2.2 Assembly strategiesTwo different approaches to assembly werepursued. The first was a whole-genome as-sembly process that used Celera data and thePFP data in the form of additional syntheticshotgun data, and the second was a compart-mentalized assembly process that first parti-tioned the Celera and PFP data into setslocalized to large chromosomal segments andthen performed ab initio shotgun assembly oneach set. Figure 4 gives a schematic of theoverall process flow.

For the whole-genome assembly, the PFPdata was first disassembled or “shredded” into asynthetic shotgun data set of 550-bp reads thatform a perfect 23 covering of the bactigs. Thisresulted in 16.05 million “faux” reads that weresufficient to cover the genome 2.963 becauseof redundancy in the BAC data set, withoutincorporating the biases inherent in the PFPassembly process. The combined data set of43.32 million reads (83), and all associatedmate-pair information, were then subjected toour whole-genome assembly algorithm to pro-duce a reconstruction of the genome. Neitherthe location of a BAC in the genome nor itsassembly of bactigs was used in this process.Bactigs were shredded into reads because wefound strong evidence that 2.13% of them weremisassembled (40). Furthermore, BAC location

Fig. 3. Anatomy of whole-genome assembly. Overlapping shredded bactig fragments (red lines) andinternally derived reads from five different individuals (black lines) are combined to produce acontig and a consensus sequence (green line). Contigs are connected into scaffolds (red) by usingmate pair information. Scaffolds are then mapped to the genome (gray line) with STS (blue star)physical map information.

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1309

Page 7: The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the

information was ignored because some BACswere not correctly placed on the PFP physicalmap and because we found strong evidence that

at least 2.2% of the BACs contained sequencedata that were not part of the given BAC (41),possibly as a result of sample-tracking errors

(see below). In short, we performed a true, abinitio whole-genome assembly in which wetook the expedient of deriving additional se-quence coverage, but not mate pairs, assembledbactigs, or genome locality, from some exter-nally generated data.

In the compartmentalized shotgun assembly(CSA), Celera and PFP data were partitionedinto the largest possible chromosomal segmentsor “components” that could be determined withconfidence, and then shotgun assembly was ap-plied to each partitioned subset wherein thebactig data were again shredded into faux readsto ensure an independent ab initio assembly ofthe component. By subsetting the data in thisway, the overall computational effort was re-duced and the effect of interchromosomal dupli-cations was ameliorated. This also resulted in areconstruction of the genome that was relativelyindependent of the whole-genome assembly re-sults so that the two assemblies could be com-pared for consistency. The quality of the parti-tioning into components was crucial so thatdifferent genome regions were not mixed to-gether. We constructed components from (i) thelongest scaffolds of the sequence from eachBAC and (ii) assembled scaffolds of data uniqueto Celera’s data set. The BAC assemblies wereobtained by a combining assembler that used thebactigs and the 53 Celera data mapped to thosebactigs as input. This effort was undertaken asan interim step solely because the more accurateand complete the scaffold for a given sequencestretch, the more accurately one can tile thesescaffolds into contiguous components on thebasis of sequence overlap and mate-pair infor-mation. We further visually inspected and cu-rated the scaffold tiling of the components tofurther increase its accuracy. For the final CSAassembly, all but the partitioning was ignored,and an independent, ab initio reconstruction ofthe sequence in each component was obtainedby applying our whole-genome assembly algo-rithm to the partitioned, relevant Celera data andthe shredded, faux reads of the partitioned, rel-evant bactig data.

2.3 Whole-genome assemblyThe algorithms used for whole-genome as-sembly (WGA) of the human genome wereenhancements to those used to produce thesequence of the Drosophila genome reportedin detail in (28).

The WGA assembler consists of a pipelinecomposed of five principal stages: Screener,Overlapper, Unitigger, Scaffolder, and RepeatResolver, respectively. The Screener findsand marks all microsatellite repeats with lessthan a 6-bp element, and screens out allknown interspersed repeat elements, includ-ing Alu, Line, and ribosomal DNA. Markedregions get searched for overlaps, whereasscreened regions do not get searched, but canbe part of an overlap that involves unscreenedmatching segments.

Table 2. GenBank data input into assembly.

Center StatisticsCompletion phase sequence

0 1 and 2 3

Whitehead Institute/ Number of accession records 2,825 6,533 363MIT Center for Number of contigs 243,786 138,023 363Genome Research, Total base pairs 194,490,158 1,083,848,245 48,829,358USA Total vector masked (bp) 1,553,597 875,618 2,202

Total contaminant masked(bp)

13,654,482 4,417,055 98,028

Average contig length (bp) 798 7,853 134,516

Washington University, Number of accession records 19 3,232 1,300USA Number of contigs 2,127 61,812 1,300

Total base pairs 1,195,732 561,171,788 164,214,395Total vector masked (bp) 21,604 270,942 8,287Total contaminant masked

(bp)22,469 1,476,141 469,487

Average contig length (bp) 562 9,079 126,319

Baylor College of Number of accession records 0 1,626 363Medicine, USA Number of contigs 0 44,861 363

Total base pairs 0 265,547,066 49,017,104Total vector masked (bp) 0 218,769 4,960Total contaminant masked

(bp)0 1,784,700 485,137

Average contig length (bp) 0 5,919 135,033

Production Sequencing Number of accession records 135 2,043 754Facility, DOE Joint Number of contigs 7,052 34,938 754Genome Institute, Total base pairs 8,680,214 294,249,631 60,975,328USA Total vector masked (bp) 22,644 162,651 7,274

Total contaminant masked(bp)

665,818 4,642,372 118,387

Average contig length (bp) 1,231 8,422 80,867

The Institute of Physical Number of accession records 0 1,149 300and Chemical Number of contigs 0 25,772 300Research (RIKEN), Total base pairs 0 182,812,275 20,093,926Japan Total vector masked (bp) 0 203,792 2,371

Total contaminant masked (bp) 0 308,426 27,781Average contig length (bp) 0 7,093 66,978

Sanger Centre, UK Number of accession records 0 4,538 2,599Number of contigs 0 74,324 2,599Total base pairs 0 689,059,692 246,118,000Total vector masked (bp) 0 427,326 25,054Total contaminant masked (bp) 0 2,066,305 374,561Average contig length (bp) 0 9,271 94,697

Others* Number of accession records 42 1,894 3,458Number of contigs 5,978 29,898 3,458Total base pairs 5,564,879 283,358,877 246,474,157Total vector masked (bp) 57,448 279,477 32,136Total contaminant masked

(bp)575,366 1,616,665 1,791,849

Average contig length (bp) 931 9,478 71,277

All centers combined† Number of accession records 3,021 21,015 9,137Number of contigs 258,943 409,628 9,137Total base pairs 209,930,983 3,360,047,574 835,722,268Total vector masked (bp) 1,655,293 2,438,575 82,284Total contaminant masked

(bp)14,918,135 16,311,664 3,365,230

Average contig length (bp) 811 8,203 91,466

*Other centers contributing at least 0.1% of the sequence include: Chinese National Human Genome Center;Genomanalyse Gesellschaft fuer Biotechnologische Forschung mbH; Genome Therapeutics Corporation; GENOSCOPE;Chinese Academy of Sciences; Institute of Molecular Biotechnology; Keio University School of Medicine; LawrenceLivermore National Laboratory; Cold Spring Harbor Laboratory; Los Alamos National Laboratory; Max-Planck Institut fuerMolekulare, Genetik; Japan Science and Technology Corporation; Stanford University; The Institute for GenomicResearch; The Institute of Physical and Chemical Research, Gene Bank; The University of Oklahoma; University of TexasSouthwestern Medical Center, University of Washington. †The 4,405,700,825 bases contributed by all centers wereshredded into faux reads resulting in 2.963 coverage of the genome.

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1310

Page 8: The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the

The Overlapper compares every readagainst every other read in search of completeend-to-end overlaps of at least 40 bp and withno more than 6% differences in the match.Because all data are scrupulously vector-trimmed, the Overlapper can insist on com-plete overlap matches. Computing the set ofall overlaps took roughly 10,000 CPU hourswith a suite of four-processor Alpha SMPswith 4 gigabytes of RAM. This took 4 to 5days in elapsed time with 40 such machinesoperating in parallel.

Every overlap computed above is statisti-cally a 1-in-1017 event and thus not a coinci-dental event. What makes assembly combi-natorially difficult is that while many over-laps are actually sampled from overlappingregions of the genome, and thus imply thatthe sequence reads should be assembled to-gether, even more overlaps are actually fromtwo distinct copies of a low-copy repeatedelement not screened above, thus constitutingan error if put together. We call the former“true overlaps” and the latter “repeat-inducedoverlaps.” The assembler must avoid choos-ing repeat-induced overlaps, especially earlyin the process.

We achieve this objective in the Unitig-ger. We first find all assemblies of reads thatappear to be uncontested with respect to allother reads. We call the contigs formed fromthese subassemblies unitigs (for uniquely as-sembled contigs). Formally, these unitigs arethe uncontested interval subgraphs of thegraph of all overlaps (42). Unfortunately, al-though empirically many of these assembliesare correct (and thus involve only true over-laps), some are in fact collections of readsfrom several copies of a repetitive elementthat have been overcollapsed into a singlesubassembly. However, the overcollapsedunitigs are easily identified because their av-erage coverage depth is too high to be con-sistent with the overall level of sequencecoverage. We developed a simple statisticaldiscriminator that gives the logarithm of theodds ratio that a unitig is composed of uniqueDNA or of a repeat consisting of two or morecopies. The discriminator, set to a sufficientlystringent threshold, identifies a subset of theunitigs that we are certain are correct. Inaddition, a second, less stringent thresholdidentifies a subset of remaining unitigs verylikely to be correctly assembled, of which weselect those that will consistently scaffold(see below), and thus are again almost certainto be correct. We call the union of these twosets U-unitigs. Empirically, we found from a63 simulated shotgun of human chromosome22 that we get U-unitigs covering 98% of thestretches of unique DNA that are .2 kbplong. We are further able to identify theboundary of the start of a repetitive elementat the ends of a U-unitig and leverage this sothat U-unitigs span more than 93% of all

singly interspersed Alu elements and other100-to 400-bp repetitive segments.

The result of running the Unitigger wasthus a set of correctly assembled subcontigscovering an estimated 73.6% of the humangenome. The Scaffolder then proceeded touse mate-pair information to link these to-gether into scaffolds. When there are two ormore mate pairs that imply that a given pairof U-unitigs are at a certain distance andorientation with respect to each other, theprobability of this being wrong is againroughly 1 in 1010, assuming that mate pairsare false less than 2% of the time. Thus, onecan with high confidence link together allU-unitigs that are linked by at least two 2- or10-kbp mate pairs producing intermediate-sized scaffolds that are then recursivelylinked together by confirming 50-kbp matepairs and BAC end sequences. This processyielded scaffolds that are on the order ofmegabase pairs in size with gaps betweentheir contigs that generally correspond to re-petitive elements and occasionally to smallsequencing gaps. These scaffolds reconstructthe majority of the unique sequence within agenome.

For the Drosophila assembly, we engagedin a three-stage repeat resolution strategywhere each stage was progressively more

aggressive and thus more likely to make amistake. For the human assembly, we contin-ued to use the first “Rocks” substage whereall unitigs with a good, but not definitive,discriminator score are placed in a scaffoldgap. This was done with the condition thattwo or more mate pairs with one of theirreads already in the scaffold unambiguouslyplace the unitig in the given gap. We estimatethe probability of inserting a unitig into anincorrect gap with this strategy to be less than1027 based on a probabilistic analysis.

We revised the ensuing “Stones” substageof the human assembly, making it more likethe mechanism suggested in our earlier work(43). For each gap, every read R that is placedin the gap by virtue of its mated pair M beingin a contig of the scaffold and implying R’splacement is collected. Celera’s mate-pairinginformation is correct more than 99% of thetime. Thus, almost every, but not all, of thereads in the set belong in the gap, and whena read does not belong it rarely agrees withthe remainder of the reads. Therefore, wesimply assemble this set of reads within thegap, eliminating any reads that conflict withthe assembly. This operation proved muchmore reliable than the one it replaced for theDrosophila assembly; in the assembly of asimulated shotgun data set of human chromo-

Fig. 4. Architecture of Celera’s two-pronged assembly strategy. Each oval denotes a computationprocess performing the function indicated by its label, with the labels on arcs between ovalsdescribing the nature of the objects produced and/or consumed by a process. This figuresummarizes the discussion in the text that defines the terms and phrases used.

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1311

Page 9: The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the

some 22, all stones were placed correctly.The final method of resolving gaps is to

fill them with assembled BAC data that coverthe gap. We call this external gap “walking.”We did not include the very aggressive “Peb-bles” substage described in our Drosophilawork, which made enough mistakes so as toproduce repeat reconstructions for long inter-spersed elements whose quality was only99.62% correct. We decided that for the hu-man genome it was philosophically better notto introduce a step that was certain to produceless than 99.99% accuracy. The cost was asomewhat larger number of gaps of some-what larger size.

At the final stage of the assembly process,and also at several intermediate points, aconsensus sequence of every contig is pro-duced. Our algorithm is driven by the princi-ple of maximum parsimony, with quality-value–weighted measures for evaluating eachbase. The net effect is a Bayesian estimate ofthe correct base to report at each position.Consensus generation uses Celera data when-ever it is present. In the event that no Celeradata cover a given region, the BAC datasequence is used.

A key element of achieving a WGA of thehuman genome was to parallelize the Overlap-per and the central consensus sequence–con-structing subroutines. In addition, memory wasa real issue—a straightforward application ofthe software we had built for Drosophila would

have required a computer with a 600-gigabyteRAM. By making the Overlapper and Unitiggerincremental, we were able to achieve the samecomputation with a maximum of instantaneoususage of 28 gigabytes of RAM. Moreover, theincremental nature of the first three stages al-lowed us to continually update the state of thispart of the computation as data were deliveredand then perform a 7-day run to complete Scaf-folding and Repeat Resolution whenever de-sired. For our assembly operations, the totalcompute infrastructure consists of 10 four-pro-cessor SMPs with 4 gigabytes of memory percluster (Compaq’s ES40, Regatta) and a 16-processor NUMA machine with 64 gigabytesof memory (Compaq’s GS160, Wildfire). Thetotal compute for a run of the assembler wasroughly 20,000 CPU hours.

The assembly of Celera’s data, togetherwith the shredded bactig data, produced a set ofscaffolds totaling 2.848 Gbp in span and con-sisting of 2.586 Gbp of sequence. The chaff, orset of reads not incorporated in the assembly,numbered 11.27 million (26%), which is con-sistent with our experience for Drosophila.More than 84% of the genome was covered byscaffolds .100 kbp long, and these averaged91% sequence and 9% gaps with a total of2.297 Gbp of sequence. There were a total of93,857 gaps among the 1637 scaffolds .100kbp. The average scaffold size was 1.5 Mbp,the average contig size was 24.06 kbp, and theaverage gap size was 2.43 kbp, where the dis-

tribution of each was essentially exponential.More than 50% of all gaps were less than 500bp long, .62% of all gaps were less than 1 kbplong, and no gap was .100 kbp long. Similar-ly, more than 65% of the sequence is in contigs.30 kbp, more than 31% is in contigs .100kbp, and the largest contig was 1.22 Mbp long.Table 3 gives detailed summary statistics forthe structure of this assembly with a directcomparison to the compartmentalized shotgunassembly.

2.4 Compartmentalized shotgunassemblyIn addition to the WGA approach, we pur-sued a localized assembly approach that wasintended to subdivide the genome into seg-ments, each of which could be shotgun as-sembled individually. We expected that thiswould help in resolution of large interchro-mosomal duplications and improve the statis-tics for calculating U-unitigs. The compart-mentalized assembly process involved clus-tering Celera reads and bactigs into large,multiple megabase regions of the genome,and then running the WGA assembler on theCelera data and shredded, faux reads ob-tained from the bactig data.

The first phase of the CSA strategy was toseparate Celera reads into those that matchedthe BAC contigs for a particular PFP BACentry, and those that did not match any publicdata. Such matches must be guaranteed to

Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies.

Scaffold size

All .30 kbp .100 kbp .500 kbp .1000 kbp

Compartmentalized shotgun assembly

No. of bp in scaffolds 2,905,568,203 2,748,892,430 2,700,489,906 2,489,357,260 2,248,689,128(including intrascaffold gaps)

No. of bp in contigs 2,653,979,733 2,524,251,302 2,491,538,372 2,320,648,201 2,106,521,902No. of scaffolds 53,591 2,845 1,935 1,060 721No. of contigs 170,033 112,207 107,199 93,138 82,009No. of gaps 116,442 109,362 105,264 92,078 81,288No. of gaps #1 kbp 72,091 69,175 67,289 59,915 53,354Average scaffold size (bp) 54,217 966,219 1,395,602 2,348,450 3,118,848Average contig size (bp) 15,609 22,496 23,242 24,916 25,686Average intrascaffold gap size

(bp)2,161 2,054 1,985 1,832 1,749

Largest contig (bp) 1,988,321 1,988,321 1,988,321 1,988,321 1,988,321% of total contigs 100 95 94 87 79

Whole-genome assembly

No. of bp in scaffolds(including intrascaffold gaps)

2,847,890,390 2,574,792,618 2,525,334,447 2,328,535,466 2,140,943,032

No. of bp in contigs 2,586,634,108 2,334,343,339 2,297,678,935 2,143,002,184 1,983,305,432No. of scaffolds 118,968 2,507 1,637 818 554No. of contigs 221,036 99,189 95,494 84,641 76,285No. of gaps 102,068 96,682 93,857 83,823 75,731No. of gaps #1 kbp 62,356 60,343 59,156 54,079 49,592Average scaffold size (bp) 23,938 1,027,041 1,542,660 2,846,620 3,864,518Average contig size (bp) 11,702 23,534 24,061 25,319 25,999Average intrascaffold gap size

(bp)2,560 2,487 2,426 2,213 2,082

Largest contig (bp) 1,224,073 1,224,073 1,224,073 1,224,073 1,224,073% of total contigs 100 90 89 83 77

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1312

Page 10: The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the

properly place a Celera read, so all reads werefirst masked against a library of commonrepetitive elements, and only matches of atleast 40 bp to unmasked portions of the readconstituted a hit. Of Celera’s 27.27 millionreads, 20.76 million matched a bactig andanother 0.62 million reads, which did nothave any matches, were nonetheless identi-fied as belonging in the region of the bactig’sBAC because their mate matched the bactig.Of the remaining reads, 2.92 million werecompletely screened out and so could not bematched, but the other 2.97 million reads hadunmasked sequence totaling 1.189 Gbp thatwere not found in the GenBank data set.Because the Celera data are 5.113 redundant,we estimate that 240 Mbp of unique Celerasequence is not in the GenBank data set.

In the next step of the CSA process, acombining assembler took the relevant 53Celera reads and bactigs for a BAC entry, andproduced an assembly of the combined datafor that locale. These high-quality sequencereconstructions were a transient result whoseutility was simply to provide more reliableinformation for the purposes of their tilinginto sets of overlapping and adjacent scaffoldsequences in the next step. In outline, thecombining assembler first examines the set ofmatching Celera reads to determine if thereare excessive pileups indicative of un-screened repetitive elements. Wherever theseoccur, reads in the repeat region whose mateshave not been mapped to consistent positionsare removed. Then all sets of mate pairs thatconsistently imply the same relative positionof two bactigs are bundled into a link andweighted according to the number of mates inthe bundle. A “greedy” strategy then attemptsto order the bactigs by selecting bundles ofmate-pairs in order of their weight. A selectedmate-pair bundle can tie together two forma-tive scaffolds. It is incorporated to form asingle scaffold only if it is consistent with themajority of links between contigs of the scaf-fold. Once scaffolding is complete, gaps arefilled by the “Stones” strategy describedabove for the WGA assembler.

The GenBank data for the Phase 1 and 2BACs consisted of an average of 19.8 bactigsper BAC of average size 8099 bp. Applica-tion of the combining assembler resulted inindividual Celera BAC assemblies being puttogether into an average of 1.83 scaffolds(median of 1 scaffold) consisting of an aver-age of 8.57 contigs of average size 18,973 bp.In addition to defining order and orientationof the sequence fragments, there were 57%fewer gaps in the combined result. For Phase0 data, the average GenBank entry consistedof 91.52 reads of average length 784 bp.Application of the combining assembler re-sulted in an average of 54.8 scaffolds consist-ing of an average of 58.1 contigs of averagesize 873 bp. Basically, some small amount of

assembly took place, but not enough Celeradata were matched to truly assemble the 0.53to 13 data set represented by the typicalPhase 0 BACs. The combining assemblerwas also applied to the Phase 3 BACs forSNP identification, confirmation of assem-bly, and localization of the Celera reads. Thephase 0 data suggest that a combined whole-genome shotgun data set and 13 light-shot-gun of BACs will not yield good assembly ofBAC regions; at least 33 light-shotgun ofeach BAC is needed.

The 5.89 million Celera fragments notmatching the GenBank data were assembledwith our whole-genome assembler. The as-sembly resulted in a set of scaffolds totaling442 Mbp in span and consisting of 326 Mbpof sequence. More than 20% of the scaffoldswere .5 kbp long, and these averaged 63%sequence and 27% gaps with a total of 302Mbp of sequence. All scaffolds .5 kbp wereforwarded along with all scaffolds producedby the combining assembler to the subse-quent tiling phase.

At this stage, we typically had one or twoscaffolds for every BAC region constitutingat least 95% of the relevant sequence, and acollection of disjoint Celera-unique scaffolds.The next step in developing the genome com-ponents was to determine the order and over-lap tiling of these BAC and Celera-uniquescaffolds across the genome. For this, weused Celera’s 50-kbp mate-pairs information,and BAC-end pairs (18) and sequence taggedsite (STS) markers (44 ) to provide long-range guidance and chromosome separation.Given the relatively manageable number ofscaffolds, we chose not to produce this tilingin a fully automated manner, but to computean initial tiling with a good heuristic and thenuse human curators to resolve discrepanciesor missed join opportunities. To this end, wedeveloped a graphical user interface that dis-played the graph of tiling overlaps and theevidence for each. A human curator couldthen explore the implication of mapped STSdata, dot-plots of sequence overlap, and avisual display of the mate-pair evidence sup-porting a given choice. The result of thisprocess was a collection of “components,”where each component was a tiled set ofBAC and Celera-unique scaffolds that hadbeen curator-approved. The process resultedin 3845 components with an estimated spanof 2.922 Gbp.

In order to generate the final CSA, weassembled each component with the WGAalgorithm. As was done in the WGA process,the bactig data were shredded into a synthetic23 shotgun data set in order to give theassembler the freedom to independently as-semble the data. By using faux reads ratherthan bactigs, the assembly algorithm couldcorrect errors in the assembly of bactigs andremove chimeric content in a PFP data entry.

Chimeric or contaminating sequence (fromanother part of the genome) would not beincorporated into the reassembly of the com-ponent because it did not belong there. Ineffect, the previous steps in the CSA processserved only to bring together Celera frag-ments and PFP data relevant to a large con-tiguous segment of the genome, wherein weapplied the assembler used for WGA to pro-duce an ab initio assembly of the region.

WGA assembly of the components result-ed in a set of scaffolds totaling 2.906 Gbp inspan and consisting of 2.654 Gbp of se-quence. The chaff, or set of reads not incor-porated into the assembly, numbered 6.17million, or 22%. More than 90.0% of thegenome was covered by scaffolds spanning.100 kbp long, and these averaged 92.2%sequence and 7.8% gaps with a total of 2.492Gbp of sequence. There were a total of105,264 gaps among the 107,199 contigs thatbelong to the 1940 scaffolds spanning .100kbp. The average scaffold size was 1.4 Mbp,the average contig size was 23.24 kbp, andthe average gap size was 2.0 kbp where eachdistribution of sizes was exponential. Assuch, averages tend to be underrepresentativeof the majority of the data. Figure 5 shows ahistogram of the bases in scaffolds of varioussize ranges. Consider also that more than49% of all gaps were ,500 bp long, morethan 62% of all gaps were ,1 kbp, and allgaps are ,100 kbp long. Similarly, more than73% of the sequence is in contigs . 30 kbp,more than 49% is in contigs .100 kbp, andthe largest contig was 1.99 Mbp long. Table 3provides summary statistics for the structureof this assembly with a direct comparison tothe WGA assembly.

2.5 Comparison of the WGA and CSAscaffoldsHaving obtained two assemblies of the hu-man genome via independent computationalprocesses (WGA and CSA), we comparedscaffolds from the two assemblies as anothermeans of investigating their completeness,consistency, and contiguity. From each as-sembly, a set of reference scaffolds contain-ing at least 1000 fragments (Celera sequenc-ing reads or bactig shreds) was obtained; thisamounted to 2218 WGA scaffolds and 1717CSA scaffolds, for a total of 2.087 Gbp and2.474 Gbp. The sequence of each referencescaffold was compared to the sequence of allscaffolds from the other assembly with whichit shared at least 20 fragments or at least 20%of the fragments of the smaller scaffold. Foreach such comparison, all matches of at least200 bp with at most 2% mismatch weretabulated.

From this tabulation, we estimated theamount of unique sequence in each assemblyin two ways. The first was to determine thenumber of bases of each assembly that were

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1313

Page 11: The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the

not covered by a matching segment in theother assembly. Some 82.5 Mbp of the WGA(3.95%) was not covered by the CSA, where-as 204.5 Mbp (8.26%) of the CSA was notcovered by the WGA. This estimate did notrequire any consistency of the assemblies orany uniqueness of the matching segments.Thus, another analysis was conducted inwhich matches of less than 1 kbp between apair of scaffolds were excluded unless theywere confirmed by other matches having aconsistent order and orientation. This givessome measure of consistent coverage: 1.982Gbp (95.00%) of the WGA is covered by theCSA, and 2.169 Gbp (87.69%) of the CSA iscovered by the WGA by this more stringentmeasure.

The comparison of WGA to CSA alsopermitted evaluation of scaffolds for structur-al inconsistencies. We looked for instances inwhich a large section of a scaffold from oneassembly matched only one scaffold from theother assembly, but failed to match over thefull length of the overlap implied by thematching segments. An initial set of candi-dates was identified automatically, and theneach candidate was inspected by hand. Fromthis process, we identified 31 instances inwhich the assemblies appear to disagree in anonlocal fashion. These cases are being fur-ther evaluated to determine which assemblyis in error and why.

In addition, we evaluated local inconsis-tencies of order or orientation. The followingresults exclude cases in which one contig inone assembly corresponds to more than oneoverlapping contig in the other assembly (aslong as the order and orientation of the latteragrees with the positions they match in theformer). Most of these small rearrangementsinvolved segments on the order of hundredsof base pairs and rarely .1 kbp. We found atotal of 295 kbp (0.012%) in the CSA assem-blies that were locally inconsistent with theWGA assemblies, whereas 2.108 Mbp(0.11%) in the WGA assembly were incon-sistent with the CSA assembly.

The CSA assembly was a few percentagepoints better in terms of coverage and slightlymore consistent than the WGA, because itwas in effect performing a few thousand shot-gun assemblies of megabase-sized problems,whereas the WGA is performing a shotgunassembly of a gigabase-sized problem. Whenone considers the increase of two-and-a-halforders of magnitude in problem size, the in-formation loss between the two is remarkablysmall. Because CSA was logistically easier todeliver and the better of the two results avail-able at the time when downstream analysesneeded to be begun, all subsequent analysiswas performed on this assembly.

2.6 Mapping scaffolds to the genomeThe final step in assembling the genome was toorder and orient the scaffolds on the chromo-somes. We first grouped scaffolds together onthe basis of their order in the components fromCSA. These grouped scaffolds were reorderedby examining residual mate-pairing data be-tween the scaffolds. We next mapped the scaf-fold groups onto the chromosome using physi-cal mapping data. This step depends on havingreliable high-resolution map information suchthat each scaffold will overlap multiple mark-ers. There are two genome-wide types of mapinformation available: high-density STS mapsand fingerprint maps of BAC clones developedat Washington University (45). Among the ge-nome-wide STS maps, GeneMap99 (GM99)has the most markers and therefore was mostuseful for mapping scaffolds. The two differentmapping approaches are complementary to oneanother. The fingerprint maps should have bet-ter local order because they were built by com-parison of overlapping BAC clones. On theother hand, GM99 should have a more reliablelong-range order, because the framework mark-ers were derived from well-validated geneticmaps. Both types of maps were used as areference for human curation of the compo-nents that were the input to the regional assem-bly, but they did not determine the order ofsequences produced by the assembler.

In order to determine the effectiveness ofthe fingerprint maps and GM99 for mappingscaffolds, we first examined the reliability ofthese maps by comparison with large scaf-folds. Only 1% of the STS markers on the 10largest scaffolds (those .9 Mbp) weremapped on a different chromosome onGM99. Two percent of the STS markers dis-agreed in position by more than five frame-work bins. However, for the fingerprintmaps, a 2% chromosome discrepancy wasobserved, and on average 23.8% of BAClocations in the scaffold sequence disagreedwith fingerprint map placement by more thanfive BACs. When further examining thesource of discrepancy, it was found that mostof the discrepancy came from 4 of the 10scaffolds, indicating this there is variation inthe quality of either the map or the scaffolds.All four scaffolds were assembled, as well asthe other six, as judged by clone coverageanalysis, and showed the same low discrep-ancy rate to GM99, and thus we concludedthat the fingerprint map global order in thesecases was not reliable. Smaller scaffolds hada higher discordance rate with GM99 (4.21%of STSs were discordant by more than fiveframework bins), but a lower discordance ratewith the fingerprint maps (11% of BACsdisagreed with fingerprint maps by more thanfive BACs). This observation agrees with theclone coverage analysis (46 ) that Celera scaf-fold construction was better supported bylong-range mate pairs in larger scaffolds thanin small scaffolds.

We created two orderings of Celera scaf-folds on the basis of the markers (BAC orSTS) on these maps. Where the order ofscaffolds agreed between GM99 and theWashU BAC map, we had a high degree ofconfidence that that order was correct; thesescaffolds were termed “anchor scaffolds.”Only scaffolds with a low overall discrepancyrate with both maps were considered anchorscaffolds. Scaffolds in GM99 bins were al-lowed to permute in their order to matchWashU ordering, provided they did not vio-late their framework orders. Orientation ofindividual scaffolds was determined by thepresence of multiple mapped markers withconsistent order. Scaffolds with only onemarker have insufficient information to as-sign orientation. We found 70.1% of the ge-nome in anchored scaffolds, more than 99%of which are also oriented (Table 4). BecauseGM99 is of lower resolution than the WashUmap, a number of scaffolds without STSmatches could be ordered relative to the an-chored scaffolds because they included se-quence from the same or adjacent BACs onthe WashU map. On the other hand, becauseof occasional WashU global ordering dis-crepancies, a number of scaffolds determinedto be “unmappable” on the WashU map couldbe ordered relative to the anchored scaffolds

Fig. 5. Distribution of scaffold sizes of the CSA. For each range of scaffold sizes, the percent of totalsequence is indicated.

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1314

Page 12: The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the

with GM99. These scaffolds were termed“ordered scaffolds.” We found that 13.9% ofthe assembly could be ordered by these ad-ditional methods, and thus 84.0% of the ge-nome was ordered unambiguously.

Next, all scaffolds that could be placed,but not ordered, between anchors were as-signed to the interval between the anchoredscaffolds and were deemed to be “bound-ed” between them. For example, small scaf-folds having STS hits from the same Gene-Map bin or hitting the same BAC cannot beordered relative to each other, but can beassigned a placement boundary relative toother anchored or ordered scaffolds. Theremaining scaffolds either had no localiza-tion information, conflicting information,or could only be assigned to a genericchromosome location. Using the above ap-proaches, ;98% of the genome was an-chored, ordered, or bounded.

Finally, we assigned a location for eachscaffold placed on the chromosome byspreading out the scaffolds per chromosome.We assumed that the remaining unmappedscaffolds, constituting 2% of the genome,were distributed evenly across the genome.By dividing the sum of unmapped scaffoldlengths with the sum of the number ofmapped scaffolds, we arrived at an estimateof interscaffold gap of 1483 bp. This gap wasused to separate all the scaffolds on eachchromosome and to assign an offset in thechromosome.

During the scaffold-mapping effort, we en-countered many problems that resulted in addi-tional quality assessment and validation analy-sis. At least 978 (3% of 33,173) BACs werebelieved to have sequence data from more thanone location in the genome (47). This is con-sistent with the bactig chimerism analysis re-ported above in the Assembly Strategies sec-tion. These BACs could not be assigned tounique positions within the CSA assembly andthus could not be used for ordering scaffolds.Likewise, it was not always possible to assignSTSs to unique locations in the assembly be-cause of genome duplications, repetitive ele-ments, and pseudogenes.

Because of the time required for an ex-haustive search for a perfect overlap, CSAgenerated 21,607 intrascaffold gaps wherethe mate-pair data suggested that the contigsshould overlap, but no overlap was found.These gaps were defined as a fixed 50 bp inlength and make up 18.6% of the total116,442 gaps in the CSA assembly.

We chose not to use the order of exonsimplied in cDNA or EST data as a way ofordering scaffolds. The rationale for not us-ing this data was that doing so would havebiased certain regions of the assembly byrearranging scaffolds to fit the transcript dataand made validation of both the assembly andgene definition processes more difficult.

2.7 Assembly and validation analysisWe analyzed the assembly of the genomefrom the perspectives of completeness(amount of coverage of the genome) andcorrectness (the structural accuracy of theorder and orientation and the consensus se-quence of the assembly).

Completeness. Completeness is defined asthe percentage of the euchromatic sequencerepresented in the assembly. This cannot beknown with absolute certainty until the eu-chromatin sequence has been completed.However, it is possible to estimate complete-ness on the basis of (i) the estimated sizes ofintrascaffold gaps; (ii) coverage of the twopublished chromosomes, 21 and 22 (48, 49);and (iii) analysis of the percentage of anindependent set of random sequences (STSmarkers) contained in the assembly. Thewhole-genome libraries contain heterochro-matic sequence and, although no attempt hasbeen made to assemble it, there may be in-stances of unique sequence embedded in re-gions of heterochromatin as were observed inDrosophila (50, 51).

The sequences of human chromosomes 21and 22 have been completed to high qualityand published (48, 49). Although this se-quence served as input to the assembler, thefinished sequence was shredded into a shot-gun data set so that the assembler had theopportunity to assemble it differently fromthe original sequence in the case of structuralpolymorphisms or assembly errors in theBAC data. In particular, the assembler mustbe able to resolve repetitive elements at thescale of components (generally multimega-base in size), and so this comparison revealsthe level to which the assembler resolvesrepeats. In certain areas, the assembly struc-ture differs from the published versions ofchromosomes 21 and 22 (see below). Theconsequence of the flexibility to assemble“finished” sequence differently on the basisof Celera data resulted in an assembly withmore segments than the chromosome 21 and22 sequences. We examined the reasons whythere are more gaps in the Celera sequencethan in chromosomes 21 and 22 and expectthat they may be typical of gaps in otherregions of the genome. In the Celera assem-bly, there are 25 scaffolds, each containing atleast 10 kb of sequence, that collectively span94.3% of chromosome 21. Sixty-two scaf-folds span 95.7% of chromosome 22. Thetotal length of the gaps remaining in theCelera assembly for these two chromosomesis 3.4 Mbp. These gap sequences were ana-lyzed by RepeatMasker and by searchingagainst the entire genome assembly (52).About 50% of the gap sequence consisted ofcommon repetitive elements identified by Re-peatMasker; more than half of the remainderwas lower copy number repeat elements.

A more global way of assessing complete-

ness is to measure the content of an independentset of sequence data in the assembly. We com-pared 48,938 STS markers from Genemap99(51) to the scaffolds. Because these markerswere not used in the assembly processes, theyprovided a truly independent measure of com-pleteness. ePCR (53) and BLAST (54) wereused to locate STSs on the assembled genome.We found 44,524 (91%) of the STSs in themapped genome. An additional 2648 markers(5.4%) were found by searching the unas-sembled data or “chaff.” We identified 1283STS markers (2.6%) not found in either Celerasequence or BAC data as of September 2000,raising the possibility that these markers maynot be of human origin. If that were the case,the Celera assembled sequence would represent93.4% of the human genome and the unas-sembled data 5.5%, for a total of 98.9% cover-age. Similarly, we compared CSA against36,678 TNG radiation hybrid markers (55a)using the same method. We found that 32,371markers (88%) were located in the mappedCSA scaffolds, with 2055 markers (5.6%)found in the remainder. This gave a 94% cov-erage of the genome through another genome-wide survey.

Correctness. Correctness is defined as thestructural and sequence accuracy of the as-sembly. Because the source sequences for theCelera data and the GenBank data are fromdifferent individuals, we could not directlycompare the consensus sequence of the as-

Table 4. Summary of scaffold mapping. Scaffoldswere mapped to the genome with different levelsof confidence (anchored scaffolds have the highestconfidence; unmapped scaffolds have the lowest).Anchored scaffolds were consistently ordered bythe WashU BAC map and GM99. Ordered scaf-folds were consistently ordered by at least one ofthe following: the WashU BAC map, GM99, orcomponent tiling path. Bounded scaffolds had or-der conflicts between at least two of the externalmaps, but their placements were adjacent to aneighboring anchored or ordered scaffold. Un-mapped scaffolds had, at most, a chromosomeassignment. The scaffold subcategories are givenbelow each category.

Mappedscaffoldcategory

Number Length (bp)%

Totallength

Anchored 1,526 1,860,676,676 70Oriented 1,246 1,852,088,645 70Unoriented 280 8,588,031 0.3

Ordered 2,001 369,235,857 14Oriented 839 329,633,166 12Unoriented 1,162 39,602,691 2

Bounded 38,241 368,753,463 14Oriented 7,453 274,536,424 10Unoriented 30,788 94,217,039 4

Unmapped 11,823 55,313,737 2Known 281 2,505,844 0.1

chromosomeUnknown

chromosome11,542 52,807,893 2

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1315

Page 13: The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the

sembly against other finished sequence fordetermining sequencing accuracy at the nu-cleotide level, although this has been done foridentifying polymorphisms as described inSection 6. The accuracy of the consensussequence is at least 99.96% on the basis of astatistical estimate derived from the qualityvalues of the underlying reads.

The structural consistency of the assemblycan be measured by mate-pair analysis. In acorrect assembly, every mated pair of se-quencing reads should be located on the con-sensus sequence with the correct separationand orientation between the pairs. A pair istermed “valid” when the reads are in thecorrect orientation and the distance betweenthem is within the mean 6 3 standard devi-ations of the distribution of insert sizes of thelibrary from which the pair was sampled. Apair is termed “misoriented” when the readsare not correctly oriented, and is termed “mis-separated” when the distance between thereads is not in the correct range but the readsare correctly oriented. The mean 6 the stan-dard deviation of each library used by theassembler was determined as describedabove. To validate these, we examined allreads mapped to the finished sequence ofchromosome 21 (48) and determined howmany incorrect mate pairs there were as aresult of laboratory tracking errors and chi-merism (two different segments of the ge-nome cloned into the same plasmid), and howtight the distribution of insert sizes was for

those that were correct (Table 5). The stan-dard deviations for all Celera libraries werequite small, less than 15% of the insertlength, with the exception of a few 50-kbplibraries. The 2- and 10-kbp libraries con-tained less than 2% invalid mate pairs, where-as the 50-kbp libraries were somewhat higher(;10%). Thus, although the mate-pair infor-mation was not perfect, its accuracy was suchthat measuring valid, misoriented, and mis-separated pairs with respect to a given assem-bly was deemed to be a reliable instrumentfor validation purposes, especially when sev-eral mate pairs confirm or deny an ordering.

The clone coverage of the genome was393, meaning that any given base pair was,on average, contained in 39 clones or, equiv-alently, spanned by 39 mate-paired reads.Areas of low clone coverage or areas with ahigh proportion of invalid mate pairs wouldindicate potential assembly problems. Wecomputed the coverage of each base in theassembly by valid mate pairs (Table 6). Insummary, for scaffolds .30 kbp in length,less than 1% of the Celera assembly was inregions of less than 33 clone coverage. Thus,more than 99% of the assembly, includingorder and orientation, is strongly supportedby this measure alone.

We examined the locations and number ofall misoriented and misseparated mates. Inaddition to doing this analysis on the CSAassembly (as of 1 October 2000), we alsoperformed a study of the PFP assembly as of

5 September 2000 (30, 55b). In this lattercase, Celera mate pairs had to be mapped tothe PFP assembly. To avoid mapping errorsdue to high-fidelity repeats, the only pairsmapped were those for which both readsmatched at only one location with less than6% differences. A threshold was set such thatsets of five or more simultaneously invalidmate pairs indicated a potential breakpoint,where the construction of the two assembliesdiffered. The graphic comparison of the CSAchromosome 21 assembly with the publishedsequence (Fig. 6A) serves as a validation ofthis methodology. Blue tick marks in thepanels indicate breakpoints. There were asimilar (small) number of breakpoints onboth chromosome sequences. The exceptionwas 12 sets of scaffolds in the Celera assem-bly (a total of 3% of the chromosome lengthin 212 single-contig scaffolds) that weremapped to the wrong positions because theywere too small to be mapped reliably. Figures6 and 7 and Table 6 illustrate the mate-pairdifferences and breakpoints between the twoassemblies. There was a higher percentage ofmisoriented and misseparated mate pairs inthe large-insert libraries (50 kbp and BACends) than in the small-insert libraries in bothassemblies (Table 6). The large-insert librar-ies are more likely to identify discrepanciessimply because they span a larger segment ofthe genome. The graphic comparison be-tween the two assemblies for chromosome 8(Fig. 6, B and C) shows that there are many

Table 5. Mate-pair validation. Celera fragment sequences were mapped tothe published sequence of chromosome 21. Each mate pair uniquelymapped was evaluated for correct orientation and placement (number

of mate pairs tested). If the two mates had incorrect relative orienta-tion or placement, they were considered invalid (number of invalid matepairs).

Librarytype

Libraryno.

Chromosome 21 Genome

Meaninsertsize(bp)

SD(bp)

SD/mean(%)

No. ofmatepairs

tested

No. ofinvalidmatepairs

%invalid

Meaninsert

size (bp)

SD(bp)

SD/mean(%)

2 kbp 1 2,081 106 5.1 3,642 38 1.0 2,082 90 4.32 1,913 152 7.9 28,029 413 1.5 1,923 118 6.13 2,166 175 8.1 4,405 57 1.3 2,162 158 7.3

10 kbp 4 11,385 851 7.5 4,319 80 1.9 11,370 696 6.15 14,523 1,875 12.9 7,355 156 2.1 14,142 1,402 9.96 9,635 1,035 10.7 5,573 109 2.0 9,606 934 9.77 10,223 928 9.1 34,079 399 1.2 10,190 777 7.6

50 kbp 8 64,888 2,747 4.2 16 1 6.3 65,500 5,504 8.49 53,410 5,834 10.9 914 170 18.6 53,311 5,546 10.4

10 52,034 7,312 14.1 5,871 569 9.7 51,498 6,588 12.811 52,282 7,454 14.3 2,629 213 8.1 52,282 7,454 14.312 46,616 7,378 15.8 2,153 215 10.0 45,418 9,068 20.013 55,788 10,099 18.1 2,244 249 11.1 53,062 10,893 20.514 39,894 5,019 12.6 199 7 3.5 36,838 9,988 27.1

BES 15 48,931 9,813 20.1 144 10 6.9 47,845 4,774 10.016 48,130 4,232 8.8 195 14 7.2 47,924 4,581 9.617 106,027 27,778 26.2 330 16 4.8 152,000 26,600 17.518 160,575 54,973 34.2 155 8 5.2 161,750 27,000 16.719 164,155 19,453 11.9 642 44 6.9 176,500 19,500 11.05

Sum 102,894 2,768 2.7(mean 5 2.7)

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1316

Page 14: The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the

more breakpoints for the PFP assembly thanfor the Celera assembly. Figure 7 shows thebreakpoint map (blue tick marks) for bothassemblies of each chromosome in a side-by-side fashion. The order and orientation ofCelera’s assembly shows substantially fewerbreakpoints except on the two finished chro-mosomes. Figure 7 also depicts large gaps(.10 kbp) in both assemblies as red tickmarks. In the CSA assembly, the size of allgaps have been estimated on the basis of themate-pair data. Breakpoints can be caused bystructural polymorphisms, because the twoassemblies were derived from different hu-man genomes. They also reflect the unfin-ished nature of both genome assemblies.

3 Gene Prediction and AnnotationSummary. To enumerate the gene inventory,we developed an integrated, evidence-basedapproach named Otto. The evidence used toincrease the likelihood of identifying genesincludes regions conserved between themouse and human genomes, similarity toESTs or other mRNA-derived data, or simi-larity to other proteins. A comparison of Otto(combined Otto-RefSeq and Otto homology)with Genscan, a standard gene-prediction al-gorithm, showed greater sensitivity (0.78 ver-sus 0.50) and specificity (0.93 versus 0.63) ofOtto in the ability to define gene structure.Otto-predicted genes were complementedwith a set of genes from three gene-predictionprograms that exhibited weaker, but still sig-nificant, evidence that they may be ex-pressed. Conservative criteria, requiring atleast two lines of evidence, were used todefine a set of 26,383 genes with good con-fidence that were used for more detailed anal-ysis presented in the subsequent sections.Extensive manual curation to establish pre-cise characterization of gene structure will benecessary to improve the results from thisinitial computational approach.

3.1 Automated gene annotationA gene is a locus of cotranscribed exons. Asingle gene may give rise to multiple tran-scripts, and thus multiple distinct proteinswith multiple functions, by means of alterna-

tive splicing and alternative transcription ini-tiation and termination sites. Our cells areable to discern within the billions of basepairs of the genomic DNA the signals forinitiating transcription and for splicing to-gether exons separated by a few or hundredsof thousands of base pairs. The first step incharacterizing the genome is to define thestructure of each gene and each transcriptionunit.

The number of protein-coding genes inmammals has been controversial from theoutset. Initial estimates based on reassocia-tion data placed it between 30,000 to 40,000,whereas later estimates from the brain were.100,000 (56 ). More recent data from boththe corporate and public sectors, based onextrapolations from EST, CpG island, andtranscript density–based extrapolations, havenot reduced this variance. The highest recentnumber of 142,634 genes emanates from areport from Incyte Pharmaceuticals, and isbased on a combination of EST data and theassociation of ESTs with CpG islands (57 ).In stark contrast are three quite different, andmuch lower estimates: one of ;35,000 genesderived with genome-wide EST data andsampling procedures in conjunction withchromosome 22 data (58); another of 28,000to 34,000 genes derived with a comparativemethodology involving sequence conserva-tion between humans and the puffer fish Te-traodon nigroviridis (59); and a figure of35,000 genes, which was derived simply byextrapolating from the density of 770 knownand predicted genes in the 67 Mbp of chro-mosomes 21 and 22, to the approximately3-Gbp euchromatic genome.

The problem of computational identifica-tion of transcriptional units in genomic DNAsequence can be divided into two phases. Thefirst is to partition the sequence into segmentsthat are likely to correspond to individualgenes. This is not trivial and is a weakness ofmost de novo gene-finding algorithms. It isalso critical to determining the number ofgenes in the human gene inventory. The sec-ond challenge is to construct a gene modelthat reflects the probable structure of thetranscript(s) encoded in the region. This can

be done with reasonable accuracy when afull-length cDNA has been sequenced or ahighly homologous protein sequence isknown. De novo gene prediction, althoughless accurate, is the only way to find genesthat are not represented by homologous pro-teins or ESTs. The following section de-scribes the methods we have developed toaddress these problems for the prediction ofprotein-coding genes.

We have developed a rule-based expert sys-tem, called Otto, to identify and characterizegenes in the human genome (60). Otto attemptsto simulate in software the process that a humanannotator uses to identify a gene and refine itsstructure. In the process of annotating a regionof the genome, a human curator examines theevidence provided by the computational pipe-line (described below) and examines how var-ious types of evidence relate to one another. Acurator puts different levels of confidence indifferent types of evidence and looks forcertain patterns of evidence to support geneannotation. For example, a curator may ex-amine homology to a number of ESTs andevaluate whether or not they can be connect-ed into a longer, virtual mRNA. The curatorwould also evaluate the strength of the simi-larity and the contiguity of the match, inessence asking whether any ESTs crosssplice-junctions and whether the edges ofputative exons have consensus splice sites.This kind of manual annotation process wasused to annotate the Drosophila genome.

The Otto system can promote observedevidence to a gene annotation in one of twoways. First, if the evidence includes a high-quality match to the sequence of a knowngene [here defined as a human gene repre-sented in a curated subset of the RefSeqdatabase (61)], then Otto can promote this toa gene annotation. In the second method, Ottoevaluates a broad spectrum of evidence anddetermines if this evidence is adequate tosupport promotion to a gene annotation.These processes are described below.

Initially, gene boundaries are predicted onthe basis of examination of sets of overlap-ping protein and EST matches generated by acomputational pipeline (62). This pipelinesearches the scaffold sequences against pro-tein, EST, and genome-sequence databases todefine regions of sequence similarity andruns three de novo gene-prediction programs.

To identify likely gene boundaries, re-gions of the genome were partitioned by Ottoon the basis of sequence matches identifiedby BLAST. Each of the database sequencesmatched in the region under analysis wascompared by an algorithm that takes intoaccount both coordinates of the matching se-quence, as well as the sequence type (e.g.,protein, EST, and so forth). The results wereused to group the matches into bins of relatedsequences that may define a gene and identify

Table 6. Genome-wide mate pair analysis of compartmentalized shotgun (CSA) and PFP assemblies.*

Genomelibrary

CSA PFP

%valid

%mis-

oriented

%mis-

separated†

%valid

%mis-

oriented

%mis-

separated†

2 kbp 98.5 0.6 1.0 95.7 2.0 2.310 kbp 96.7 1.0 2.3 81.9 9.6 8.650 kbp 93.9 4.5 1.5 64.2 22.3 13.5BES 94.1 2.1 3.8 62.0 19.3 18.8Mean 97.4 1.0 1.6 87.3 6.8 5.9

*Data for individual chromosomes can be found in Web fig. 3 on Science Online at www.sciencemag.org/cgi/content/full/291/5507/1304/DC1. †Mates are misseparated if their distance is .3 SD from the mean library size.

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1317

Page 15: The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the

gene boundaries. During this process, multiplehits to the same region were collapsed to acoherent set of data by tracking the coverage ofa region. For example, if a group of bases wasrepresented by multiple overlapping ESTs, theunion of these regions matched by the set ofESTs on the scaffold was marked as beingsupported by EST evidence. This resulted in aseries of “gene bins,” each of which was be-lieved to contain a single gene. One weakness ofthis initial implementation of the algorithm wasin predicting gene boundaries in regions of tan-demly duplicated genes. Gene clusters frequent-ly resulted in homologous neighboring genes

being joined together, resulting in an annotationthat artificially concatenated these gene models.

Next, known genes (those with exact match-es of a full-length cDNA sequence to the ge-nome) were identified, and the region corre-sponding to the cDNA was annotated as apredicted transcript. A subset of the curat-ed human gene set RefSeq from the Nation-al Center for Biotechnology Information(NCBI) was included as a data set searched inthe computational pipeline. If a RefSeq tran-script matched the genome assembly for at least50% of its length at .92% identity, then theSIM4 (63) alignment of the RefSeq transcript to

the region of the genome under analysis waspromoted to the status of an Otto annotation.Because the genome sequence has gaps andsequence errors such as frameshifts, it was notalways possible to predict a transcript thatagrees precisely with the experimentally deter-mined cDNA sequence. A total of 6538 genesin our inventory were identified and transcriptspredicted in this way.

Regions that have a substantial amount ofsequence similarity, but do not match knowngenes, were analyzed by that part of the Ottosystem that uses the sequence similarity in-formation to predict a transcript. Here, Otto

Fig. 6. Comparison of the CSA and the PFP assembly.(A) All of chromosome 21, (B) all of chromosome 8,and (C) a 1-Mb region of chromosome 8 representinga single Celera scaffold. To generate the figure, Celerafragment sequences were mapped onto each assem-bly. The PFP assembly is indicated in the upper thirdof each panel; the Celera assembly is indicated in thelower third. In the center of the panel, green linesshow Celera sequences that are in the same order andorientation in both assemblies and form the longestconsistently ordered run of sequences. Yellow linesindicate sequence blocks that are in the same orien-tation, but out of order. Red lines indicate sequenceblocks that are not in the same orientation. Forclarity, in the latter two cases, lines are only drawnbetween segments of matching sequence that are atleast 50 kbp long. The top and bottom thirds of eachpanel show the extent of Celera mate-pair violations(red, misoriented; yellow, incorrect distance betweenthe mates) for each assembly grouped by library size.(Mate pairs that are within the correct distance, asexpected from the mean library insert size, are omit-ted from the figure for clarity.) Predicted breakpoints,corresponding to stacks of violated mate pairs of thesame type, are shown as blue ticks on each assemblyaxis. Runs of more than 10,000 Ns are shown as cyanbars. Plots of all 24 chromosomes can be seen in Webfig. 3 on Science Online at www.sciencemag.org/cgi/content/full/291/5507/1304/DC1.

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1318

Page 16: The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the

evaluates evidence generated by the compu-tational pipeline, corresponding to conserva-tion between mouse and human genomicDNA, similarity to human transcripts (ESTs

and cDNAs), similarity to rodent transcripts(ESTs and cDNAs), and similarity of thetranslation of human genomic DNA to knownproteins to predict potential genes in the hu-

man genome. The sequence from the regionof genomic DNA contained in a gene bin wasextracted, and the subsequences supported byany homology evidence were marked (plus 100

Fig. 7. Schematic view of the distribution of breakpoints and large gapson all chromosomes. For each chromosome, the upper pair of linesrepresent the PFP assembly, and the lower pair of lines represent Celera’s

assembly. Blue tick marks represent breakpoints, whereas red tick marksrepresent a gap of larger than 10,000 bp. The number of breakpoints perchromosome is indicated in black, and the chromosome numbers in red.

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1319

Page 17: The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the

bases flanking these regions). The other basesin the region, those not covered by any homol-ogy evidence, were replaced by N’s. This se-quence segment, with high confidence regionsrepresented by the consensus genomic se-quence and the remainder represented by N’s,was then evaluated by Genscan to see if aconsistent gene model could be generated. Thisprocedure simplified the gene-prediction taskby first establishing the boundary for the gene(not a strength of most gene-finding algo-rithms), and by eliminating regions with nosupporting evidence. If Genscan returned aplausible gene model, it was further evaluatedbefore being promoted to an “Otto” annotation.The final Genscan predictions were often quitedifferent from the prediction that Genscan re-turned on the same region of native genomicsequence. A weakness of using Genscan torefine the gene model is the loss of valid, smallexons from the final annotation.

The next step in defining gene structuresbased on sequence similarity was to compareeach predicted transcript with the homology-based evidence that was used in previous stepsto evaluate the depth of evidence for each exonin the prediction. Internal exons were consid-ered to be supported if they were covered byhomology evidence to within 610 bases oftheir edges. For first and last exons, the internaledge was required to be within 10 bases, but theexternal edge was allowed greater latitude toallow for 59 and 39 untranslated regions(UTRs). To be retained, a prediction for amulti-exon gene must have evidence such thatthe total number of “hits,” as defined above,divided by the number of exons in the predic-tion must be .0.66 or must correspond to aRefSeq sequence. A single-exon gene must becovered by at least three supporting hits (610bases on each side), and these must cover thecomplete predicted open reading frame. Fora single-exon gene, we also required thatthe Genscan prediction include both a startand a stop codon. Gene models that did notmeet these criteria were disregarded, and

those that passed were promoted to Ottopredictions. Homology-based Otto predic-tions do not contain 39 and 59 untranslatedsequence. Although three de novo gene-findingprograms [GRAIL, Genscan, and FgenesH(63)] were run as part of the computationalanalysis, the results of these programs were notdirectly used in making the Otto predictions.Otto predicted 11,226 additional genes bymeans of sequence similarity.

3.2 Otto validationTo validate the Otto homology-based processand the method that Otto uses to define thestructures of known genes, we compared tran-scripts predicted by Otto with their correspond-ing (and presumably correct) transcript from aset of 4512 RefSeq transcripts for which therewas a unique SIM4 alignment (Table 7). Inorder to evaluate the relative performance ofOtto and Genscan, we made three comparisons.The first involved a determination of the accu-racy of gene models predicted by Otto withonly homology data other than the correspond-ing RefSeq sequence (Otto homology in Table7). We measured the sensitivity (correctly pre-dicted bases divided by the total length of thecDNA) and specificity (correctly predictedbases divided by the sum of the correctly andincorrectly predicted bases). Second, we exam-ined the sensitivity and specificity of the Ottopredictions that were made solely with the Ref-Seq sequence, which is the process that Ottouses to annotate known genes (Otto-RefSeq).And third, we determined the accuracy of theGenscan predictions corresponding to theseRefSeq sequences. As expected, the alignmentmethod (Otto-RefSeq) was the most accurate,and Otto-homology performed better than Gen-scan by both criteria. Thus, 6.1% of true RefSeqnucleotides were not represented in the Otto-refseq annotations and 2.7% of the nucleotidesin the Otto-RefSeq transcripts were not con-tained in the original RefSeq transcripts. Thediscrepancies could come from legitimatedifferences between the Celera assemblyand the RefSeq transcript due to polymor-phisms, incomplete or incorrect data in theCelera assembly, errors introduced by Sim4during the alignment process, or the pres-ence of alternatively spliced forms in thedata set used for the comparisons.

Because Otto uses an evidence-based ap-proach to reconstruct genes, the absence ofexperimental evidence for intervening exonsmay inadvertantly result in a set of exons thatcannot be spliced together to give rise to atranscript. In such cases, Otto may “split genes”when in fact all the evidence should be com-bined into a single transcript. We also examinedthe tendency of these methods to incorrectlysplit gene predictions. These trends are shownin Fig. 8. Both RefSeq and homology-basedpredictions by Otto split known genes into few-er segments than Genscan alone.

3.3 Gene numberRecognizing that the Otto system is quiteconservative, we used a different gene-pre-diction strategy in regions where the ho-mology evidence was less strong. Here theresults of de novo gene predictions wereused. For these genes, we insisted that apredicted transcript have at least two of thefollowing types of evidence to be includedin the gene set for further analysis: protein,human EST, rodent EST, or mouse genomefragment matches. This final class of pre-dicted genes is a subset of the predictionsmade by the three gene-finding programsthat were used in the computational pipe-line. For these, there was not sufficientsequence similarity information for Otto toattempt to predict a gene structure. Thethree de novo gene-finding programs re-sulted in about 155,695 predictions, ofwhich ;76,410 were nonredundant (non-overlapping with one another). Of these,57,935 did not overlap known genes orpredictions made by Otto. Only 21,350 ofthe gene predictions that did not overlapOtto predictions were partially supportedby at least one type of sequence similarityevidence, and 8619 were partially support-ed by two types of evidence (Table 8).

The sum of this number (21,350) and thenumber of Otto annotations (17,764), 39,114,is near the upper limit for the human genecomplement. As seen in Table 8, if the re-quirement for other supporting evidence ismade more stringent, this number drops rap-idly so that demanding two types of evidencereduces the total gene number to 26,383 anddemanding three types reduces it to ;23,000.Requiring that a prediction be supported byall four categories of evidence is too stringentbecause it would eliminate genes that encodenovel proteins (members of currently unde-scribed protein families). No correction forpseudogenes has been made at this point inthe analysis.

In a further attempt to identify genes thatwere not found by the autoannotation processor any of the de novo gene finders, we ex-amined regions outside of gene predictionsthat were similar to the EST sequence, andwhere the EST matched the genomic se-quence across a splice junction. After correct-ing for potential 39 UTRs of predicted genes,about 2500 such regions remained. Additionof a requirement for at least one of the fol-lowing evidence types—homology to mousegenomic sequence fragments, rodent ESTs,or cDNAs—or similarity to a known proteinreduced this number to 1010. Adding this tothe numbers from the previous paragraphwould give us estimates of about 40,000,27,000, and 24,000 potential genes in thehuman genome, depending on the stringencyof evidence considered. Table 8 illustrates thenumber of genes and presents the degree of

Table 7. Sensitivity and specificity of Otto andGenscan. Sensitivity and specificity were calculat-ed by first aligning the prediction to the publishedRefSeq transcript, tallying the number (N) ofuniquely aligned RefSeq bases. Sensitivity is theratio of N to the length of the published RefSeqtranscript. Specificity is the ratio of N to thelength of the prediction. All differences are signif-icant (Tukey HSD; P , 0.001).

Method Sensitivity Specificity

Otto (RefSeq only)* 0.939 0.973Otto (homology)† 0.604 0.884Genscan 0.501 0.633

*Refers to those annotations produced by Otto using onlythe Sim4-polished RefSeq alignment rather than an evi-dence-based Genscan prediction. †Refers to thoseannotations produced by supplying all available evidenceto Genscan.

T H E H U M A N G E N O M E

16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org1320

Page 18: The Sequence of the Human Genome1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the

confidence based on the supporting evidence.Transcripts encoded by a set of 26,383 geneswere assembled for further analysis. This setincludes the 6538 genes predicted by Otto onthe basis of matches to known genes, 11,226transcripts predicted by Otto based on homol-ogy evidence, and 8619 from the subset oftranscripts from de novo gene-prediction pro-grams that have two types of supporting ev-idence. The 26,383 genes are illustrated alongchromosome diagrams in Fig. 1. These are avery preliminary set of annotations and aresubject to all the limitations of an automatedprocess. Considerable refinement is still nec-essary to improve the accuracy of these tran-script predictions. All the predictions anddescriptions of genes and the associated evi-dence that we present are the product ofcompletely computational processes, not ex-pert curation. We have attempted to enumer-ate the genes in the human genome in such away that we have different levels of confi-dence based on the amount of supportingevidence: known genes, genes with good pro-tein or EST homology evidence, and de novogene predictions confirmed by modest ho-mology evidence.

3.4 Features of human genetranscriptsWe estimate the average span for a “typi-cal” gene in the human DNA sequence tobe about 27,894 bases. This is based on theaverage span covered by RefSeq tran-scripts, used because it represents our high-est confidence set.

The set of transcripts promoted to geneannotations varies in a number of ways. Ascan be seen from Table 8 and Fig. 9, tran-scripts predicted by Otto tend to be longer,having on average about 7.8 exons, whereasthose promoted from gene-prediction pro-grams average about 3.7 exons. The largestnumber of exons that we have identified in atranscript is 234 in the titin mRNA. Table 8compares the amounts of evidence that sup-

port the Otto and other predicted transcripts.For example, one can see that a typical Ottotranscript has 6.99 of its 7.81 exons supportedby protein homology evidence. As would beexpected, the Otto transcripts generally havemore support than do transcripts predicted bythe de novo methods.

4 Genome StructureSummary. This section describes several ofthe noncoding attributes of the assembledgenome sequence and their correlations withthe predicted gene set. These include an anal-ysis of G1C content and gene density in thecontext of cytogenetic maps of the genome,an enumerative analysis of CpG islands, anda brief description of the genome-wide repet-itive elements.

4.1 Cytogenetic mapsPerhaps the most obvious, and certainly themost visible, element of the structure ofthe genome is the banding pattern producedby Giemsa stain. Chromosomal bandingstudies have revealed that about 17% to20% of the human chromosome comple-ment consists of C-bands, or constitutiveheterochromatin (64 ). Much of this hetero-chromatin is highly polymorphic and con-sists of different families of alpha satelliteDNAs with various higher order repeatstructures (65). Many chromosomes havecomplex inter- and intrachromosomal du-plications present in pericentromeric re-gions (66 ). About 5% of the sequence readswere identified as alpha satellite sequences;these were not included in the assembly.

Fig. 8. Analysis of split genes resulting from different annotation methods. A set of 4512Sim4-based alignments of RefSeq transcripts to the genomic assembly were chosen (see the textfor criteria), and the numbers of overlapping Genscan, Otto (RefSeq only) annotations based solelyon Sim4-polished RefSeq alignments, and Otto (homology) annotations (annotations produced bysupplying all available evidence to Genscan) were tallied. These data show the degree to whichmultiple Genscan predictions and/or Otto annotations were associated with a single RefSeqtranscript. The zero class for the Otto-homology predictions shown here indicates that theOtto-homology calls were made without recourse to the RefSeq transcript, and thus no Otto callwas made because of insufficient evidence.

Table 8. Numbers of exons and transcripts supported by various types of evidence for Otto and de novo gene prediction methods. Highlighted cells indicatethe gene sets analyzed in this paper (boldface, set of genes selected for protein analysis; italic, total set of accepted de novo predictions).

TotalTypes of evidence No. of lines of evidence*

Mouse Rodent Protein Human $1 $2 $3 $4

Otto Number oftranscripts

17,969 17,065 14,881 15,477 16,374 17,968† 17,501 15,877 12,451

Number ofexons

141,218 111,174 89,569 108,431 118,869 140,710 127,955 99,574 59,804

De novo Number oftranscripts

58,032 14,463 5,094 8,043 9,220 21,350 8,619 4,947 1,904

Number ofexons

319,935 48,594 19,344 26,264 40,104 79,148 31,130 17,508 6,520

No. of exons per Otto 7.84 5.77 6.01 6.99 7.24 7.81 7.19 6.00 4.28transcript De novo 5.53 3.17 3.80 3.27 4.36 3.7 3.56 3.42 3.16

*Four kinds of evidence (conservation in 33 mouse genomic DNA, similarity to human EST or cDNA, similarity to rodent EST or cDNA, and similarity to known proteins) wereconsidered to support gene predictions from the different methods. The use of evidence is quite liberal, requiring only a partial match to a single exon of predicted transcript. †Thisnumber includes alternative splice forms of the 17,764 genes mentioned elsewhere in the text.

T H E H U M A N G E N O M E

www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1321