Sequence scanning: A method for rapid sequence acquisition … · 1696 Molecular Biology: Nurminsky...

5
Proc. Natl. Acad. Sci. USA Vol. 93, pp. 1694-1698, February 1996 Molecular Biology Sequence scanning: A method for rapid sequence acquisition from large-fragment DNA clones (DNA sequencing/physical mapping/sequence-tagged sites) DMITRY I. NURMINSKY AND DANIEL L. HARTL Department of Organismic and Evolutionary Biology, Harvard University, 16 Divinity Avenue, Cambridge, MA 02138 Communicated by Walter Gilbert, Harvard University, Cambridge, MA, November 8, 1995 (received for review June 26, 1995) ABSTRACT A strategy of "sequence scanning" is pro- posed for rapid acquisition of sequence from clones such as bacteriophage P1 clones, cosmids, or yeast artificial chromo- somes. The approach makes use of a special vector, called LambdaScan, that reliably yields subclones with inserts in the size range 8-12 kb. A number of subclones, typically 96 or 192, are chosen at random, and the ends of the inserts are sequenced using vector-specific primers. Then long-range spectrum PCR is used to order and orient the clones. This combination of shotgun and directed sequencing results in a high-resolution physical map suitable for the identification of coding regions or for comparison of sequence organization among genomes. Computer simulations indicate that, for a target clone of 100 kb, the scanning of 192 subclones with sequencing reads as short as 350 bp results in an approximate ratio of 1:2:1 of regions of double-stranded sequence, single- stranded sequence, and gaps. Longer sequencing reads tip the ratio strongly toward increased double-stranded sequence. The low coding density of many complex genomes presents special problems in the efficient identification of coding regions present in a background of largely noncoding DNA. Even when the region of interest is present in overlapping clones, conventional physical maps or maps of sequence- tagged sites (STSs) provide insufficient sequence information to deduce a putative intron/exon structure. On the other hand, complete genomic sequencing of a large region may be im- practical, either because the low coding density makes it cost ineffective or because suitable high-throughput sequencing technology is not available on site. Similar issues arise in exploring the sequence organization of a genomic region coding for a known cDNA or in comparative genomic analysis when it is desirable to compare, among diverse species, a region whose sequence organization is known in a related organism. Physical mapping yields too little information; genomic sequencing yields too much. To solve such problems, it would be desirable to develop a sequencing strategy that enables rapid but light coverage of lengthy regions of DNA present in large-fragment DNA clones such as P1 clones, cosmids, or yeast artificial chromosomes. The method we propose here combines the convenience and cost effectiveness of shotgun sequencing with the benefits of a directed strategy. This method is called sequence scanning. The sequence coverage is "light" because not all target se- quences are included in double-stranded sequence: some regions are covered only by single-stranded sequence, and some regions are not covered at all. In a pure shotgun approach, such strategy would be fatal because the sequenced regions could not be ordered and oriented relative to each other. However, in sequence scanning, the directed strategy allows the sequenced regions to be assembled and the lengths and positions of the gaps to be identified. The result of The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact. sequence scanning is therefore a "framework sequence" con- sisting of an ordered, ultra-high-resolution physical map of single-stranded sequence, double-stranded sequence, and gaps of known size and location. If efficiently implemented, se- quence scanning can enable a steady-state output from three researchers of '200 kb per week. At the end of the procedure, a DNA insert in an initial large-fragment clone is covered with 10- to 20-fold redundancy by a set of mapped, overlapping, template-quality plasmid subclones with inserts of -8-12 kb. These plasmids may be used for gap closure if necessary or to provide material for subsequent studies. The expected average size of the gaps is typically 300-500 bp, and so the gaps may readily be closed by sequencing from suitable oligonucleotide primers. MATERIALS AND METHODS LambdaScan Cloning Vector. The sequence-scanning method makes use of sequencing templates in the size range 8-12 kb subcloned from the target sequence. To facilitate the construction of such libraries, a specialized A replacement vector, called LambdaScan, was developed from the Lamb- daZAP (Stratagene) insertion vector. LambdaScan has a theoretical cloning capacity of 4-17 kb but, in practice, most of the clones have inserts in the size range 8-12 kb (see below). To create LambdaScan, we first inserted a 9-kb stuffer fragment of DNA from Drosophila virilis into LambdaZAP, using the restriction enzymes Not I and Xho I. The stuffer fragment contains no BamHI sites. The DNA of resulting phage was cleaved with BamHI, cutting off most of the phage right arm but leaving the pBlueScript sequence intact. DNA of AEMBL3 was cleaved with Sal I. From both digestions, the cohesive ends generated by BamHI and Sal I were half-filled with Klenow fragment, and the resulting DNA fragments were ligated and packaged in vitro in A capsids. The NM539 strain of Escherichia coli, lysogenic for phage P2, was infected with the packaged mixture. A number of plaques with the Spi- phenotype appeared, which was expected of recombinant As carrying the left arm of LambdaZAP and the right arm of AEMBL3; the resulting phage was called ZAMBL. In the final step of construction, the 9-kb stuffer fragment from D. virilis was excised from ZAMBL with Sac I and Xho I and replaced with the 14-kb Sac I stuffer fragment from LambdaGEM-12 (Promega). To accomplish the exchange, the DNA of ZAMBL was first cleaved with Xho I, ligated with the phosphorylated octamer TCGAAGCT, and cleaved again with Xho I and Sac I, and the arms were purified from an agarose gel. At the same time, the DNA of LambdaGEM-12 was cleaved with Sfi I and Sac I, the cos (cohesive) sites were rendered blunt with Klenow fragment, and the central stuffer was purified from an agarose gel. The resulting DNA frag- ments were ligated and packaged in vitro in A capsids. Recombinant phage were tested for the Spil phenotype, and two were partially sequenced to verify the structure of the Abbreviation: STS, sequence-tagged site. 1694 Downloaded by guest on December 21, 2020

Transcript of Sequence scanning: A method for rapid sequence acquisition … · 1696 Molecular Biology: Nurminsky...

Page 1: Sequence scanning: A method for rapid sequence acquisition … · 1696 Molecular Biology: Nurminsky and Hartl Asample ofthese subclones is used as sequencing templates, andbothendsare

Proc. Natl. Acad. Sci. USAVol. 93, pp. 1694-1698, February 1996Molecular Biology

Sequence scanning: A method for rapid sequence acquisition fromlarge-fragment DNA clones(DNA sequencing/physical mapping/sequence-tagged sites)

DMITRY I. NURMINSKY AND DANIEL L. HARTLDepartment of Organismic and Evolutionary Biology, Harvard University, 16 Divinity Avenue, Cambridge, MA 02138

Communicated by Walter Gilbert, Harvard University, Cambridge, MA, November 8, 1995 (received for review June 26, 1995)

ABSTRACT A strategy of "sequence scanning" is pro-posed for rapid acquisition of sequence from clones such asbacteriophage P1 clones, cosmids, or yeast artificial chromo-somes. The approach makes use of a special vector, calledLambdaScan, that reliably yields subclones with inserts in thesize range 8-12 kb. A number of subclones, typically 96 or 192,are chosen at random, and the ends of the inserts aresequenced using vector-specific primers. Then long-rangespectrum PCR is used to order and orient the clones. Thiscombination of shotgun and directed sequencing results in ahigh-resolution physical map suitable for the identification ofcoding regions or for comparison of sequence organizationamong genomes. Computer simulations indicate that, for atarget clone of 100 kb, the scanning of 192 subclones withsequencing reads as short as 350 bp results in an approximateratio of 1:2:1 of regions of double-stranded sequence, single-stranded sequence, and gaps. Longer sequencing reads tip theratio strongly toward increased double-stranded sequence.

The low coding density of many complex genomes presentsspecial problems in the efficient identification of codingregions present in a background of largely noncoding DNA.Even when the region of interest is present in overlappingclones, conventional physical maps or maps of sequence-tagged sites (STSs) provide insufficient sequence informationto deduce a putative intron/exon structure. On the other hand,complete genomic sequencing of a large region may be im-practical, either because the low coding density makes it costineffective or because suitable high-throughput sequencingtechnology is not available on site. Similar issues arise inexploring the sequence organization of a genomic regioncoding for a known cDNA or in comparative genomic analysiswhen it is desirable to compare, among diverse species, aregion whose sequence organization is known in a relatedorganism. Physical mapping yields too little information;genomic sequencing yields too much.To solve such problems, it would be desirable to develop a

sequencing strategy that enables rapid but light coverage oflengthy regions of DNA present in large-fragment DNA clonessuch as P1 clones, cosmids, or yeast artificial chromosomes.The method we propose here combines the convenience andcost effectiveness of shotgun sequencing with the benefits of adirected strategy. This method is called sequence scanning.The sequence coverage is "light" because not all target se-quences are included in double-stranded sequence: someregions are covered only by single-stranded sequence, andsome regions are not covered at all. In a pure shotgunapproach, such strategy would be fatal because the sequencedregions could not be ordered and oriented relative to eachother. However, in sequence scanning, the directed strategyallows the sequenced regions to be assembled and the lengthsand positions of the gaps to be identified. The result of

The publication costs of this article were defrayed in part by page chargepayment. This article must therefore be hereby marked "advertisement" inaccordance with 18 U.S.C. §1734 solely to indicate this fact.

sequence scanning is therefore a "framework sequence" con-sisting of an ordered, ultra-high-resolution physical map ofsingle-stranded sequence, double-stranded sequence, and gapsof known size and location. If efficiently implemented, se-quence scanning can enable a steady-state output from threeresearchers of '200 kb per week. At the end of the procedure,a DNA insert in an initial large-fragment clone is covered with10- to 20-fold redundancy by a set of mapped, overlapping,template-quality plasmid subclones with inserts of -8-12 kb.These plasmids may be used for gap closure if necessary or toprovide material for subsequent studies. The expected averagesize of the gaps is typically 300-500 bp, and so the gaps mayreadily be closed by sequencing from suitable oligonucleotideprimers.

MATERIALS AND METHODSLambdaScan Cloning Vector. The sequence-scanningmethod makes use of sequencing templates in the size range8-12 kb subcloned from the target sequence. To facilitate the

construction of such libraries, a specialized A replacementvector, called LambdaScan, was developed from the Lamb-daZAP (Stratagene) insertion vector. LambdaScan has atheoretical cloning capacity of 4-17 kb but, in practice, mostof the clones have inserts in the size range 8-12 kb (see below).To create LambdaScan, we first inserted a 9-kb stufferfragment of DNA from Drosophila virilis into LambdaZAP,using the restriction enzymes Not I and Xho I. The stufferfragment contains no BamHI sites. The DNA of resultingphage was cleaved with BamHI, cutting off most of the phageright arm but leaving the pBlueScript sequence intact. DNA ofAEMBL3 was cleaved with Sal I. From both digestions, thecohesive ends generated by BamHI and Sal I were half-filledwith Klenow fragment, and the resulting DNA fragments wereligated and packaged in vitro in A capsids. The NM539 strainof Escherichia coli, lysogenic for phage P2, was infected withthe packaged mixture. A number of plaques with the Spi-phenotype appeared, which was expected of recombinant Ascarrying the left arm of LambdaZAP and the right arm ofAEMBL3; the resulting phage was called ZAMBL.

In the final step of construction, the 9-kb stuffer fragmentfrom D. virilis was excised from ZAMBL with Sac I and XhoI and replaced with the 14-kb Sac I stuffer fragment fromLambdaGEM-12 (Promega). To accomplish the exchange, theDNA ofZAMBL was first cleaved withXho I, ligated with thephosphorylated octamer TCGAAGCT, and cleaved againwith Xho I and Sac I, and the arms were purified from anagarose gel. At the same time, the DNA of LambdaGEM-12was cleaved with Sfi I and Sac I, the cos (cohesive) sites wererendered blunt with Klenow fragment, and the central stufferwas purified from an agarose gel. The resulting DNA frag-ments were ligated and packaged in vitro in A capsids.Recombinant phage were tested for the Spil phenotype, andtwo were partially sequenced to verify the structure of the

Abbreviation: STS, sequence-tagged site.

1694

Dow

nloa

ded

by g

uest

on

Dec

embe

r 21

, 202

0

Page 2: Sequence scanning: A method for rapid sequence acquisition … · 1696 Molecular Biology: Nurminsky and Hartl Asample ofthese subclones is used as sequencing templates, andbothendsare

Proc. Natl. Acad. Sci. USA 93 (1996) 1695Molecular Biology: Nurminsky and Hartl

polylinker regions and the junction between the LambdaZAPand AEMBL3 sequences. Both clones had identical sequences

across these regions. One of the clones (LambdaScan) was

chosen for subsequent experiments. The structure of Lambda-Scan is outlined in Fig. 1. The sequence of the junction Jl wasobtained from the Applied Biosystems primer M13RP1 (thereverse M13 primer); it contains the portion of pBlueScriptincluding the T3 RNA polymerase promoter and the Lamb-

daGEM-12 polylinker, starting from the Sac I site. The se-

quence of the junction J2 was obtained from the M13 forwardprimer "-21"; it contains the LambdaGEM-12 polylinker, the

Sac I:Xho I junction, and the pBlueScript sequence includingthe T7 RNA polymerase promoter. The junction J3 is the

sequence of the junction between the LambdaZAP and

AEMBL3 fragments. It was obtained using the primer 5'-

CAGGCCAGTTATCTGGGCTTAAAAGCAGAA-3,which is homologous to the git gene near the polycloning site

in the right arm of AEMBL3. The J3 junction contains a

segment of AEMBL3, the Sal I:BamHI junction, and a segmentof the int gene adjacent to the BamHI site in LambdaZAP.

Cloning in LambdaScan. To evaluate the performance of

LambdaScan, an experiment was carried out in which a librarywas made from genomic DNA from Drosophila melanogaster.DNA from LambdaScan was cut with Xba I and Xho I, and theXho I-produced cohesive ends were half-filled with Klenowfragment. The genomic DNA was partially digested withSau3A, and the cohesive ends were also half-filled with Klenowfragment. The DNAs were ligated and packaged in vitro in A

capsids, and the packaged phage were plated on a layer of a P2

lysogenic strain of E. coli. The cloning efficiency was 106 clonesper jig of genomic DNA. The phage library was converted into

a pBlueScript library by in vivo excision using the ExAssist/SOLR system (Stratagene), and 40 clones were checked forsize of insert by long-range PCR (1). Although the majority ofclones contained inserts >7 kb, we also observed about 25%of clones with relatively small inserts (<4 kb). It seemed likelythat the small clones originated from phage recombinants thatcould not be propagated as phages owing to their small size butcould nevertheless produce pBlueScript clones. If this were thecase, then one round of phage amplification should eliminatethe small clones. Accordingly, we plated about 104 A clonesonto a 90-mm plate containing a P2 lysogenic strain of E. coliand, after overnight incubation, eluted the phages from theplate. This amplified library was converted into a pBlueScriptlibrary, and 40 clones were checked for size of insert bylong-range PCR (1). All clones contained inserts >6 kb (datashown below), and the average size was in the range 9-10 kb.

Subcloning from P1 into LambdaScan. To check the effi-ciency of the LambdaScan system, subclones were preparedfrom a minipreparation of a clone of bacteriophage P1 con-

taining an '80-kb fragment of genomic DNA from D. mela-nogaster. The phage minilibrary was created from the P1 clone

JA

B27-41 (DS02537) (2). Various numbers of phages were

amplified and converted into pBlueScript. When 103 phageclones per 90-mm plate were amplified, the resulting plasmidminilibrary was represented by a limited number of clones,

many of them represented several times. However, only a

10-fold increase in the complexity of the phage library (104phage clones amplified) was enough to produce a large number

of plasmid clones. A sample of 48 clones from the latter

amplification was analyzed. The smallest insert size was 6 kb

and the average insert size was 10 kb. No duplicates were

detected among the 48 clones as determined by restrictionanalysis of all clones and automated sequencing of both ends

of 12 clones. One-sixth of the clones were derived from the P1

vector, which is expected based on its total size relative to the

size of the Drosophila insert in the P1 clone.DNA Sequencing. Direct sequencing of the LambdaScan was

performed with an Applied Biosystems model 373A DNAsequencing system and the Taq DyeDeoxy Terminator cycle-sequencing kit (Applied Biosystems). Sequencing of the endsof the plasmid subclones employed the Applied Biosystems"M13RP1/M13-21" DyePrimer reagents. The average lengthof reliable sequence was about 350 bp. Most sequencingreactions also included an additional 100-150 bp that could be

read less reliably.Long-Range Spectrum PCR. The PCR was performed in MJ

Research (Watertown, MA) PTC-100 thermal cyclers in 96-

well, V-bottomed, polycarbonate microtiter plates. The PCRwas carried out in 20-p,l reaction volumes of PCR cocktailcontaining 2 mM MgCl2 overlaid with mineral oil and sub-

jected to 30 cycles of 10 sec at 99°C, 20 sec at the annealingtemperature (optimized for each pair of oligonucleotides), and15 min at 68°C. The reactions were terminated by holding at

72°C for 5 min and stored at 4°C. PCR products were frac-tionated by gel electrophoresis in 0.7% agarose. In sizing theinserts in LambdaScan with PCR, we used a ratio of Pfu:Taqof 1:100 (1).The primers homologous to the pBlueScript plasmid adja-

cent to the polylinker are pBS-T3 (5'-CCTCACTAAAGG-GAACAAAAGCTGG-3', near the "Reverse M13RP1"primer annealing site) and pBS-T7 (5'-ACTCACTAT-AGGGCGAATTGGGTA-3', near the "Forward-21" primer

annealing site). In the demonstration of spectrum PCR usingthe P1 clone B27-41, the primer 135-55T+ is 5'-ACCGATAT-GATGGCCGAGA-3' and the primer 135-55T+ is 5'-TAA-TTTGGGCGACCAGGAG-3'. The primers were chosenfrom the sequence at the T7 end of the P1 clone B55-92; thepriming sites are located at a distance of 180 bp and are

oriented in opposite orientations.

RESULTS

For sequence scanning a cloned region of target genome, suchas a P1 clone, first a library of -10 kb subclones is produced.

J2 J3

LambdaScan v v v

E X XQ:

Jl GGGAACAAAAGCTGGAGCTCGCGGCCGCGGATCCCGGGAATTCTCGAGTCGACAAGCTTCTAGAGATCCCT3 promoter SacI NotI BamHI SmaI EcoRI XhoI SalI HindIII XbaI

J2 ATCTCTAGAAGCTTGTCGACTCGAGAATTCCCGGGATCCGCGGCCGCGAGCTTCGAGGGGGGGCCCGGTACCCAATTCGCCCTATAXbaI HindIII SalI XhoI EcoRI SmaI BanHI NotI SacI:XhoI ApaI KpnI T7 promoter

J3 TGAACACTCGTCCGAGAATAACGAGTGGATCTGGGTCGATCCGTCTACCTTTCACGAGTTGCGCAGTTTGTCTGCAAgit SalI:BanHI int

FIG. 1. Structure of LambdaScan vector along with the sequences of the junctions Ji, J2, and J3.

Dow

nloa

ded

by g

uest

on

Dec

embe

r 21

, 202

0

Page 3: Sequence scanning: A method for rapid sequence acquisition … · 1696 Molecular Biology: Nurminsky and Hartl Asample ofthese subclones is used as sequencing templates, andbothendsare

1696 Molecular Biology: Nurminsky and Hartl

A sample of these subclones is used as sequencing templates,and both ends are sequenced. This is the shotgun phase of thestrategy. Then the order and orientation of the templates aredetermined by a PCR walking strategy using one of the endsequences as a priming site to carry out each successive step ofthe walk. This is the directed phase of the strategy. The stepsin the procedure are described in detail below.

Size of Insert Recovered from LambdaScan. LambdaScan isreplacement vector with the following features: (i) it has atheoretical capacity of 4-17 kb, (ii) it contains multiple cloningsites (Xba I, Sal I, Xho I, EcoRI, BamHI, and Not I), (iii) itretains the convenient feature of Spi selection, (iv) it includesall pBlueScript sequences flanking the polylinker, and (v) it isstructured so that the cloned insert can be excised as a plasmidby the ExAssist/SOLR system (Stratagene). As far as we areaware, only LambdaScan combines these features in a singlevector.One of the convenient features of LambdaScan is that

sufficient numbers of high-quality template subclones can beproduced from a minipreparation of phage P1 DNA. Further-more, when a sufficient number of phage subclones areamplified and converted into plasmids, the amplification stepdoes not allow a small proportion of faster-growing clones tobecome grossly overrepresented in the resulting plasmid li-brary. The plasmids are a source of template-quality DNAsuitable for automated DNA sequencing.The distribution of size of insert in 40 pBlueScript plasmids

obtained from a D. melanogaster genomic library in Lamb-daScan is shown in Fig. 2. The minimal size of insert is 6 kb,and only a minority of clones are so small. Most clones haveinserts in the size range 8-12 kb, and the average is in the range9-10 kb. A similar size distribution is observed in plasmidsobtained from P1 clones that have been subcloned in Lamb-daScan (data not shown). From P1 clone B27-41 (DS02537),which contains D. melanogaster DNA from the chromosomalregion 3A5-3A10, 48 pBlueScript plasmids were analyzed. Theminimum size of insert was again about 6 kb and the averagesize of insert was 10 kb.Ordering the Templates with Long-Range Spectrum PCR.

The plasmids obtained from LambdaScan are a source oftemplate-quality DNA suitable for automated DNA sequenc-ing. Their template quality also makes them ideal substrates foramplification by long-range PCR (1). Given a set of plasmidsubclones that cover a target clone with some predeterminedlevel of redundancy, the templates can be reassembled in theircorrect order and orientation by a walking method denotedspectrum PCR (3) adapted, in the case of LambdaScan, to 8-to 10-kb inserts (1). In the first step of the walk, a small regionofDNA sequence from one of the vector-insert junctions of theoriginal large-DNA-fragment clone is used as a STS to createan oligonucleotide primer employed in PCR in combinationwith each of two different vector-specific primers flanking thecloning site in the LambdaScan subclones. Subclones contain-

C,)20

0

-0

WI)

10QIo

0

0 4 8 12 16Size of Insert (kb;

FIG. 2. Distribution of insert size in pBlueScript plasmids obtainedfrom a LambdaScan library of D. melanogaster genomic DNA.

Proc. Natl. Acad. Sci. USA 93 (1996)ing an insert with the STS in the proper orientation relative toeither vector primer will support amplification, and the size ofthe PCR product indicates the distance between the vector-insert junction and the STS. The subclone with the insertextending maximally in the direction of the walk is chosen tocontinue the walk, and a new STS derived from the farthest endof the insert in the subclone becomes the next STS to continuethe walk. After the entire walk is completed, each member ofthe set subclones is ordered and oriented. Progress can beaccelerated by walking bidirectionally from both ends of thetarget sequence, or there can even be multiple internal startingpoints, since all of the subclone ends are sequenced in advance.The implementation of multiple starting points requires onlya system of data handling sufficient to identify when the walkscoalesce owing to the presence of two STS markers in a singlesubclone. (The walking procedure is illustrated in figure 7.6 onp. 130 in ref. 2.)An Example Using P1 Clones. Fig. 3 is an example of

mapping plasmid subclones from the P1 phage B27-41 thatinclude the STS in the overlapping P1 clone B55-92. CloneB27-41 includes the D. melanogaster chromosomal region3A5-3A10, and B55-92 includes 3A5-3A8. The P1 clones areshown at the top as horizontal solid bars. The letters "S" and"T" stand for the "SP6" and "T7" ends of the P1 clones. TheSTS used in the long-range spectrum PCR was generated formthe T7 end of B55-92. In this example, 48 plasmid subclones ofB27-41 were examined. The symbols 135-55T+ and 135-55T-refer to two oppositely oriented primers generated from theSTS, which are separated by 180 bp. The symbols T7(F) andT3(R) refer to two primers annealing to the plasmid adjacentto the polylinker near the annealing sites for the "Forward"and "Reverse" M13 sequencing primers. After long-rangespectrum PCR, the reaction products were separated in a 0.7%agarose gel. The results are shown in the photographs at thebottom. In the photographs, the lanes with multiple bandscontain a 1-kb DNA ladder. The other lanes in each panel arenumbered in reading order, left to right, top to bottom. Eachlane contains the products, if any, resulting from PCR ampli-fication of the subclone of the same number using the primerpairs 135-55T+ and T7(F) (panel A), 135-55T- and T3(R)(panel D), 135-55T+ and T3(R) (panel B), and 135-55T- andT7(F) (panel C). Clones of the required type contain the STSnumbered 135-55 and support amplification yielding a singleproduct in both A and D or in both B and C; the sum of thesizes of both products, which equals the size of the insert in thesubclone, must also be within the desired range (7-17 kb).Panels A and D identify the clones pB6, pB22, pP33, pB38, andpB48; panels B and C identify the clones pB21 and pB37. Eachpanel also includes 0-4 clones that support amplification of afragment not matched by a fragment of the appropriate sizefrom the primer oriented in the opposite direction. The use oftwo primers from each STS allows such apparent artifacts tobe identified. Another positive control (not shown) is affordedby use of the STS-derived oligonucleotides as a primer pair ina separate PCR assay, because only subclones containing theSTS will support amplification of the expected 180-bp product.The subclones of interest, with their insert sizes indicated inparentheses, are diagrammed immediately above the photo-graphs. In this case, any of the clones pB21, pB38, or pB37would be suitable choices to continue the walk from left toright.

Theoretical Coverage from Sequence Scanning. In sequencescanning, all of the vector-insert junctions of the subclones aresequenced; this is the "shotgun" component of the procedure,and it is maximally efficient because virtually all of thesequence is useful. For example, 350 bp of sequence from eachend of 96 subclones totals 67.2 kb, which is equivalent tocoverage of an 80-kb target sequence with a redundancy of0.84; increasing the length of the sequencing runs to, say, 550bp increases the redundancy to 1.32. Shotgun sequencing is still

Dow

nloa

ded

by g

uest

on

Dec

embe

r 21

, 202

0

Page 4: Sequence scanning: A method for rapid sequence acquisition … · 1696 Molecular Biology: Nurminsky and Hartl Asample ofthese subclones is used as sequencing templates, andbothendsare

Proc. Natl. Acad. Sci. USA 93 (1996) 1697Molecular Biology: Nurminsky and Hartl

55-92 T27-41

135-55T- 135-55T+

pB22 )0) RF - -- -; - -

p833(11) R

F Rp B48 (11) ',

F -- - _ RpB6 (10) R

F p 921 (13) F

R-- p38() - -

RF F

pB37 (1 i)R F

A B

135-55T+

T7 (F)

135-55T-

T7 (F)

135-55T+

T3 (R)

135-55T-

T3(R)

C D

FIG. 3. Walking strategy applied to clones derived from bacteriophage P1 clone B27-41. The oligonucleotide primers 135-55T- and 135-T+,

used in conjunction with the vector primers T7(F) and T3(R), enable overlapping clones to be identified and ordered.

very efficient at these levels of redundancy. Additional cov-

erage of the target sequence may be obtained by sequencingthe ends of 2 x 96 templates and ordering them by spectrum

PCR, but at the still moderate cost of some increase in the

redundancy of the shotgun sequencing.The expected level of sequence acquisition with sequence

scanning a 100-kb P1 clone with 10-kb templates is summarizedin Table 1. The percentages are the percentage of the total

sequence (100 kb) covered by sequence in both strands (dou-ble-stranded), one or the other strand (single-stranded), or

Table 1. Expected coverage from sequence scanning

Coverage, %

Read No. of Double- Single- Gaps,

length, bp templates stranded stranded Total %

350 1 x 96 7.8 40.5 48.3 51.7

2 x 96 22.4 50.2 72.6 27.4

3 x 96 36.4 48.8 85.2 14.8

550 1 x 96 16.2 48.2 64.4 35.6

2 x 96 38.6 47.5 86.1 13.9

3 x 96 55.5 38.5 94.0 6.0

missing altogether (gaps). The data are based on the average

of 100 computer simulations of each condition. The mainvariables in Table 1 are (i) length of read from each end of eachtemplate (350 or 550 bp) and (ii) number of templates (96, 192,or 288). The main effect of increasing the number of templatesis an increase in the total amount of double-stranded sequence

with a concomitant decrease in the proportion of gaps. Per-

haps surprisingly, the proportion of single-stranded sequence

is virtually independent of the number of templates over thesecombinations of parameters. At 350 bp per sequencing run, the

total sequence obtained from 96, 192, and 288 templates is 67

kb, 134 kb, and 202 kb, respectively, corresponding to a

redundancy of about 0.7, 1.3, and 2.0 in the shotgun sequencingphase. It appears from Table 1 that 96 templates is too few

unless the reads are substantially longer than 350 kb. Whether192 templates is adequate in practice will be determined by the

extent to which the templates are truly random across the P1

clone.Size Distribution and Number of Gaps. Table 1 indicates

that, with 192 templates, the proportion of the total sequenceremaining in gaps is roughly 10-25% depending on the lengthof the sequencing reads. However, computer simulations also

indicate that most of the gaps are quite short (data not shown).

--- CentromereTelomere--

S

S

d kb

I fiI

-j(w T

Dow

nloa

ded

by g

uest

on

Dec

embe

r 21

, 202

0

Page 5: Sequence scanning: A method for rapid sequence acquisition … · 1696 Molecular Biology: Nurminsky and Hartl Asample ofthese subclones is used as sequencing templates, andbothendsare

1698 Molecular Biology: Nurminsky and Hartl

In minimizing the size of the gaps, there is a large gain inincreasing from 96 to 192 templates but diminished returns inincreasing from 192 to 288. With 192 templates, 95% of thegaps are <1 kb. The average size of the gaps is about 500 bpfor 96 templates, 300 bp for 192 templates, and 200 bp for 288templates. A surprising result is that the distribution of lengthof gaps does not depend dramatically on whether the total readlength is 350 bp or 550 bp (data not shown). However, thelonger reads decrease the average number of gaps by a factorof 2-3.

DISCUSSIONThe strategy of sequence scanning proposed here is an ap-proach for large-scale acquisition of sequence from clones suchas bacteriophage P1 clones, cosmids, or yeast artificial chro-mosomes. The approach makes use of a special LambdaScanvector that reliably yields subclones in the range 8-12 kb. Anumber of subclones in this size range, typically 96 or 192, arechosen at random, the ends of the inserts are sequenced usingvector-specific primers, and long-range spectrum PCR is usedto order and orient the subclones. Computer simulationsindicate that, even when each sequencing reaction yields only350 bp, the scanning of 192 subclones would result in anapproximate ratio of 1:2:1 of regions sequenced two times(double-stranded sequence), one time (single-stranded se-quence), or not at all (gaps). Longer reads tip the ratio stronglytoward more double-stranded sequence.The use of long-range spectrum PCR to order and orient the

scanning templates adds considerably to the utility of theprocedure. First, the subclones have 10-kb inserts, calling forthe use of long-range PCR (1). Second, because the subclonesare longer, fewer are needed to cover a target clone withsuitable redundancy; for example, 96 subclones with inserts of10 kb are needed to cover an 80-kb P1 insert with 12-foldredundancy. The smaller number of subclones eliminates theneed for pooling, and so the PCR amplifications can be carriedout with single templates, thus reducing the likelihood ofartifacts. Third, in spectrum PCR as originally envisaged (3),subclones not used for walking are discarded; in sequence

Proc. Natl. Acad. Sci. USA 93 (1996)

scanning, the STS at each end of each clone is ordered andspaced relative to neighboring ends in the process of walking,and this information is incorporated into the finished frame-work sequence. Finally, the turnaround time for each step inthe walk in sequence scanning is abbreviated because allpossible sequences needed for primer selection are alreadyavailable.

In sequence scanning, although some regions of the targetclone remain unsequenced, most of the gaps are small and thelocation and size of each gap are known from the long-rangespectrum PCR. For example, given a 100-kb target sequence(the size of a typical P1 clone), sequence scanning of 192subclones with an average read length of 350 bp results in aframework sequence containing an expected 80-90 gaps av-eraging -300 bp. Longer sequence reads serve primarily toreduce the number of gaps. Completing the sequence can beaccomplished by sequencing from the STS primers used in thelong-range spectrum PCR, by longer reads off the vector endsof critical templates, or by primer walking. It may even be costeffective to include a somewhat greater redundancy of shotgunsequencing at the beginning, or as the scanning proceeds,either with additional 8- to 12-kb subclones or with shorterdouble-stranded or single-stranded sequencing templates. Inany case, the 192 ordered subclones with sequenced endsafford a framework sequence of the target that is likely to beof great utility in identifying coding regions and in comparisonsbetween genomes.

We gratefully acknowledge the Stratagene Corporation and thePromega Corporation for their permission to adapt the vectorsLambdaZAP and LambdaGEM-12, respectively, to create Lambda-Scan. Many thanks to Benjamin Kirkup for the computer simulations.We are also grateful to Bruce Roe, Walter Gilbert, and Gerald Rubinfor helpful discussions. This work was supported by National Institutesof Health Grant HG01250.1. Barnes, W. M. (1994) Proc. Natl. Acad. Sci. USA 91, 2216-2220.2. Hartl, D. L. & Lozovskaya, E. R. (1995) The Drosophila Genome

Map: A Practical Guide (R. G. Landes, Austin, TX).3. Yoshida, K., Strathmann, M. P., Mayeda, C. A., Martin, C. H. &Palazzolo, M. J. (1993) Nucleic Acids Res. 21, 3553-3562.

Dow

nloa

ded

by g

uest

on

Dec

embe

r 21

, 202

0