SCALEs: multiscale analysis of library enrichment

7
SCALEs: multiscale analysis of library enrichment Michael D Lynch, Tanya Warnecke & Ryan TGill We report a genome-wide, multiscale approach to simultaneously measure the effect that the increased copy of each gene and/or operon has on a desired trait or phenotype. The method involves (i) growth selections on a mixture of several different plasmid- based genomic libraries of defined insert sizes or SCALEs, (ii) microarray studies of enriched plasmid DNA, and a (iii) mathematical multiscale analysis that precisely identifies the relevant genetic elements. This approach allows for identification of all single open reading frames and larger multigene fragments within a genomic library that alter the expression of a given phenotype. We have demonstrated this method in Escherichia coli by monitoring, in parallel, a population of 410 6 genomic library clones of different insert sizes, throughout continuous selections over a period of 100 generations. Conventional genetic methods for identifying the basis of cellular phenotypes are often laborious, not always genome-wide or quan- titative, and often not capable of studies involving cell populations. Advances in genomics technologies and associated applications have addressed several but not all of these limitations simulta- neously 1–7 . For example, although transcriptional profiling is genome-wide, now easy to perform and capable of single-gene resolution; profiling data only provide for the development of correlative, as opposed to causative relationships concerning altered gene expression and a particular phenotype. Conventional methods for identifying such relationships typically involve the evaluation of insertional mutant or extrachromosomal genomic libraries. In the case of insertional mutant libraries, genome-wide approaches that provide high-resolution (single-gene) and quantitative data have begun to provide new insights into the population dynamics of mutants grown competitively in various environments 4,7–10 . Although attempts have been made to develop similar approaches for application to extrachromosomal genomic libraries 5,6 , such approaches have yet to provide the level of resolution required for quantitative studies. We report here the demonstration of such an approach. Our approach involves simultaneous growth selections on mix- tures of several plasmid libraries containing defined, yet different, insert sizes or scales. This is followed by a microarray and multiscale analysis. The multiscale analysis decomposes the microarray signal into scales corresponding to the signal contribution from each of the different, distinct-sized libraries (Fig. 1). We hypothesized that selections performed on such mixed libraries would produce unique signal intensity patterns along the genome that would indicate specific genes or regions required for altered growth. That is, for phenotypes resulting from the overexpression of short pieces of genomic DNA (that is, a single gene, small RNA or DNA-binding motif), enrichment of the insert DNA would occur in each of the libraries and result in a sharp signal intensity peak corresponding to the gene of interest. In contrast, for those phenotypes dependent upon the overexpression of a larger region of genomic DNA (that is, an operon), enrichment would occur only in those libraries containing the largest insert DNA leading to a broad signal-intensity peak. Additional scenarios provide for simi- larly unique signal-intensity patterns (that is, larger libraries that carry genes with antagonistic or synergistic effects would not be enriched or further enriched, respectively). Using this approach, we monitored changes in the population of 410 6 clones in an E. coli genomic library throughout 100 genera- tions of continuous culture. This approach accurately identifies the location of fitness-altering loci as well as the specific size of the relevant fragments, thus minimizing the need for subcloning. Additionally, comparisons among such scale-specific enrichment and dilution patterns can be used to gain insight into the identity of genes required for the maximum fitness benefit as well as genes that might have antagonistic effects when present at increased copy. We expect that this approach is readily extendable for studying increased-copy or mutation effects in other organisms for a broad range of phenotypes. RESULTS Time-course evaluation of continuous culture selections To demonstrate the SCALEs method, we created five E. coli K12 genomic libraries in the pSMART-LC-Kan vector. Libraries con- structed in these vectors have been shown to have better represen- tation than those constructed in more conventional vectors. Each of these five libraries each contained enough clones (410 5 –10 6 clones) to ensure with 499% probability that the entire genome was represented. We developed a continuous culture system with a working volume of 100 ml. For each duplicate culture, we transformed genomic libraries into E. coli MACH1-T1 cells yielding greater than 10 7 clones for each library. We mixed all transformants, grew them RECEIVED 16 MAY; ACCEPTED 27 SEPTEMBER; PUBLISHED ONLINE 12 NOVEMBER 2006; DOI:10.1038/NMETH946 Department of Chemical and Biological Engineering, University of Colorado, ECCH 111, Campus Box 424, Boulder, Colorado 80309, USA. Correspondence should be addressed to R.T.G. ([email protected]). NATURE METHODS | VOL.4 NO.1 | JANUARY 2007 | 87 ARTICLES

Transcript of SCALEs: multiscale analysis of library enrichment

Page 1: SCALEs: multiscale analysis of library enrichment

SCALEs: multiscale analysis of library enrichmentMichael D Lynch, Tanya Warnecke & Ryan T Gill

We report a genome-wide, multiscale approach to simultaneously

measure the effect that the increased copy of each gene and/or

operon has on a desired trait or phenotype. The method involves

(i) growth selections on a mixture of several different plasmid-

based genomic libraries of defined insert sizes or SCALEs,

(ii) microarray studies of enriched plasmid DNA, and a

(iii) mathematical multiscale analysis that precisely identifies

the relevant genetic elements. This approach allows for

identification of all single open reading frames and larger

multigene fragments within a genomic library that alter the

expression of a given phenotype. We have demonstrated this

method in Escherichia coli by monitoring, in parallel, a

population of 4106 genomic library clones of different

insert sizes, throughout continuous selections over a period

of 100 generations.

Conventional genetic methods for identifying the basis of cellularphenotypes are often laborious, not always genome-wide or quan-titative, and often not capable of studies involving cell populations.Advances in genomics technologies and associated applicationshave addressed several but not all of these limitations simulta-neously1–7. For example, although transcriptional profiling isgenome-wide, now easy to perform and capable of single-generesolution; profiling data only provide for the development ofcorrelative, as opposed to causative relationships concerning alteredgene expression and a particular phenotype. Conventional methodsfor identifying such relationships typically involve the evaluation ofinsertional mutant or extrachromosomal genomic libraries. In thecase of insertional mutant libraries, genome-wide approaches thatprovide high-resolution (single-gene) and quantitative data havebegun to provide new insights into the population dynamics ofmutants grown competitively in various environments4,7–10.Although attempts have been made to develop similar approachesfor application to extrachromosomal genomic libraries5,6, suchapproaches have yet to provide the level of resolution requiredfor quantitative studies. We report here the demonstration of suchan approach.

Our approach involves simultaneous growth selections on mix-tures of several plasmid libraries containing defined, yet different,insert sizes or scales. This is followed by a microarray and multiscaleanalysis. The multiscale analysis decomposes the microarray signalinto scales corresponding to the signal contribution from each of

the different, distinct-sized libraries (Fig. 1). We hypothesized thatselections performed on such mixed libraries would produceunique signal intensity patterns along the genome that wouldindicate specific genes or regions required for altered growth.That is, for phenotypes resulting from the overexpression ofshort pieces of genomic DNA (that is, a single gene, small RNAor DNA-binding motif), enrichment of the insert DNA wouldoccur in each of the libraries and result in a sharp signal intensitypeak corresponding to the gene of interest. In contrast, for thosephenotypes dependent upon the overexpression of a larger regionof genomic DNA (that is, an operon), enrichment would occuronly in those libraries containing the largest insert DNA leading to abroad signal-intensity peak. Additional scenarios provide for simi-larly unique signal-intensity patterns (that is, larger libraries thatcarry genes with antagonistic or synergistic effects would not beenriched or further enriched, respectively).

Using this approach, we monitored changes in the population of4106 clones in an E. coli genomic library throughout 100 genera-tions of continuous culture. This approach accurately identifies thelocation of fitness-altering loci as well as the specific size of therelevant fragments, thus minimizing the need for subcloning.Additionally, comparisons among such scale-specific enrichmentand dilution patterns can be used to gain insight into the identity ofgenes required for the maximum fitness benefit as well as genes thatmight have antagonistic effects when present at increased copy. Weexpect that this approach is readily extendable for studyingincreased-copy or mutation effects in other organisms for abroad range of phenotypes.

RESULTSTime-course evaluation of continuous culture selectionsTo demonstrate the SCALEs method, we created five E. coli K12genomic libraries in the pSMART-LC-Kan vector. Libraries con-structed in these vectors have been shown to have better represen-tation than those constructed in more conventional vectors. Each ofthese five libraries each contained enough clones (4105–106

clones) to ensure with 499% probability that the entire genomewas represented.

We developed a continuous culture system with a workingvolume of 100 ml. For each duplicate culture, we transformedgenomic libraries into E. coli MACH1-T1 cells yielding greater than107 clones for each library. We mixed all transformants, grew them

RECEIVED 16 MAY; ACCEPTED 27 SEPTEMBER; PUBLISHED ONLINE 12 NOVEMBER 2006; DOI:10.1038/NMETH946

Department of Chemical and Biological Engineering, University of Colorado, ECCH 111, Campus Box 424, Boulder, Colorado 80309, USA. Correspondence should beaddressed to R.T.G. ([email protected]).

NATURE METHODS | VOL.4 NO.1 | JANUARY 2007 | 87

ARTICLES

Page 2: SCALEs: multiscale analysis of library enrichment

to mid-log phase (OD600 nm ¼ 0.5) and inoculated into culturevessels containing MOPS minimal medium supplemented with20 mg/ml kanamycin. The initial dilution rate was set at 0.9 h–1,which is approximately the growth rate of MACH1-T1 cells. Tomaintain an environment that was not nutrient limited, it wasnecessary to gradually increase the selective pressure or dilutionrate within the culture vessel to try to maintain a constantcell density.

After 24 h, we removed a sample from each culture every 12 h forpreparation and hybridization to Affymetrix E. coli GeneChips(Fig. 2). Although the initial population inoculated into the twocultures may have been different, both trials selected for clones ofthe same regions of the genome. This demonstrates some reprodu-cibility among the clones with the greatest fitness in the culture.

The information of primary interest here is the precise identity ofthe genomic regions for which increased copy number conferred aselective advantage. This selective advantage or fitness ‘‘reflects thepropensity to leave descendants’’11, which in our studies wasapproximated by changes in clone or allele frequency within thepopulation of the culture. Thus, the fitness of our clones can bethought of in terms of residence time, which is a complex functiondepending not only upon planktonic growth rate but also on wall

adherence, and other factors such as initial conditions and inter-clone interactions. To obtain this information, we first calculatedposition-specific relative enrichment (W¢) values for each normal-ized microarray, subjected these values to an N-sieve–based multi-scale analysis (Fig. 2), and then log-transformed these data toobtain position-specific fitness values (W) as described in Methods.The multiscale analysis provides such information as the fractionof total signal corresponding to each insert size present in ourmixture of libraries. For example, the 4-kbp region centered overmurC–ddlB has a higher fitness relative to other SCALEs centeredover the same genes (Fig. 2d).

Multiscale analysis of clone-population dynamicsThe biggest advantage of the SCALEs approach is that this analysiscan be rapidly extended to the entire genome for each array andtime point evaluated (Fig. 3). We demonstrated the dynamic natureof the library populations as a function of time and position, aswell as the ability of SCALES to track genome-wide enrichmentand dilution patterns that can be decomposed to individualscales corresponding to the specific insert sizes used to create ourgenomic libraries.

Population dynamicsThe analysis just described allows one to calculate specificfitness values for each time point, scale and genomic positionevaluated. These values represent the relative enrichment of aparticular region of the genome at each scale throughout thetime course of our selections. These data provide informationabout changes in the distribution of fitness values occurringthroughout the selection (Fig. 4). We observed that selectionacted to not only reduce the total number of different cloneswithin our population but also to spread the distribution in fitnessvalues and increase the median fitness value. It should be notedthat at the end of our selections the fittest clones made up asubstantial portion of the overall population and that, in fact,variation across all individuals was reduced. A particularly usefulaspect of the SCALEs approach is that we were able to identifythe corresponding position, scale and associated genes thatwere enriched or diluted, which when combined with functionalinformation provides for a powerful method for dissectingselectable phenotypes.

Parallel selectionsSCALEs uses libraries with defined yet different insert sizes and amultiscale analysis to decompose gene-chip signals into scale-specific contributions. We provide genome-wide, position-specificfitness values for each of the SCALEs contained within our librariesthat were substantially enriched in both selections (Fig. 5).

a

b

c

d

e

f

Genomic position

Sig

nal i

nten

sity

Genomic position

Sig

nal i

nten

sity

Scale

Figure 1 | Overview of SCALEs. (a) Genomic DNA fragmented to several

specific sizes is ligated into vectors creating several libraries with defined

insert sizes. (b) These libraries are individually transformed into the cell line

used for selections. (c) The pools of transformants are mixed and subjected to

selection. Only clones bearing plasmids with insert increasing fitness survive.

(d) Enriched plasmids are purified from the selected population, prepared for

hybridization and applied to a microarray. (e) After analyzing the microarray

signal, the processed signal is treated as a function of sequence position.

(f) A nonlinear multiscale decomposition gives the signal not only as a

function of position but also as a function of scale or library size.

88 | VOL.4 NO.1 | JANUARY 2007 | NATURE METHODS

ARTICLES

Page 3: SCALEs: multiscale analysis of library enrichment

Enrichment for a particular region was specific to only a few of theinsert sizes encompassing the region, demonstrating the high-resolution (single-clone) nature of SCALEs, which accuratelyidentifies not only the location of fitness-altering loci but also thespecific size of the relevant fragments, thus minimizing the need forany additional subcloning. Comparison of enrichment patterns for

each scale surrounding a specific region allows one to identify notonly the gene(s) essential for expression of the phenotype (that is,those on the smallest scale) but also the adjacent genes that mayhave antagonistic or beneficial effects. Moreover, this comparisonallows one to assess reproducibility among populations selectedin parallel.

Selection 1 Selection 2

Genomiclibraries

24 h

36 h

48 h

24 h

36 h

48 h

60 h60 h

a

Con

cent

ratio

n (p

M)

murD ftsW murG murC ddlB ftsQ ftsA ftsZ lpxCsecM

secA

c

0.75–1.5 kbp1.5–3.0 kbp3.0–5.0 kbp5.0–10 kbp10–20 kbp

0

5.0 × 10–5

1.0 × 10–4

1.5 × 10–4

2.0 × 10–4

2.5 × 10–4

3.0 × 10–4

Con

cent

ratio

n (p

M)

murD ftsW murG murC ddlB ftsQ ftsA ftsZ lpxCsecM secA

Genomic position

d

Con

cent

ratio

n (p

M)

murD ftsW murG murC ddlB ftsQ ftsA ftsZ lpxCsecM

secA

Genomic position

e

Sig

nal

murD ftsW murG murCddlB ftsQ ftsA ftsZ lpxCsecM secA

b

1100

00

1083

75

1067

50

1051

25

1035

00

1018

75

1002

50

9862

5

9700

0

Genomic position

1100

00

1083

75

1067

50

1051

25

1035

00

1018

75

1002

50

9862

5

9700

0

1100

00

1083

75

1067

50

1051

25

1035

00

1018

75

1002

50

9862

5

9700

0

Genomic position11

0000

1083

75

1067

50

1051

25

1035

00

1018

75

1002

50

9862

5

9700

0

0

5.0 × 10–5

1.0 × 10–4

1.5 × 10–4

2.0 × 10–4

2.5 × 10–4

3.0 × 10–4

0

5.0 × 10–5

1.0 × 10–4

1.5 × 10–4

2.0 × 10–4

2.5 × 10–4

3.0 × 10–4

Figure 2 | Multiscale analysis. (a) Microarray images following the time course of the selection. Two continuous cultures were inoculated with a mixture of

transformants from each size library. After 24 h, samples were taken every 12 h and applied to Affymetix E. coli Antisense Gene Chips. (b–e) An example of the

multiscale analysis for the region of the E. coli K12 genome (97000–110000 bp). The original corrected probe signals and their corresponding positions along the

genome (b). The de-noised concentration is plotted as a function of genomic position (c). The results of the multiscale analysis, with the SCALEs differentiated

by color as indicated (d). Reconstruction of the original signal; the SCALEs are added along the genome to reconstruct the original signal (e).

a

e

Genomic position

yehQ yehR mlrAyehU yehW yehY yehZyehXyeh

TyehS

Cum

ulat

ive

fitne

ss

8.0

6.0

4.0

2.0

0.75–1.5 kbp1.5–3.0 kbp3.0–5.0 kbp5.0–10 kbp10–20 kbp

50

100/0

12.5

37.562.5

75

87.5

4.0 kbp

b

Genomic position

yehQ yehR mlrAyehU yehW yehY yehZyehXyeh

TyehS

Fitn

ess

8.0

6.0

4.0

2.0

i ii iii iv

2217

500

2216

187

2214

875

2213

562

2212

250

2210

937

2209

625

2208

312

2207

000

c

Genomic position

yehQ yehR mlrAyehU yehW yehY yehZyehXyeh

TyehS

Fitn

ess

8.0

6.0

4.0

2.0

2217

500

2216

187

2214

875

2213

562

2212

250

2210

937

2209

625

2208

312

2207

000

d

Genomic position

yehQ yehR mlrAyehU yehW yehY yehZyehXyeh

TyehS

Fitn

ess

8.0

6.0

4.0

2.0

2217

500

2216

187

2214

875

2213

562

2212

250

2210

937

2209

625

2208

312

2207

000

2217

500

2216

187

2214

875

2213

562

2212

250

2210

937

2209

625

2208

312

2207

000

Figure 3 | Genome-wide multiscale analysis. (a) A genome-wide plot of the multiscale analysis of the fitness for one culture over time. For each time point, the

fitness for each 125-bp position is plotted around the genome for each scale referred to in the legend. Time is increasing outward from the center circle. Circles

i, ii, iii and iv correspond to the 36/24 h (that is, the 36-h frequency relative to 24-h frequency as described in Methods), 48/36 h, 60/48 h and cumulative

fitness, respectively. The percentage of the E. coli genome is plotted clockwise around the circles. The SCALEs are differentiated by color as indicated in the

legend. (b–e) An enlargement of the analysis of the genomic segment corresponding to 2207000–2217500 bp for the 36h/24h fitness calculation (b) and the

48h/36h, 60h/48h and cumulative fitness samples (c–e, respectively).

NATURE METHODS | VOL.4 NO.1 | JANUARY 2007 | 89

ARTICLES

Page 4: SCALEs: multiscale analysis of library enrichment

Validation of SCALEs approachTo validate the results generated by SCALEs, we sampled andsequenced a total of 75 individual clones from both cultures atall time points and made several comparisons. First, we determinedthat all 75 inserts identified by sequencing were also identified aspresent in the corresponding plasmid population using the micro-array approach. Second, 48 of 50 inserts identified by sequencing, atlater time points, mapped to genomic regions showing the highestlevels of enrichment by SCALEs. By contrast 0 of 10 clones

randomly chosen from an unselected library contained insertsidentified as present on the selected arrays, but 8 of these 10 clonesdid map to regions present in the control array experimentperformed with the original library. Clones not observed in thearrays had growth rates lower than the corresponding dilution rateof the culture, which was 1.2 h–1 at 24 h and 2.1 h–1 at 60 h, andwould not have been expected to be present in the selectedpopulation. The measured growth rates of MACH1-T1 and thevector control were 0.9 ± 0.04 h–1 and 0.85 ± 0.06 h–1, respectively.As maintenance in the culture is a function of many differentparameters, including the ability to adhere to the glass wall of theculture, we expected, with high dilution rates, that the array resultscorresponding to later stages of the selections would indicate genesinvolved in wall adherence, and thus, biofilm formation. In fact, after60 h of selection, the majority of the signal mapped to five regions ofthe genome corresponding to five members of paralogous genegroup-117 from E. coli K12, including the open reading frames,yliF (44 kbp), yddV (4 kbp), adrA (1–2 kbp), yeaP (1–2 kbp)and ydeH (2 kbp). These open reading frames all contain GGDEFdomains, which have recently been shown to catalyze the formationof cyclic di-GMP, a key second messenger regulating biofilmformation12–19. Overexpression of such genes has been previouslyshown to confer a biofilm phenotype in E. coli17. Thus, to provideexternal validation of the SCALEs approach, we confirmed thatclones isolated from our cultures that contain these regions at highcopy number rapidly form biofilms that adhere to glass surfaces(unpublished data). It is important to note that for each of thesefive regions the SCALEs approach measured enrichment at differ-ent SCALEs, which further validates the ability of the SCALEsapproach to provide high-resolution data. We also confirmed theseresults by sequencing, where insert sizes matched the sizes identi-

fied as enriched by SCALEs. Finally, clonesconstructed carrying deletions within theregions did not confer the biofilm pheno-type to the same extent.

DISCUSSIONWe expect that this approach is widelyapplicable for studying increased copy ormutation affects in other organisms in abroad range of different selective environ-ments, thus allowing well-controlled studiesof microbial evolution. For example,because SCALEs allows for the quantitativetracking of individual clones within a largepopulation of clones, SCALEs can be used toexamine the reproducibility of parallel selec-tions, outcome of different selection strate-gies, or the role of different evolutionarymechanisms in selection experiments (seeref. 11 for a review). Finally, because SCALEsprovides causative relationships betweengene copy or overexpression and fitness,it can be combined with other geneticmethods such as microarray–based inser-tional mutagenesis approaches to providefor the development and validation of gen-ome-scale models of cellular behavior20–22.SCALEs can have an important role in

Cumulative fitness

Clo

ne n

umbe

r

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

4.84.03.22.41.60.80–0.8–1.6–2.4–3.2–4–4.8–5.6

60 h

36 h48 h

24 h 36 h 48 h 60 h

80

60

40

20

0

100

Per

cent

age

of 2

4 h

clon

es r

emai

ning

Figure 4 | Relative fitness distributions for the different clones remaining

(W 4 10�6) in the culture during selection after 36, 48 and 60 h. Fitness

for each position and scale (0.75–1.5, 1.5–3, 3–5, 5–10 and 10–20 kbp)

within the genome is included if present above background on the microarray.

The average relative fitness of the population in the culture is increasing over

time while, as is shown in the inset, the number of different clones retained

from the 24 h sample is decreasing.

a

0.75–1.5 kbp1.5–3.0 kbp3.0–5.0 kbp5.0–10 kbp10–20 kbp

b

Genomic position

yehQ yehR mlrAyehU yehW yehY yehZyehXyeh

TyehS

Cum

ulat

ive

fitne

ss

8.0

6.0

4.0

2.0

c

Genomic position

yehQ yehR mlrAyehU yehW yehY yehZyehXyeh

TyehS

Cum

ulat

ive

fitne

ss

8.0

6.0

4.0

2.0

d

Genomic position

2217

500

2216

187

2214

875

2213

562

2212

250

2210

937

2209

625

2208

312

2207

000

2217

500

2216

187

2214

875

2213

562

2212

250

2210

937

2209

625

2208

312

2207

000

2217

500

2216

187

2214

875

2213

562

2212

250

2210

937

2209

625

2208

312

2207

000

yehQ yehR mlrAyehU yehW yehY yehZyehXyeh

TyehS

Cum

ulat

ive

fitne

ss

8.0

6.0

4.0

2.050

100/0

12.5

37.562.5

75

87.5

25

Figure 5 | Overlap of enriched regions in replicate cultures. (a) Regions and scales with cumulative

fitness greater than W ¼ 0.5 in both cultures after 60 h. (b,c) Cumulative fitness data by scales for the

region from position 2207000 to 2217500 bp in the E. coli genome resulting from the first (b) or second

(c) selection. (d) The enrichment patterns do not match over the entire region but do overlap for the

2-kb scale centered over yehU–mlrA genes.

90 | VOL.4 NO.1 | JANUARY 2007 | NATURE METHODS

ARTICLES

Page 5: SCALEs: multiscale analysis of library enrichment

post-genomics efforts to improve understanding of the relationshipsthat exist between genotype, phenotype and fitness.

There are several challenges to present and future applications ofthis method. The first is a requirement for truly representationalextrachromosomal libraries. This relies not only on the numbers ofclones, which is a measure of the ability to clone all regions of thegenome (including difficult-to-clone genes), but also on theavoidance of repeat cloning of the same genomic regions withinsuch libraries, which requires a bias-free stable cloning vector. Forthis reason, we have used vectors that contain strong, bidirectionaltranscriptional terminators flanking the multiple cloning site,which has been shown to ensure adequate representation and tominimize fitness defects associated with uncontrolled expression ofcloned genes23. A second challenge concerns the probe density ofavailable microarrays. The current E. coli antisense arrays availablefrom Affymetrix have a highly variable probe density along thegenome. The limitation is that plasmids representing regions with alow number of probes may be misrepresented in the microarraysignal. Fortunately, technologies exist to address both of thesechallenges and should be taken advantage of in future studiesusing the SCALEs approach24 (also see examples from Affymetrixand Nimblegen). Such studies might include the application ofSCALEs to studies involving mutated libraries, libraries constructedfrom mutagenized and selected clones, or various growth condi-tions (nutrient limitations, antimicrobials) and selection strategies(serial transfer versus cultures). A challenge for future applicationsof SCALEs concerns the difficulties of designing selections forspecific phenotypes when fitness is often a complex function ofmultiple phenotypes. As demonstrated here, SCALEs is well suitedfor improving fundamental understanding of this issue, whichshould allow efforts to develop functional knowledge of the rapidlyexpanding list of fully sequenced microbial genomes.

METHODSBacteria, plasmids, media and library construction. We usedwild-type Escherichia coli K12 (ATCC # 29425) for the preparationof genomic DNA. We grew cultures for library construction inLuria-Bertani (LB) medium at 37 1C. We constructed genomiclibraries of insert sizes 500, 1,000, 2,000, 4,000 and 48,000 basepairs of E. coli strain K12 genomic DNA in the pSMART LC-Kanvector (low-copy) according to the manufacturer’s instructions(Lucigen). We obtained greater than 106–105 clones for librarieswith insert sizes less than 4,000 base pairs or greater than 8,000base pairs, respectively. Detailed protocols are available in Supple-mentary Methods online.

Continuous cultures. We introduced purified plasmid DNA fromeach library into MACH1-T1R (Invitrogen) by electroporation. Weprepared electrocompetent MACH1-T1R cells by standard glycerolwashes on ice to a final concentration of 1011 cells/ml (ref. 25). Weplated 1/1,000 volume of the original transformations on LB withkanamycin in triplicate to determine transformation efficiency andnumbers of transformants. We combined the original cultures,diluted them to 100 ml with MOPS minimal medium andincubated them at 37 1C for 6 h or until the cultures reached anOD600 of 0.50. We then introduced this mixture into a 100-mlculture vessel for continuous culture studies. We recorded theOD600 of the culture every 6 h and adjusted the dilution rateaccording to the growth. We added MOPS minimal medium with

kanamycin at a controlled volumetric flow rate by use of aperistaltic pump. Similarly, volume was maintained by an outletpump set to a maximal flow rate at a given depth in the culturevessel. The culture was continuously agitated using a stir plate,maintained at 37 1C, and aerated using filtered house air.

Sampling. After 24 h of growth, every 12 h we innoculated 100 mlof LB with kanamycin with a 100 ml sample collected from theoutlet stream. We plated 10 ml of the 100-ml culture on LBwith kanamycin to obtain colonies for sequencing and subsequentgrowth studies. We incubated the remainder of the culture at37 1C for 12 h, with shaking at 225 r.p.m. We amplified plasmidsfrom these cultures by growing the cells in medium containingchloramphenicol at 37 1C for 30 min and collected the cells bycentrifugation. We extracted plasmid DNA using a HiSpeedPlasmid Midi kit (Qiagen).

Microarray studies: hybridizations. For each array, we mixed7.5 mg of sample plasmid DNA with the following control plasmidDNA, which was similarly purified: 1,000 ng pGIBS-DAP(ATCC#87486), 100 ng pGIBS-THR (ATCC#87484), 10 ngpGIBS-TRP (ATCC#87485) and 1 ng pGIBS-PHE (ATCC#87483). We digested the plasmid mixture at 37 1C overnight with10 units each of AluI and RsaI (Invitrogen) in a reaction contain-ing 50 mM Tris-HCl (pH 8.0) and 10 mM MgCl2. We heat-inactivated the enzymes in these reactions at 70 1C for 15 min.Then we added 10� One Phor All buffer (Amersham PharmaciaBiotech) to these reactions to a final 1� concentration, two unitsof RQDNAse I (Fisher) and 200 units of Exonuclease III (Fisher).We incubated these reactions at 37 1C for 30 min and then heat-inactivated the enzymes at 98 1C for 20 min. We labeled theresulting fragmented single-stranded DNA with biotinylatedddUTP using the Enzo BioArray Terminal Labeling kit (Enzo LifeSciences) following the manufacturer’s protocol.

Affymetrix E. coli Antisense GeneChip arrays were handled atthe University of Colorado DNA Microarray facility accordingto manufacturer’s specifications using a GeneChip hybridizationoven, GeneChip fluidics station, GeneArray scanner andGeneChip Operating Software v1.1 (Affymetrix).

Microarray studies: low-level probe analysis. We extractedprobe-level signals from the Affymetrix .cel file. We calculatedthe background for each probe according to the algorithm used bythe Microarray Suite 5.0 (MAS 5.0) software from Affymetrix. Wepartitioned background-corrected probes into groups of 25 probepairs, each having a similar affinity. Each pair consists of aperfect match and mismatch probe. We included 25 pairs, or 50probes, in each group. The predicted affinities used to make thesegroupings were taken from the literature26. For each group, weestimated nonspecific signal by a robust regression of the perfect-match probe signal against the difference between the perfectmatch and mismatch probe signals. We performed the robustregression by fitting a repeat median line27. The intercept of thisline was used as an estimate of nonspecific signal for the probegroup and subtracted from each perfect match probe signal.We set the minimum signal allowable for any probe to 1. Wethen corrected these perfect match signals for brightness bydividing them by their predicted affinities, resulting in finalcorrected values26.

NATURE METHODS | VOL.4 NO.1 | JANUARY 2007 | 91

ARTICLES

Page 6: SCALEs: multiscale analysis of library enrichment

Microarray studies: signal summarization and de-noising. Thecorrected probe signals can be mapped to their positions inthe genome. We calculated the signal at any given position asthe Tukey biweight (weighted average) of the closest 25 probesignals to that position27. We then applied to the signal a medianfilter with a window length of 1,000 bp, which served to removeany signal spikes resulting from SCALEs smaller than 500 bp.

Multi-scale analysis. We performed a nonlinear multiscaledecomposition to decompose the signal into SCALEs correspond-ing to the signal contribution from each of the different, distinct-sized libraries. This was done using an N-Sieve decompositiongrouping the signal into sieves or windows. This analysis iscapable of discriminating all possible SCALEs. For our purposes,we grouped SCALEs into sizes surrounding the insert size ofour libraries (300–600 bp, 600–1,200 bp, 1,200–2,400 bp,2,400–5,000 bp, 5,000–10,000 bp and 10,000–20,000 bp)28,29.

Normalization. We normalized the signal values for the SCALEsobtained after the multiscale analysis using the information fromthe positive control probes. We analyzed these probes similarly tothe genomic data using a single scale corresponding to the lengthof the associated positive control target gene. This analysis allowedfor a sigmoidal relationship between signal intensity and molarconcentration of the form given in equation 1, which can be fit tothe control probe data and used to estimate molar concentrationfor the remaining summarized signals on a given array. S,processed signal from array analysis for a given element; Amax,maximal processed signal; C, concentration of a given geneticelement; a and b, fitted parameters.

S ¼ Amax=ðða=CÞ+bÞ ð1Þ

Fitness calculations and statistics. To calculate the fitness, foreach array we divided the estimated concentration or normalizedsignal as a function of genomic position by the estimatedconcentration of the ROP (repressor of primer) gene, which ison the backbone of vector used for library construction. Thisallowed us to estimate allele frequency, f, according to equation 2.

fi;n ¼ Ci;n=CROP;n ð2Þwhere C is concentration, i is genomic position, and n is micro-array sample. With an estimate of allele frequency it was thenpossible determine the change in frequency for different samples(that is, W¢i36 ¼ fi36 / fi24). We then subjected these values to theN-sieve multiscale analysis, as described above. Finally, we log-transformed decomposed position-specific values to produce W ¼log (W¢). For the analysis presented in Figure 4, we consideredvalues of W o 10�6 to be absent from the culture. We definedcumulative fitness as Wcum ¼ W1 + W2 + W3. We determinesignificance using a t-test comparing clones with Wcum 4 0 inboth cultures to clones with Wcum o 0.

Accession codes. ArrayExpress: E-TABM-143.

Note: Supplementary information is available on the Nature Methods website.

ACKNOWLEDGMENTSThis work was supported by US National Institutes of Health grants R21 AI055773-01 and K25 AI064338 and National Science Foundation grant BES0228584. M.D.L.was supported by a National Institutes of Health F31 award A1056687. T.W. wassupported by a US Department of Education Graduate Assistantship in Areas of

National Need fellowship. We thank H. Marshall at the University of ColoradoMicroarray Facility, and P.D. Bevins for his help with this work.

COMPETING INTERESTS STATEMENTThe authors declare competing financial interests (see the Nature Methods websitefor details).

Published online at http://www.nature.com/naturemethods/Reprints and permissions information is available online athttp://npg.nature.com/reprintsandpermissions/

1. DeRisi, J.L., Iyer, V.R. & Brown, P.O. Exploring the metabolic and genetic controlof gene expression on a genomic scale. Science 278, 680–686 (1997).

2. Fodor, S.P. et al. Light-directed, spatially addressable parallel chemical synthesis.Science 251, 767–773 (1991).

3. Schena, M., Shalon, D., Davis, R.W. & Brown, P.O. Quantitative monitoring of geneexpression patterns with a complementary DNA microarray. Science 270, 467–470(1995).

4. Badarinarayana, V. et al. Selection analyses of insertional mutants using subgenicresolution arrays. Nat. Biotechnol. 19, 1060–1064 (2001).

5. Cho, R.J. et al. Parallel analysis of genetic selections using whole genomeoligonucleotide arrays. Proc. Natl. Acad. Sci. USA 95, 3752–3757 (1998).

6. Gill, R.T., Wildt, S., Yang, Y.T., Ziesman, S. & Stephanopoulos, G. Genome-widescreening for trait conferring genes using DNA microarrays. Proc. Natl. Acad. Sci.USA 99, 7033–7038 (2002).

7. Winzeler, E.A. et al. Functional characterization of the S. cerevisiae genome bygene deletion and parallel analysis. Science 285, 901–906 (1999).

8. Shoemaker, D.D., Lashkari, D., Morris, D., Mittmann, M. & Davis, R. Quantitativephenotypic analysis of yeast deletion mutants using a highly parallel molecularbar-coding strategy. Nat. Genet. 14, 450–456 (1996).

9. Karlyshev, A.V. et al. Application of high-density array-based signature-taggedmutagenesis to discover novel Yersinia virulence-associated genes. Infect.Immun. 69, 7810–7819 (2001).

10. Giaever, G. et al. Functional profiling of the Saccharomyces cerevisiae genome.Nature 418, 387–391 (2002).

11. Elena, S.F. & Lenski, R. Evolution experiments with microorganisms:the dynamics and genetic bases of adaptation. Nat. Rev. Genet. 4,457–469 (2003).

12. Garcia, B. et al. Role of the GGDEF protein family in Salmonella cellulosebiosynthesis and biofilm formation. Mol. Microbiol. 54, 264–277 (2004).

13. Kirillina, O., Fetherston, J.D., Bobrov, A.G., Abney, J. & Perry, R.D. HmsP, aputative phosphodiesterase, and HmsT, a putative diguanylate cyclase, controlHms-dependent biofilm formation in Yersinia pestis. Mol. Microbiol. 54, 75–88(2004).

14. Simm, R., Fetherston, J., Kader, A., Romling, U. & Perry, R. Phenotypicconvergence mediated by GGDEF-domain-containing proteins. J. Bacteriol. 187,6816–6823 (2005).

15. Simm, R., Morr, M., Kader, A., Nimtz, M. & Romling, U. GGDEF and EAL domainsinversely regulate cyclic di-GMP levels and transition from sessility to motility.Mol. Microbiol. 53, 1123–1134 (2004).

16. Brown, P.K. et al. MlrA, a novel regulator of curli (AgF) and extracellular matrixsynthesis by Escherichia coli and Salmonella enterica serovar Typhimurium. Mol.Microbiol. 41, 349–363 (2001).

17. Brombacher, E., Dorel, C., Zehnder, A. & Landini, P. The curli biosynthesisregulator CsgD co-ordinates the expression of both positive and negativedeterminants for biofilm formation in Escherichia coli. Microbiology 149,2847–2857 (2003).

18. Hickman, J.W., Tifrea, D. & Harwood, C. A chemosensory system the regulatesbiofilm formation through modulatiaon of cyclic diguanylate levels. Proc. Natl.Acad. Sci. USA 102, 14422–14427 (2005).

19. Jenal, U. Cyclic di-guanosine-monophosphate comes of age: a novel secondarymessenger involved in modulating cell surface structures in bacteria? Curr. Opin.Microbiol. 7, 185–191 (2004).

20. Covert, M.W., Knight, E.M., Reed, J.L., Herrgard, M.J. & Palsson, B.O. Integratinghigh-throughput and computational data elucidates bacterial networks. Nature429, 92–96 (2004).

21. Edwards, J.S. & Palsson, B.O. The Escherichia coli MG1655 in silico metabolicgenotype: its definition, characteristics, and capabilities. Proc. Natl. Acad. Sci.USA 97, 5528–5533 (2000).

22. Ibarra, R.U., Edwards, J.S. & Palsson, B.O. Escherichia coli K-12 undergoesadaptive evolution to achieve in silico predicted optimal growth. Nature 420,186–189 (2002).

23. Godiska, R., Patterson, M., Schoenfeld, T. & Mead, D. Beyond pUC: vectors forcloning unstable DNA. In DNA Sequencing: Optimizing the Process and Analysis(ed., Kieleczawa, J.) 55–75 (Jones and Bartlett, Boston, 2004).

92 | VOL.4 NO.1 | JANUARY 2007 | NATURE METHODS

ARTICLES

Page 7: SCALEs: multiscale analysis of library enrichment

24. Lynch, M.D. & Gill, R.T. Broad host range vectors for stable genomic libraryconstruction. Biotechnol. Bioeng. 94, 151–158 (2006).

25. Sambrook, J., Fritsch, E.F. & Maniatis, T. Molecular Cloning: A Laboratory Manual(Cold Spring Harbor Laboratory Press, Cold Spring Harbor, 1989).

26. Naef, F. & Magnasco, M.O. Solving the riddle of the bright mismatches: labelingand effective binding in oligonucleotide arrays. Phys. Rev. E 68, 011906(2003).

27. Hoaglin, D.C., Mosteller, F. & Tukey, J.W. Understanding robust and exploratorydata analysis (John Wiley & Sons Inc., New York, 1983).

28. Bangham, J., Chardaire, P., Pye, J. & Ling, P. Multiscale nonlinear decomposition:the sieve decomposition theorem. IEEE Trans. Pattern Anal. Mach. Intell. 18,529–539 (1996).

29. Bangham, J., Ling, P. & Harvey, R. Scale-space from nonlinear filters. IEEE Trans.Pattern Anal. Mach. Intell. 18, 520–529 (1996).

NATURE METHODS | VOL.4 NO.1 | JANUARY 2007 | 93

ARTICLES