SCALEs: multiscale analysis of library enrichment
Transcript of SCALEs: multiscale analysis of library enrichment
SCALEs: multiscale analysis of library enrichmentMichael D Lynch, Tanya Warnecke & Ryan T Gill
We report a genome-wide, multiscale approach to simultaneously
measure the effect that the increased copy of each gene and/or
operon has on a desired trait or phenotype. The method involves
(i) growth selections on a mixture of several different plasmid-
based genomic libraries of defined insert sizes or SCALEs,
(ii) microarray studies of enriched plasmid DNA, and a
(iii) mathematical multiscale analysis that precisely identifies
the relevant genetic elements. This approach allows for
identification of all single open reading frames and larger
multigene fragments within a genomic library that alter the
expression of a given phenotype. We have demonstrated this
method in Escherichia coli by monitoring, in parallel, a
population of 4106 genomic library clones of different
insert sizes, throughout continuous selections over a period
of 100 generations.
Conventional genetic methods for identifying the basis of cellularphenotypes are often laborious, not always genome-wide or quan-titative, and often not capable of studies involving cell populations.Advances in genomics technologies and associated applicationshave addressed several but not all of these limitations simulta-neously1–7. For example, although transcriptional profiling isgenome-wide, now easy to perform and capable of single-generesolution; profiling data only provide for the development ofcorrelative, as opposed to causative relationships concerning alteredgene expression and a particular phenotype. Conventional methodsfor identifying such relationships typically involve the evaluation ofinsertional mutant or extrachromosomal genomic libraries. In thecase of insertional mutant libraries, genome-wide approaches thatprovide high-resolution (single-gene) and quantitative data havebegun to provide new insights into the population dynamics ofmutants grown competitively in various environments4,7–10.Although attempts have been made to develop similar approachesfor application to extrachromosomal genomic libraries5,6, suchapproaches have yet to provide the level of resolution requiredfor quantitative studies. We report here the demonstration of suchan approach.
Our approach involves simultaneous growth selections on mix-tures of several plasmid libraries containing defined, yet different,insert sizes or scales. This is followed by a microarray and multiscaleanalysis. The multiscale analysis decomposes the microarray signalinto scales corresponding to the signal contribution from each of
the different, distinct-sized libraries (Fig. 1). We hypothesized thatselections performed on such mixed libraries would produceunique signal intensity patterns along the genome that wouldindicate specific genes or regions required for altered growth.That is, for phenotypes resulting from the overexpression ofshort pieces of genomic DNA (that is, a single gene, small RNAor DNA-binding motif), enrichment of the insert DNA wouldoccur in each of the libraries and result in a sharp signal intensitypeak corresponding to the gene of interest. In contrast, for thosephenotypes dependent upon the overexpression of a larger regionof genomic DNA (that is, an operon), enrichment would occuronly in those libraries containing the largest insert DNA leading to abroad signal-intensity peak. Additional scenarios provide for simi-larly unique signal-intensity patterns (that is, larger libraries thatcarry genes with antagonistic or synergistic effects would not beenriched or further enriched, respectively).
Using this approach, we monitored changes in the population of4106 clones in an E. coli genomic library throughout 100 genera-tions of continuous culture. This approach accurately identifies thelocation of fitness-altering loci as well as the specific size of therelevant fragments, thus minimizing the need for subcloning.Additionally, comparisons among such scale-specific enrichmentand dilution patterns can be used to gain insight into the identity ofgenes required for the maximum fitness benefit as well as genes thatmight have antagonistic effects when present at increased copy. Weexpect that this approach is readily extendable for studyingincreased-copy or mutation effects in other organisms for abroad range of phenotypes.
RESULTSTime-course evaluation of continuous culture selectionsTo demonstrate the SCALEs method, we created five E. coli K12genomic libraries in the pSMART-LC-Kan vector. Libraries con-structed in these vectors have been shown to have better represen-tation than those constructed in more conventional vectors. Each ofthese five libraries each contained enough clones (4105–106
clones) to ensure with 499% probability that the entire genomewas represented.
We developed a continuous culture system with a workingvolume of 100 ml. For each duplicate culture, we transformedgenomic libraries into E. coli MACH1-T1 cells yielding greater than107 clones for each library. We mixed all transformants, grew them
RECEIVED 16 MAY; ACCEPTED 27 SEPTEMBER; PUBLISHED ONLINE 12 NOVEMBER 2006; DOI:10.1038/NMETH946
Department of Chemical and Biological Engineering, University of Colorado, ECCH 111, Campus Box 424, Boulder, Colorado 80309, USA. Correspondence should beaddressed to R.T.G. ([email protected]).
NATURE METHODS | VOL.4 NO.1 | JANUARY 2007 | 87
ARTICLES
to mid-log phase (OD600 nm ¼ 0.5) and inoculated into culturevessels containing MOPS minimal medium supplemented with20 mg/ml kanamycin. The initial dilution rate was set at 0.9 h–1,which is approximately the growth rate of MACH1-T1 cells. Tomaintain an environment that was not nutrient limited, it wasnecessary to gradually increase the selective pressure or dilutionrate within the culture vessel to try to maintain a constantcell density.
After 24 h, we removed a sample from each culture every 12 h forpreparation and hybridization to Affymetrix E. coli GeneChips(Fig. 2). Although the initial population inoculated into the twocultures may have been different, both trials selected for clones ofthe same regions of the genome. This demonstrates some reprodu-cibility among the clones with the greatest fitness in the culture.
The information of primary interest here is the precise identity ofthe genomic regions for which increased copy number conferred aselective advantage. This selective advantage or fitness ‘‘reflects thepropensity to leave descendants’’11, which in our studies wasapproximated by changes in clone or allele frequency within thepopulation of the culture. Thus, the fitness of our clones can bethought of in terms of residence time, which is a complex functiondepending not only upon planktonic growth rate but also on wall
adherence, and other factors such as initial conditions and inter-clone interactions. To obtain this information, we first calculatedposition-specific relative enrichment (W¢) values for each normal-ized microarray, subjected these values to an N-sieve–based multi-scale analysis (Fig. 2), and then log-transformed these data toobtain position-specific fitness values (W) as described in Methods.The multiscale analysis provides such information as the fractionof total signal corresponding to each insert size present in ourmixture of libraries. For example, the 4-kbp region centered overmurC–ddlB has a higher fitness relative to other SCALEs centeredover the same genes (Fig. 2d).
Multiscale analysis of clone-population dynamicsThe biggest advantage of the SCALEs approach is that this analysiscan be rapidly extended to the entire genome for each array andtime point evaluated (Fig. 3). We demonstrated the dynamic natureof the library populations as a function of time and position, aswell as the ability of SCALES to track genome-wide enrichmentand dilution patterns that can be decomposed to individualscales corresponding to the specific insert sizes used to create ourgenomic libraries.
Population dynamicsThe analysis just described allows one to calculate specificfitness values for each time point, scale and genomic positionevaluated. These values represent the relative enrichment of aparticular region of the genome at each scale throughout thetime course of our selections. These data provide informationabout changes in the distribution of fitness values occurringthroughout the selection (Fig. 4). We observed that selectionacted to not only reduce the total number of different cloneswithin our population but also to spread the distribution in fitnessvalues and increase the median fitness value. It should be notedthat at the end of our selections the fittest clones made up asubstantial portion of the overall population and that, in fact,variation across all individuals was reduced. A particularly usefulaspect of the SCALEs approach is that we were able to identifythe corresponding position, scale and associated genes thatwere enriched or diluted, which when combined with functionalinformation provides for a powerful method for dissectingselectable phenotypes.
Parallel selectionsSCALEs uses libraries with defined yet different insert sizes and amultiscale analysis to decompose gene-chip signals into scale-specific contributions. We provide genome-wide, position-specificfitness values for each of the SCALEs contained within our librariesthat were substantially enriched in both selections (Fig. 5).
a
b
c
d
e
f
Genomic position
Sig
nal i
nten
sity
Genomic position
Sig
nal i
nten
sity
Scale
Figure 1 | Overview of SCALEs. (a) Genomic DNA fragmented to several
specific sizes is ligated into vectors creating several libraries with defined
insert sizes. (b) These libraries are individually transformed into the cell line
used for selections. (c) The pools of transformants are mixed and subjected to
selection. Only clones bearing plasmids with insert increasing fitness survive.
(d) Enriched plasmids are purified from the selected population, prepared for
hybridization and applied to a microarray. (e) After analyzing the microarray
signal, the processed signal is treated as a function of sequence position.
(f) A nonlinear multiscale decomposition gives the signal not only as a
function of position but also as a function of scale or library size.
88 | VOL.4 NO.1 | JANUARY 2007 | NATURE METHODS
ARTICLES
Enrichment for a particular region was specific to only a few of theinsert sizes encompassing the region, demonstrating the high-resolution (single-clone) nature of SCALEs, which accuratelyidentifies not only the location of fitness-altering loci but also thespecific size of the relevant fragments, thus minimizing the need forany additional subcloning. Comparison of enrichment patterns for
each scale surrounding a specific region allows one to identify notonly the gene(s) essential for expression of the phenotype (that is,those on the smallest scale) but also the adjacent genes that mayhave antagonistic or beneficial effects. Moreover, this comparisonallows one to assess reproducibility among populations selectedin parallel.
Selection 1 Selection 2
Genomiclibraries
24 h
36 h
48 h
24 h
36 h
48 h
60 h60 h
a
Con
cent
ratio
n (p
M)
murD ftsW murG murC ddlB ftsQ ftsA ftsZ lpxCsecM
secA
c
0.75–1.5 kbp1.5–3.0 kbp3.0–5.0 kbp5.0–10 kbp10–20 kbp
0
5.0 × 10–5
1.0 × 10–4
1.5 × 10–4
2.0 × 10–4
2.5 × 10–4
3.0 × 10–4
Con
cent
ratio
n (p
M)
murD ftsW murG murC ddlB ftsQ ftsA ftsZ lpxCsecM secA
Genomic position
d
Con
cent
ratio
n (p
M)
murD ftsW murG murC ddlB ftsQ ftsA ftsZ lpxCsecM
secA
Genomic position
e
Sig
nal
murD ftsW murG murCddlB ftsQ ftsA ftsZ lpxCsecM secA
b
1100
00
1083
75
1067
50
1051
25
1035
00
1018
75
1002
50
9862
5
9700
0
Genomic position
1100
00
1083
75
1067
50
1051
25
1035
00
1018
75
1002
50
9862
5
9700
0
1100
00
1083
75
1067
50
1051
25
1035
00
1018
75
1002
50
9862
5
9700
0
Genomic position11
0000
1083
75
1067
50
1051
25
1035
00
1018
75
1002
50
9862
5
9700
0
0
5.0 × 10–5
1.0 × 10–4
1.5 × 10–4
2.0 × 10–4
2.5 × 10–4
3.0 × 10–4
0
5.0 × 10–5
1.0 × 10–4
1.5 × 10–4
2.0 × 10–4
2.5 × 10–4
3.0 × 10–4
Figure 2 | Multiscale analysis. (a) Microarray images following the time course of the selection. Two continuous cultures were inoculated with a mixture of
transformants from each size library. After 24 h, samples were taken every 12 h and applied to Affymetix E. coli Antisense Gene Chips. (b–e) An example of the
multiscale analysis for the region of the E. coli K12 genome (97000–110000 bp). The original corrected probe signals and their corresponding positions along the
genome (b). The de-noised concentration is plotted as a function of genomic position (c). The results of the multiscale analysis, with the SCALEs differentiated
by color as indicated (d). Reconstruction of the original signal; the SCALEs are added along the genome to reconstruct the original signal (e).
a
e
Genomic position
yehQ yehR mlrAyehU yehW yehY yehZyehXyeh
TyehS
Cum
ulat
ive
fitne
ss
8.0
6.0
4.0
2.0
0.75–1.5 kbp1.5–3.0 kbp3.0–5.0 kbp5.0–10 kbp10–20 kbp
50
100/0
12.5
37.562.5
75
87.5
4.0 kbp
b
Genomic position
yehQ yehR mlrAyehU yehW yehY yehZyehXyeh
TyehS
Fitn
ess
8.0
6.0
4.0
2.0
i ii iii iv
2217
500
2216
187
2214
875
2213
562
2212
250
2210
937
2209
625
2208
312
2207
000
c
Genomic position
yehQ yehR mlrAyehU yehW yehY yehZyehXyeh
TyehS
Fitn
ess
8.0
6.0
4.0
2.0
2217
500
2216
187
2214
875
2213
562
2212
250
2210
937
2209
625
2208
312
2207
000
d
Genomic position
yehQ yehR mlrAyehU yehW yehY yehZyehXyeh
TyehS
Fitn
ess
8.0
6.0
4.0
2.0
2217
500
2216
187
2214
875
2213
562
2212
250
2210
937
2209
625
2208
312
2207
000
2217
500
2216
187
2214
875
2213
562
2212
250
2210
937
2209
625
2208
312
2207
000
Figure 3 | Genome-wide multiscale analysis. (a) A genome-wide plot of the multiscale analysis of the fitness for one culture over time. For each time point, the
fitness for each 125-bp position is plotted around the genome for each scale referred to in the legend. Time is increasing outward from the center circle. Circles
i, ii, iii and iv correspond to the 36/24 h (that is, the 36-h frequency relative to 24-h frequency as described in Methods), 48/36 h, 60/48 h and cumulative
fitness, respectively. The percentage of the E. coli genome is plotted clockwise around the circles. The SCALEs are differentiated by color as indicated in the
legend. (b–e) An enlargement of the analysis of the genomic segment corresponding to 2207000–2217500 bp for the 36h/24h fitness calculation (b) and the
48h/36h, 60h/48h and cumulative fitness samples (c–e, respectively).
NATURE METHODS | VOL.4 NO.1 | JANUARY 2007 | 89
ARTICLES
Validation of SCALEs approachTo validate the results generated by SCALEs, we sampled andsequenced a total of 75 individual clones from both cultures atall time points and made several comparisons. First, we determinedthat all 75 inserts identified by sequencing were also identified aspresent in the corresponding plasmid population using the micro-array approach. Second, 48 of 50 inserts identified by sequencing, atlater time points, mapped to genomic regions showing the highestlevels of enrichment by SCALEs. By contrast 0 of 10 clones
randomly chosen from an unselected library contained insertsidentified as present on the selected arrays, but 8 of these 10 clonesdid map to regions present in the control array experimentperformed with the original library. Clones not observed in thearrays had growth rates lower than the corresponding dilution rateof the culture, which was 1.2 h–1 at 24 h and 2.1 h–1 at 60 h, andwould not have been expected to be present in the selectedpopulation. The measured growth rates of MACH1-T1 and thevector control were 0.9 ± 0.04 h–1 and 0.85 ± 0.06 h–1, respectively.As maintenance in the culture is a function of many differentparameters, including the ability to adhere to the glass wall of theculture, we expected, with high dilution rates, that the array resultscorresponding to later stages of the selections would indicate genesinvolved in wall adherence, and thus, biofilm formation. In fact, after60 h of selection, the majority of the signal mapped to five regions ofthe genome corresponding to five members of paralogous genegroup-117 from E. coli K12, including the open reading frames,yliF (44 kbp), yddV (4 kbp), adrA (1–2 kbp), yeaP (1–2 kbp)and ydeH (2 kbp). These open reading frames all contain GGDEFdomains, which have recently been shown to catalyze the formationof cyclic di-GMP, a key second messenger regulating biofilmformation12–19. Overexpression of such genes has been previouslyshown to confer a biofilm phenotype in E. coli17. Thus, to provideexternal validation of the SCALEs approach, we confirmed thatclones isolated from our cultures that contain these regions at highcopy number rapidly form biofilms that adhere to glass surfaces(unpublished data). It is important to note that for each of thesefive regions the SCALEs approach measured enrichment at differ-ent SCALEs, which further validates the ability of the SCALEsapproach to provide high-resolution data. We also confirmed theseresults by sequencing, where insert sizes matched the sizes identi-
fied as enriched by SCALEs. Finally, clonesconstructed carrying deletions within theregions did not confer the biofilm pheno-type to the same extent.
DISCUSSIONWe expect that this approach is widelyapplicable for studying increased copy ormutation affects in other organisms in abroad range of different selective environ-ments, thus allowing well-controlled studiesof microbial evolution. For example,because SCALEs allows for the quantitativetracking of individual clones within a largepopulation of clones, SCALEs can be used toexamine the reproducibility of parallel selec-tions, outcome of different selection strate-gies, or the role of different evolutionarymechanisms in selection experiments (seeref. 11 for a review). Finally, because SCALEsprovides causative relationships betweengene copy or overexpression and fitness,it can be combined with other geneticmethods such as microarray–based inser-tional mutagenesis approaches to providefor the development and validation of gen-ome-scale models of cellular behavior20–22.SCALEs can have an important role in
Cumulative fitness
Clo
ne n
umbe
r
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
4.84.03.22.41.60.80–0.8–1.6–2.4–3.2–4–4.8–5.6
60 h
36 h48 h
24 h 36 h 48 h 60 h
80
60
40
20
0
100
Per
cent
age
of 2
4 h
clon
es r
emai
ning
Figure 4 | Relative fitness distributions for the different clones remaining
(W 4 10�6) in the culture during selection after 36, 48 and 60 h. Fitness
for each position and scale (0.75–1.5, 1.5–3, 3–5, 5–10 and 10–20 kbp)
within the genome is included if present above background on the microarray.
The average relative fitness of the population in the culture is increasing over
time while, as is shown in the inset, the number of different clones retained
from the 24 h sample is decreasing.
a
0.75–1.5 kbp1.5–3.0 kbp3.0–5.0 kbp5.0–10 kbp10–20 kbp
b
Genomic position
yehQ yehR mlrAyehU yehW yehY yehZyehXyeh
TyehS
Cum
ulat
ive
fitne
ss
8.0
6.0
4.0
2.0
c
Genomic position
yehQ yehR mlrAyehU yehW yehY yehZyehXyeh
TyehS
Cum
ulat
ive
fitne
ss
8.0
6.0
4.0
2.0
d
Genomic position
2217
500
2216
187
2214
875
2213
562
2212
250
2210
937
2209
625
2208
312
2207
000
2217
500
2216
187
2214
875
2213
562
2212
250
2210
937
2209
625
2208
312
2207
000
2217
500
2216
187
2214
875
2213
562
2212
250
2210
937
2209
625
2208
312
2207
000
yehQ yehR mlrAyehU yehW yehY yehZyehXyeh
TyehS
Cum
ulat
ive
fitne
ss
8.0
6.0
4.0
2.050
100/0
12.5
37.562.5
75
87.5
25
Figure 5 | Overlap of enriched regions in replicate cultures. (a) Regions and scales with cumulative
fitness greater than W ¼ 0.5 in both cultures after 60 h. (b,c) Cumulative fitness data by scales for the
region from position 2207000 to 2217500 bp in the E. coli genome resulting from the first (b) or second
(c) selection. (d) The enrichment patterns do not match over the entire region but do overlap for the
2-kb scale centered over yehU–mlrA genes.
90 | VOL.4 NO.1 | JANUARY 2007 | NATURE METHODS
ARTICLES
post-genomics efforts to improve understanding of the relationshipsthat exist between genotype, phenotype and fitness.
There are several challenges to present and future applications ofthis method. The first is a requirement for truly representationalextrachromosomal libraries. This relies not only on the numbers ofclones, which is a measure of the ability to clone all regions of thegenome (including difficult-to-clone genes), but also on theavoidance of repeat cloning of the same genomic regions withinsuch libraries, which requires a bias-free stable cloning vector. Forthis reason, we have used vectors that contain strong, bidirectionaltranscriptional terminators flanking the multiple cloning site,which has been shown to ensure adequate representation and tominimize fitness defects associated with uncontrolled expression ofcloned genes23. A second challenge concerns the probe density ofavailable microarrays. The current E. coli antisense arrays availablefrom Affymetrix have a highly variable probe density along thegenome. The limitation is that plasmids representing regions with alow number of probes may be misrepresented in the microarraysignal. Fortunately, technologies exist to address both of thesechallenges and should be taken advantage of in future studiesusing the SCALEs approach24 (also see examples from Affymetrixand Nimblegen). Such studies might include the application ofSCALEs to studies involving mutated libraries, libraries constructedfrom mutagenized and selected clones, or various growth condi-tions (nutrient limitations, antimicrobials) and selection strategies(serial transfer versus cultures). A challenge for future applicationsof SCALEs concerns the difficulties of designing selections forspecific phenotypes when fitness is often a complex function ofmultiple phenotypes. As demonstrated here, SCALEs is well suitedfor improving fundamental understanding of this issue, whichshould allow efforts to develop functional knowledge of the rapidlyexpanding list of fully sequenced microbial genomes.
METHODSBacteria, plasmids, media and library construction. We usedwild-type Escherichia coli K12 (ATCC # 29425) for the preparationof genomic DNA. We grew cultures for library construction inLuria-Bertani (LB) medium at 37 1C. We constructed genomiclibraries of insert sizes 500, 1,000, 2,000, 4,000 and 48,000 basepairs of E. coli strain K12 genomic DNA in the pSMART LC-Kanvector (low-copy) according to the manufacturer’s instructions(Lucigen). We obtained greater than 106–105 clones for librarieswith insert sizes less than 4,000 base pairs or greater than 8,000base pairs, respectively. Detailed protocols are available in Supple-mentary Methods online.
Continuous cultures. We introduced purified plasmid DNA fromeach library into MACH1-T1R (Invitrogen) by electroporation. Weprepared electrocompetent MACH1-T1R cells by standard glycerolwashes on ice to a final concentration of 1011 cells/ml (ref. 25). Weplated 1/1,000 volume of the original transformations on LB withkanamycin in triplicate to determine transformation efficiency andnumbers of transformants. We combined the original cultures,diluted them to 100 ml with MOPS minimal medium andincubated them at 37 1C for 6 h or until the cultures reached anOD600 of 0.50. We then introduced this mixture into a 100-mlculture vessel for continuous culture studies. We recorded theOD600 of the culture every 6 h and adjusted the dilution rateaccording to the growth. We added MOPS minimal medium with
kanamycin at a controlled volumetric flow rate by use of aperistaltic pump. Similarly, volume was maintained by an outletpump set to a maximal flow rate at a given depth in the culturevessel. The culture was continuously agitated using a stir plate,maintained at 37 1C, and aerated using filtered house air.
Sampling. After 24 h of growth, every 12 h we innoculated 100 mlof LB with kanamycin with a 100 ml sample collected from theoutlet stream. We plated 10 ml of the 100-ml culture on LBwith kanamycin to obtain colonies for sequencing and subsequentgrowth studies. We incubated the remainder of the culture at37 1C for 12 h, with shaking at 225 r.p.m. We amplified plasmidsfrom these cultures by growing the cells in medium containingchloramphenicol at 37 1C for 30 min and collected the cells bycentrifugation. We extracted plasmid DNA using a HiSpeedPlasmid Midi kit (Qiagen).
Microarray studies: hybridizations. For each array, we mixed7.5 mg of sample plasmid DNA with the following control plasmidDNA, which was similarly purified: 1,000 ng pGIBS-DAP(ATCC#87486), 100 ng pGIBS-THR (ATCC#87484), 10 ngpGIBS-TRP (ATCC#87485) and 1 ng pGIBS-PHE (ATCC#87483). We digested the plasmid mixture at 37 1C overnight with10 units each of AluI and RsaI (Invitrogen) in a reaction contain-ing 50 mM Tris-HCl (pH 8.0) and 10 mM MgCl2. We heat-inactivated the enzymes in these reactions at 70 1C for 15 min.Then we added 10� One Phor All buffer (Amersham PharmaciaBiotech) to these reactions to a final 1� concentration, two unitsof RQDNAse I (Fisher) and 200 units of Exonuclease III (Fisher).We incubated these reactions at 37 1C for 30 min and then heat-inactivated the enzymes at 98 1C for 20 min. We labeled theresulting fragmented single-stranded DNA with biotinylatedddUTP using the Enzo BioArray Terminal Labeling kit (Enzo LifeSciences) following the manufacturer’s protocol.
Affymetrix E. coli Antisense GeneChip arrays were handled atthe University of Colorado DNA Microarray facility accordingto manufacturer’s specifications using a GeneChip hybridizationoven, GeneChip fluidics station, GeneArray scanner andGeneChip Operating Software v1.1 (Affymetrix).
Microarray studies: low-level probe analysis. We extractedprobe-level signals from the Affymetrix .cel file. We calculatedthe background for each probe according to the algorithm used bythe Microarray Suite 5.0 (MAS 5.0) software from Affymetrix. Wepartitioned background-corrected probes into groups of 25 probepairs, each having a similar affinity. Each pair consists of aperfect match and mismatch probe. We included 25 pairs, or 50probes, in each group. The predicted affinities used to make thesegroupings were taken from the literature26. For each group, weestimated nonspecific signal by a robust regression of the perfect-match probe signal against the difference between the perfectmatch and mismatch probe signals. We performed the robustregression by fitting a repeat median line27. The intercept of thisline was used as an estimate of nonspecific signal for the probegroup and subtracted from each perfect match probe signal.We set the minimum signal allowable for any probe to 1. Wethen corrected these perfect match signals for brightness bydividing them by their predicted affinities, resulting in finalcorrected values26.
NATURE METHODS | VOL.4 NO.1 | JANUARY 2007 | 91
ARTICLES
Microarray studies: signal summarization and de-noising. Thecorrected probe signals can be mapped to their positions inthe genome. We calculated the signal at any given position asthe Tukey biweight (weighted average) of the closest 25 probesignals to that position27. We then applied to the signal a medianfilter with a window length of 1,000 bp, which served to removeany signal spikes resulting from SCALEs smaller than 500 bp.
Multi-scale analysis. We performed a nonlinear multiscaledecomposition to decompose the signal into SCALEs correspond-ing to the signal contribution from each of the different, distinct-sized libraries. This was done using an N-Sieve decompositiongrouping the signal into sieves or windows. This analysis iscapable of discriminating all possible SCALEs. For our purposes,we grouped SCALEs into sizes surrounding the insert size ofour libraries (300–600 bp, 600–1,200 bp, 1,200–2,400 bp,2,400–5,000 bp, 5,000–10,000 bp and 10,000–20,000 bp)28,29.
Normalization. We normalized the signal values for the SCALEsobtained after the multiscale analysis using the information fromthe positive control probes. We analyzed these probes similarly tothe genomic data using a single scale corresponding to the lengthof the associated positive control target gene. This analysis allowedfor a sigmoidal relationship between signal intensity and molarconcentration of the form given in equation 1, which can be fit tothe control probe data and used to estimate molar concentrationfor the remaining summarized signals on a given array. S,processed signal from array analysis for a given element; Amax,maximal processed signal; C, concentration of a given geneticelement; a and b, fitted parameters.
S ¼ Amax=ðða=CÞ+bÞ ð1Þ
Fitness calculations and statistics. To calculate the fitness, foreach array we divided the estimated concentration or normalizedsignal as a function of genomic position by the estimatedconcentration of the ROP (repressor of primer) gene, which ison the backbone of vector used for library construction. Thisallowed us to estimate allele frequency, f, according to equation 2.
fi;n ¼ Ci;n=CROP;n ð2Þwhere C is concentration, i is genomic position, and n is micro-array sample. With an estimate of allele frequency it was thenpossible determine the change in frequency for different samples(that is, W¢i36 ¼ fi36 / fi24). We then subjected these values to theN-sieve multiscale analysis, as described above. Finally, we log-transformed decomposed position-specific values to produce W ¼log (W¢). For the analysis presented in Figure 4, we consideredvalues of W o 10�6 to be absent from the culture. We definedcumulative fitness as Wcum ¼ W1 + W2 + W3. We determinesignificance using a t-test comparing clones with Wcum 4 0 inboth cultures to clones with Wcum o 0.
Accession codes. ArrayExpress: E-TABM-143.
Note: Supplementary information is available on the Nature Methods website.
ACKNOWLEDGMENTSThis work was supported by US National Institutes of Health grants R21 AI055773-01 and K25 AI064338 and National Science Foundation grant BES0228584. M.D.L.was supported by a National Institutes of Health F31 award A1056687. T.W. wassupported by a US Department of Education Graduate Assistantship in Areas of
National Need fellowship. We thank H. Marshall at the University of ColoradoMicroarray Facility, and P.D. Bevins for his help with this work.
COMPETING INTERESTS STATEMENTThe authors declare competing financial interests (see the Nature Methods websitefor details).
Published online at http://www.nature.com/naturemethods/Reprints and permissions information is available online athttp://npg.nature.com/reprintsandpermissions/
1. DeRisi, J.L., Iyer, V.R. & Brown, P.O. Exploring the metabolic and genetic controlof gene expression on a genomic scale. Science 278, 680–686 (1997).
2. Fodor, S.P. et al. Light-directed, spatially addressable parallel chemical synthesis.Science 251, 767–773 (1991).
3. Schena, M., Shalon, D., Davis, R.W. & Brown, P.O. Quantitative monitoring of geneexpression patterns with a complementary DNA microarray. Science 270, 467–470(1995).
4. Badarinarayana, V. et al. Selection analyses of insertional mutants using subgenicresolution arrays. Nat. Biotechnol. 19, 1060–1064 (2001).
5. Cho, R.J. et al. Parallel analysis of genetic selections using whole genomeoligonucleotide arrays. Proc. Natl. Acad. Sci. USA 95, 3752–3757 (1998).
6. Gill, R.T., Wildt, S., Yang, Y.T., Ziesman, S. & Stephanopoulos, G. Genome-widescreening for trait conferring genes using DNA microarrays. Proc. Natl. Acad. Sci.USA 99, 7033–7038 (2002).
7. Winzeler, E.A. et al. Functional characterization of the S. cerevisiae genome bygene deletion and parallel analysis. Science 285, 901–906 (1999).
8. Shoemaker, D.D., Lashkari, D., Morris, D., Mittmann, M. & Davis, R. Quantitativephenotypic analysis of yeast deletion mutants using a highly parallel molecularbar-coding strategy. Nat. Genet. 14, 450–456 (1996).
9. Karlyshev, A.V. et al. Application of high-density array-based signature-taggedmutagenesis to discover novel Yersinia virulence-associated genes. Infect.Immun. 69, 7810–7819 (2001).
10. Giaever, G. et al. Functional profiling of the Saccharomyces cerevisiae genome.Nature 418, 387–391 (2002).
11. Elena, S.F. & Lenski, R. Evolution experiments with microorganisms:the dynamics and genetic bases of adaptation. Nat. Rev. Genet. 4,457–469 (2003).
12. Garcia, B. et al. Role of the GGDEF protein family in Salmonella cellulosebiosynthesis and biofilm formation. Mol. Microbiol. 54, 264–277 (2004).
13. Kirillina, O., Fetherston, J.D., Bobrov, A.G., Abney, J. & Perry, R.D. HmsP, aputative phosphodiesterase, and HmsT, a putative diguanylate cyclase, controlHms-dependent biofilm formation in Yersinia pestis. Mol. Microbiol. 54, 75–88(2004).
14. Simm, R., Fetherston, J., Kader, A., Romling, U. & Perry, R. Phenotypicconvergence mediated by GGDEF-domain-containing proteins. J. Bacteriol. 187,6816–6823 (2005).
15. Simm, R., Morr, M., Kader, A., Nimtz, M. & Romling, U. GGDEF and EAL domainsinversely regulate cyclic di-GMP levels and transition from sessility to motility.Mol. Microbiol. 53, 1123–1134 (2004).
16. Brown, P.K. et al. MlrA, a novel regulator of curli (AgF) and extracellular matrixsynthesis by Escherichia coli and Salmonella enterica serovar Typhimurium. Mol.Microbiol. 41, 349–363 (2001).
17. Brombacher, E., Dorel, C., Zehnder, A. & Landini, P. The curli biosynthesisregulator CsgD co-ordinates the expression of both positive and negativedeterminants for biofilm formation in Escherichia coli. Microbiology 149,2847–2857 (2003).
18. Hickman, J.W., Tifrea, D. & Harwood, C. A chemosensory system the regulatesbiofilm formation through modulatiaon of cyclic diguanylate levels. Proc. Natl.Acad. Sci. USA 102, 14422–14427 (2005).
19. Jenal, U. Cyclic di-guanosine-monophosphate comes of age: a novel secondarymessenger involved in modulating cell surface structures in bacteria? Curr. Opin.Microbiol. 7, 185–191 (2004).
20. Covert, M.W., Knight, E.M., Reed, J.L., Herrgard, M.J. & Palsson, B.O. Integratinghigh-throughput and computational data elucidates bacterial networks. Nature429, 92–96 (2004).
21. Edwards, J.S. & Palsson, B.O. The Escherichia coli MG1655 in silico metabolicgenotype: its definition, characteristics, and capabilities. Proc. Natl. Acad. Sci.USA 97, 5528–5533 (2000).
22. Ibarra, R.U., Edwards, J.S. & Palsson, B.O. Escherichia coli K-12 undergoesadaptive evolution to achieve in silico predicted optimal growth. Nature 420,186–189 (2002).
23. Godiska, R., Patterson, M., Schoenfeld, T. & Mead, D. Beyond pUC: vectors forcloning unstable DNA. In DNA Sequencing: Optimizing the Process and Analysis(ed., Kieleczawa, J.) 55–75 (Jones and Bartlett, Boston, 2004).
92 | VOL.4 NO.1 | JANUARY 2007 | NATURE METHODS
ARTICLES
24. Lynch, M.D. & Gill, R.T. Broad host range vectors for stable genomic libraryconstruction. Biotechnol. Bioeng. 94, 151–158 (2006).
25. Sambrook, J., Fritsch, E.F. & Maniatis, T. Molecular Cloning: A Laboratory Manual(Cold Spring Harbor Laboratory Press, Cold Spring Harbor, 1989).
26. Naef, F. & Magnasco, M.O. Solving the riddle of the bright mismatches: labelingand effective binding in oligonucleotide arrays. Phys. Rev. E 68, 011906(2003).
27. Hoaglin, D.C., Mosteller, F. & Tukey, J.W. Understanding robust and exploratorydata analysis (John Wiley & Sons Inc., New York, 1983).
28. Bangham, J., Chardaire, P., Pye, J. & Ling, P. Multiscale nonlinear decomposition:the sieve decomposition theorem. IEEE Trans. Pattern Anal. Mach. Intell. 18,529–539 (1996).
29. Bangham, J., Ling, P. & Harvey, R. Scale-space from nonlinear filters. IEEE Trans.Pattern Anal. Mach. Intell. 18, 520–529 (1996).
NATURE METHODS | VOL.4 NO.1 | JANUARY 2007 | 93
ARTICLES