Adaptive gene expression divergence inferred from … Adaptive gene expression divergence inferred...

27
1 Adaptive gene expression divergence inferred from molecular population genomics Alisha K. Holloway 1* , Mara K. N. Lawniczak 2 , Jason G. Mezey 3 , David J. Begun 1 , & Corbin D. Jones 4 1 Section of Evolution and Ecology & Center for Population Biology, University of California, Davis, CA 95616 2 Department of Biology, University College London, London, UK WC1E6BT 3 Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY 14853 4 Department of Biology & Carolina Center for Genome Science, University of North Carolina, Chapel Hill, NC 27599 * Corresponding author: Alisha K. Holloway Section of Evolution and Ecology Center for Population Biology University of California-Davis One Shields Avenue Davis, CA 95616 TEL: (530) 754-9551 FAX: (530) 752-1449 Email: [email protected] Alisha Holloway designed experiments, analyzed data, and wrote the paper. Mara Lawniczak analyzed data. Jason Mezey collected and analyzed data. Corbin Jones and David Begun helped design experiments and edited the paper.

Transcript of Adaptive gene expression divergence inferred from … Adaptive gene expression divergence inferred...

1

Adaptive gene expression divergence inferred from molecular population genomics Alisha K. Holloway1*, Mara K. N. Lawniczak2, Jason G. Mezey3, David J. Begun1, & Corbin D. Jones4

1 Section of Evolution and Ecology & Center for Population Biology, University of

California, Davis, CA 95616 2 Department of Biology, University College London, London, UK WC1E6BT 3 Department of Biological Statistics and Computational Biology, Cornell University,

Ithaca, NY 14853 4 Department of Biology & Carolina Center for Genome Science, University of North

Carolina, Chapel Hill, NC 27599

*Corresponding author: Alisha K. Holloway Section of Evolution and Ecology Center for Population Biology University of California-Davis One Shields Avenue Davis, CA 95616 TEL: (530) 754-9551 FAX: (530) 752-1449 Email: [email protected]

Alisha Holloway designed experiments, analyzed data, and wrote the paper. Mara Lawniczak analyzed data. Jason Mezey collected and analyzed data. Corbin Jones and David Begun helped design experiments and edited the paper.

2

Detailed studies of individual genes have shown that gene expression divergence often

results from adaptive evolution of regulatory sequence. Genome-wide analyses, however,

have yet to unite patterns of gene expression with polymorphism and divergence to infer

population genetic mechanisms underlying expression evolution. Here, we combined

genomic expression data—analyzed in a phylogenetic context—with whole genome

light-shotgun sequence data from six Drosophila simulans lines and reference sequences

from D. melanogaster and D. yakuba. These data allowed us to use molecular population

genetics to test for neutral vs. adaptive gene expression divergence on a genomic scale.

We identified recent and recurrent adaptive evolution along the D. simulans lineage by

contrasting sequence polymorphism within D. simulans to divergence from D.

melanogaster and D. yakuba. Genes that evolved higher levels of expression in D.

simulans have experienced adaptive evolution of the associated 3' flanking and amino

acid sequence. Concomitantly, those genes are also decelerating in their rates of protein

evolution, which is in agreement with the finding that highly expressed genes evolve

slowly. Interestingly, adaptive evolution in 5’ cis-regulatory regions did not correspond

strongly with expression evolution. Our results provide a genomic view of the intimate

link between selection acting on a phenotype and associated genic evolution.

3

Author Summary

Changes in patterns of gene expression likely contribute greatly to phenotypic

differences between closely related organisms. However, the evolutionary mechanisms,

such as Darwinian selection and random genetic drift that are underlying differences in

patterns of expression are only now being understood on a genomic level. We combine

measurements of gene expression and whole genome sequence data to investigate the

relationship between the forces driving sequence evolution and expression divergence

between closely related fruit flies. We find that Darwinian selection acting on regions

that may control gene expression is associated with increases in expression levels.

Investigation of the functional consequences of adaptive evolution on regulating gene

expression is clearly warranted. The genetic tools available in Drosophila make

functional experiments possible and will shed light on how closely related species have

responded to reproductive, pathogenic, and environmental pressures.

Introduction

Changes in gene expression are governed primarily by the evolution of cis-acting

elements and trans-acting factors. Several single-gene studies have combined data on

expression, protein abundance, function, and sequence evolution to make powerful

statements about the role of adaptive evolution in effecting phenotypic change [1,2].

These case studies of single genes focused on well-described pathways that were known,

a priori, to have remarkable expression differences. As such, they may provide a biased

view of the population genetic mechanisms controlling gene expression evolution. Thus,

4

the question remains as to which forces, neutral or adaptive, predominate on a genomic

level to bring about changes in gene expression.

Recent studies have tried to discern the causes of genome wide expression

evolution solely from patterns of gene expression variation within and between species

[3-5]. Patterns of constant expression levels across several species combined with

significantly elevated or reduced expression in a single species have been taken as

evidence of lineage-specific adaptive evolution [3,4]. Alternatively, low levels of within

population variation in expression compared to divergence in expression between species

has also been taken as evidence of adaptive evolution [5-7]. As these studies are based

strictly on phenotypic data—expression variation—they are indirect indicators of the

underlying genetic and population genetic phenomena. For example, elevated lineage

specific expression divergence can be explained equally well by directional selection or

by reduced constraint. These studies highlight the importance of direct tests of the

mechanisms of evolution. For example, Good et al. [8] used statistical inferences of

adaptive protein evolution along with expression evolution to investigate the connection

between the two. Their highly conservative test suggested that no significant connection

existed. In an attempt to unite population genetic inference with expression data,

Khaitovich et al. [9] found a positive correlation between linkage disequilibrium and

expression divergence in genes expressed in the human brain. This result is consistent

with recent adaptive evolution of cis-acting regulatory elements associated with brain

expressed genes, but could also be due to selection on protein function.

A global understanding of the population genetic processes acting on expression

phenotypes requires both genomic expression data and genomic sequence variation and

5

divergence data. Combining these data allows for the use of molecular population genetic

tests to identify the underlying evolutionary mechanism. To this end, we combined

expression data from three closely related species, D. simulans, D. melanogaster, and D.

yakuba [6,10], with population genomic sequence data from D. simulans [11], and

genome sequence data from D. melanogaster [12] and D. yakuba [11]. These data allow

us to polarize both expression and sequence evolution to particular lineages.

Additionally, we used the sequence data to mask expression probes (which were

developed using the D. melanogaster reference) with sequence mismatches in D.

simulans and D. yakuba. This approach has the critical advantage that it does not

confound expression divergence with sequence evolution across lineages.

DNA polymorphism and divergence data allow one to directly test for both recent

and recurrent directional selection on genes and non-coding regions associated with rapid

changes in expression. If expression evolution were due to recent directional selection on

cis-acting elements, we predict a reduction in the DNA heterozygosity to divergence ratio

in flanking regions of genes showing expression evolution relative to genomic averages

[13]. Alternatively, if recurrent directional selection has acted on cis-regulatory

sequences controlling expression levels, one might observe excess fixations at regulatory

sites relative to nearby “neutrally” evolving sites [14]. Finally, if gene expression

diverges primarily due to trans-acting factors or neutral processes at cis-acting sites, one

would expect no evidence of directional selection on non-coding sequences near genes

showing expression divergence.

Here, we use population genomic and gene expression data from Drosophila to

address the following questions: Is expression evolution associated with adaptive

6

evolution of cis regions? Are genes with modified expression patterns also evolving

modified protein function under directional selection? Are genes that change expression

over short time scales clustered into distinct functional groups?

Results and Discussion

Expression Analysis

We reanalyzed previously collected expression data from adult male D.

melanogaster, D. simulans, and D. yakuba from the Drosophila v1 Affymetrix GeneChip

Array [6,10]. Sequence divergence of probe targets in D. simulans and D. yakuba could

confound expression analysis [15], so mismatched probes were masked before analysis.

After masking procedures, 4427 probe sets remained, with an average of 3.81 (SE±1.01)

probes per set. We defined genes that are increasing and decreasing in expression in D.

simulans as those in the 5% tails of expression divergence from the D. melanogaster–D.

simulans ancestor (see Methods).

Adaptive 3’ cis-regulatory Evolution Associated with Expression Divergence

Cis-regulatory element evolution directly affects transcription and mRNA half-

life (see [16,17]). Cis-acting elements, such as core promoters, that regulate transcription

are predominantly located in 5’ regions and those that control mRNA stability and

degradation are primarily located in 3’ regions [16,17], although there is considerable

variation among genes. We tested for evidence of an association between recent and

recurrent directional selection in 5’ and 3’ flanking regions (which include UTRs and

putative regulatory regions) and significant changes in expression levels.

7

Reductions in polymorphism relative to divergence indicate the action of recent

directional selection [13]. Flanking regions with polymorphism to divergence ratios in the

lowest 5% tail of the distribution were taken as having evidence of recent selective

sweeps. Figure 1 depicts mean levels of polymorphism and divergence in 5’ and 3’ non-

coding sequence. Flanking regions and UTRs have lower levels of polymorphism and

divergence than silent sites, which is in agreement with previous findings that non-coding

regions are under greater constraint than silent sites [11,25,26]. Genes with increased

expression levels show more variability in levels of polymorphism and divergence over

different features, but no strong pattern emerges. There is no evidence of hitchhiking

effects in either 5’ or 3’ UTR or flanking regions in association with changes in

expression (Figure 2, Table S1).

Using an extension of the McDonald-Kreitman test [14] for noncoding sites, we

compared flanking polymorphic and fixed sites to synonymous sites of the corresponding

gene to infer the action of recurrent directional selection. Genes with significant

expression evolution show more evidence of recurrent directional selection in 3’ UTRs

and 3’ flanking regions than expected by chance (Figure 2, Table S1). Genes with

increases in expression drive this relationship. Although genes with reduced expression

have more 3’ UTR and flanking region divergence than genes with no change in

expression, the tests provide no strong evidence of recurrent adaptation associated with

reduced gene expression (Figure 2, Table S1). The 5’ regulatory regions of genes with

increased expression show the same trend, but again the result is not statistically

significant (Figure 2, Table S1). Thus, recurrent adaptive evolution of 3’ cis-regulatory

regions likely plays a critical role in adaptive expression increases.

8

The 3’ regulatory regions are bound by elements, such as microRNAs, that can

stabilize or destabilize mRNA (see [18]). Given the linkage between adaptive evolution

of 3’ regulatory regions and expression evolution, we hypothesized that microRNAs may

be co-evolving with their target genes. We retrieved information on known microRNAs

and their targets in D. melanogaster from miRBase [19,20]. We found that microRNAs

that regulate a greater number of genes with changes in expression have faster, but not

significantly faster rates of evolution (Spearman’s ρ = 0.2065, p = 0.1073). Rapid

evolution of microRNAs and adaptive expression divergence associated with 3’ regions

strongly motivate in-depth investigation of the 3’ flanking regions to uncover the

functional mechanisms for transcriptional regulation of genes with significant expression

evolution.

Increases in gene expression were more often associated with adaptive evolution

than decreases in expression (Figure 2). This observation does not appear to be due to a

bias in analysis of the data because expression changes are normally distributed and there

is no correlation between estimated ancestral divergence and change in expression (see

Methods). However, continually increasing expression levels cannot persist over long

evolutionary time scales. In fact, expression levels are typically under strong stabilizing

selection [5, and see Methods]. A speculative hypothesis for this observation relies on

relaxation of codon bias. Begun et al. [11] documented an accumulation of fixations for

unpreferred codons in D. simulans. If these unpreferred codons are slightly deleterious

and reduce translational efficiency, regulatory regions may be under directional selection

to compensate for this phenomenon by making more transcript available for translation.

9

Rapid Protein Evolution Accompanies Rapid Gene Expression Divergence

As seen in previous research [6,8], genes with greater absolute levels of

expression divergence evolve faster at the protein level (mean dN ± SE 0.0046±0.0003

and 0.0034±0.0001, for genes changing in expression and not changing, respectively;

Wilcoxon: p < 0.0001; Table S2). Genes with rapid expression evolution are also

represented by fewer expression probes per set (mean number of probes ± SE:

2.98±0.076 versus 3.90±0.033; Wilcoxon: p < 0.0001). A rapid rate of sequence

evolution would lead to more probe mismatch, which explains the observed pattern. This

also renders our expression divergence analysis conservative, as our power to detect a

significant expression difference is reduced for the most rapidly evolving genes.

Interestingly, even though genes with significant increases in expression in D. simulans

have higher average dN, they show decelerating dN in D. simulans relative to D.

melanogaster and D. yakuba (resampling test: p=0.023; method for relative rates

described in Begun et al. [11]). The same is not true of genes with decreasing expression

(p=0.861). While higher average rates of amino acid evolution in genes with expression

divergence could have been indicative of relaxed purifying selection, the deceleration in

dN certainly speak against that hypothesis. Previous work showed that high levels of

expression correlate with lower rates of protein evolution [21-23], which may reflect

selection for translational robustness [23] or translational accuracy [22]. The deceleration

in protein evolution of genes with increases in expression is consistent with the idea of

stronger translational selection on highly expressed genes, but overall, we see only a

weak relationship between expression level and protein divergence (Spearman’s ρ = -

0.1821, p<0.0001).

10

Coding Sequence Evolution Associated with Expression Divergence

Genes adaptively evolving modified expression patterns may also be adaptively

evolving modified protein function. We estimated the proportion of genes in each

expression class—increasing, decreasing, and no change—with evidence for recurrent

directional selection using the McDonald-Kreitman test [14]. For all genes in this

analysis, the proportion undergoing recurrent adaptive evolution was similar to the

genome-wide estimate [11]. The prevalence of recurrent adaptive evolution was not

significantly different for genes showing expression evolution versus those showing no

expression evolution (p=0.4438; Figure 2, Table S1).

We also tested for evidence of recent directional selection as measured by a

reduction in the ratio of silent polymorphism to silent divergence [13]. Coding regions

with ratios in the lowest 5% tail of the distribution were taken to have evidence for recent

selective sweeps. A higher proportion of genes showing expression evolution have

significantly reduced ratios of silent site polymorphism to divergence, which is consistent

with recent selective sweeps (p=0.0445; Figure 2, Table S1). Genes with increased

expression levels explain more of this relationship than genes with decreased expression

(increase p=0.0328, decrease p=0.2530), although both sets have greater reductions of

silent polymorphism to divergence ratios than genes that are not changing in expression.

The targets of these putative hitchhiking events may have been nearby regulatory

regions in an intron or upstream or downstream of the protein coding region.

Alternatively, one possible explanation for the association between up-regulation and

recent selection on coding regions is codon bias. Gene expression is positively correlated

11

with codon bias [22]. Given this association, hitchhiking effects of preferred codons

might increase with increasing levels of expression due to stronger selection for

translational accuracy [22]. While there is a higher ratio of preferred to unpreferred

polymorphisms and fixations in genes evolving increases in expression versus those that

show no expression evolution, the difference is not statistically significant (Fisher’s Exact

Test: p >> 0.05 for both tests; Table 1). There may be a time lag between expression

evolution and the fine-tuning of translation via codon bias. Thus, our data might mean

that genes with the most extreme expression differences have recently increased

expression. Alternatively, the hitchhiking events may result from adaptive evolution

acting on one or a few amino acids or on nearby regulatory regions.

Gene Ontology Analysis

We used gene ontology information from Flybase and from the generic Gene

Ontology Slim set of terms to determine whether certain functional classes of genes were

more likely to evolve expression differences. Six ontology terms are significantly

enriched for genes both with significant increases and decreases in expression (Table S3).

Two of those terms, chymotrypsin and trypsin activity have completely overlapping

genes and are part of a larger category, serine-type endopeptidase activity. These genes

have many functions, including reproduction, digestion, and immunity [24]. Three other

categories, courtship behavior, negative regulation of transcription, and sex determination

appear to be unrelated on the surface, but closer inspection of the genes in these

categories reveals that all are involved in regulation of transcription or chromatin

remodeling. These functions frequently evinced adaptive protein evolution in the

12

genome wide analysis of adaptive evolution in D. simulans [11]. This suggests that there

may be a connection between adaptive protein evolution and expression divergence for

some biological functions.

Because adaptive evolution of 3’ cis-regulatory regions may be driving expression

divergence, at least for genes with increased expression, we examined the classes of

genes associated with genes that have both evidence for adaptive 3’ evolution and

significant expression divergence (Tables S4 and S5). We also investigated ontology

terms associated with genes showing evidence of hitchhiking events and significant

expression divergence (Table S6). Generally, genes with adaptive 3’ or protein evolution

are found in the cytoplasm or are integral to the membrane. Their molecular functions are

predominately protein binding, nucleic acid binding, and translation related. The most

common biological processes are related to response to stimuli, RNA regulation (binding,

splicing, degradation), and metabolism.

Conclusions

In this study, we link adaptive sequence evolution to phenotypic change on a

genome-wide scale. Several recent studies have illustrated the importance of adaptive

evolution acting on non-coding DNA [11,25,26] and our data reinforce this point. More

critically, we show that adaptive evolution of cis-acting elements in 3’ regions is clearly

associated with and may be driving lineage-specific increases in expression that lead to

phenotypic differences between species. Recent work suggests that genes with certain 5’

promoters elements show an increased interspecies variability in expression in yeast as

well as Drosophila [27]. In contrast, our data implies that 3' regulatory regions are

13

playing a more critical role in adaptive expression divergence. Functional genomic

investigation of these 3’ cis-regulatory regions is clearly warranted. The question now

becomes, how and why do genes involved in important processes such as chromatin

remodeling change their expression patterns through 3’ cis-acting regulatory adaptive

evolution?

Materials and Methods

RNA expression data. We reanalyzed expression data from 3 day old virgin

adult males of one isogenic line of D. melanogaster, 10 isogenic lines of D. simulans, and

one isogenic line of D. yakuba [6,10]. Three replicate chips for each line were used. All

data were collected at the same location under standard conditions using the Affymetrix

GeneChip Arrays (Drosophila 1.0), which contain 13,966 features representing the

genome of D. melanogaster. Because the D. melanogaster gene annotation has been

updated since the array was developed, we compared probe sequences to the D.

melanogaster genome to determine which genes were targeted with each probe set.

Masking approach. The probes representing features on the Affymetrix

GeneChip Arrays are constructed for D. melanogaster and are not expected to perfectly

match other species. Prior research suggests that such imperfect matches cause incorrect

measures of expression due to poor hybridization [10,15,28]. To account for the

confounding effect of probe sequence divergence between species on gene expression

measures, only probes that were identical matches to the genome sequences of D.

melanogaster, D. simulans, and D. yakuba were included in analyses. Probes showing

14

any divergence between the probe sequence on the array and the genome sequence of the

three species were masked. Probe sets with fewer than two probes remaining after

masking (out of the original 14) were removed before downstream analyses. Finally,

probe sets that bound to overlapping genes or homologous sequence of multiple genes

were also removed, as the signal could not be attributed to a single gene.

Expression analysis. After probe masking procedures, all chips were normalized

and expression intensities were calculated using gcrma from the affy package available in

Bioconductor [29,30]. The mean of the log2 expression intensity for each probe set was

then calculated for each species. Probe sets for which the log2 mean intensity of at least

one species was not greater than three were considered absent. Of the original 195,944

probes from 13,996 probe sets, 16,850 probes representing 4,427 probe sets remained

after masking and removing probe sets with no detectable expression in either D.

melanogaster or D. simulans (all expression data are in Table S7). The distribution of

expression intensities was highly similar between species (Figure S1) and probe set

intensities were highly correlated between species (Spearman’s ρ = 0.92 between D.

simulans and D. melanogaster and ρ = 0.89 between D. simulans and D. yakuba).

However, probe sets with fewer probes have higher coefficients of variation in D.

simulans and in D. melanogaster (Kruskal-Wallis tests: p<0.0001 for all four tests). We

tested whether probe sets with fewer probes gave reliable estimates of mean expression

intensity. We randomly sampled four probes from probe sets that had all 14 probes

remaining after masking. The mean expression intensity of the sample was highly

correlated with the mean intensity estimated from all 14 probes (Spearman’s ρ = 0.869).

15

The mean expression level varied by +/- 7%, and the variance in expression among

replicates increased by 22%.

Ancestral expression states were reconstructed using AncML v 1.0 [31] using the

average of normalized log2 expression values for each species. Expression divergence

was calculated as follows:

where Esim is the expression level of D. simulans and EAncmel-sim is the estimated

expression level of the D. simulans/melanogaster ancestor. Figure S2 depicts the

distribution of expression change along the D. simulans lineage. The distribution is not

significantly different from normally distributed. Additionally, there is no correlation

between change in expression along the D. simulans branch and the expression level of

the inferred ancestor (Figure S3). The conical nature of Figure S3 reflects the negative

correlation between expression level and expression divergence over short evolutionary

time scales. We defined genes that are increasing and decreasing in expression in D.

simulans as those in the 5% tails of expression divergence from the D. melanogaster-D.

simulans ancestor. We calculated confidence intervals (CI) around the expression values

for D. simulans and determined whether the D. melanogaster expression estimate fell

within the D. simulans CI. Intraspecific expression divergence values in the tails are not

normally distributed, so we calculated CIs in R using bias correction and acceleration

[32]. One probe set (of 221) with increasing expression and four probe sets (of 221) with

decreasing expression along the D. simulans lineage had mean intensities in D.

melanogaster within the 95% confidence intervals of D. simulans.

!

"Esim

=Esim#E

Ancmel#sim

EAnc

mel#sim

$

% & &

'

( ) )

16

Analysis of syntenic assembly. Drosophila simulans and D. yakuba syntenic

assemblies are described in Begun et al. [11] and information on the D. yakuba genome

project can be found at http://genome.wustl.edu. From light-shotgun sequencing of six

lines of D. simulans, a total of 109 Mbp of euchromatic sequence were covered by at

least one of the six lines. Each line had 43-90% coverage of that 109 Mbp with an

average of 3.6 alleles per site. However, coverage of genic regions was somewhat higher

at 3.9 alleles per site.

Genes and Affymetrix probes were localized using the Flybase v.4.2 annotation

(http://flybase.org/annot). Genes included were from two categories. The first set

maintained the gene model of D. melanogaster meaning that, in D. simulans, they have

canonical translation initiation codons (or that matched the D. melanogaster non-

canonical codon), canonical splice junctions at the same position as D. melanogaster (or

non-canonical splice junctions that were identical to the D. melanogaster nucleotides at

splice sites), no premature termination, and a canonical termination codon. The second

set was less conservative in that the gene could have a different gene model with respect

to only one of the aforementioned criteria (i.e. either a non-canonical translation initiation

codon at the D. melanogaster initiation site, or non-canonical splice junctions, or lack a

termination codon at the D. melanogaster termination). Additionally, genes with

premature terminations in the last exon were included. There were very few genes with

imperfect models in any of the expression groups (10/212 with increased expression,

14/210 with decreased expression, and 173/3814 with no change in expression). Only

gold collection UTRs (i.e. those with completely sequenced cDNAs) were used in

17

analyses (http://www.fruitfly.org/EST/gold_collection.shtml). Flanking regions

consisted of sequence 1000 bases upstream and downstream of any annotated UTR

sequence for each gene (or initiation/termination codons for genes without annotated

UTRs). Flanking sequence was truncated if the coding sequence of a neighboring gene

was within the 1000 bases. We also investigated 300 bases upstream of the 5’ UTR (see

Table S1), which would target core promoter regions, and recovered the same results as

with 1000 bases upstream.

Statistical tests and parameter estimation. Some statistical tests were

performed using JMP IN v5.1 (SAS Institute, Inc.). PERL scripts for calculations of

estimated nucleotide diversity (π), McDonald-Kreitman tests, and resampling tests were

written by and can be obtained from AKH. Nucleotide diversity was estimated as in

Begun et al. [11] for each genomic feature (exon, intron, UTRs, flanking) that had a

minimum number of nucleotides represented [i.e. n (n-1) * s >= 100, where n=average

number of alleles sampled and s=number of sites]. The measure of nucleotide diversity,

π, is the coverage-weighted average expected heterozygosity of nucleotide variants and is

therefore an unbiased estimate of polymorphism. For coding regions, the numbers of

silent and replacement sites were counted using the method of Nei and Gojobori [33].

The pathway between two codons was calculated as the average number of silent and

replacement changes from all possible paths between the pair. Estimates of π on the X

chromosome were corrected for sample size [π w = π * (4/3)] under the assumption that

males and females have equal population sizes. Lineage-specific divergence was

estimated by maximum likelihood using PAML v3.14 [34] and was reported as a

18

weighted average over each D. simulans line with greater than 50 aligned sites in the

segment being analyzed. PAML was run in batch mode using a BioPerl wrapper [35].

For noncoding regions, we used baseml with HKY as the model of evolution to account

for transition/transversion bias and unequal base frequencies [36] and for coding regions

we used codeml with codon frequencies estimated from the data. For all genes, 0.001 was

added to heterozygosity and divergence values so that we could calculate ratios for genes

with entries of zero. We did not analyze genes with zero values for both heterozygosity

and divergence. Even after correction for smaller effective population sizes,

heterozygosity at silent sites is significantly lower on the X chromosome than on

autosomes (Kruskal-Wallis test: p<0.0001, Tukey’s HSD shows X is different from all

autosomes), so we defined significantly low heterozygosity/divergence ratios separately

for the X and autosomes. For each feature, genes in the lowest 5% tail of silent site

heterozygosity/divergence ratios were defined as being significantly low and therefore

showing evidence of a recent selective sweep. Those ratios defined as having evidence

of recent selective sweeps were at least 10-fold lower than the mean ratio for all features.

Drosophila simulans-specific accelerations/decelerations in protein evolution were

calculated as described in Begun et al. [11].

Polarized MK tests minimized the numbers of nonsynonymous substitutions and

required that D. melanogaster and D. yakuba share the same codon to ensure that

fixations and polymorphisms were attributable to evolution along the D. simulans

lineage. We used a derivative of the McDonald-Kreitman test [14] to evaluate evidence

for recurrent directional selection in noncoding regions. Polymorphic and fixed sites of

19

noncoding DNA were compared to polymorphic and fixed silent sites of the gene. Again,

we only analyzed sites where D. melanogaster and D. yakuba shared the same nucleotide.

With very few polymorphisms and fixations there is little power to detect the

action of directional selection. Therefore, we imposed a minimum row and column count

for tests to be included in downstream analyses. We required that each row and column

in the 2x2 table have a sum of at least 5 observations. We also removed any tests that

had a significant test result but that had a neutrality index value greater than one, (which

indicates excess amino acid/noncoding polymorphism not directional selection [37]) in

order to calculate the proportion of genes that are experiencing recurrent directional

selection. All data for D. simulans heterozygosity, lineage-specific divergence and MK

tests are listed in Table S8. Substitutions to preferred and unpreferred codons were

estimated by a parsimony method developed by Y-P. Poh [11].

Resampling tests. For each category of interest (e.g. increasing or decreasing

expression levels) we calculated the proportion of genes with a significant test result (for

MK tests, p <= 0.05, for heterozygosity/divergence ratios were considered significant if

they fell in the 5% tail). We then tested whether this proportion was significantly greater

than the random expectation using resampling tests. We randomly drew n p-values from

the set of all genes where n is the number of genes in the category. We repeated this

procedure 10,000 times to get the empirical distribution of proportion genes with

significant tests.

20

Gene ontology. We obtained cellular component, molecular function, and

biological process ontology terms from the Flybase gene ontology terms

(http://flybase.org/genes/lk/function) in combination with the generic Gene Ontology

Slim set of ontology terms (http://geneontology.org/GO.slims.shtml#avail). The

proportion of genes with significant expression evolution was calculated for each

ontology term. We determined whether each ontology term had a higher proportion of

genes with significant D. simulans expression divergence than would be expected from

the empirical distribution. We derived the empirical distribution for each ontology term

by drawing the same number of genes as was in the term from all genes with expression

data. We then calculated the proportion in the resampled data set with significant

expression evolution. We used 10,000 resampled data sets to derive the empirical

distribution for each term.

Acknowledgments

We thank Ariel Chernomoretz for providing the R scripts used to mask probes and much

help adapting them for our purposes and Eric Blanc, Gene Schuster, and Bregje

Wertheim for their valuable help with R and Bioconductor packages. We also thank

three anonymous reviewers and Mia Levine for thoughtful and critical assessment of this

work and suggestions for improvement. This work was funded by an NSF Postdoctoral

Fellowship in Biological Informatics to AKH and by funds from the Carolina Center for

Genome Sciences and the NSF to CDJ.

21

References 1. Crawford DL, Powers DA (1989) Molecular basis of evolutionary adaptation at the

lactate dehydrogenase-B locus in the fish Fundulus heteroclitus. Proc Natl Acad Sci U S A 86: 9365-9369.

2. Odgers WA, Aquadro CF, Coppin CW, Healy MJ, Oakeshott JG (2002) Nucleotide

polymorphism in the Est6 promoter, which is widespread in derived populations of Drosophila melanogaster, changes the level of Esterase 6 expressed in the male ejaculatory duct. Genetics 162: 785-797.

3. Khaitovich P, Weiss G, Lachmann M, Hellmann I, Enard W, et al. (2004) A neutral

model of transcriptome evolution. PLoS Biol 2: E132. 4. Gilad Y, Oshlack A, Smyth GK, Speed TP, White KP (2006) Expression profiling in

primates reveals a rapid evolution of human transcription factors. Nature 440: 242-245.

5. Rifkin SA, Kim J, White KP (2003) Evolution of gene expression in the Drosophila

melanogaster subgroup. Nat Genet 33: 138-144. 6. Nuzhdin SV, Wayne ML, Harmon KL, McIntyre LM (2004) Common pattern of

evolution of gene expression level and protein sequence in Drosophila. Mol Biol Evol 21: 1308-1317.

7. Meiklejohn CD, Parsch J, Ranz JM, Hartl DL (2003) Rapid evolution of male-biased

gene expression in Drosophila. Proc Natl Acad Sci U S A 100: 9894-9899. 8. Good JM, Hayden CA, Wheeler TJ (2006) Adaptive protein evolution and regulatory

divergence in Drosophila. Mol Biol Evol 23: 1101-1103. 9. Khaitovich P, Tang K, Franz H, Kelso J, Hellmann I, et al. (2006) Positive selection on

gene expression in the human brain. Curr Biol 16: R356-358. 10. Mezey JG, Ye F, Nuzhdin SV, Jones CD (In Review) Coordinated evolution of co-

expression gene clusters in the Drosophila transcriptome. BMC Bioinformatics. 11. Begun DJ, Holloway AK, Stevens KS, Hillier, L., Poh Y.-P, et al. (In Review)

Population genomics of Drosophila simulans. PLoS Biology. 12. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, et al. (2000) The

genome sequence of Drosophila melanogaster. Science 287: 2185-2195. 13. Maynard Smith J, Haigh J (1974) The hitch-hiking effect of a favourable gene. Genet

Res 23: 23-35.

22

14. McDonald JH, Kreitman M (1991) Adaptive protein evolution at the Adh locus in Drosophila. Nature 351: 652-654.

15. Gilad Y, Rifkin SA, Bertone P, Gerstein M, White KP (2005) Multi-species

microarrays reveal the effect of sequence divergence on gene expression profiles. Genome Res 15: 674-680.

16. Wray GA, Hahn MW, Abouheif E, Balhoff JP, Pizer M, et al. (2003) The evolution

of transcriptional regulation in eukaryotes. Mol Biol Evol 20: 1377-1419. 17. Ross J (1996) Control of messenger RNA stability in higher eukaryotes. Trends

Genet 12: 171-175. 18. Ambros V (2003) MicroRNA pathways in flies and worms: growth, death, fat, stress,

and timing. Cell 113: 673-676. 19. Griffiths-Jones S (2004) The microRNA Registry. Nucleic Acids Res 32: D109-111. 20. Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ (2006)

miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res 34: D140-144.

21. Pal C, Papp B, Hurst LD (2001) Highly expressed genes in yeast evolve slowly.

Genetics 158: 927-931. 22. Akashi H (2003) Translational selection and yeast proteome evolution. Genetics 164:

1291-1303. 23. Drummond DA, Bloom JD, Adami C, Wilke CO, Arnold FH (2005) Why highly

expressed proteins evolve slowly. Proc Natl Acad Sci U S A 102: 14338-14343. 24. Ross J, Jiang H, Kanost MR, Wang Y (2003) Serine proteases and their homologs in

the Drosophila melanogaster genome: an initial analysis of sequence conservation and phylogenetic relationships. Gene 304: 117-131.

25. Andolfatto P (2005) Adaptive evolution of non-coding DNA in Drosophila. Nature

437: 1149-1152. 26. Kohn MH, Fang S, Wu CI (2004) Inference of positive and negative selection on the

5' regulatory regions of Drosophila genes. Mol Biol Evol 21: 374-383. 27. Tirosh I, Weinberger A, Carmi M, Barkai N (2006) A genetic signature of

interspecies variations in gene expression. Nat Genet 38: 830-834. 28. Ranz JM, Castillo-Davis CI, Meiklejohn CD, Hartl DL (2003) Sex-dependent gene

expression and evolution of the Drosophila transcriptome. Science 300: 1742-1745.

23

29. Gautier L, Cope L, Bolstad BM, Irizarry RA (2004) affy--analysis of Affymetrix

GeneChip data at the probe level. Bioinformatics 20: 307-315. 30. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, et al. (2004)

Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5: R80.

31. Schluter D, Price T, Mooers AO, Ludwig D (1997) Likelihood of ancestor states in

adaptive radiation. Evolution 51: 1699-1711. 32. Efron B, Tibshirani RJ (1993) An introduction to the bootstrap; Cox DR, Hinkley

DV, Reid N, Rubin DB, Silverman BW, editors. Boca Raton, FL: Chapman & Hall/CRC.

33. Nei M, Gojobori T (1986) Simple methods for estimating the numbers of

synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol 3: 418-426. 34. Yang Z, Goldman N, Friday A (1994) Comparison of models for nucleotide

substitution used in maximum-likelihood phylogenetic estimation. Mol Biol Evol 11: 316-324.

35. Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, et al. (2002) The Bioperl

toolkit: Perl modules for the life sciences. Genome Res 12: 1611-1618. 36. Hasegawa M, Kishino H, Yano T (1985) Dating of the human-ape splitting by a

molecular clock of mitochondrial DNA. J Mol Evol 22: 160-174. 37. Rand DM, Kann LM (1996) Excess amino acid polymorphism in mitochondrial

DNA: contrasts among genes from Drosophila, mice, and humans. Mol Biol Evol 13: 735-748.

38. Akashi H (1994) Synonymous codon usage in Drosophila melanogaster: natural

selection and translational accuracy. Genetics 136: 927-935.

24

25

26

Table 1. No evidence for codon bias with increased expression.

Fixation Preferred Unpreferred P:U p-value

↑ 453 597 0.7588 0.8983 nc 8689 11551 0.7522 Polymorphism Preferred Unpreferred P:U p-value ↑ 545 1443 0.3777 0.4205 nc 10779 29774 0.3620

Codon preference was obtained from D. melanogaster [38]. The codon with the highest frequency was used in the counts of preferred and unpreferred polymorphisms. Key: nc = no significant change in expression; ↑ = increase in expression; p-values from Fisher’s Exact Test.

27

Supplementary Figures and Tables Figure S1. Distribution of expression intensities in D. melanogaster, D. simulans, and D. yakuba. Figure S2. Distribution of expression divergence along the D. simulans lineage. Figure S3. Relationship between change in expression and estimated ancestral expression levels. Table S1. Recurrent and recent selection on coding regions and cis-regulatory regions. Table S2. D. simulans heterozygosity and lineage-specific divergence for protein coding and cis-regulatory regions. Table S3. Ontology categories with enrichment of genes with both significant increases and decreases in expression. Table S4. Gene Ontology information for gene with increases in expression and evidence for adaptive evolution in the 3’UTR. Table S5. Gene Ontology information for gene with increases in expression and evidence for adaptive evolution in 3’flanking regions. Table S6. Gene Ontology information for gene with increases in expression and evidence for recent adaptive evolution in the coding region. Table S7. Gene expression data. Table S8. Heterozygosity, divergence, and counts of polymorphic and fixed sites for each feature.