art%253A10.1007%252Fs002390010143(1)

13
A Survey of the Molecular Evolutionary Dynamics of Twenty-Five Multigene Families from Four Grass Taxa Liqing Zhang, 1 Sergei Kosakovsky Pond, 2 Brandon S. Gaut 1 1 Department of Ecology and Evolutionary Biology, 321 Steinhaus Hall, University of California, Irvine, Irvine, CA 92697-2525, USA 2 Program in Applied Mathematics, Department of Mathematics, University of Arizona, Tucson, AZ 85721, USA Received: 25 May 2000 / Accepted: 16 October 2000 Abstract. We surveyed the molecular evolutionary characteristics of 25 plant gene families, with the goal of better understanding general processes in plant gene family evolution. The survey was based on 247 GenBank sequences representing four grass species (maize, rice, wheat, and barley). For each gene family, orthology and paralogy relationships were uncertain. Recognizing this uncertainty, we characterized the molecular evolution of each gene family in four ways. First, we calculated the ratio of nonsynonymous to synonymous substitutions (d N /d S ) both on branches of gene phylogenies and across codons. Our results indicated that the d N /d S ratio was statistically heterogeneous across branches in 17 of 25 (68%) gene families. The vast majority of d N /d S esti- mates were <<1.0, suggestive of selective constraint on amino acid replacements, and no estimates were >1.0, either across phylogenetic lineages or across codons. Second, we tested separately for nonsynonymous and synonymous molecular clocks. Sixty-eight percent of gene families rejected a nonsynonymous molecular clock, and 52% of gene families rejected a synonymous molecular clock. Thus, most gene families in this study deviated from clock-like evolution at either synonymous or nonsynonymous sites. Third, we calculated the effec- tive number of codons and the proportion of G+C syn- onymous sites for each sequence in each gene family. One or both quantities vary significantly within 18 of 25 gene families. Finally, we tested for gene conversion, and only six gene families provided evidence of gene conversion events. Altogether, evolution for these 25 gene families is marked by selective constraint that var- ies among gene family members, a lack of molecular clock at both synonymous and nonsynonymous sites, and substantial variation in codon usage. Key words: Maize — Rice — Wheat — Barley — Molecular evolutionary dynamics — Plant multigene families Introduction Most, if not all, plant nuclear genes are members of gene families—genes of common origin that encode products of similar function. Gene families can vary in size from a few (e.g., Clegg et al. 1997) to a few hundred (e.g., Meyers et al. 1999; Waters 1995) genes. In addition, the number of loci within a single gene family can be plastic. For example, there are at least five classes of genes en- coding small heat-shock proteins, and both the number of classes and the number of gene copies within each class vary among species (Waters 1995). Despite variation in copy number, it is clear that many individual gene family members fill critical functional roles. Multigene family members are produced by gene du- plication. Traditionally, it has been thought that one of two duplicate gene copies evolves under relaxed selec- tive constraint (Ohno 1970; Kimura 1983), and many theoretical models predict the eventual loss of one gene copy as a pseudogene (Nei and Roychoudhury 1973; Takahata and Maruyama 1979; Walsh 1995). However, Correspondence to: Brandon S. Gaut; e-mail: [email protected] J Mol Evol (2001) 52:144–156 DOI: 10.1007/s002390010143 © Springer-Verlag New York Inc. 2001

Transcript of art%253A10.1007%252Fs002390010143(1)

Page 1: art%253A10.1007%252Fs002390010143(1)

A Survey of the Molecular Evolutionary Dynamics of Twenty-Five MultigeneFamilies from Four Grass Taxa

Liqing Zhang,1 Sergei Kosakovsky Pond,2 Brandon S. Gaut1

1 Department of Ecology and Evolutionary Biology, 321 Steinhaus Hall, University of California, Irvine, Irvine, CA 92697-2525, USA2 Program in Applied Mathematics, Department of Mathematics, University of Arizona, Tucson, AZ 85721, USA

Received: 25 May 2000 / Accepted: 16 October 2000

Abstract. We surveyed the molecular evolutionarycharacteristics of 25 plant gene families, with the goal ofbetter understanding general processes in plant genefamily evolution. The survey was based on 247 GenBanksequences representing four grass species (maize, rice,wheat, and barley). For each gene family, orthology andparalogy relationships were uncertain. Recognizing thisuncertainty, we characterized the molecular evolution ofeach gene family in four ways. First, we calculated theratio of nonsynonymous to synonymous substitutions(dN/dS) both on branches of gene phylogenies and acrosscodons. Our results indicated that thedN/dS ratio wasstatistically heterogeneous across branches in 17 of 25(68%) gene families. The vast majority ofdN/dS esti-mates were <<1.0, suggestive of selective constraint onamino acid replacements, and no estimates were >1.0,either across phylogenetic lineages or across codons.Second, we tested separately for nonsynonymous andsynonymous molecular clocks. Sixty-eight percent ofgene families rejected a nonsynonymous molecularclock, and 52% of gene families rejected a synonymousmolecular clock. Thus, most gene families in this studydeviated from clock-like evolution at either synonymousor nonsynonymous sites. Third, we calculated the effec-tive number of codons and the proportion of G+C syn-onymous sites for each sequence in each gene family.One or both quantities vary significantly within 18 of 25gene families. Finally, we tested for gene conversion,and only six gene families provided evidence of gene

conversion events. Altogether, evolution for these 25gene families is marked by selective constraint that var-ies among gene family members, a lack of molecularclock at both synonymous and nonsynonymous sites, andsubstantial variation in codon usage.

Key words: Maize — Rice — Wheat — Barley —Molecular evolutionary dynamics — Plant multigenefamilies

Introduction

Most, if not all, plant nuclear genes are members of genefamilies—genes of common origin that encode productsof similar function. Gene families can vary in size froma few (e.g., Clegg et al. 1997) to a few hundred (e.g.,Meyers et al. 1999; Waters 1995) genes. In addition, thenumber of loci within a single gene family can be plastic.For example, there are at least five classes of genes en-coding small heat-shock proteins, and both the number ofclasses and the number of gene copies within each classvary among species (Waters 1995). Despite variation incopy number, it is clear that many individual gene familymembers fill critical functional roles.

Multigene family members are produced by gene du-plication. Traditionally, it has been thought that one oftwo duplicate gene copies evolves under relaxed selec-tive constraint (Ohno 1970; Kimura 1983), and manytheoretical models predict the eventual loss of one genecopy as a pseudogene (Nei and Roychoudhury 1973;Takahata and Maruyama 1979; Walsh 1995). However,Correspondence to:Brandon S. Gaut;e-mail: [email protected]

J Mol Evol (2001) 52:144–156DOI: 10.1007/s002390010143

© Springer-Verlag New York Inc. 2001

Page 2: art%253A10.1007%252Fs002390010143(1)

empirical studies suggest that more duplicate genes re-main functional than predicted by theory (Ferris andWhitt 1977; Force et al. 1999), and it is thus unclear towhat extent duplicate genes evolve under relaxed con-straint (Hughes 1994; Hughes and Hughes 1993). Otherpossible fates for duplicated genes include retention oforiginal function (Ohno 1970), evolution of new or al-tered expression patterns (Force et al. 1999; Lynch andForce 2000), and development of new function (Kimuraand Ohta 1974; Ohno 1970; Tropf et al. 1994).

Plant gene families act as potential sources of mo-lecular adaptation, and hence it is important to charac-terize their pattern of evolution. Yet surprisingly fewstudies have compared evolutionary patterns amongplant gene families. The purpose of this paper is to char-acterize the molecular evolution of several multigenefamilies and to generalize, as best as possible, their evo-lutionary dynamics. We focus on 25 gene families fromfour grass species: maize (Zea maysssp.mays), wheat(Triticum aestivum), rice (Oryza sativa), and barley(Hordeum vulgare). These four species represent a rangeof genome organization. For example, rice has a smalldiploid genome, with 0.9 pg of DNA per haploid genome(Bennett and Leitch 1995), whereas maize and barley(5.7 and 10.9 pg DNA, respectively) have larger diploidgenomes (Bennett and Leitch 1995). Wheat is hexaploidand has an even larger genome of 33 pg of DNA perhaploid genome (Bennett and Leitch 1995). The fourspecies also represent diverse evolutionary lineages.Wheat and barley are the most closely related of the fourspecies, and both are members of the tribe Triticeae.However, maize, rice, and the Triticeae likely divergedearly in the history of the grass family (Bennetzen andKellogg 1997), roughly 65 million years ago (Stebbins1987; Thomasson 1987). In short, there has been ampletime for these species potentially to accumulate differ-ences in the size and evolutionary dynamics of genefamilies.

We characterize the molecular evolution of each ofthe 25 gene families in four ways. The first way is tomeasure the ratio of nonsynonymous to synonymoussubstitution (dN/dS). ThedN/dS ratio provides insight intothe level of selective constraint acting on proteins. WhendN/dS << 1.0, the protein is under selective constraint;whendN/dS 4 1.0, the protein evolves without constrainton amino acid replacements; and whendN/dS > 1.0, thereis evidence that positive selection has acted to promoteamino acid replacement (Hughes and Nei 1988). If se-lective constraint varies among members of a gene fam-ily, as has been suggested by some studies (Li and Go-jobori 1983), thendN/dS ratios should vary amongphylogenetic lineages of a gene family.

The interpretation of thedN/dS ratio as a measure ofselective constraint often relies implicitly on the assump-tion that synonymous substitution rates adhere to a mo-lecular clock. It has been shown that plant sequences

often do not evolve according to a molecular clock(Bousquet et al. 1992; Eyre-Walker and Gaut 1997; Gautet al. 1992; dePamphilis et al. 1997), but there is as yetno clear consensus as to the behavior of clocks withingene families. Some studies have detected deviationsfrom molecular clocks within gene families (Alba et al.2000; Small et al. 1998; Waters 1995), but in most casesit is not clear if deviations are due to synonymous rates,nonsynonymous rates, or both. Thus, the second way thatwe characterize the evolution of plant multigene familiesis to test separately for synonymous and nonsynonymousmolecular clocks.

The third way we characterize multigene families is tocompare G+C content and codon usage among gene fam-ily members. Because G+C bias and codon usage arecorrelated with levels of gene expression (Bulmer 1990;Fennoy and Bailey-Serres 1993; Sharp and Li 1986),shifts in G+C content among gene family members mayprovide clues about evolutionary shifts in gene function.G+C content has been measured for many plant nucleargenes (Carels et al. 1998; Jansson et al. 1994), but therehas been little effort to determine whether G+C contentcommonly varies among gene family members. Somedata indicate that gene family members can vary widelyin G+C content—for example, maize glyceraldehyde-3-P-dehydrogenase genes differ in their third position G+Ccontent by 39% (Fennoy and Bailey-Serres 1993)—butthis issue has not been addressed broadly.

The fourth and final way we characterize gene fami-lies is by testing for gene conversion. Gene conversionhomogenizes sequences and thus retards gene family di-versification. Gene conversion plays a prominent role inthe evolution of some highly repetitive multigene fami-lies, such as the 18S, 5.8S, and 26S rDNA genes (Zim-mer et al. 1988), but is apparently infrequent in othergene families, such as the 5S rRNA of wheat (Kelloggand Appels 1995) and theadh gene family of peonies(Sang et al. 1997). Here we determine whether the sig-nature of conversion events is commonly detectableamong gene family members.

Materials and Methods

GenBank Searches

We focused on rice, maize, barley, and wheat because they are wellcharacterized at the DNA sequence level. We searched initially forgene families in the literature and GenBank. We found 31 gene familieswith one or several sequences from one of the four species. One ofthese (adh) was not included in analyses because it was studied recently(Gaut et al. 1999). Beginning with one or more sequences from each ofthe remaining 30 gene families, we identified additional GenBank se-quences with BLAST, based on data released before November 1998.Each sequence identified by BLAST was used, in turn, as a new querysequence to perform an additional BLAST search. This process wasrepeated until no additional sequences from rice, maize, wheat, orbarley were identified. After BLAST searches, six gene families con-

145

Page 3: art%253A10.1007%252Fs002390010143(1)

tained only three sequences. These six gene families were not includedin further analyses, leaving 24 gene families. One of these gene fami-lies was studied as two distinct gene families (see below), resulting ina total of 25 gene families. During the search, we encountered a labeledpseudogene sequence in three gene families; these pseudogene se-quences were discarded because they could not be included in analysesthat require separate consideration of nonsynonymous and synonymoussites.

Introns were removed from nucleotide sequences, and exons weretranslated. Protein sequences were aligned with ClustalW and adjustedmanually. Nucleotide sequence alignments were adjusted to conform tothe protein alignment, and all subsequent analyses were based on theDNA sequence alignments. The resulting alignments covered over 80%of the coding length of all gene families except the transcriptionalactivator (R-gene) family (Table 1), for which only 30% of the fullcoding length was analyzed.

Phylogenetic trees were constructed for all the multigene familiesby the Neighbor-Joining (NJ) method (Saitou and Nei 1987) withTamura–Nei (TN) (1993) distances, as implemented in PAUP*4.0b4a.Sequences from the same species that clustered together in a phyloge-netic tree and had more than 95% synonymous similarity were dis-carded to avoid including multiple alleles from the same locus (Clegget al. 1997). All reported analyses were conducted on data sets lacking“allelic” sequences. The sequence alignments of all 25 gene familiesare available at http://bgbox.bio.uci.edu.

The Ratio of Nonsynonymous toSynonymous Substitutions

Each data set was analyzed with two types ofdN/dS analyses, as imple-mented in PAML version 2.0k (Yang 1997). The first type of analysiswas estimation ofdN/dS along branches of the multigene family phy-logeny, and the second type of analysis was estimation ofdN/dS acrosscodon sites. For alldN/dS analyses, we used tree topologies producedby NJ with TN distances (but see Discussion).

Two models were initially applied to the data for the analysis ofdN/dS across branches. The first model (model 1) estimated adN/dS

value for every branch in a phylogenetic tree, whereas the secondmodel (model 0) constrained all the branches to one maximum-likelihood dN/dS value. A likelihood-ratio (LR) comparison betweenmodel 0 and model 1 tested the null hypothesis that thedN/dS ratio ishomogeneous across all branches of the gene phylogeny. ThedN/dS

values were statistically heterogeneous when the null hypothesis wasrejected. A third model (model 2) permitted a test ofdN/dS > 1.0 andthus a test for positive selection. Model 2 was applied in cases wheremodel 1 produced branches withdN/dS ratios higher than 1.0. For thesebranches, thedN/dS value was constrained to equal 1.0 in model 2, andan LR comparison between model 1 and model 2 tested for positiveselection (against the null hypothesis of neutral evolution) along asingle branch of the phylogeny.

The analysis ofdN/dS across codon sites determined whether anycodon residues evolved under positive selection. For this analysis, boththe neutral model and the positive selection model of Nielsen and Yang(1998) were applied to each of the 25 data sets. The neutral modeldivided all codon sites of a gene region into two categories: neutralcodon sites, withdN/dS equal to 1.0, and unvarying codon sites, withdN/dS equal to 0.0. The positive selection model assumed three catego-ries of sites: the two codon categories assumed by the neutral modeland a third category for positively selected codons, for whichdN/dS >1.0. The two models were compared with an LR statistic; a significantstatistic indicated that codons evolved under positive selection.

Tests of Molecular Clocks

For each of the 25 data sets, we tested whether the group of sequencesfollowed either a nonsynonymous or a synonymous molecular clock.

To perform these tests, we tested for clocks over the entire gene phy-logeny with HYPHY (http://peppercat.stat.ncsu.edu/∼hyphy). Theclock-testing procedure was identical to that of DNAmlk in PHYLIP(Felsenstein 1990; http://www.washington.edu/), with the importantdifference that our tests used a codon model that separates synonymousfrom nonsynonymous substitution rates (Muse and Gaut 1994).

The test for a molecular clock used an LR statistic based on acomparison of the likelihood of a rooted tree (Lr) constrained to followa molecular clock to the likelihood of an unrooted and unconstrainedtree (Lu). The need for a rooted tree introduced a problem for ouranalyses, because we did not have an outgroup for any of our data sets.We employed a conservative approach to get around this problem. Weused the unrooted NJ topology to estimateLu. To estimateLr, weproduced 2n−3 rooted topologies, wheren is the number of sequences,from the unrooted topology. The 2n−3 rooted topologies represent allpossible rootings of the tree.Lr was estimated for each rooted versionof the phylogeny, and themaximum Lr (Lr,max) was compared toLu togenerate an LR statistic. This approach is conservative becauseLr,max

either equals or exceeds theLr based on the true (but unknown, in thiscase) root, and thus the LR statistic can only be either underestimatedor estimated correctly. The LR statistic isx2 distributed with n−2degrees of freedom. This procedure—i.e., estimatingLu andLr,max toform an LR statistic—was performed for both nonsynonymous andsynonymous rates for each gene family.

G+C Content and Codon Bias

We measured G+C content as the percentage of G+C at third positionsynonymous sites (GCS) and codon bias as the effective number ofcodons (ENC) (Wright 1990). We chose to study ENC as a measure ofcodon use because it is not biased by gene length, given a certainminimum length, or by amino acid composition (Comeron and Aguade1998; Wright 1990). ENC values range from 20 to 61. An ENC valueof 20 represents extreme bias in which just one codon is used for anyamino acid, whereas a value of 61 indicates that synonymous codon useis random.

We tested for homogeneity of GCS and ENC among sequences bya permutation procedure. The procedure for testing GCS is outlinedbelow, and the procedure was analogous for ENC. First, aligned se-quences were translated, and the codons that encoded polymorphicamino acids were removed from the data set. The removal of thesecodons ensured that tests for homogeneity reflected variation in codonusage among sequences rather than differences in amino acid compo-sition. All GCS and ENC values reported are based on data for whichpolymorphism was removed. Second, the GCS proportion was calcu-lated for all sequences within a data set, and the variance of GCS

(varGCS) among sequences was computed from these values. Third,the data were permuted by randomizing codons (and thus codon usage)among sequences. Permutation retained the position of codons; forexample, the first codon was randomized among alln sequences withina data set, then the second codon was randomized amongn sequences,and so on, until the last codon. The data were permuted 10,000 times,and varGCS was calculated for each permuted data set. Finally, thevarGCS based on GenBank data was compared to the distribution ofvarGCS based on 10,000 permuted data sets. When thevarGCS fromGenBank data was greater than 95% of thevarGCS based on permuteddata, then observed values were significantly more heterogeneous thanexpected under the null hypothesis of homogeneity. The homogeneitytests were implemented in a C-program that is available from the au-thors.

Gene Conversion

We used Sawyer’s (1989) method to test for gene conversion. Sawyer’smethod does not require phylogenetic assumptions and is more sensi-

146

Page 4: art%253A10.1007%252Fs002390010143(1)

tive than other tests of gene conversion in some cases (Drouin et al.1999). Following Sawyer (1989), we used only synonymous variablesites in the permutation statistics. Two statistics were used to test forconversion: SSCF and SSUF. These two statistics differ in that SSCFis less influenced by mutation hotspots, but SSUF probably has greaterpower in detecting gene conversion in the absence of hotspots (Sawyer

1989). If an entire data set demonstrated evidence of gene conversion,we identified converted sequences by two methods. First, we examinedthe distribution of two other statistics [MCF and MUF (see Sawyer,1989)], and, second, we applied gene-conversion tests to intraspecificsequence sets. A C-program for Sawyer’s method is available from theauthors.

Table 1. Gene families (with pseudonyms in parentheses), lengths of aligned sequences, taxa, and GenBank accession numbers: Taxa are maize(Zea mays), rice (Oryza sativa), wheat (Triticum aestivum), and barley (Hordeum vulgare)

Gene familyLength(bp) Taxa (number of sequences) Accession nos.

Actin 1020 Z. mays(8), O. sativa(4) U60507, U60508 U60509, U60510, U60511, U60513,U60514, J01238, X15864, X16280, X15862, X15863

a-Amylase (amylase) 1308 Z. mays(1), O. sativa(7),T. aestivum(1), H. vulgare(5)

L25805, M24286, M74177, M59352, M24287, M24941,M17126, M17128, X56336, X56338, X05809, X15227,Y11277, Y11276

Chlorophyll a/b bindingprotein (Cab)

666 Z. mays(5), O. sativa(4),T. aestivum(2), H. vulgare(4)

X68682, X14794, Y00379, X55892, X63205, AF022739,U74295, X13909, X13908, U73218, M10144, X89023,X12735, X63197, X63052

Calmodulin 447 Z. mays(5), O. sativa(6),T. aestivum(8), H. vulgare(1)

Y13974, X77397, AF031482, X74490, X77396, Z12827,Z12828, AF042839, X65016, AF042840, U37936, U48692,U48689, U48688, U48693, U49103, U49104, U49105,U48690, M27303

Catalase 1500 Z. mays(4), O. sativa(2),T. aestivum(2), H. vulgare(2)

X60135, X12539, Z54358, L05934, D29966, D26484,D86327, X94352, U20778, U20777

Chalcone synthase (CHS) 1218 Z. mays(2), O. sativa(2),H. vulgare(2)

X60205, X60204, X89859, D50576, Y09233, X58339

Chitinase 729 Z. mays(3), O. sativa(5),H. vulgare(5)

L00973, M84164, M84165, L37289, U02286, Z29961,Z29962, X56787, M62904, U02287, L34211, X78672,X78671

Glyceraldehyde-3-phosphatedehydrogenase (Gapd)

915 Z. mays(4), O. sativa(1),H. vulgare(2)

X15596, X73151, U45857, U45856, U31676, X60343,M36650

Glutelin 1512 O. sativa(5) M17513, Y00687, X54313, X54192, X14568Glutamine synthetase 1098 Z. mays(6), O. sativa(3),

H. vulgare(2)X65926, X65927, X65928, X65929, D14578, X65931,

X14244, X14245, X14246, X69087, X53580GTP-binding protein 513 Z. mays(6), O. sativa(5) U22432, U22433, X77795, X63278, D31905, D31906,

S66160, D13152, D13758, X59276, L35845Small heat-shock protein

(heatshock)435 Z. mays(1), O. sativa(6),

T. aestivum(2), H. vulgare(2)X65725, U83671, X75616, X60820, U81385, U83669,

M80186, X13431, X64618, Y07844, X64560Histone H2B 468 Z. mays(5), T. aestivum(4) X69961, X57312, X57313, X69960, U08226, D37942,

D37943, D37944, X59873Histone H3 378 Z. mays(4), O. sativa(3),

T. aestivum(3), H. vulgare(1)M36658, M13379, M35388, X84377, M15664, X13678,

U77296, X00937, U38423, U38422, U38420Histone H4 309 Z. mays(2), T. aestivum(2) M13377, X84376, M12277, X00043Manganese superoxide

dismutase (Mn-dismutase)705 Z. mays(2), O. sativa(1),

T. aestivum(1)X12540, L19462 L34039, AF092524

Peroxidase 882 Z. mays(1), O. sativa(8),T. aestivum(4), H. vulgare(3)

Y13905, X66125, D16442, AF014468, AF014469, D49551,D14481, AF019743, D14997, X85230, X85227, X85228,X56011, L36093, M73234, Z23131

Profilin 393 Z. mays(4), T. aestivum(2),H. vulgare(1)

AF032370, X73279, X73280, X73281, X89825, X89827,U49505

Transcriptional activator(R-gene)

426 Z. mays(2), O. sativa(4) X57276, X60706, U39860, U39863, U39868, U39866

Ribulose-1,5-bisphosphatecarboxylase (Rubp)

513 Z. mays(2), O. sativa(4),T. aestivum(2), H. vulgare

Y09214, X06535, L22155, AF052305, AF017363, D00644,M37477, M37328, U43493

Sucrose synthase 2433 Z. mays(2), O. sativa(3),T. aestivum(2), H. vulgare(2)

L29418, X02400, L03366, X59046, X64770, AJ000513,AJ001117, X69931, X65871

Thionin 429 T. aestivum(4), H. vulgare(4) AF004018, X76861, X70665, X61670, X05576, X05589,L36883, M19046

Ubiquitin 465 Z. mays(3), O. sativa(1),H. vulgare(2)

U29160, U29161, X92422, L31941, M60175, M60176

Zein1 810 Z. mays(12) V01480, X55724, V01475, V01478, M86591, L34340,X55661, X61085, X55722, X55723, X55726, X14334

Zein2 729 Z. mays(10) M12144, M29627, M60836, M60837, X53582, X58700,V01470, X02450, X05911, X67203

147

Page 5: art%253A10.1007%252Fs002390010143(1)

Results

Sequence Alignments and Phylogenetic Trees

The GenBank search resulted in 24 gene families repre-sented by 495 sequences. The phylogeny of the zeinmultigene family resolved two disparate groups of se-quences, and we analyzed the two groups separately.After removing sequences with >95% synonymous iden-tity and splitting the zein multigene family, the completedata consisted of 247 sequences in 25 gene families(Table 1). The average level of DNA sequence identitywithin each family ranged from 63 to 93% (Table 2).

There are three important points to make about thesedata. First, the level of sequence divergence suggests thatmany, if not all, of the gene families originated beforethe grass family. We base this conclusion on the fact thattwo adh paralogues that diverged near the time of theorigin of the grasses have sequence identities of≈80%(Gaut et al. 1999). Thus, sequences with <80% identitymay have diverged before the origin of the grass family,and 18 of 25 gene families have sequences that fall intothis category (Table 2). Second, because the sequenceslikely predate the grasses, gene phylogenies cannot berooted easily by including a nongrass sequence. Therooting of all 25 phylogenies is therefore uncertain, butwe devised methods to eliminate the need for rooted

topologies in our analyses. Finally, it is difficult to defineorthologues in any of the data sets (Fig. 1), and this limitsour ability to contrast the evolutionary dynamics of or-thologues and paralogues.

dN/dS Analysis

The purpose ofdN/dS analyses was to determine, first,whetherdN/dS commonly varies among evolutionary lin-eages of multigene families and, second, whether posi-tive selection governs gene family diversification. Weapplied dN/dS analyses along phylogenetic lineages.Among 25 multigene families, 17 families showed sig-nificant heterogeneity indN/dS across branches atp <0.05, and 11 of the 17 results remained significant aftercorrection for multiple tests (p < 0.002; Table 2). Theoverall conclusion from these tests was thatdN/dS variescommonly during the evolution of plant multigene fami-lies (Fig. 2).

Seventeen of the gene families contained brancheswith estimateddN/dS values greater than 1.0 (e.g., Fig.2). We tested whether thesedN/dS values were signifi-cantly greater than 1.0 (and thus indicative of positiveselection for amino acid replacements) or statisticallyindistinguishable from 1.0 (and thus consistent with noconstraint on amino acid replacements). None of thedN/dS values were significantly greater than 1.0 (data not

Table 2. Percentage sequence identity among sequences, results ofdN/dS heterogeneity tests across branches, and results of molecular clockanalysesa

Gene familyAvg % sequence identity(range)

p value

dN/dS across branches Nonsynonymous clock Synonymous clock

Actin 81.2 (76.4–97.1) <0.001** <0.001** 0.010*a-Amylase 78.4 (70.4–94.0) <0.001** <0.001** 0.001**Cab 80.8 (59.0–97.0) 0.011* <0.001** 0.001**Calmodulin 87.7 (75.8–98.6) <0.001** <0.001** 0.001**Catalase 74.8 (63.8–99.3) <0.001** <0.001** 0.146CHS 72.7 (49.0–94.4) 0.101 0.001** 0.006*Chitinase 71.7 (54.2–92.2) 0.006* <0.001** 0.304Gapd 84.8 (80.6–96.2) <0.001** 0.408 0.650Glutelin 75.9 (67.3–94.6) 0.212 0.904 0.137Glutamine synthetase 78.1 (67.7–95.9) <0.001** <0.001** 0.001**GTP-binding protein 63.0 (52.4–95.3) <0.001** <0.001** 0.028*Heatshock 85.1 (73.7–97.6) 0.058 <0.001** 0.001**Histone H2B 90.9 (83.1–97.2) 0.015* 0.047* 0.001**Histone H3 90.4 (84.7–96.6) <0.001** <0.001** 0.229Histone H4 92.5 (90.3–94.5) 0.029* 0.377 0.218Mn-dismutase 87.8 (85.5–94.4) 0.003* 0.178 0.029*Peroxidase 64.2 (53.1–90.9) <0.001** <0.001** 0.001**Profilin 81.5 (74.1–95.2) <0.001** 0.713 0.623R-gene 70.7 (63.2–92.7) 0.741 0.168 0.160Rubp 84.6 (78.1–95.7) 0.079 0.804 0.402Sucrose synthase 81.4 (75.4–96.2) <0.001** <0.001** 0.001**Thionin 72.2 (55.7–98.6) 0.031* <0.001** 0.148Ubiquitin 92.1 (89.5–96.7) 0.224 0.140 0.507Zein1 93.1 (82.9–97.2) 0.087 <0.001** 0.015*Zein2 85.4 (69.5–96.7) 0.305 <0.001** 0.872

a Gene family names are given in Table 1. *p < 0.05; **p < 0.002 (Bonferroni correction for a significance level of 0.05 over 25 independent tests).

148

Page 6: art%253A10.1007%252Fs002390010143(1)

Fig. 1. The NJ topology of two gene families, with bootstrap support>50% indicated. The scale bar represents the number of nucleotidesubstitutions per site indicated. These trees demonstrate difficulties indefining orthologues with gene family data. For example, it is difficultto determine if chitinase sequences L37289, X56787, and U02286 rep-resent duplications in rice after the divergence of rice and barley (rep-resented by L34211) or, alternatively, whether barley orthologues toL37289, X56787, and U02286 are missing. Maize chitinase sequences

M84165 and M84164 appear to be quite diverged from other se-quences, suggesting that rice and barley may have orthologues to thesesequences that have yet to be sampled. Similar ambiguities apply to theamylase tree and other data sets. Because of difficulties in definingorthology relationships, analyses did not rely on such definitions. Thesetrees are also examples of the relatively high bootstrap support for theNJ topologies of the 25 gene families.

149

Page 7: art%253A10.1007%252Fs002390010143(1)

shown), and hence this method did not detect evidence ofpositive selection.

Failure to detect branches with adN/dS value greaterthan 1.0 could reflect a lack of power to detect positiveselection. This is especially true because positive selec-tion is likely to affect a small subset of amino acids, butthe analysis ofdN/dS across branches averages the effectsof selection over all codon positions. We therefore ap-plied the method of Nielsen and Yang (1998) to deter-mine whether individual codon sites exhibit evidence ofpositive selection. None of the gene families contained

codons withdN/dS ratios significantly greater than 1.0(data not shown). Thus, neither codon tests nor branchtests provided evidence of positive selection during thediversification of the gene families in our data set.

Molecular Clocks

We applied tests of nonsynonymous and synonymousmolecular clocks to all 25 multigene families. Despitethe fact that our clock test is conservative with respect to

Fig. 2. An example of gene phylogenies of two gene families with heterogeneousdN/dS estimates on phylogenetic branches. Glutamine synthetasehas two branches with MLdN/dS estimates >1.0; the two branches were tested separately and neither differs statistically fromdN/dS 4 1.0.

150

Page 8: art%253A10.1007%252Fs002390010143(1)

the rooting of the topology, 17 of 25 (68%) gene familiesexhibited deviation from clock-like nonsynonymousrates. Sixteen of the nonsynonymous tests remained sig-nificant after Bonferroni correction (Table 2). A synony-mous clock was rejected in 13 of 25 (52%) gene families;8 rejections remained significant after Bonferroni correc-tion (Table 2). Altogether, nonsynonymous and synony-mous clocks were rejected for a substantial proportion ofthe gene families.

G+C and Codon Bias

The mean GCS among gene families ranged from 40 to98%, with zein2 and histone H3 at the low and highextremes, respectively (Table 3). GCS also varied withingene families. Based on the permutation approach, 18 ofthe 25 gene families demonstrated significant heteroge-neity in GCS and 15 remained significant after Bonfer-roni correction. For those gene families with sufficientsampling, it is clear that much of the GCS variation isencompassed among paralogues within species (Fig. 3).Together with previous work showing that G+C contentsare similar across grass species (Carels and Bernardi2000), these data suggest that most variation in GCS

within gene families is due to divergence of paraloguesrather than GCS differences among species.

We also applied homogeneity tests for ENC. Like

GCS, ENC varied substantially among gene families; therange of ENC among gene families was 27.5 (histone2B) to 61.0 (zein 1) (Table 3). ENC also varied amongsequences within gene families. Nine families had sig-nificant ENC heterogeneity, and four remained signifi-cant after Bonferroni correction (Table 3). Fewer genefamilies demonstrated significant ENC heterogeneity asopposed to GCS heterogeneity, probably reflecting dif-ferences in the statistical power to detect heterogeneitywith these two statistics. Heterogeneity in ENC, likeGCS, appears to be a function primarily of divergenceamong paralogues rather than different codon usageamong species (Fig. 4). Altogether, our results suggestthat shift in codon preference is a common and perhapsimportant feature of the diversification of paralogues.

Gene Conversion

We applied Sawyer’s (1989) permutation tests to eachdata set. Six gene families showed significance for eitherone or both of two statistics (Table 4), with three remain-ing significant after Bonferroni correction. For each ofthe six gene families, we tried to identify converted se-quences both by applying Sawyer’s tests to intraspecificsequence data and by examining two additional metrics[MCF and MUF (Sawyer 1989)]. For five of the six genefamilies, we were able to identify converted sequences.

Table 3. ENC and GCS values, with results of heterogeneity testsa

Gene family

ENC GCS

Mean Range p value Mean (%) Range (%) p value

Actin 47.5 41.0–53.2 0.077 57.4 43.3–68.2 0.000**Amylase 34.4 31.1–38.3 0.125 90.2 82.6–95.5 0.001**Cab 34.6 23.1–50.6 0.023* 89.7 70.6–100 0.000**Calmodulin 44.7 32.5–61.0 0.017* 72.8 51.5–92.8 0.000**Catalase 38.7 29.2–52.0 0.000** 77.7 47.4–96.0 0.000**CHS 38.9 30.4–57.4 0.000** 86.6 59.5–98.3 0.000**Chitin NA NA NA 96.3 84.8–100 0.006*Gapd 43.6 40.7–46.6 0.610 61.2 57.0–69.1 0.001**Glutelin 53.3 50.9–59.0 0.062 41.1 36.2–45.7 0.059Glutamine synthetase 41.3 32.3–60.0 0.000** 72.4 41.5–86.0 0.000**Gtp-binding protein 53.2 37.7–61.0 0.162 58.6 36.2–74.5 0.000**Heatshock 46.9 25.1–61.0 0.193 95.8 86.3–98.6 0.000**Histone H2B 27.5 23.7–33.4 0.043* 96.1 86.8–100 0.000**Histone H3 30.0 26.6–35.0 0.291 97.7 91.6–100 0.004*Histone H4 30.6 29.0–32.0 0.838 94.1 90.7–96.9 0.231Mn-dismutase 56.2 53.6–58.0 0.716 63.8 60.1–68.4 0.068Peroxidase 41.8 33.1–55.1 0.998 86.8 64.9–98.2 0.000**Profilin 56.2 46.3–61.0 0.129 73.8 64.3–86.9 0.000**R-gene 50.0 32.9–61.0 0.040* 75.4 0.65–0.85 0.016*Rubp 38.3 27.6–46.2 0.001* 84.0 75.0–97.9 0.000**Sucrose synthase 48.6 43.7–52.3 0.000** 63.4 55.5–76.8 0.000**Thionin 60.0 53.3–61.0 0.076 59.5 55.3–68.4 0.231Ubiquitin 29.6 28.2–31.8 0.914 94.8 91.9–97.0 0.197Zein1 61.0 61.0–61.0 0.460 43.6 40.6–45.8 0.431Zein2 52.8 30.3–59.1 0.673 40.4 37.0–43.0 0.916

a NA—not available, because data without amino acid variation consists of <61 codons. *p < 0.05; **p < 0.002 (Bonferroni correction for asignificance level of 0.05 over 25 independent tests).

151

Page 9: art%253A10.1007%252Fs002390010143(1)

For example, gene conversion within the actin gene fam-ily was localized to two sequences from maize; this con-version event was easily visible by eye and was the onlyconversion event identified in our study that had beenreported previously (Moniz de Sa and Drouin 1996).Similarly, gene conversion within the glutamine genefamily was localized to a pair of sequences in rice and apair of sequences in maize. We could not identify con-verted sequences for thea-amylase multigene familyboth because intraspecific tests for conversion were notsignificant and because MCF and MUF statistics did notidentify putative intraspecific sequence pairs.

Discussion

Multigene families are large components of plant ge-nomes and ultimately serve as the template for molecularadaptation. Despite the importance of plant gene fami-lies, there is generally little detailed knowledge of theirmolecular evolution (Clegg et al. 1997). To our knowl-edge, this is the only molecular evolutionary study thathas investigated more than three plant gene families, andit is the first to apply these molecular evolutionary analy-ses to a broad array of plant multigene family data.

Most of the 25 gene families in this study exhibit

Fig. 3. Graph of the range of GCS for all sequences within a gene family and for all sequences from a gene family within a particular species.Within-species ranges are based on species for which three or more sequences from the gene family are available. The graph illustrates that the GCS

range is often encompassed within a single species and also shows that GCS usually overlaps among species.

Fig. 4. Graph of the range of ENC for all sequences within a gene family and for all sequences from a gene family within a species. Within-speciesranges are based on species for which three or more sequences are available from the gene family. The graph illustrates that the range of ENC isoften encompassed among paralogues with a single species, and it also shows that the ENC range overlaps among species. Chitinase is not shownbecause the ENC was incalculable, and zein1 is not shown because there was no variation in ENC among sequences.

152

Page 10: art%253A10.1007%252Fs002390010143(1)

significant heterogeneity indN/dS ratios among evolu-tionary lineages (Table 1). There are four possible rea-sons for such heterogeneity. The first reason is that thephylogenetic assumptions used in the analyses are mis-leading. Nielsen and Yang’s (1998) method requires asingle phylogeny for each gene family, and hence ourdN/dS results could be incorrect if the NJ phylogenies areincorrect. To investigate this issue, we estimated maxi-mum-parsimony (MP) trees and compared MP and NJtrees. The MP and NJ topologies were identical for 12 of25 data sets, and hence we believe that our phylogeniesare reasonable for these 12 data sets. For 5 of the re-maining 13 data sets we repeated thedN/dS analysesusing the MP tree. The results from the NJ and the MPtopology were qualitatively identical for all five cases,suggesting that changes in tree topology do not dramati-cally affect dN/dS results. Altogether, then, it does notappear that heterogeneity indN/dS is an artifact of inac-curate phylogenies.

The second potential reason for heterogeneity ofdN/dS ratios is that some gene copies are under selectiveconstraint, while others lack constraint entirely. Levelsof selective constraint can be inferred fromdN/dS ratios;whendN/dS is similar to 1.0, a gene sequence is evolvingwithout selective constraint on nonsynonymous substitu-tions relative to synonymous substitutions. A lack of se-lective constraint is consistent with eventual pseudo-genization (Walsh 1995). To determine whether manysequences are evolving in the absence of constraint, wecounted the number of phylogenetic branches withdN/dS

values close to 1.0. Of a total of 471 branches on 25phylogenies, only 44 of the branches havedN/dS esti-mates greater than 0.50. Given the low frequency (9.3%)of branches withdN/dS estimates greater than 0.50, it isclear that few of the sequences in our data set haveevolved in the absence of selective constraint (e.g., Fig.2). Therefore heterogeneity indN/dS values does not ap-pear to be fueled solely by differences between genesthat are evolving with and without selective constraint.

It is tempting to try to generalize from these results bystating that plant gene duplications are rarely followed

by strictly neutral evolution, but the data in this study aresubject to ascertainment bias, in that most GenBank dataprobably represent genes that were sequenced because ofknown function. Thus, it may not be surprising that wefind little evidence for sequences that lack selective con-straint. Continued research on the evolution ofdN/dS

values in gene families needs to be pursued, particularlywith gene family data that lack ascertainment biases.

The third potential reason for heterogeneousdN/dS

ratios is diversifying selection, but this explanationseems unlikely because we found no evidence that posi-tive selection has affected any of the 25 gene families.This result should be interpreted cautiously because thetest of dN/dS > 1.0 detects only a subset of adaptiveevolutionary events. Nonetheless, we have applied apowerful method to test fordN/dS > 1.0, and our resultssuggest that strong diversifying selection is rare in theseplant gene families. In this respect, our results paralleltheadhgene family of grasses, where thedN/dS ratio washeterogeneous among lineages but not >1.0 (Gaut et al.1999). To our knowledge, the only plant genes that havebeen documented to experience diversifying selectionare disease defense genes (Meyers et al. 1998; Parniskeet al. 1997; Wang et al. 1998) and genes for self-incompatibility (Ioerger et al. 1990; Richman and Kohn1999). To summarize, heterogeneity indN/dS amongevolutionary lineages appears to be a function of neithercompletely relaxed selection (i.e.,dN/dS 4 1.0) nor posi-tive selection (i.e.,dN/dS > 1.0).

The final possibility is that variation indN/dS ratios isfueled by synonymous rate heterogeneity. This must betrue to some extent, because synonymous clocks do nothold for 13 of the 25 genes families in this study (Table2). Synonymous rates for plant nuclear genes have notbeen studied in great detail (Gaut 1998), and hence theforces contributing to synonymous rate heterogeneity areunclear. In some cases the generation time may affectsynonymous substitution rates (Wu and Li 1985; Gaut etal. 1992; Eyre-Walker and Gaut 1997), but the genera-tion time does not adequately explain the rate variation insequences from these four annual plant species. Otherexplanations, such as speciation rates (Bousquet et al.1992), also seem untenable for these data. It is possiblethat synonymous rate heterogeneity is caused by shifts inmutation rates. Such shifts could occur either in differentevolutionary lineages, which could lead to rate differ-ences between orthologues (dePamphilis et al. 1997), orin different genomic regions, which could lead to ratedifferences between paralogues.

Codon usage likely also contributes to synonymousrate heterogeneity within gene families. Evolutionaryrate and codon usage are correlated; highly biased genesgenerally evolve more slowly, primarily because there isconstraint on synonymous nucleotide substitution (Eyre-Walker and Bulmer 1995; Sharp and Li 1987). This cor-relation has not been examined in detail in plants, and

Table 4. Tests for gene conversiona

Gene family

p valuea Sequences identifiedby MCF or MUF(species)SSCF SSUF

Actin <0.001** <0.001** U60513–U60514 (maize)Glutamine

synthetase0.0142* 0.076 D14578–X65926 (maize)

X14244–X14245 (rice)GTP-binding

protein<0.001** 0.159 X59276–L35845 (rice)

Profilin 0.0635 0.0362* X59276–D13758 (rice)Rubp 0.0501 0.0113* Y09214–X06535 (maize)

M37477–M37328 (wheat)L22155–AF052305 (rice)

Amylase 0.0019** 0.038 None identified

a See Table 3, footnote a.

153

Page 11: art%253A10.1007%252Fs002390010143(1)

such examination requires sequences for which orthol-ogy and paralogy are well defined. Nonetheless, it isknown that the correlation is imperfect. For example,GCS differs between two grassadhparalogues, but GCS

differences were not associated with differences in syn-onymous substitution rates (Gaut et al. 1996).

In both prokaryotes and eukaryotes, including maize(Fennoy and Bailey-Serres 1993), codon usage is alsocorrelated with gene expression. Correlated shifts in ex-pression and codon usage are likely driven by naturalselection for efficient translation of highly expressedgenes (Morton 1993; Akashi 1997), but shifts in codonusage could also be driven by mutation pressure amongparalogues. We cannot discriminate between mutationand selection as the forces fueling diversification in thesegene families. However, our data (Fig. 3), coupled withthe fact that maize, rice, and barley have very similarG+C profiles (Carels and Bernardi 2000), suggest thatvariation in codon usage is due primarily to differencesamong paralogues within species rather than wholesaleG+C differences between species. We conclude that di-vergence of codon usage is a common phenomenon inplant gene family evolution.

Although gene families deviate from synonymousclock-like evolution, heterogeneity indN/dS is also afunction of nonsynonymous rate variation. Few of thedata sets in this study adhere to a nonsynonymous clock.Nonsynonymous rate variation has also been docu-mented in phytochrome genes (Alba et al. 2000;Mathews and Sharrock 1996),adh genes (Gaut et al.1996; Small et al. 1998), heat-shock proteins (Waters1995), and MADS-box genes (Purugganan et al. 1995),to name a few. Our results further document that non-synonymous rate variation is common among gene fam-ily members and suggest thatdN/dS heterogeneity is afunction of both synonymous rate heterogeneity and dif-fering levels of selective constraint on gene family mem-bers.

Our final analyses focused on gene conversion. De-spite the fact that gene conversion is considered an im-portant aspect of gene family evolution (Basten and Ohta1992), we found little evidence of gene conversion in ourdata set. Our results contribute to a growing picture inwhich gene conversion is common in some high-copy,clustered gene families like rDNA (Zimmer et al. 1988)but either difficult to detect or uncommon in many mul-tigene families. Gene conversion seems to occur primar-ily in gene clusters. For example, therbcS gene family isknown to undergo gene conversion (Meagher et al.1989), probably in a hierarchical pattern wherein genesthat are close physically undergo conversion more oftenthan physically separated genes (Clegg et al. 1997).Little is known, however, about the physical length overwhich conversion is effective. In addition, gene conver-sion may be rare even when it is detectable. For example,the actin gene family contains clear examples of gene

conversion between sequences, yet only two gene con-version events were found among a sample of 53 se-quences (Moniz de Sa and Drouin 1996).

Care must be exercised before broad generalizationscan be made about gene family evolution, because acomplete characterization ultimately calls for many typesof data—e.g., estimates of copy number among species,physical mapping, knowledge of pseudogenes, and com-parative gene expression data—in addition to molecularevolutionary analyses like those performed here. None-theless, this survey has identified four general features ofthe molecular evolution of plant gene families that havenot been described in detail previously. First,dN/dS var-ies commonly during the evolution of gene families.Such variation does not appear to be a consequence ofeither positive selection or a complete lack of selectiveconstraint. Second, variation indN/dS among lineagesreflects variation in both nonsynonymous and synony-mous rates. Third, paralogues within species vary incodon use for some gene families, but it remains to beseen whether such variation is fueled by selection ormutation. Finally, few gene families retain evidence ofgene conversion events, suggesting that gene conversionis either infrequent or difficult to detect.

Acknowledgments. The authors would like to thank L. Eguiarte, P.Tiffin, M. Le Theirry d’Ennequin, M. Sawkins, S.V. Muse, and A. Peekfor comments, and we are particularly grateful for the comments of twoanonymous reviewers. This work was supported by the NSF (DBI-9872631 and NSF DEB-9996118) and the USDA (98-35301-6153).

References

Akashi H (1997) Codon bias evolution inDrosophila: Population ge-netics of mutation-selection-drift. Gene 205:269–278

Alba R, Kelmenson PM, Cordonnier-Pratt M-M, Pratt LH (2000) Thephytochrome gene family in tomato and the rapid differential evo-lution of this family in angiosperms. Mol Biol Evol 17:362–373

Basten CJ, Ohta T (1992) Simulation study of a multigene family, withspecial reference to the evolution of compensatory advantageousmutations. Genetics 132:247–252

Bennett MD, Leitch IJ (1995) Nuclear DNA amounts in angiosperms.Ann Bot 76:113–176

Bennetzen JL, Kellogg EA (1997) Do plants have a one-way ticket togenomic obesity? Plant Cell 9:1509–1514

Bousquet J, Strauss SH, Doerksen AH, Price RA (1992) Extensivevariation in evolutionary rates ofrbcL gene sequences among seedplants. Proc Natl Acad Sci USA 89:7844–7848

Bulmer M (1990) The effect of context on synonymous codon usage ingenes with low codon usage bias. Nucleic Acids Res 8:2869–2873

Carels N, Bernardi G (2000) Two classes of genes in plants. Genetics154:1819–1825

Carels N, Hatey P, Jabbari K, Bernardi G (1998) Compositional prop-erties of homologous coding sequences from plants. J Mol Evol46:45–53

Clegg MT, Cummings MP, Durbin ML (1997) The evolution of plantnuclear genes.Proc Natl Acad Sci USA 94:7791–7798

Comeron JM, Aguade M (1998) An evaluation of measures of syn-onymous codon usage bias. J Mol Evol 47:268–274

dePamphilis CW, Young ND, Wolfe AD (1997) Evolution of plastidgenerps2 in a lineage of hemiparasitic and holoparasitic plants:

154

Page 12: art%253A10.1007%252Fs002390010143(1)

Many losses of photosynthesis and complex patterns of rate varia-tion. Proc Natl Acad Sci USA 94:7367–7372

Drouin G, Prat F, Ell M, Clarke GDP (1999) Detecting and character-izing gene conversions between multigene family members. MolBiol Evol 16:1369–1390

Eyre-Walker A, Bulmer M (1995) Synonymous substitution rates inenterobacteria. Genetics 140:1407–1412

Eyre-Walker A, Gaut BS (1997) Correlated rates of synonymous siteevolution among plant genomes. Mol Biol Evol 14:455–460

Felsenstein J (1990) PHYLIP manual. University Herbarium, Univer-sity of California, Berkeley

Fennoy SL, Bailey-Serres J (1993) Synonymous codon usage inZeamays L. nuclear genes is varied by levels of C and G-endingcodons. Nucleic Acids Res 21:5294–5300

Ferris S, Whitt G (1977) Loss of duplicated gene expression afterpolyploidization. Nature 265:258–260

Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J(1999) Preservation of duplicate genes by complementary degen-erative mutations. Genetics 151:1531–1545

Gaut BS (1998) Molecular clocks and nucleotide substitution rates inhigher plants. Evol Biol 30:93–120

Gaut BS, Muse SV, Clark WD, Clegg MT (1992) Relative rates ofnucleotide substitution at therbcL locus of monocotyledonousplants. J Mol Evol 35:292–303

Gaut BS, Morton BR, McCaig BM, Clegg MT (1996) Substitution ratecomparisons between grasses and palms: Synonymous rate differ-ences at the nuclear geneAdhparallel rate differences at the plastidgenerbcL. Proc Natl Acad Sci USA 93:10274–10279

Gaut BS, Peek AS, Morton BR, Clegg MT (1999) Patterns of geneticdiversification within theAdhgene family in the grasses (Poaceae).Mol Biol Evol 16:1086–1097

Hughes AL (1994) The evolution of functionally novel proteins aftergene duplication. Proc Roy Soc Lond B Biol 256:119–124

Hughes AL, Nei M (1988) Pattern of nucleotide substitution at majorhistocompatibility complex class I loci reveals overdominant selec-tion. Nature 335:167–170

Hughes MK, Hughes AL (1993) Evolution of duplicate genes in atetraploid animal,Xenopus laevis.Mol Biol Evol 10:1360–1369

Ioerger TR, Clark AG, Kao T-H (1990) Polymorphism at the self-incompatibility locus in Solanaceae predates speciation. Proc NatlAcad Sci USA 87:9732–9735

Jansson S, Meyer-Gauen G, Cerff R, Martin W (1994) Nucleotidedistribution in gymnosperm nuclear sequences suggests a model forGC-content change in land plant nuclear genomes. J Mol Evol39:34–46

Kellogg EA, Appels R (1995) Intraspecific and interspecific variationin 5S RNA genes are decoupled in diploid wheat relatives. Genetics140:325–343

Kimura M (1983) The neutral theory of molecular evolution. Cam-bridge University Press, Cambridge

Kimura M, Ohta T (1974) On some principles governing molecularevolution. Proc Natl Acad Sci USA 71:2848–2852

Li WH, Gojobori T (1983) Rapid evolution of goat and sheep globingenes following gene duplication. Mol Biol Evol 1:94–108

Lynch M, Force A (2000) The probability of duplicate gene preserva-tion by subfunctionalization. Genetics 154:459–473

Mathews S, Sharrock RA (1996) The phytochrome gene family ingrasses (Poaceae): A phylogeny and evidence that grasses have asubset of loci found in dicot angiosperms. Mol Biol Evol 13:1141–1150

Meagher RB, Berry-Lowe S, Rice K (1989) Molecular evolution of thesmall subunit of ribulose bisphosphate carboxylase: nucleotide sub-stitution and gene conversion. Genetics 123:845–863

Meyers BC, Shen KA, Rohani P, Gaut BS, Michelmore RW (1998)Receptor-like genes in the major resistance locus of lettuce aresubject to divergent selection. Plant Cell 10:1833–1846

Meyers BC, Dickerman AW, Michelmore RW, Sivaramakrishnan S,Sobral BW, Young ND (1999) Plant disease resistance genes en-code members of an ancient and diverse protein family within thenucleotide-binding superfamily. Plant J 20:317–332

Moniz de Sa M, Drouin G (1996) Phylogeny and substitution rates ofangiosperm actin genes. Mol Biol Evol 13:1198–1212

Morton BR (1993) Chloroplast DNA codon use—Evidence for selec-tion at the psbA locus based on transfer RNA availability. J MolEvol 37:2731–280

Muse SV, Gaut BS (1994) A likelihood approach for comparing syn-onymous and nonsynonymous nucleotide substitution rates, withapplication to the chloroplast genome. Mol Biol Evol 11:715–1724

Nei M, Roychoudhury AK (1973) Probability of fixation of nonfunc-tional genes at duplicate loci. Am Nat 107:362–372

Nielsen R, Yang ZH (1998) Likelihood models for detecting positivelyselected amino acid sites and applications to the HIV-1 envelopegene. Genetics 148:929–936

Ohno S (1970) Evolution by gene duplication. Springer-Verlag,Heidelberg

Parniske M, Hammond-Kosack KE, Golstein C, Thomas CM, JonesDA, Harrison K, Wulff BBH, Jones JDG (1997) Novel diseaseresistance specificities result from sequence exchange between tan-demly repeated genes at the Cf-4/9 locus of tomato. Cell 91:821–832

Purugganan MD, Rounsley SD, Schmidt RJ, Yanofsky MF (1995)Molecular evolution of flower development: diversification of theplant MADS-box regulatory gene family. Genetics 139:345–356

Richman AD, Kohn JR (1999) Self-incompatibility alleles from Phy-salis: Implications for historical inference from balanced geneticpolymorphisms. Proc Natl Acad Sci USA 96:168–172

Saitou N, Nei M (1987) The neighbor-joining method: A new methodfor reconstructing phylogenetic trees. Mol Biol Evol 4:406–425

Sang T, Donoghue MJ, Zhang D (1997) Evolution of alcohol dehy-drogenase genes in peonies (Paeonia): Phylogenetic relationshipsof putative nonhybrid species. Mol Biol Evol 14:994–1007

Sawyer S (1989) Statistical tests for detecting gene conversion. MolBiol Evol 6:526–538

Sharp PM, Li WH (1986) An evolutionary perspective on synonymouscodon usage in unicellular organisms. J Mol Evol 24:28–38

Sharp PM, Li W-H (1987) The rate of synonymous substitution inenterobacterial genes is inversely related to codon usage bias. MolBiol Evol 4:222–230

Small RL, Ryburn JA, Cronn RC, Seelanan T, Wendel JF (1998) Thetortoise and the hare: choosing between noncoding plastome andnuclearadh sequences for phylogeny reconstruction in a recentlydiverged plant group. Am J Bot 85:1301–1315

Stebbins GL (1987) Grass systematics and evolution: past, present andfuture. In: Soderstron TR, Hilu KH, Campbell CS, Barkworth ME(eds) Grass systematics and evolution. Smithsonian InstitutionPress, Washington, DC, pp 359–367

Takahata N, Maruyama T (1979) Polymorphism and loss of duplicategene expression: A theoretical study with application to tetraploidfish. Proc Natl Acad Sci USA 76:4521–4525

Tamura K, Nei M (1993) Estimation of the number of nucleotidesubstitutions in the control region of mitochondrial DNA in humansand chimpanzees. Mol Biol Evol 10:512–526

Thomasson JR (1987) Fossil grasses: 1820–1987. In: Soderstrom TR,Hilu KH, Campbell CS, Barkworth ME (eds) Grass systematics andevolution. Smithsonian Institution Press, Washington, DC, pp 159–167

Tropf S, Lanz T, Rensing SA, Schroder J, Schroder G (1994) Evidencethat stilbene synthases have developed from chalcone synthesesseveral times in the course of evolution. J Mol Evol 38:610–618

Walsh JB (1995) How often do duplicated genes evolve new function?Genetics 139:439–444

Wang G-L, Ruan D-L, Song W-Y, Sideris S, Chen L, Pi L-Y, Zhang S,

155

Page 13: art%253A10.1007%252Fs002390010143(1)

Zhang Z, Fauquet C, Gaut BS, Whalen MC, Ronald PC (1998)Xa21D encodes a receptor-like molecule with a leucine rich repeatdomain that determines race-specific recognition and is subject toadaptive evolution. Plant Cell 10:765–779

Waters ER (1995) The molecular evolution of the small heat-shockproteins in plants. Genetics 141:785–795

Wright F (1990) The ‘effective number of codons’ used in a gene. Gene87:23–29

Wu C-I, Li W-H (1985) Evidence for higher rates of nucleotide sub-stitution in rodents than in man. Proc Natl Acad Sci USA 82:1741–1745

Yang Z (1997) PAML: A program package for phylogenetic analysisby maximum likelihood. CABIOS 13:555–556

Zimmer EA, Jupe ER, Walbot V (1988) Ribosomal gene structure,variation and inheritance in maize and its ancestors. Genetics 120:1125–1136

156