Supporting Online Material for - Science Selaginella Genome Identifies Genetic Changes Associated...

www.sciencemag.org/cgi/content/full/science.1203810/DC1

Supporting Online Material for

The Selaginella Genome Identifies Genetic Changes Associated with the Evolution of Vascular Plants

Jo Ann Banks,* Tomoaki Nishiyama, Mitsuyasu Hasebe, John L. Bowman, Michael Gribskov, Claude dePamphilis, Victor A. Albert, Naoki Aono, Tsuyoshi Aoyama,

Barbara A. Ambrose, Neil W. Ashton, Michael J. Axtell, Elizabeth Barker, Michael S. Barker, Jeffrey L. Bennetzen, Nicholas D. Bonawitz, Clint Chapple, Chaoyang Cheng,

Luiz Gustavo Guedes Correa, Michael Dacre, Jeremy DeBarry, Ingo Dreyer, Marek Elias, Eric M. Engstrom, Mark Estelle, Liang Feng, Cédric Finet, Sandra K. Floyd, Wolf B. Frommer, Tomomichi Fujita, Lydia Gramzow, Michael Gutensohn, Jesper Harholt, Mitsuru Hattori, Alexander Heyl, Tadayoshi Hirai, Yuji Hiwatashi, Masaki Ishikawa,

Mineko Iwata, Kenneth G. Karol, Barbara Koehler, Uener Kolukisaoglu, Minoru Kubo, Tetsuya Kurata, Sylvie Lalonde, Kejie Li, Ying Li, Amy Litt, Eric Lyons, Gerard

Manning, Takeshi Maruyama, Todd P. Michael, Koji Mikami, Saori Miyazaki, Shin‐ichi Morinaga, Takashi Murata, Bernd Mueller‐Roeber, David R. Nelson, Mari Obara,

Yasuko Oguri, Richard G. Olmstead, Naoko Onodera, Bent Larsen Petersen, Birgit Pils, Michael Prigge, Stefan A. Rensing, Diego Mauricio Riaño-Pachón, Alison W. Roberts, Yoshikatsu Sato, Henrik Vibe Scheller, Burkhard Schulz, Christian Schulz, Eugene V.

Shakirov, Nakako Shibagaki, Naoki Shinohara, Dorothy E. Shippen, Iben Sørensen, Ryo Sotooka, Nagisa Sugimoto, Mamoru Sugita, Naomi Sumikawa, Milos Tanurdzic, Gunter

Theißen, Peter Ulvskov, Sachiko Wakazuki, Jing‐Ke Weng, William W.G.T. Willats, Daniel Wipf, Paul G. Wolf, Lixing Yang, Andreas D. Zimmer, Qihui Zhu, Therese

Mitros, Uffe Hellsten, Dominique Loqué, Robert Otillar, Asaf Salamov, Jeremy Schmutz, Harris Shapiro, Erika Lindquist, Susan Lucas, Daniel Rokhsar, and Igor V. Grigoriev

*To whom correspondence should be addressed. E-mail: [email protected]

Published 5 May 2011 on Science Express

DOI: 10.1126/science.1203810

This PDF file includes:

SOM Text Figs. S1 to S14 Tables S1 to S8 References

2

Table of Contents

Sequencing and Assembly ...................................................................................................... 4

Genomic DNA sequencing and assembly............................................................................. 4

Table S1. Assembly statistics of the Selaginella genome.......................................................4

Table S1.1. Inputs to the assembly. ........................................................................................5

Table S1.2. Inputs to the assembly. ........................................................................................5

cDNA library construction, sequencing and EST assembly. .................................................. 5

Taxonomy mapping using MEGAN. ..................................................................................... 7

Figure S1. MEGAN analysis of the Selaginella genome. .........................................................7

Genome Annotation ............................................................................................................... 8

Intron analysis. ....................................................................................................................... 8

Table S2. Summary of Selaginella, Arabidopsis and Physcomitrella genome features. ........9

Transposable element discovery and description.................................................................... 9

Table S3. Summary of transposable element copy number, coverage and occupation by

their superfamily in Selaginella. ...........................................................................................12

Polyploidy and Duplicate Gene Analyses............................................................................... 13

Figure S2. Age distribution of Selaginella duplicate genes. .................................................14

Table S4. Fitted values of age distribution of gene duplication with mixture model. .........15

Small RNAs ........................................................................................................................... 15

Table S5. Annotation of Selaginella MIRNAs. ......................................................................16

Figure S3. Orphan small RNAs. ............................................................................................19

Figure S4. Hotspots of orphan sRNA accumulation in Selaginella. ......................................21

Selaginella homologs of small RNA biogenesis and function genes.................................... 22

Figure S5. Phylogenies of sRNA biogenesis and function genes..........................................23

3

Finding Gene Families by Gene Clustering............................................................................. 24

Phylogenetic analyses of genes involved in Arabidopsis development .................................. 26

Table S6. The number of land plant orthologues in Arabidopsis, rice, Selaginella and

Physcomitrella. .....................................................................................................................28

Metabolic pathways. ............................................................................................................ 37

P450s................................................................................................................................ 37

Table S7. The number of genes per P450 clan in Physcomitrella, Selaginella and

Arabidopsis. ..........................................................................................................................38

Figure S6. Phylogeny of Selaginella CYP51 clan P450s. .......................................................40


Figure S8. Phylogenies of Selaginella CYP72 clan (left) and CYP74 clan (right) P450s..........43


Figure S10. Phylogeny of Selaginella CYP86 clan P450s. .....................................................45

Figure S11. Phylogeny of Selaginella CYP97 clan P450s. .....................................................46

BAHD acyltransferases...................................................................................................... 47

Figure S12. Cladogram of BAHD acyltranferases………………….....………………………………………48

Terpene Synthases………………………..…………………...………………………………………………………….49

Figure S13. Phylogeny of plant terpene synthases……..……………………………………………………50

The plastome........................................................................................................................ 51

Figure S14. Map of the Selaginella plastid genome…..…………….………………………………………52

PPR gene family.................................................................................................................... 53

Table S8. PPR gene numbers and number of RNA editing sites in organellar genomes. .....53

References............................................................................................................................ 53

4

Sequencing and Assembly

Genomic DNA sequencing and assembly.

DNA was isolated from purified nuclei (27) from clonally propagated Selaginella plants grown in a greenhouse under 65% shadecloth and in soil. The original Selaginella ramet was obtained from Plant Delights nursery (http://www.plantdelights.com/). Raw sequences were generated using ABI3730 sequencers from several whole genome shotgun libraries (Table S1.1‐S1.2) prepared as described in (28). The Selaginella genome sequence assembly v1.0 was built with the Arachne assembler from whole genome shotgun paired end sequencing reads (29) (Table S1.1‐S1.2). Two haplotypes showing substantial DNA polymorphism (>1.5%) were evident after assembly, with many genes represented by two alleles. For this reason, a 14‐fold redundant coverage of the haploid genome was produced and stringently assembled to separate haplotypes. Assembly statistics are shown in Table S1. The assembly was built with Arachne V.2007201 with parameters “correct1_passes=0, maxcliq1=100, BINGE_AND_PURGE=True”. The attempt with these parameters was to treat Selaginella as though it were a haploid genome and split the haplotypes into individual scaffolds, instead of interleaving the haplotypes for the same alleles within the same scaffold. To achieve this, we turned off error correction (which can be used to converge the read sequence into one haplotype) and turned on binging and purging, which cleans out unanchored repetitive pairs and then replaces these read pairs in the best‐anchored location. For repeat sequences, this sorts discrepant copies of repeats into their appropriate location and ensures that unanchored repetitive sequence is removed from the assembly. For Selginella, the remaining parameters in Arachne were left as defaults, which is the same procedure we use to assemble haploid genomes with Arachne.

Table S1. Assembly statistics of the Selaginella genome.

Nuclear genome assembly size (Mbp)* 212.6 Contig sequence total (Mbp) /%gap* 208.8 / 1.9% gap Sequencing read coverage depth 7.0x # of contigs 5 169 # of nuclear scaffolds 768 # of nuclear scaffolds >2 Kbp 726 Nuclear scaffold N/L50 38/1.7 Mbp Contig N/L50 515/119.8 Kb *While the estimated genome size is 110 Mbp, the total assembly composed of two divergent haplotypes represents 212 Mbp of genomic sequence. The statistics above include only nuclear scaffolds 1 kb or longer.

5

Table S1.1. Inputs to the assembly.

Library Insert Size (bp) Reads Failed (%) Vector Unpaired (%) Paired (%) Goodp20 Stddev

ASXW 3,200 791,953 39,800 (5.0) 10,665 35,288 (4.5) 706,200 (89.2) 782.11 +/‐118.14

BWOG 3,200 468,938 55,643 (11.9) 10,525 30,326 (6.5) 372,444 (79.4) 717.57 +/‐137.39

BXZC 5,900 357,317 33,867 (9.5) 25,041 12,911 (3.6) 285,498 (79.9) 703.62 +/‐110.63

BHBS 6,500 87,639 5,986 (6.8) 1,011 1,692 (1.9) 78,950 (90.1) 759.67 +/‐113.48

ASXX 6,800 703,194 40,549 (5.8) 19,553 13,320 (1.9) 629,772 (89.6) 745.38 +/‐113.05

BGZC 8,200 107,429 10,671 (9.9) 2,363 3,741 (3.5) 90,654 (84.4) 741.79 +/‐122.96

ASXY 36,200 171,840 19,545 (11.4) 1,995 12,720 (7.4) 137,580 (80.1) 682.55 +/‐154.76

FAHY 39,200 43,776 9,940 (22.7) 345 2,805 (6.4) 30,686 (70.1) 679.88 +/‐150.12

Total N/A 2,732,086 216,001 (7.9) 71,498 112,803 (4.1) 2,331,784 (85.3) 742.06 +/‐126.54

Table S1.2. Inputs to the assembly.

Min Scaffold

length

#Scaffolds

#Contigs

Total scaffold length Total scaffold length Scaffold contig

coverage

All 768 5,169 212,761,159 208,782,310 98.13%

1 kb 751 5,152 212,749,482 208,770,633 98.13%

2.5 kb 728 5,128 212,712,244 208,734,126 98.13%

5 kb 510 4,794 211,849,451 207,914,181 98.14%

10 kb 345 4,525 210,696,261 206,797,504 98.15%

25 kb 242 4,338 209,234,724 205,372,357 98.15%

50 kb 217 4,259 208,287,987 204,522,614 98.19%

100 kb 190 4,113 206,392,259 202,812,135 98.27%

250 kb 157 3,888 200,551,682 197,152,372 98.31%

500 kb 118 3,513 186,475,478 183,472,480 98.39%

1 Mbp 70 2,715 150,746,097 148,565,091 98.55%

2.5 Mbp 16 1,023 62,107,177 61,354,061 98.79%

5 Mbp 2 228 13,133,933 13,005,222 99.02%

cDNA library construction, sequencing and EST assembly. Total RNA was Isolated from pooled Selaginella roots, shoots and strobili with RNeasy Plant Mini Kit (QIAGEN). Poly A+ RNA was isolated from total RNA using the Absolutely mRNA Purification kit (Stratagene). cDNA synthesis and cloning was a modified procedure based on the “SuperScript plasmid system with Gateway technology for cDNA synthesis and cloning” (Invitrogen). 1‐2 μg of poly A+ RNA, reverse transcriptase SuperScript II (Invitrogen) and oligo dT‐NotI primer (5' GACTAGTTCTAGATCGCGAGCGGCCGCCCT15VN 3') were used to synthesize

6

first strand cDNA. Second strand synthesis was performed with E. coli DNA ligase, polymerase I and RNaseH followed by end repair using T4 DNA polymerase. The SalI adaptor (5' TCGACCCACGCGTCCG and 5' CGGACGCGTGGG) was ligated to the cDNA, digested with NotI (NEB), and subsequently size selected by gel electrophoresis (1.1% agarose). The cDNA inserts were directionally ligated into the SalI and NotI digested vector pCMVsport6 (Invitrogen) or pMCL200cDNA. The ligation was transformed into ElectroMAX T1 DH10B cells (Invitrogen). Library quality was first assessed by randomly selecting 24 clones and PCR amplifying the cDNA inserts with the primers M13‐F (5’ GTAAAACGACGGCCAGT) and M13‐R (5’ AGGAAACAGCTATGACCAT) to determine the fraction of insertless clones. Colonies from each library were plated onto agarose plates (254 mm plates [Teknova, Hollister, CA]) at a density of approximately 1 000 colonies per plate. Plates were grown at 37°C for 18 hours then individual colonies were picked and each used to inoculate a well containing LB media with appropriate antibiotic in a 384 well plate. Clones in 384 well plates were grown at 37°C for 18 hours. Contained plasmid DNA for sequencing was produced by rolling circle amplification (30) with Templiphi (GE Healthcare). Subclone inserts were sequenced from both ends using primers complimentary to the flanking vector sequence (for pCMVsport6 ‐ Fw: 5’ ATTTAGGTGACACTATAGAA, Rv: 5’ TAATACGACTCACTATAGGG: for pMCL200cDNA‐ Fw: 5’ AGGAAACAGCTATGACCAT, Rv: 5’ GTTTTCCCAGTCACGACGTTGTA) and Big Dye terminator chemistry then run on ABI 3730 instruments (ABI, Foster City, CA). A total of 108 287 Expressed Sequence tags (ESTs) were sequenced. The ESTs (108,287 from JGI plus 2,187 from Genbank) were processed through the JGI EST pipeline (ESTs generated in pairs, a 5’ and 3’ end read from each cDNA clone). The Phred software (31, 32) was used to call the bases and generate quality scores. Vector, linker, adapter, poly‐A/T and other artifact sequences were removed using the Cross_match software (31, 32) and an internally developed short pattern finder. Low quality regions of the read were identified using internally developed software, masking regions with a combined quality score of less than 15. The longest high quality region of each read was considered the EST. ESTs shorter than 150 bp were removed from the data set. ESTs containing common contaminants such as E. coli, common vectors and sequencing standards were also removed from the data set. EST clustering was performed ab‐initio, based on alignments between pairs of trimmed, high quality ESTs. Pair‐wise EST alignments were generated using internally developed alignment software. ESTs sharing an alignment of at least 98% identity were assigned to the same cluster. Furthermore, ESTs not sharing alignments were assigned to the same cluster, if they were derived from the same cDNA clone. Clusters of ESTs were assembled into consensus sequences, contigs or singlets using CAP3 (33). A total of 19,871 assembled consensus sequences were generated.

7

Taxonomy mapping using MEGAN. Genomic scaffolds were cut in silico into 2000 nt subsequences then subjected to BLAST (34). MEGAN (35) was then used to map the BLAST hits on the NCBI taxonomy to summarize and order the results (Fig. S1). The sequences yielding hits to Metazoa and Fungi are Selaginella transposable element regions, and such hits are also observed in Physcomitrella. There are significant hits in Bacteria, especially Bacillus selenitireducens with many hits to two proteins (Genbank: EDP81118.1 and EDP81119.1). These two proteins are located on a single contig

(Genbank: ABHZ01000025) in the B. selenitireducens genome draft. The corresponding regions in the Selaginella assembly seem to be always intergenic and related to LTR retrotransposons. The fact that only two putative B. selenitireducens proteins produce all these ~2000 hits leads to the assumption that there are in fact Selaginella sequences in the B. selenitireducens genome draft. Taken together, these results suggest that there are no obvious bacterial contaminations in the Selaginella genome assembly.

Figure S1. MEGAN analysis of the Selaginella genome.

BLAST hits of genomic scaffolds against NCBI Genpept were mapped to NCBI taxonomy using MEGAN. The sizes of the circles are proportional to the number of sequences assigned to the corresponding taxon. The numbers on the right side are the number of Selaginella subsequences yielding a significant hit.

8

Genome Annotation

The diploid Selaginella genome was annotated using the JGI Annotation pipeline, which combines several gene prediction, annotation and analysis tools. Gene predictors used for annotation include Fgenesh, Fgensh+ (36) and Genewise (37). Over 110,000 Selaginella ESTs were clustered, converted into putative full‐length (FL) genes and directly mapped to genomic sequences or used to extend predicted gene models into FL genes by adding 5’ and/or 3’ UTRs to the models. Since multiple gene models were generated for each locus, a single representative model was chosen based on homology and EST support and used for further analysis. All predicted gene models were functionally annotated by homology to annotated genes in the NCBI non‐redundant set. Each gene model was analyzed for domain content and structure using InterproScan (38) and classified according to Gene Ontology (39), eukaryotic orthologous groups (KOGs) (40) and KEGG metabolic pathways (41). These predictions and annotations are available from JGI Genome Portal at http://www.jgi.doe.gov/Selaginella. In total, 34,292 gene models (referred as JGI Filtered Models2) were predicted from the Selaginella assembly, 37% of which are supported by ESTs. To enable whole‐genome analysis of genes of a single haplotype we used the Dagchainer program (42) to identify duplicated regions in Selaginella genome. From the list of duplicated regions (scaffolds), gene models with pairwise percent identity of >90% were selected. Gene models from larger scaffolds (per pair) were extracted to yield a set of 22,285 non‐redundant models, which represents a single haplotype and collectively is referred to as JGI Filtered Models3. 33% of the non‐redundant genes are supported by ESTs. All gene models were deposited at NCBI such that both alleles can be retreived after performing blast searches. A blast search limited to the 22,285 JGI Filtered Models3 gene models can be performed by accessing the JGI website http://genome.jgi‐psf.org/cgi‐bin/runAlignment?db=Selmo1&advanced=1 and choosing from the database dropdown menu “Selaginella moellendorffii V1.0 non‐redundant filtered model proteins (Selmo1_GeneModels_FilteredModels3_aa)” fot blastx, blastp or blatp searches, or “Selaginella moellendorffii V1.0 non‐redundant filtered model transcripts (Selmo1_GeneModels_FilteredModels3_nt)” for blast, blastn or tblastn searches. A summary comparison of genome features in Selaginella, Arabidopsis and Physcomitrella is shown in Table S2.

Intron analysis.

There are on average 5.7 introns per spliced gene, with a median of 4 per spliced gene, a minimum of 1, and a maximum of 78. The introns are A+T and pyrimidine rich compared to exons. Introns were 57% A+T, 54% pyrimidine, while exons were 48% A+T, 47% pyrimidine. Both were biased compared to the full genomic assembly, which was 50% A+T and 50%

9

pyrimidine. Of the 102,360 gaps between exons, 99.4% (101715/102360) had canonical splice donor‐acceptor sequences (GT‐AG). 0.14% (142 cases) had CT‐AC, the reverse complement of GT‐AG, suggesting genes with the wrong orientation. 0.31% (318 cases) had GC‐AG, a signal known to exist as a low‐frequency splice signal variant in other eukaryotes. All other possible pairs had < 10 occurrences.

Table S2. Summary of Selaginella, Arabidopsis and Physcomitrella genome features.

Selaginella Arabidopsis Physcomitrella

1C genome size (Mbp) 110 134 (43) 480 (28)

Protein encoding gene number 22,285 26,207 (44) 27,949 (45)

Average intron length (bp) 103 164 (44) 310 (28)

Median 3’ UTR length (bp) 143 205 351

Average CDS length (bp) 1,276 1,211 1,137

Gene density (kb per gene) 4.94 4.50 (44) 13.33

1N chromosome number 10 (46) 5 (44) 27 (47)

Whole genome duplications (paleoploidy)

0 3 (48, 49) 1 (50)

Transposable element content 37.5% 15% (43) 50% (28)

Transposable element discovery and description.

Methods. Due to the lack of a close relative among the sequenced and annotated plant genomes, and the rapid rate of TE evolution and sequence change, the search for transposons in Selaginella emphasized structural rather than homology criteria. Long terminal repeat (LTR) retrotransposons were sought using the programs LTR_STRUC (51) and LTR_FINDER (52). The LTR retrotransposon candidates from these two programs were combined to normalize any redundancy. An HMMer (http://hmmer.janelia.org) search for typical retrotransposon protein domains (GAG, PR, INT and RT) was used to further filter candidates. The intact element candidates with at least one retrotransposon protein domain were judged true LTR retrotransposons. The 573 candidate LTR retrotransposons without any identified LTR retrotransposon protein domain were manually inspected for the presence of appropriate termini and the presence of a PBS (primer binding site) and/or polypurine tract (PPT) (53). This manual analysis confirmed that 524 of the 573 LTR retrotransposon candidates without identified LTR retrotransposon protein domains were true LTR retrotransposons. Once intact elements were identified and further confirmed by inspection of their termini for appropriate target site duplications (TSDs), the predicted intact elements were then employed in homology

10

searches to find fragments of elements. The approximate insertion times for largely intact LTR retrotransposons were calculated (54) using an estimated sequence divergence rate of 1.3 × 10‐8 nucleotide changes per bp per year (55). Helitrons, a class of DNA transposons (56), were sought using a structure‐based approach described previously (57). Once candidate Helitrons were identified, all members of each predicted family (defined by a unique 3’ end sequence) were manually aligned to confirm their Helitron nature. Hence, only families with two or more members could be confirmed as Helitrons by this structural approach. Gene fragments within Helitrons were sought by BLASTX search. Candidate gene fragments were only considered real if they showed high similarity (an Expect value <1E‐8) to known genes in the NCBI nr protein database (BLASTX). Two classes of retroelements, LINEs and SINEs, and the cut‐and‐paste DNA TEs were sought by a combination of homology criteria (using very low required homology values) and investigations of aligned repeats identified by the RECON program (58). For the RECON pipeline, first, an all‐versus‐all comparison was performed using WU‐BLASTN 2.0 (http://blast.wustl.edu) with options M=5 N=‐11 Q=22 R=11 ‐kap E=0.00001 wordmask=dust wordmask=seg maskextra=20 ‐hspmax 5000. As a comparison, we also ran BLASTN searches with a required expect value of <1E‐10. Then, the BLAST output files were filtered by three criteria. The first criterion was the removal of “self‐hits” (i.e., where the query sequence name is the same as the subject sequence name). Second, only those repetitive sequences with a >225 bit score (59) and over 100 bp of homology were considered to be meaningful for this repetitive sequence analysis. Third, only those repeats with more than 10 copies were chosen. Homology of the cut‐and‐paste DNA transposons was sought to the hAT, CACTA, MULE, Tc1/mariner, PIF/Harbinger and P element superfamilies (53). Transposase protein and nucleotide sequences were collected and all protein‐encoding sequences aligned with ClustalW version 2.0 (60) to find signatures of conserved domains. For TBLASTN and BLASTP searches, all transposase sequences from plants were used as the initial query against the Selaginella assembled scaffolds and proteins (Filtered Models 3, http://genome.jgi‐psf.org/Selmo1/Selmo1.download.ftp.html) using default parameters. In these searches, the most homologous sequence was investigated, even if that homology was very low (e.g., BIT scores (59) as low as 26). The output file was processed to eliminate redundancy coming from different query sequences and aligned homologies with a known transposase were extracted. The extracted sequences were extended in both directions to construct multiple sequence alignments with ClustalW (60). TIRs and TSDs were sought with the GCG package BESTFIT program (Madison, WI) as well as by manual alignment. If none of the three most‐homologous sequences had the structure of a TE of that family, then the homology‐based component of the search was judged a failure.

11

MITE candidates were identified by looking for obvious structural characteristics (TIRs, TSDs, short length) using FINDMITE (61) with defined parameters (at least 11 bp length of the TIR, 1 bp or less allowed mismatch in TIRs, 30‐700 bp distance between the inverted repeats). TIRs solely composed of A/T strings, C/G strings, or simple repeats were filtered out. All MITE candidates were aligned and clustered by BLAST to identify redundant elements. To remove identical repeats, the sequences were clustered to a non‐redundant set with an expect value of 1E‐3 or less. The copy numbers of intact and fragmented elements were determined by searching with the fullest‐length TE of that family across the entire assembled sequence using RepeatMasker (version 3.1.19, 01‐18‐2008, http://www.repeatmasker.org/) with WU‐blast as search engine under default settings. To filter short sequences that might inflate both copy number and genome occupation analyses, several criteria were employed to parse the output generated by RepeatMasker. If the transposon query was 500 bp or less, the length of the subject from RepeatMasker had to be no less than 50 bp to be counted. If the size of the transposon query was longer than 500 bp, the length of the subject had to be no less than 100 bp to be counted. In addition, the percent contribution of each element family was determined by searching with the fullest‐length TE across the shotgun sequence data itself. Results. 59,748 LTR retrotransposons, 46 LINEs, 5394 Helitrons, 2386 MITEs, 11 hAT elements from one family (named DTA‐sel1) and 37 DNA transposons from one family of unknown origin (named Qute) were discovered (Table S3). CACTA, Tc1/mariner, PIF/Harbinger and P elements were not detected and/or confirmed by the criteria employed, nor were SINEs or MULEs. In the assembled genomic sequences, all of these identified TEs combined make up 37.5% of the genome, while they comprise 42.7% of the raw genomic shotgun sequence data. 2093 full‐length LTR elements were identified, of which 1369 elements (65%) were from the gypsy superfamily and 125 elements (6%) were from the copia superfamily, along with 599 (29%) elements that are currently uncategorized. By AAARF analysis (62) of shotgun sequence data, the most genome occupation was found associated with four LTR retrotransposons families that were named Tuje, Kutag, Vode and Sufi. Each of these elements comprises 2‐4% of the Selaginella genome. The most abundant family, Tuje, was found in 1868 copies in the assembled genome, of which 201 are largely intact, 653 are solo LTRs and 1014 are fragmented. In the assembled data, Tuje contributed 6.7 Mbp of sequence, or ~3.1% of the assembled genome, compared to its ~4.2% representation in the shotgun sequence data. Only one LINE family was discoved; 46 copies were found in the assembled Selaginella genome.

12

Nineteen intact Helitrons, including 8 putative autonomous and 11 non‐autonomous members, were detected in the Selaginella genome. These intact members belong to 4 Helitron families and 6 subfamilies. The average size of these elements is 5 143 bp. All of the intact elements harbor fragments from different standard non‐Selaginella genes with high similarity (E‐value < 1 × 10‐8) by BLASTX search. Only three cut‐and‐paste DNA transposon types were identified in the Selaginella genome. Most classes of these elements, like CACTAs and MULEs, were not found at all. The most abundant DNA transposon observed was a highly diverged MITE, with a copy number of over 2000 and an average element size of 168 bp. The other two cut‐and‐paste DNA transposon types, named DTA‐sel1 and Qute, were initially found as repeated sequences with inverted repeat ends and identifiable TSDs. Eleven intact hAT elements were found in the Selaginella genome, and one contained a full‐length Tuje LTR retrotransposon inserted into it.

Table S3. Summary of transposable element copy number, coverage and occupation by their superfamily in Selaginella.

Coverage Genome

order Superfamily Copy number (kb) Occupation

Methods and tools Author and webside

Oligomer count J Estill, http://dawgpaws.sourceforge.net/

58,030 62,404.3 28.80% Vmatch S Kurtz, http://www.vmatch.de/ Repeated elements 39,463 45,997.8 21.20% RECON Z Bao, http://selab.janelia.org/recon.html

Class I (retrotranposons)

LTR_STRUC McCarthy&McDonald, http://www.genetics.uga.edu/retrolab/data/LTR_Struc.html3

LTR Copia 3,450 5,653 2.70% LTR_FINDER Xu & Wang, http://tlife.fudan.edu.cn/ltr_finder/

Gypsy 32,346 44,921 21.10% HMMER R. Durbin, http://hmmer.janelia.org/

unclassified 23,952 25,237 11.80% tRNAscan Lowe&Eddy, http://lowelab.ucsc.edu/tRNAscan‐SE/

Class II (DNA transposons)‐subclass I

TIR hAT 38 50 0.02%

MITE 2,386 401 0.19% FINDMITE Z Tu, http://jaketu.biochem.vt.edu/dl_software.htm

MAK G Yang, http://wesslercluster.plantbio.uga.edu/mak06.html

Class II (DNA transposons)‐subclass II

Helitron Helitron 5,394 3,322 1.57% Helsearch Yang&Bennetzen, http://sourceforge.net/projects/helsearch/

13

Polyploidy and Duplicate Gene Analyses

Methods. Duplicate gene pairs were identified from Selaginella transcripts (JGI Filtered Models3) and their divergence, in terms of substitutions per synonymous site per year (Ks), calculated. Duplicate pairs were identified as sequences that demonstrated 40% sequence similarity over at least 300 base pairs from a discontinuous MegaBLAST (63, 64). Reading frames for duplicate pairs were identified by comparison to available plant protein sequences. Each duplicated gene was searched against all plant proteins available on GenBank (65) using BLASTX (34). Best hit proteins were paired with each gene at a minimum cutoff of 30% sequence similarity over at least 150 sites. Genes that did not have a best‐hit protein at this level were removed before further analyses. To determine reading frame and generate estimated amino acid sequences, each gene was aligned against its best hit protein by Genewise 2.2.2 (66). Using the highest scoring Genewise DNA‐protein alignments, custom Perl scripts were used to remove stop and “N” containing codons and produce estimated amino acid sequences for each gene. Amino acid sequences for each duplicate pair were then aligned using MUSCLE 3.6 (67). The aligned amino acids were subsequently used to align their corresponding DNA sequences using RevTrans 1.4 (68). Ks values for each duplicate pair were calculated using the maximum likelihood method implemented in codeml of the PAML package (69) under the F3x4 model (70). Further cleaning of the data set was conducted to remove duplication events that could bias the results. All duplicate pairs containing identifiable transposable elements were removed from the analysis because duplication resulting from transposition may obscure a signal from paleopolyploidy. To reduce the multiplicative effects of multicopy gene families on Ks values, simple hierarchical clustering was used to construct phylogenies for each gene family (71), identified as single‐linked clusters, and calculate the node Ks values. Node Ks values < 2 were used in subsequent analyses. Three statistical tests were employed to identify significant features in the age distribution. The bootstrapped K‐S goodness of fit test of Cui et al. (72) was used to assess if the overall age distributions deviated from a simulated null. Taxa that deviated significantly from the null were then analyzed with SiZer (73) to identify significant features (α = 0.05) in the age distributions. SiZer uses changes in the first derivative of a range of kernel density estimates to find significant slope increases or decreases, and the combination may be used to identify peaks and their ranges. EMMIX (74) was used to fit a mixture model of normal distributions to our data by maximum likelihood. Peaks produced by paleopolyploidy are expected to be approximately Gaussian (75, 76), and this mixture model test identifies the number of normal distributions and their position(s) that could produce our observed age distributions. For our analyses, 1−10 normal distributions were fitted to the data with 1000 random starts and 100 k‐mean starts. The Bayesian Information Criterion (BIC) was used to select the best model fit to the data.

Results. The age distribution of duplicate genes did not reveal any evidence of ancient whole genome duplications in the history of Selaginella. No significant peaks were apparent in the age distribution of 4,164 gene family duplications (Fig. S2) and SiZer analyses of this distribution did not identify any significant peaks. Considering the paucity of extant polyploid species in the Selaginellaceae and their relatively low chromosome numbers among the lycophytes this result is

14

not entirely surprising. Chromosome numbers in the genus range from 7 ‐ 12 with S. moellendorffii and other members of its clade possessing an x = 10 cytotype (46, 77). Species immediately outside of the S. moellendorfii clade possess a larger diversity of chromosome counts, including x = 8 and 9 cytotypes. If the x = 10 count of S. moellendorffii and its relatives is the result of chromosomal duplication rather than fission, we expect to observe relatively recent, small peaks in their age distribution. Consistent with this pattern, the K‐S goodness of fit test (72) rejected the null, indicating a deviation from a purely exponential distribution. Subsequent analysis with a maximum likelihood mixture model found five normal distributions in the data (Table S4, Fig. S2). The youngest of these distributions corresponds to the zero class of recent duplicates, and the second youngest is likely a result of heterozygous alleles in our data set. However, two of the remaining distributions, centered at Ks = 0.14 and 0.34, are possibly segmental or chromosomal duplications. Future analyses of fully assembled sequence scaffolds should reveal whether or not these mixture components are indeed segmental or chromosomal duplications, and if they are responsible for the x = 10 cytotype observed in Selaginella. Full output of the duplicate analysis is located at http://msbarker.com/selmo/pamloutput_selmo.

Figure S2. Age distribution of Selaginella duplicate genes.

Black lines indicate normal fits based on mixture model analysis. No statistically significant peaks consistent with paleopolyploidy are observed, although more recent mixture components may correspond to segmental duplications.

15

Table S4. Fitted values of age distribution of gene duplication with mixture model.

Mixture Median (Ks) Mixture Proportion (%) 0.019 18.6 0.056 25.2 0.14 26.5 0.34 22.4 0.87 7.2

Small RNAs

A previously described experiment sampled the Selaginella small RNAs (sRNAs) from shoots via 454/pyrosequencing and annotated MIRNA loci based upon local assemblies of the whole‐genome shotgun traces available at the time (78). The genome assembly allowed expanded study of Selaginella sRNAs. Specifically we analyzed: 1) the genomic locations of microRNA‐encoding loci (MIRNA loci); 2) genes likely involved in Selaginella sRNA biogenesis and function; and 3) identified and mapped the non‐miRNA component of the Selaginella sRNA population. To annotate MIRNA loci, the previously described MIRNA hairpin consensus sequences (78) were obtained from miRBase (version 10; (79) and mapped to the Selaginella genome assembly using megablast. A list of predicted miRNA targets was also developed and is available at (http://homes.bio.psu.edu/people/faculty/Axtell/AxtellLab/Home.html) or upon request. Selaginella homologs of sRNA biogenesis and function genes were identified by BLASTp queries against Selaginella proteins using known Dicer‐Like (DCL), Argonaute (AGO), RNA‐dependent RNA polymerase (RDR) and DNA‐Dependent RNA Polymerase largest sub‐units from Arabidopsis and Physcomitrella. Proteins or protein segments were aligned with MUSCLE (67) using default parameters, and the resulting alignments were used to infer evolutionary relationships using the Neighbor‐Joining method as implemented in the MEGA4 (80) program. To identify non‐miRNA, 'orphan' sRNAs, previously sequenced sRNAs from Selaginella leaves (NCBI GEO GSE7320; (78) were first mapped to the genome assembly, sans plastid and mitochondrial contigs, using SHRiMP 2.0.2 (81) under default settings except the following: ‐o 10,000 ‐‐strata. Mapped sRNAs were then filtered to retain the best mapping location(s) for each read, provided that a) the alignment contained no insertions or gaps and b) the alignment was either perfect or contained one mismatch. Small RNAs mapped to at least one genomic location corresponding to rRNAs (defined by BLASTn similarity to manually curated 26S, 18S, 5.8S, and 5S rRNAs at <= 1E‐15), tRNAs (defined by tRNAscanSE 1.21 (82) using default settings), or MIRNA hairpins (defined above) were then filtered; the remaining mapped sRNAs were designated ‘orphan’ sRNAs. Global patterns of orphan sRNA accumulation were determined by comparison to various genome annotations.

16

‘Hotspots’ of orphan sRNA accumulation were determined by tabulating the repeat‐normalized abundance of orphan sRNAs in static 1kb windows across the genome. MIRNA loci. In most cases, two allelic variants for each previously described MIRNA hairpin were identified (Table S5). The allele located on the lower‐numbered scaffold was designated “‐1”, while the allele present on the higher‐numbered scaffold was designated “‐2”. Similar to other plants, most MIRNA hairpins did not overlap annotated protein‐coding genes, suggesting that most Selaginella MIRNAs are expressed via independent promoters. Some MIRNA loci were clustered in apparent tandem repeats: For instance, the four MIR156 loci were found in two clusters of two loci each. This genomic arrangement suggests expression from a single poly‐cistronic primary transcript, as has been described for maize MIR156 loci (83) and in other plant genomes (78).

Table S5. Annotation of Selaginella MIRNAs.

Locus/Allele Scaffold # start stop Overlapping Annotation (FM2)

smo‐MIR1080‐1 7 3,148,542 3,148,442

smo‐MIR1080‐2 36 635,634 635,734

smo‐MIR1081‐1 0 2,076,074 2,076,441

smo‐MIR1081‐2 8 3,269,158 3,268,804

smo‐MIR1082a‐1 19 434,365 434,524

smo‐MIR1082a‐2 61 665,694 665,849

smo‐MIR1082b‐1 75 337,656 337,755

smo‐MIR1082b‐2 111 288,422 288,323

smo‐MIR1083‐1 1 5,923,936 5,924,027 exon antisense

smo‐MIR1083‐2 5 2,156,652 2,156,746 exon antisense

smo‐MIR1084‐1 55 433,640 433,756

smo‐MIR1084‐2 113 450,541 450,657

smo‐MIR1085‐1 7 933,957 933,857

smo‐MIR1085‐2 53 293,706 293,606

smo‐MIR1086‐1 26 62,773 62,874

smo‐MIR1086‐2 92 176,513 176,614

smo‐MIR1087 0 1,121,179 1,120,915 exon‐intron‐exon sense

smo‐MIR1088‐1 6 1,929,987 1,929,892 intron sense

smo‐MIR1088‐2 56 1,171,160 1,171,065 intron sense

smo‐MIR1089‐1 35 449,198 449,375

smo‐MIR1089‐2 41 1,070,507 1,070,325

smo‐MIR1090‐1 2 110,347 110,444

smo‐MIR1090‐2 2 2,267,333 2,267,234

smo‐MIR1091 106 253,390 253,156

smo‐MIR1092 106 203,117 203,272

smo‐MIR1093‐1 10 737,313 737,402

smo‐MIR1093‐2 14 2,179,026 2,179,115

smo‐MIR1094a‐1 11 1,011,973 1,012,076 exon sense

smo‐MIR1094a‐2 110 215,857 215,959

smo‐MIR1094b‐1 11 1,016,459 1,016,561 exon sense

smo‐MIR1094b‐2 110 220,610 220,714

smo‐MIR1094c‐1 11 1,023,887 1,023,785

17


smo‐MIR1094c‐2 110 228,314 228,212

smo‐MIR1095a‐1 52 1,028,658 1,028,498

smo‐MIR1095a‐2 130 325,584 325,424 exon antisense

smo‐MIR1095b‐1 1 2,650,195 2,650,475

smo‐MIR1095b‐2 16 1,494,378 1,494,098

smo‐MIR1096‐1 1 1,650,976 1,650,878

smo‐MIR1096‐2 16 2,291,467 2,291,562 exon‐intron sense

smo‐MIR1097 47 1,111,070 1,111,450 exon‐intron antisense

smo‐MIR1098‐1 11 2,537,001 2,536,912

smo‐MIR1098‐2 23 1,331,681 1,331,781 exon‐intron sense

smo‐MIR1099 6 3,691,710 3,691,443

smo‐MIR1100‐1 5 3,058,575 3,058,462

smo‐MIR1100‐2 33 123,589 123,476 intron‐antisense

smo‐MIR1101 6 3,069,913 3,069,818 exon‐sense

smo‐MIR1102‐1 1 1,301,572 1,301,675

smo‐MIR1102‐2 106 196,804 196,701

smo‐MIR1103 12 769,049 768,852

smo‐MIR1104‐1 3 492,915 492,803

smo‐MIR1104‐2 648 2,776 2,664

smo‐MIR1105 7 2,681,846 2,681,979 intron‐exon sense

smo‐MIR1106 5 901,022 901,136 exon sense

smo‐MIR1107‐1 0 5,848,114 5,848,203 exon‐antisense

smo‐MIR1107‐2 76 551,414 551,503

smo‐MIR1108‐1 7 2,769,280 2,769,529

smo‐MIR1108‐2 36 990,808 990,511 exon‐sense

smo‐MIR1109 36 48,855 48,739 exon‐intron sense

smo‐MIR1110 43 1,370,289 1,370,386

smo‐MIR1111 106 246,869 247,172

smo‐MIR1112 139 167,353 167,259 intron‐sense

smo‐MIR1113‐1 1 5,417,841 5,417,621

smo‐MIR1113‐2 5 1,637,992 1,637,772

smo‐MIR1114 31 1,290,112 1,290,010 intron‐sense

smo‐MIR1115 33 649,641 649,744

smo‐MIR156a‐1 7 431,619 431,720

smo‐MIR156a‐2 52 210,852 210,751

smo‐MIR156b‐1 7 433,216 433,327

smo‐MIR156b‐2 52 209,267 209,156

smo‐MIR156c‐1 6 1,931,908 1,931,809

smo‐MIR156c‐2 56 1,173,098 1,172,999

smo‐MIR156d‐1 6 1,927,325 1,927,223

smo‐MIR156d‐2 56 1,168,478 1,168,376

smo‐MIR159‐1 42 134,894 135,172

smo‐MIR159‐2 67 295,138 295,416

smo‐MIR160a‐1 40 79,296 79,191

smo‐MIR160a‐2 89 66,415 66,310

smo‐MIR160b‐1 24 1,776,218 1,776,325

smo‐MIR160b‐2 112 227,923 227,816

smo‐MIR166a‐1 11 2,732,317 2,732,461

18


smo‐MIR166a‐2 23 1,129,447 1,129,303

smo‐MIR166b‐1 5 3,390,944 3,391,075

smo‐MIR166b‐2 33 225,563 225,698

smo‐MIR166c‐1 47 1,188,176 1,188,039

smo‐MIR166c‐2 54 987,469 987,332

smo‐MIR171a‐1 47 1,208,846 1,208,980

smo‐MIR171a‐2 54 1,012,715 1,012,848

smo‐MIR171b‐1 18 1,217,793 1,217,691

smo‐MIR171b‐2 147 250,452 250,554

smo‐MIR171c‐1 6 321,599 321,813

smo‐MIR171c‐2 190 40,742 40,956

smo‐MIR171d‐1 18 1,210,767 1,210,665

smo‐MIR171d‐2 147 257,491 257,593

smo‐MIR319‐1 11 1,874,258 1,874,044

smo‐MIR319‐2 23 1,870,771 1,870,985

smo‐MIR396‐1 15 1,970,474 1,970,307

smo‐MIR396‐2 115 308,937 308,770

smo‐MIR408‐1 47 117,000 117,108

smo‐MIR408‐2 54 92,097 92,205

smo‐MIR536‐1 26 1,025,064 1,025,220 exon‐antisense

smo‐MIR536‐2 34 1,143,732 1,143,878

Orphan small RNAs. After mapping a previously described set of leaf and stem‐derived Selaginella small RNAs (NCBI GEO GSE7320) (78) to the genome, and removing rRNA and tRNA contaminants, nearly three‐quarters of the sRNAs were attributable to annotated MIRNA loci (Fig. S3A). The miRNAs were dominated by 21nt species, while the orphan sRNAs had a broader size distribution (Fig. S3B). As a group, the orphan sRNAs were not biased towards repeat‐masked regions of the genome; the percentage of orphan sRNAs which mapped to repeat‐masked genomic loci was approximately the same as the genome‐wide occupancy of repeat‐masked regions (Fig. S3C). Orphan sRNAs had a slight bias towards intergenic genomic regions, but were strikingly depleted at intact transposable elements (TEs) and at most specific TE classes. The exception was MITE elements, for which there was substantial enrichment of sRNAs relative to genome occupancy (Fig. S3C). Most of these sRNAs mapped to just three MITE elements (scaffold_51: 950001..951001, scaffold_182:39001..40001, and scaffold_93:547001..548001) in patterns suggestive of processing from a stem‐loop precursor.

19

Figure S3. Orphan small RNAs.

(A) Genome‐mapped sRNAs are mostly derived from known MIRNA loci. (B) Size distributions of MIRNA‐derived sRNAs and orphan sRNAs. (C) Scatterplot comparing the fraction of genome occupancy for indicated features to the fraction of orphan sRNAs mapping to those features.

20

The Selaginella orphan sRNA population lacked features of non‐miRNA sRNAs found in most other plants. In angiosperms, the dominant component of the sRNA population is typically 24nt short interfering RNAs (siRNAs) that are intergenic and frequently overlap repetitive elements (84, 85). In Physcomitrella, these intergenic siRNAs are instead a mixture of 21, 23, and 24nt siRNAs, and become most apparent when 'hotspots' of non‐MIRNA sRNA accumulation are examined (86). We therefore identified the top 100 1kb Selaginella genomic intervals in terms of orphan sRNA accumulation. These top 100 loci accounted for half (50.4%) of the orphan sRNA abundance. Examination of sRNA accumulation patterns and underlying genome annotations at the hotspots allowed them to be grouped into six categories. The first, most abundant class of hotspots was dubbed the 'scaffold_175' category, since they all derived from sections of scaffold_175. Small RNAs in these hotspots were abundant, and mapped to only strand of the genome, indicative of origins from single‐stranded precursors. Large regions of this scaffold have significant similarity to plastid ribosomal DNA from various plants, so we suspect that major portions of this scaffold are plastid‐derived contaminants. The size distribution of scaffold_175 sRNA hotspots was heterogenous and thus inconsistent with DCL processing (Fig. S4A); this class of loci likely represents contamination with breakdown products derived from plastid rRNAs. Similarly, many other hotspots were comprised of unidirectionally mapped sRNAs in the immediate vicinity, but not overlapping, nuclear 18S, 5.8S, and 26S rRNA loci. We suspect that these are random breakdown products from the external and internal transcribed spacers of the 45S pre‐rRNA (Fig. S4B). Apparent siRNA clusters, where sRNAs mapped to both genomic strands were present, accounted for 16 of the top 100 sRNA hotspots. These siRNA clusters were dominated by 21nt sRNAs, with a much smaller contribution from 23nt species (Fig. S4C). All 16 of these hotspots were within degenerate LTR retrotransposon remnants. Ten hotspots consisted of single, discrete, abundant sRNAs, most of which were 21nts in length (Fig. S4D). Many of these may correspond to un‐annotated MIRNA loci. Another ten hotspots consisted of unidirectionally mapped dispersed sRNA clusters, indicative of origins from single‐stranded precursors. These were dominanted by 21nt sRNAs (Fig. S4E). Finally, a single very abundant hotspot consisted of 21nt sRNAs derived from a single MITE element (scaffold_51: 950001..951001; Fig. S4F).

21

Figure S4. Hotspots of orphan sRNA accumulation in Selaginella.

The top 100 1kb genomic intervals in terms of repeat‐normalized sRNA abundance were grouped into six categories. Length distributions of the small RNAs within each hotspot are shown.

22

Selaginella homologs of small RNA biogenesis and function genes. Dicer‐Like (DCL) proteins are endonucleases responsible for production of microRNAs (miRNAs) and short interfering RNAs (siRNAs) in plants. Many angiosperms produce four distinct DCL proteins, each of which specializes in the production of a subset of small RNAs. Selaginella expresses a clear DCL1 homolog (Fig. S5A); in other plants, DCL1 proteins are known to chiefly produce ~21nt miRNAs from stem‐loop precursors. Selaginella also appears to possess three DCL3 homologs; DCL3 genes in both moss and in angiosperms function to produce longer siRNAs (23‐24nts in moss; only 24nts in angiosperms) that associate with AGO4‐clade Argonaute proteins and direct DNA methylation to targeted genomic loci (86, 87). Interestingly, Selaginella appears to lack both DCL2 and DCL4; in angiosperms, DCL2 and DCL4 are both used to produce anti‐viral siRNAs, and DCL4 is further required to produce certain endogenous 21nt siRNAs (88, 89). Argonaute (AGO) proteins function to bind miRNAs and siRNAs, and to mediate regulation of their targets. Plant AGOs can be divided into three clades (90): the AGO1 clade, associated with miRNAs and ~21nt siRNAs, the AGO4 clade, associated with the longer ~24nt siRNA products of DCL3, and the AGO7 clade. Selaginella has four AGO1‐clade members, two of which are quite closely related to Arabidopsis AGO1 (Fig. S5B). There is also a single AGO4‐clade AGO in Selaginella, but no AGO7‐clade members. Physcomitrella also lacks AGO7‐clade AGOs, suggesting that the AGO7 clade is a derived feature specific to the angiosperms. Plant encoded RNA‐dependent RNA polymerases (RDRs) are required for the accumulation of both endogenous and viral‐derived siRNAs (89, 91). Selaginella possesses two RDR genes that are most closely related to Arabidopsis RDR1 and RDR2 (Fig. S5C). Selaginella conspicuously lacks a homolog of RDR6, which is known in both Arabidopsis and Physcomitrella to generate endogenous ~21nt secondary siRNAs (89, 92). Selaginella also contains an RDR gene in the RDR3/4/5 clade, which is of unknown function in any species. Taken together, the lack of evidence for DCL4, AGO7 and RDR6 presence in the Selaginella genome points to the trans‐acting siRNA pathway as either being acquired as an evolutionary novelty specific to the angiosperm lineage or lost from the lycopsids. In support of the latter idea, Selaginella also lacks miR390, which is conserved in both mosses and angiosperms, and whose role is to trigger secondary siRNA (ta‐siRNA) production. Angiosperms utilize two non‐canonical DNA‐dependent RNA polymerases, Pol IV and Pol V in the 24nt siRNA pathway (93‐95). Based on the identification of largest subunit homologs, Selaginella likely also contains an alternative DNA‐dependent RNA polymerase related to Arabidopsis Pol IV and Pol V (Fig. S5D). Consistent with the suggestion that separate Pol IV and Pol V complexes first arose specifically in the angiosperm lineage (96), only one Pol IV / V like largest subunit was found in Selaginella.

23

Figure S5. Phylogenies of sRNA biogenesis and function genes.

(A) Dicer‐Like (DCL) genes, (B) Argonaute (AGO) genes, (C) RNA‐dependent RNA polymerase (RDR) genes, (D) DNA‐dependent RNA polymerase largest sub‐unit genes. Neighbor‐joining trees based upon MUSCLE alignments of proteins are the consensus of 1,000 replicates. Numbered nodes show the percentage of bootstraps which supported that node: Nodes with less than 50% support were collapsed; those with 100% support were not numbered. Scale bars indication substitutions per amino acid site. Analyses of DNA‐dependent RNA polymerase largest sub‐units used only the catalytic domains (coordinated listed on the tree); all other analyses were based on full‐length protein alignments. At: Arabidopsis ; Pp: Physcomitrella; Sm, Selaginella. Numbers refer to JGI protein IDs or TAIR9 accession numbers. For Selaginella genes, the ID numbers of both allelic variants are given (the first one listed was used in the analysis). Surprisngly, Selaginella appears to lack all components required for the conserved TAS3 trans‐acting siRNA pathway. In angiosperms, miR390/AGO7 complexes initiate dsRNA production from a non‐coding RNA called TAS3 (97). The dsRNA, produced by RDR6, is processed into mostly 21nt

24

siRNAs by DCL4. Two of the resulting siRNAs are conserved among flowering plants, and function to regulate the AUXIN RESPONSE FACTOR 3 (ARF3) and ARF4 mRNAs. This regulation of ARF3 and ARF4 via TAS3‐derived siRNAs is critical for phase change and leaf polarity in Arabidopsis (98, 99). Selaginella lacks key components of this pathway (miR390, AGO7, RDR6, and DCL4). The moss Physcomitrella, like angiosperms, makes miR390‐ and RDR6‐dependent siRNAs which target ARF genes, and contains a DCL4 homolog (78, 100). Therefore, we conclude that the loss of the TAS3 siNA pathway was a secondary loss specific to the lycophyte lineage. A second surprising finding was the dearth of siRNAs in the 23‐24nt size range in the available sRNA sequence data. In all other plants examined to date, these siRNAs, produced using RDR2, DCL3, Pol IV/V, and AGO4 homologs, are readily evident in sporophytic tissues. Selaginella clearly retains the machinery required to produce these siRNAs (Fig S5), and so we hypothesize that 23‐24nt siRNAs are produced in a tissue‐ or cell‐specific manner in Selaginella and therefore not sampled in our leaf and stem‐derived sRNA sequencing sample. Finding Gene Families by Gene Clustering

Methods. The gene families identified here are clusters of orthologous and paralogous genes that represent the modern descents of ancestral gene sets. They rely on assignment of orthology by mutual‐best‐hit criteria and synteny between closely related organisms. Paralogs at each evolutionary node are, therefore, any genes within an organism that are more closely related to each other than to the orthologous genes. To identify gene families, an ortholog based clustering algorithm was used to identify gene families from among all of the protein encoding genes in the genomes of Arabidopsis thaliana, Arabidopsis lyrata, Carica papaya, Populus trichocarpa, Medicago truncatula, Glycine max, Ricinus communis, Manihot esculenta, Cucumis sativus, Vitis vinifera, Sorghum bicolor, Zea mays, Oryza sativa, Brachypodium distachyon, Mimulus guttatus, Selaginella moellendorffii, Phycomitrella patens, and Chlamydomonas reinhardtii. All‐against‐all blastp alignments were first performed for all proteins to be clustered. The bit score per unit peptide length was used as the similarity metric between two peptides. Clustering was performed hierarchically, from the crown nodes to the root, creating in‐group paralogous clusters and merginig ingroup and outgroup clusters across nodes. First, paralogous single‐organism ingroup clusters were constructed for each organism by comparing intra‐organism similarity against inter‐organism similarity; only those peptides more similar than either is to any outgroup peptides were joined into clusters (additional filters were to avoid spurious creation of large paralogous clusters of weakly similar peptides). Then, clusters were merged across nodes via a mutual‐best‐hit criterion. Synteny, ortholog and paralog identification was incorporated in the following manner. For all nodes with significant synteny, orthologs were first assigned to syntenic segments. Syntenic segments were defined as regions bounded by two or more aligning genes with a maximum of 5 non‐aligning genes between pairs of aligning genes (aligning genes are defined as those with E‐value < 1e‐25 and coverage, defined as the length of the alignment divided by the longer peptide, greater than 60%). The average 4DTv of these segments was examined to determine the relative timings of duplications/speciations; 4DTv is a measure of the rate of transversions at fourfold degenerate coding sites in the gene and provides a measure of the evolutionary distances between

25

genes (101). Orthologs were assigned as 1:1 aligning genes that occur on syntenic segments from the correct 4DTv era for that node in which mutual‐best hits account for at least 20% of hits on that segment. If aligning genes had other than a 1:1 relationship between segments (e.g. tandem duplications) genes were assigned as orthologs only if they were mutual‐best hits. The methodology for identifying paralogs was the same as for orthologs, with the addition of including tandem duplications (genes assorting in other than 1:1 across segments) and requiring that they be more recent (lower 4DTv) than the maximum 4DTv for that node. This process continued down to the root node, Viridiplantae, with paralogous clusters merged via comparison of ingroup to outgroup similarity, and mutual best hits being used to merge clusters across nodes. Minimum coverage thresholds were used to minimize the clustering of multi‐domain proteins that may share only a single common domain, or the clustering of peptides from fragmentary gene predictions. Maximum 4DTV thresholds were used to eliminate pairwise homologies corresponding to divergence times more ancient than the node in question. Note that, by construction, every gene from an organism present at a particular node is in one and only one cluster at that node. Some clusters may contain only one extant gene (singletons). Singletons can come from "fast" evolution leading to so much sequence divergence that sequence‐similarity based clustering is confounded, leading to gene loss, or gene calling errors. The gene families (clusters) were sorted into groups based on the following clades: Clade: Taxa included or excluded in the clade: allGreen present in Chlamydomonas, Physcomitrella, Selaginella and at least

three angiosperm species. angiospermLoss absent in all angiosperms but present in at least two of the following:

Chlamydomonas, Physcomitrella and Selaginella. lowerPlant Present only in Physcomitrella and Selaginella. angiospermInnovation Present only in at least three angiosperm species. lowerPlantLoss Present in Chlamydomonas and at least three angiosperm species;

absent in Physcomitrella and Selaginella. tracheophyteInnovation Present in Selaginella and at least three angiosperm species; absent in

Chlamydomonas and Physcomitrella. SelaginellaLoss Present in all but Selaginella. PhyscoLoss Present in all but Physcomitrella. landPlantInnovation Present in all but Chlamydomonas. ambiguous all families not meeting any of the above criteria A list of all of the gene families, and genes within each family, is available upon request. To test the validity of this method, flagella or basal body specific genes were identified from the 137 families within the angiospermLoss group having genes from Chlamydomonas, Physcomitrella and Selaginella. Because only these three plants have flagellated sperm and basal bodies (angiosperms lack both structures), we expected a high proportion of these 137 families to be

26

related to flagella or basal body genes identified from other organisms. Selaginella genes from each of these families were used as BLASTX queries and those showing similarity (e<10‐10) to flagella or basal body genes selected; 32, or 23% of the total, were identified as being similar to flagella or basal body genes. A list of these gene families is available upon request. This result indicates that the clustering method employed is valid. Phylogenetic analyses of genes involved in Arabidopsis development

Genes of interest were initially sought by searching the TAIR database (http://www.arabidopsis.org) using "development" as the keyword. This uncovered 1,534 Arabidopsis genes; 402 genes whose functions were relatively well characterized were selected and used as queries against various databases. We obtained Physcomitrella and Selaginella WGS sequences plus quality and ancillary data from ftp://ftp.ncbi.nih.gov/pub/TraceDB/. Sequences judged to be ESTs (ti starting with 1166, 1167, or 1181 followed by six digits, which are annotated as strategy: EST, trace_type_code: EST, source_type: GENOMIC) were excluded from the analysis. The NCBI nr data set was obtained from ftp://ftp.ncbi.nih.gov/blast/db/. Amino acid sequences of Selaginella data were obtained from Selmo1_GeneModels_FilteredModels2_aa.fasta.gz. The Chlamydomonas and Cyanidioschyzon merolae protein data sets were obtained from ftp://ftp.jgi‐psf.org/pub/JGI_data/Chlamy/v3.0/proteins.Chlre3.fasta.gz and http://merolae.biol.s.u‐tokyo.ac.jp/download/cds.fasta, respectively. The taxonomy database (65) was obtained from ftp://ftp.ncbi.nih.gov/pub/taxonomy. In some cases (where indicated), Selaginella nucleotide or amino acid sequences were obtained directly by querying the following databases available at http://genome.jgi‐psf.org/cgi‐bin/runAlignment?db=Selmo1&advanced=1: Selaginella moellendorffii v1.0 assembly scaffolds (Selmo1_assembly_scaffolds), Selaginella moellendorfii v1.0 non‐redundant filtered model proteins (Selmo1_GeneModels_FilteredModels2_aa) or Selaginella moellendorfii v1.0 non‐redundant filtered model proteins (Selmo1_GeneModels_FilteredModels3_aa). Unless noted otherwise, each phylogenetic analysis was done in three steps: automatic collection of homologous sequences and alignment, manual selection of well‐aligned regions and exclusion of sequences with insufficient similarity or mutant alleles, and automatic reconstruction of genetic phylogeny using the neighbor‐noining (NJ) (102) and maximum likehood (ML) (103) methods. Statistical support was examined with bootstrapping. To prepare the alignments, Web‐based interfaces were established. The blast‐assemble‐both interface allows a similarity search against the data sets for each query. BLASTP and TBLASTN (34) with word size 2 were used for the protein and WGS nucleotide data sets, respectively. The assembly and putative amino acid sequences of the collected WGS data were obtained using CAP3 (33) and GenomeScan (http://genes.mit.edu/genomescan.html), respectively. All amino acid sequences were aligned with MAFFT version 5.743 (104) using the EINSI option (‐‐ep 0 ‐‐genafpair ‐‐maxiterate 1,000). When a single Physcomitrella or Selaginella gene was scattered into multiple contigs, a blast‐assemble‐bothext interface was used. In this case, a MEGABLAST search of the original WGS data with the contigs as queries was performed then reassembled with CAP3 using a parameter set (‐b 30 ‐c 20 ‐d 21 ‐s 401 ‐u 1). The resulting alignment was manually edited with MacClade version 4.08 (105) to remove unnecessary sequences and to mark which amino acid sites to include or

27

exclude from the analysis. These marked alignment files are available at http://moss.nibb.ac.jp/treedb. Sequences that lacked a conserved region were removed from the analysis. Unless noted otherwise, phylogenetic trees were made using makenjtree interface. The NJ tree (102) was reconstructed using PHYLIP ver. 3.65 (106) and the ML tree was searched with local re‐arrangement using MOLPHY‐2.3b3 (107) under the JTT model (108). Each tree was investigated to identify orthologous land plant genes putatively derived from a single gene in the last common ancestor of land plants. Each minimum land plant clade consisted of genes that occur Physcomitrella, Selaginella, rice and Arabidopsis. Because we cannot exclude the possibility of parallel gene losses, the genes included in the minimum land plant clade were designated as putative orthologs. A maximum land plant clade was defined as a clade that contains neither the minimum land plant clades nor non‐land plant genes and may lack genes from some of the four species. Genes included in the maximum land plant clade were considered putative orthologs. Sequences derived from a single locus in Arabidopsis or rice were counted as one gene. Since the WGS sequences of Selaginella were derived from a diploid plant, pairs of allelic sequences were frequently detected. To avoid overestimation of gene number, Selaginella genes producing a pair in a tree were counted as two alleles of one locus. In all instances the number of genes refers to the number of gene loci, not alleles. Multiple non‐overlapping sequences were considered as fragments of a single gene and counted as one gene. Sequences with in‐frame stop codons in a conserved region or lacking a significant portion of the conserved region were judged to be pseudogenes and excluded from the count. Contigs and reconstructed amino acid sequences were mapped to the assembled genome and gene models to allow association of genes identified from pre whole genome assembly data to the assembled genome and gene models. The 424 land plant orthologous groups thus identified are listed in Table S6. Alignments and phylogenetic trees can be accessed by placing the “query gene” name in Table S6 in the “Search trees by name of gene” box at the website http://moss.nibb.ac.jp/treedb/. Table S6 also includes the Arabidopsis developmental gene or genes used to query databases, and the number of putative orthologs in Arabidopsis, rice, Selaginella and Physcomitrella. Orthologs involved in the basic molecular machinery of development, such as cell cycle, cytoskeleton, DNA methylation, histone modification, chromatin remodeling, and endogenous small RNA regulation, were also included. The absence of a gene in a taxon can mean that the gene never existed in that taxon or was lost.

28

Table S6. The number of land plant orthologues in Arabidopsis, rice, Selaginella and Physcomitrella.

Query gene # potential orthologs in

Gene function Gene names Arabidopsis rice

Selaginalla (per

haplotype) Physcomitrella

Cell cycle regulation CDKA;1 CDKA;1 1 3 1 2

Cell cycle regulation CDKB1;1 and 1;2 2) CDKB1;1 2 1 2 7

Cell cycle regulation CDKB2;1 and 2;2 CDKB1;1 2 1 0 0

Cell cycle regulation CDKC;1 to 2 CDKC;1 2 2 1 3

Cell cycle regulation CDKD;1 to 3 2) CDKD;1 3 1 1 3

Cell cycle regulation CDKE;1/HEN3 CDKE;1 1 1 1 2

Cell cycle regulation CDKF;1 CDKF;1 1 1 1 1

Cell cycle regulation CYCA1;1 to 2, A2;1 to 4, and A3;1 to 4 2)

CYCA1;1 10 7 3 8

Cell cycle regulation CYCB1;1 to 5, B2;1 to 5, and B3;1 2)

CYCB1;1 11 5 1 2

Cell cycle regulation

CYCD1;1, D2;1, D3;1 to 3, D4;1 to 2, D5;1, D6;1, and D7;1 2)

CYCD1;1 10 12 3 2

Cell cycle regulation CYCH1;1 CYCH1;1 1 1 1 2

Cell cycle regulation KRP1 to 7 KRP4 7 6 3 1

Cell cycle regulation RBR1 RBR1 1 2 2 3

Cell cycle regulation E2FA, B, and C E2Fa 3 3 2 4

Cell cycle regulation DEL1, 2, and 3 2) DEL1 3 2 1 3

Cell cycle regulation WEE1 2) WEE1 1 1 2 1

Cell cycle regulation Arath;CDC25 ARATH;CDC25 1 2 1 1

Cell cycle regulation HBT HBT 2 1 2 2

Cell cycle regulation ANP1 2) ANP1 3 2 1 2

Cytoskeleton ATFH1 to 11 ATFH8 11 9 1 6

Cytoskeleton ATFH13 to 19, and 21 2)

AT2G25050 8 5 2 2

Cytoskeleton ARP2/WRM WRM 1 1 1 1

Cytoskeleton ARP3/DIS1 DIS1 1 1 1 2

Cytoskeleton ARPC1a and b ARPC1b 2 1 1 2

Cytoskeleton ARPC2A/DIS2 and ARPC2B 2)

DIS2 2 2 1 2

Cytoskeleton ARPC3 ARPC3 1 1 1 2

Cytoskeleton ARPC4 ARPC4 1 1 1 1

Cytoskeleton ARPC5/CRK CRK 1 1 1 1

Cytoskeleton WAVE1 to 5/SCAR1 to 5

WAVE5 5 3 2 7

Cytoskeleton ABI1L1 ABI1L1 4 6 1 2

Cytoskeleton PIR121/KLK/SRA1 KLK 1 1 1 2

Cytoskeleton NAP125/GRL GRL 1 1 1 2

Cytoskeleton HSPC300/BRK1 BRK1 1 1 1 1

Cytoskeleton ADF1 ADF1 12 10 3 1

Cytoskeleton VLN1 to 4 VLN3 5 5 1 3

Cytoskeleton ATFIM1 ATFIM1 5 3 2 2

29

Cytoskeleton PRF1 PRF1 5 3 1 3

Cytoskeleton TUBG1 and 2 TUBG1 2 1 1 2

Cytoskeleton ATSPC97 AT5G17410 1 1 1 3

Cytoskeleton ATSPC98 ATSPC98 1 1 1 3

Cytoskeleton ATGCP4 AT3G53760 1 1 1 1

Cytoskeleton ATGCP5/EMB1427 AT1G80260 2 2 1 1

Cytoskeleton ATGCP6 OSJNBa0005N02.6 4)

1 1 1 1

Cytoskeleton ATNEDD1 AT5G05970 1 1 2 4

Cytoskeleton ERH3/FRA2 ERH3 1 2 1 2

Cytoskeleton MOR1/GEM1 MOR1 1 1 1 2

Cytoskeleton ATEB1A ATEB1A 3 2 3 3

Cytoskeleton SPR1/SKU6 3) SPR1 5 4 2 4

Cytoskeleton SPR2/TOR1 3) SPR2 6 4 2 4

Cytoskeleton ATMAP65‐1 to 9 2) ATMAP65‐1 9 11 5 5

Cytoskeleton ROP1 to 11 2) ROP2 11 7 2 4

Cytoskeleton FASS/TON2 FASS 1 1 1 1

Phosphoinositides COB COB 5 7 3 3

Phosphoinositides SOS5 1) SOS5 1 1 0 0

Phosphoinositides SKU5 SKU5 4 4 1 1

Phosphoinositides SETH1 SETH1 1 1 1 1

Phosphoinositides SETH2 SETH2 1 1 1 1

Phosphoinositides PLDP1 and 2 PLDP1 2 2 1 2

Phosphoinositides DGK1 and 2 DGK1 2 3 1 0

Phosphoinositides DGK3, 4, and 7 DGK7 3 2 1 2

Phosphoinositides DGK5 and 6 DGK7 2 3 2 1

Phosphoinositides TAG1 TAG1 1 2 1 2

Phosphoinositides TOR TOR 1 1 1 1

Phosphoinositides ATSAC1 ATSAC1 1 2 1 4

Phosphoinositides IPK2a IPK2a 2 1 1 1

Phosphoinositides PTEN PTEN 1 0 1 0

Epigenetic gene regulation ATX1/SDG27 and ATX2/SDG30

ATX1 2 1 1 3

Epigenetic gene regulation SDG3, 9, 11, 17, 19, 21 to 23, 32, and 33 2)

SDG3 12 12 6 6

Epigenetic gene regulation HDT1 to 4 HDT2 4 3 1 3

Epigenetic gene regulation HDA6 and 7 HDA6 2 1 0 0

Epigenetic gene regulation HDA19 HDA19 1 3 1 2

Epigenetic gene regulation HDA9 and 17 HDA9 2 1 1 1

Epigenetic gene regulation HAG1 HAG1 1 1 1 1

Epigenetic gene regulation MET1 to 4 MET1 4 2 1 1

Epigenetic gene regulation CMT1 to 3 CMT3 3 3 1 1

Epigenetic gene regulation DRM1 to 3 DRM2 3 4 3 2

Epigenetic gene regulation DME1 DME1 4 4 1 3

Epigenetic gene regulation MBD11 MBD11 2 1 2 7

Epigenetic gene regulation NFF2/NFB2/FAS1 FAS1 1 2 1 1

Epigenetic gene regulation NFB1/MUB3/FAS2 FAS2 1 1 1 1

Epigenetic gene regulation MSI2 and 3 MSI1 2 1 0 0

Epigenetic gene regulation MSI4 and 5 MSI4 2 1 1 2

Epigenetic gene regulation CHR1/DDM1 CHR1 1 2 1 1

Epigenetic gene regulation CHR2 CHR2 1 1 1 2

Epigenetic gene regulation CHR3/SYD SYD 1 1 1 1



30

Epigenetic gene regulation CHR6/GYM/PKL and CHR7

CHR6 2 1 2 2




Epigenetic gene regulation CHR11, 17 CHR11 2 2 2 2

Epigenetic gene regulation CHR12, 23 CHR23 2 1 1 1



Epigenetic gene regulation CHR15/MOM CHR15 1 1 1 0




Epigenetic gene regulation CHR37 and 41 CHR37 2 1 1 0

Epigenetic gene regulation ATSWI3A and B ATSWI3B 2 2 1 1

Epigenetic gene regulation ATSWI3C and D ATSWI3C 2 4 1 3

Epigenetic gene regulation HIRA1 HIRA1 1 1 1 2

Epigenetic gene regulation RHL2 RHL2 1 1 1 1

RISC complex DCL1 DCL4 1 1 1 1

RISC complex DCL2 and 4 DCL4 2 3 1 1

RISC complex DCL3 DCL4 1 2 2 1

RISC complex AGO1 AGO1 2 5 3 3

RISC complex AGO2, 3, and 7 AGO7 3 3 1 1

RISC complex AGO4, 6, 8, and 9 AGO6 4 3 1 3

RISC complex HEN1 HEN1 2 1 1 1

RISC complex RDR1 and 2 RDR6 2 1 2 1

RISC complex RDR3 to 5 RDR4 3 2 1 2

RISC complex RDR6 RDR6 1 1 0 1

RISC complex HEN4 2) HEN4 9 6 2 1

RISC complex FLK FLK 2 5 1 2

Light signalling PHYA to E PHYA 5 3 1 8

Light signalling CRY1, 2 CRY1 2 3 5 2

Light signalling CRY3 CRY3 1 1 1 1

Light signalling PHOT1 and 2 PHOT1 2 2 2 6

Light signalling LKP2, FKF1, and ZTL/LKP1

LKP2 3 2 1 0

Light signalling HO1 to 4 HO3 4 2 1 5

Light signalling HY2 HY2 1 1 1 1

Light signalling PAT1 2) PAT1 6 6 1 2

Light signalling EID1 EID1 1 1 1 3

Light signalling LAF3 LAF3 1 1 2 1

Light signalling FAR1, FRS1 to 12, and FHY3 3)

FAR1 14 56 0 0

Light signalling FHY1 FHY1 1 2 1 1

Light signalling PIF1/PIL5, PIF3 and 4, PIL1 to 4, and HFR/PIL6 2)

PIL6 14 10 3 3

Light signalling NDPK1 NDPK2 1 2 1 2

Light signalling NDPK2 NDPK2 1 1 1 2

Light signalling PP5 PP5 1 1 1 2

Light signalling PP7 2) PP7 3 1 3 3

Light signalling ATFYPP3 ATFYPP3 2 1 1 2

Light signalling HRB1 3) HRB1 7 7 4 4

Light signalling SHL1 2) SHL1 3 4 2 6

Light signalling SUB1 2) SUB1 3 3 2 1

31

Light signalling NPH3 2) NPH3 2 1 3 15

Light signalling RPT2 2) RPT2 8 5 2 4

Light signalling JAC1 JAC1 7 5 5 4

Light signalling PKS1 PKS1 4 1 0 0

Light signalling CDF1 CDF1 7 6 3 4

Light signalling SHB1 2) SHB1 9 0 3 2

Light signalling COP1 2) SPA1 1 1 1 9

Light signalling COP10/FUS9 COP10 1 1 1 2

Light signalling CIP8 CIP8 6 3 2 0

Light signalling SPA1 to 4 SPA1 4 2 1 2

Light signalling CIP1 6) CIP1 6) 1 0 0 0

Light signalling CIP4 6) CIP4 6) 2 0 0 0

Light signalling CIP7 3) CIP7 6 3 5 2

Light signalling HY5 and HYH 2) HY5 2 3 2 2

Light signalling LAF1 1) LAF1 3 3 0 0

Light signalling DET1/FUS2 DET1 1 1 1 3

Light signalling DDB1A DDB1A 2 1 1 2

Light signalling COP11/FUS6 COP11 1 1 1 2

Light signalling COP12/FUS12 FUS12 1 1 1 3


Light signalling COP8/FUS4/FUS8 COP8 1 1 1 1

Light signalling AJH1 and 2 AJH2 2 1 1 2

Light signalling CSN6A and B CSN6A 2 4 1 2


Light signalling COP9/FUS7 2) COP9 1 2 1 1

Light signalling AMP1 AMP1 2 4 4 2

Light signalling PFT1 3) PFT1 1 1 2 2

Light signalling TED3 TED3 1 1 1 1

Light signalling OBP3 1) OBP3 13 4 0 0

Light signalling HLS1 2) HLS1 4 4 1 2

Light signalling CR88 CR88 2 2 1 2

Light signalling MIF1 MIF1 3 3 0 0

Light signalling ZFP1 3) ZFP1 27 27 6 10

Circadian clock PRR3, 5, 7, and 9 2) PRR7 5 2 3 4

Circadian clock TOC1 TOC1 1 1 1 0

Circadian clock CCA1 and LHY 2) LHY 5 4 2 2

Circadian clock COL3 CO 18 17 3 6

Circadian clock ELF3 3) ELF3 2 2 2 4

Circadian clock ELF4 ELF4 5 4 2 1

Circadian clock SRR1 SRR1 1 2 1 2

Circadian clock LUX/PCL1 PCL1 3 1 1 3

Circadian clock CKB3 CKB3 4 3 1 4

Circadian clock TIC TIC 2 1 1 0

Auxin metabolism YUCCA YUCCA 9 8 3 6

Auxin metabolism CYP79B2 and B3 1) CYP79B2 2 0 0 0

Auxin metabolism NIT1‐3 NIT1 3 0 0 0

Auxin metabolism CYP83B1/SUR2 1) SUR2 2 0 0 0

Auxin metabolism GH3.6/DFL1 2) DFL1 19 11 17 2

Auxin metabolism AAO1 AAO1 4 6 3 2

Auxin metabolism TYDC1, 2 TYDC2 2 1 5 2

Auxin metabolism CYP79F1/SPS1 SUPERSHOOT1

8 3 0 0

Auxin signalling TIR1 TIR1 6 6 2 4

Auxin signalling ABP1 ABP1 1 1 2 1

Auxin signalling IAA1 to 20, 26 to IAA31 29 30 4 2

32

34 2)

Auxin signalling

ARF5/MP, NPH4/BIPOSTO, ARF6, 8, 16, and 19 2)

NPH4 5 8 3 6

Auxin signalling ETT/ARF3, ARF1, 2, 4, 9, 11 to 15, 18, 20, and 21

ETT 15 9 2 4

Auxin signalling ARF10, ARF16 ARF16 2 4 2 2

Auxin signalling ARF17 ARF16 1 1 0 0

Auxin signalling AXR1 2) AXR1 3 1 1 2

Auxin signalling SKP1/ASK1 SKP1 17 12 1 3

Auxin signalling AXR6/ATCUL1 ATCUL1 2 3 1 3

Auxin signalling RCE1 RCE1 2 3 3 5

Auxin carriers AUX1 AUX1 4 4 2 4

Auxin carriers PIN1 to 8 2) PIN5 8 11 5 3

Auxin carriers MDR/PGP1 PGP1 1 1 0 0

Auxin carriers PGP19 PGP1 1 2 2 2

Auxin carriers MDR4/PGP4 PGP1 8 8 5 1

Auxin carriers PID PID 3 3 0 0

Auxin carriers BIG1 BIG1 1 1 1 2

Jasmonic acid biosynthesis OPR3 OPR3 1 1 1 2

Jasmonic acid signalling COI1 9) COI1 1 3 4 3

Cytokinin biosynthesis/catabolism CKX1 to 7 CKX7 7 11 2 6

Cytokinin biosynthesis CYP735A1 and CYP735A2

CYP735A1 2 2 0 0

Cytokinin signalling AHK2, 3, and CRE1/WOL/AHK4

CRE1 3 5 2 3

Cytokinin signalling AHP1 to 6 AHP2 5 4 2 2

Cytokinin signalling ARR3 to 9, 15 to 17 2)

ARR5 10 15 2 7

Cytokinin signalling ARR1, 2, 10 to 14, and 18 to 21 1)

ARR1 11 9 5 5

Gibberellic acid biosynthesis GA1/KSA/CPS and GA2/KSB 3)

GA2 11 13 9 2

Gibberellic acid biosynthesis GA4 GA4 4 2 1 1

Gibberellic acid signalling GID1L1 to 3 GID1 4) 3 1 2 0

Gibberellic acid signalling GAI, RGA, and RGL1 to 3 4)

GAI 5 1 1 2

Gibberellic acid signalling SPY SPY 1 1 1 2

Abscisic acid biosynthesis ABA1/ZEP ABA1 1 2 1 2

Abscisic acid biosynthesis NCED3 NCED3 5 4 1 2

Abscisic acid biosynthesis ABA2 and ATA1 ABA2 11 20 5 2

Abscisic acid biosynthesis AAO3 AAO3 4 6 3 2

Abscisic acid signalling PYR1 PYR1 7 6 4 4

Abscisic acid signalling GTG1 GTG1 2 1 1 0

Abscisic acid signalling CHLH CHLH 1 1 1 1

Abscisic acid signalling ABH1 ABH1 1 1 1 2

Abscisic acid signalling ABI1 ABI1 9 10 3 2

Ethylene biosynthesis ACS1 to 9, and 11 ACS2 10 5 0 0

Ethylene biosynthesis ACO1 and 2 ACO2 5 7 0 0

Ethylene biosynthesis RCN1 RCN1 3 1 2 4

Ethylen signalling ETR1 and ERS1 ETR1 2 2 3 4

Ethylen signalling ETR2, ERS2, and EIN4 1)

ETR1 3 3 0 1

Ethylen signalling CTR1 CTR1 6 8 4 0

Ethylen signalling EIN2 EIN2 1 3 1 1

33

Ethylen signalling EBF1 and 2 EBF1 2 2 2 2

Brassinosteroid biosynthesis SMT1 SMT1 1 3 1 1

Brassinosteroid biosynthesis FK FK 1 1 1 1

Brassinosteroid biosynthesis HYD1 HYD1 1 1 2 1

Brassinosteroid biosynthesis DWF5 7) DWF5 1 1 1 2

Brassinosteroid biosynthesis DET2 DET2 1 2 2 1

Brassinosteroid biosynthesis CPD, DWF4, ROT3, and BR6ox1 and 2

CPD 19 16 4 3

Brassinosteroid signalling BRI1 and BRL1 BRI1 4 4 0 0

Brassinosteroid signalling BAK1, SERK1, 2, and 9

SERK2 5 3 3

Brassinosteroid signalling BIN2, AT1G06390, and AT2G30980 2)

AT1G06390 10 9 2 6

Brassinosteroid signalling BSU1 BSU1 1 0 0 0

Brassinosteroid signalling BIM1 to 3 2) BIM1 3 3 1 3

Brassinosteroid signalling BES1 3) BES1 6 4 5 6

Brassinosteroid signalling TRIP1 TRIP1 2 2 1 3

Seed LEC1 LEC1 2 2 1 0

Seed ABI3, FUS3, and LEC2 2)

FUS3 3 5 3 11

Seed HSI2 HSI2 3 2 1 2

Seed ABI4 1) ABI4 9 5 4 7

Seed ABI5 7) ABI5 7 5 4 1

Embryo EMB1417 and 2748 2)

AT1G04590 2 3 2 1

Embryo EMB3001 AT1G08910 2 1 2 0

Embryo EMB1273 AT1G49510 1 0 1 2

Embryo EMB2423 AT3G48470 1 1 1 1

Embryo EMB1703 AT3G61780 2 0 1 2

Embryo EMB2739 AT4G14590 1 1 2 1

Embryo EMB1895 AT4G20060 1 2 2 2

Embryo EMB1027 AT4G26300 2 3 1 1

Embryo EMB2735 AT5G06240 1 2 1 1

Embryo EMB2196 AT5G10330 1 1 1 1

Embryo EMB2024 2) AT5G24400 5 4 2 2

Embryo EMB3012 AT5G40480 1 0 2 1

Embryo EMB1875 AT5G40600 1 0 0 0

Embryo EMB1187 AT2G26830 1 1 1 2

Root PLT1 and 2 ANT 8 11 1 3

Root SHR SHR 1 2 3 2

Root SCR SCR 1 2 2 3

Root WOX5 3) WUS 16 10 9 3

Lateral root NAC1 NAC1 3 10 0 0

Lateral root SINAT5 2) SINAT5 6 5 4 4

Lateral root AIR3 AIR3 3 5 10 1

Lateral root PAS2 PAS2 1 5 1 1

Lateral root ATNRT2.1/LIN1 7) ATNRT2.1 7 4 2 8

Lateral root ALF4 ALF4 1 1 1 2

Lateral root RPD1 2) RPD1 15 17 10 4

Shoot meristem WUS and WOX1 to 14 3)

WUS 16 10 9 3

Shoot meristem CLV1 CLV1 4 6 3 0

Shoot meristem CLV2 CLV2 2 3 2 0

Shoot meristem STM, BP, KNAT1, and 6

KNAT1 4 9 3 3

Shoot meristem KNAT3 to 5, and 7 KNAT3 4 4 2 2

34

Shoot meristem AS1 1) AS1 1 1 1 0

Shoot meristem AS2 AS2 1 2 0 0

Shoot meristem BOP1 and 2 BOP1 2 2 2 3

Shoot meristem CUC1, 2, and 3 CUC1 11 12 3 8

Shoot meristem/Gynoecium BEL1, BLR, PNF, and PNY 2)

BLR 13 13 2 4

Shoot meristem TPL TPL 5 2 3 2

Shoot meristem HAN 1) HAN 3 2 2 4

Shoot meristem ULT1 ULT1 2 1 0 0

Shoot meristem BBM1 ANT 8 11 1 3

Axillary meristem MAX1 MAX1 1 5 1 0




Axillary meristem RAX1 1) RAX1 6 8 1 0

Axillary meristem LAS LAS 1 2 1 2

Leaf PHB, PHV, REV, COR, and ATHB8

PHV 5 4 3 5

Leaf CRC, FIL, and INO INO 6 7 0 0

Leaf KAN 2) KAN 4 6 3 3

Leaf AN AN 1 2 1 5

Leaf ANT ANT 8 11 1 3

Leaf ARGOS and ARL ARGOS 2 2 0 0

Leaf TCP3 and 4 2) TCP3 8 8 3 2

Leaf TSL TSL 1 1 1 1

Epidermal cell differentiation ANL2, AtML1, GL2, and PDF2 1)

GL2 17 10 5 4

Epidermal cell differentiation TTG1 TTG1 1 1 0 0

Epidermal cell differentiation ATAN11 TTG1 2 1 1 4

Epidermal cell differentiation MYB23, GL1, and WER 1)

MYB23 3 0 0 0

Epidermal cell differentiation EGL3, GL3, and TT8

GL3 4 7 1 1

Epidermal cell differentiation CPC, ETC1, ETC2, and TRY

CPC 6 1 0 0

Epidermal cell differentiation RHD3 RHD3 3 3 1 3

Epidermal cell differentiation IRE 2) IRE 4 2 2 5

Epidermal cell differentiation AKT1 and SPIK AKT1 7 9 0 4

Epidermal cell differentiation RHD2 2) RHD2 10 9 11 4

Epidermal cell differentiation RIC1 to RIC10 2) RIC1 10 10 1 1

Epidermal cell differentiation RIC2 3) RIC2 9 7 1 0

Epidermal cell differentiation SUB SUB 3 4 0 0

Stomata opening GPA1 GPA1 1 1 1 0

Stomata opening AGB1 AGB1 1 1 1 1

Stomata opening AGG1 and 2 AGG1 2 2 0 0

Stomata opening GCR1 GCR1 1 1 1 2

Stomata opening GORK GORK 2 2 1 0

Stomata development SDD1 SDD1 3 13 0 0

Vascular system VAN3 VAN3 2 1 1 3

Vascular system DRP1A DRP1A 2 2 0 0

Vascular system CVP2 CVP2 7 11 2 6

Vascular system VND1 to 7 and NST1 and 2

VND7 13 10 4 8

Vascular system COV1 2) COV1 4 5 1 4

Vascular system APL APL 4 3 2 7

Cell wall CESA4/IRX5, CESA4 10 9 4 8

35

CESA7/IRX3 and CESA8/IRX1

Cell wall Porphyra CESA1 Porphyra CESA1 0 0 4 1

Lignin synthesis PAL PAL 4 10 2 11

Lignin synthesis C4H C4H 1 2 1 6

Lignin synthesis 4CL1 2) 4CL1 4 5 2 4

Lignin synthesis HCT 7) HCT 2 2 1 1

Lignin synthesis C3H C3H 1 2 1 1

Lignin synthesis CCOAOMT1 2) CCOAOMT1 1 1 3 1

Lignin synthesis CCR1 and 2 CCR2 2 13 3 1

Lignin synthesis CAD1 to 9 CAD9 9 8 5 2

Lignin synthesis FAH1 1) FAH1 2 2 0 0

Lignin synthesis COMT 2) COMT 13 11 16 3

Phase transition CLF and SWN MEA 2 2 1 1

Phase transition MEA MEA 1 0 0 0

Phase transition FIE 2) FIE 1 2 1 1

Phase transition EMF2, FIS2, and VRN2/VEF1

EMF2 4 2 1 3

Phase transition MSI1 MSI1 1 1 1 1

Flowering CO 2) CO 18 17 3 6

Flowering FCA FCA 2 1 1 2

Flowering FY FY 1 1 1 1

Flowering FT and TFL1 FT 5 14 0 0

Flowering FD FD 2 1 0 0

Flowering VRN1 VRN1 1 2 1 6

Flowering VIP2 VIP2 2 3 1 3



Flowering TFL2/LHP1 TFL2 1 1 1 0

Flowering VIN3 VIN3 4 4 2 0

Flowering FLD 2) FLD 3 3 1 2

Flowering FPA 2) FPA 2 2 1 2

Flowering FRI 2) FRI 8 5 3 2

Flowering EMF1 EMF1 1 1 0 0

Flowering ESD4 ESD4 3 2 1 2

Flowering GI GI 1 1 1 0

Flowering LD LD 1 1 1 1

Flowering LFY LFY 1 1 1 2

Floral organ HUA1 HUA1 1 1 0 0

Floral organ HUA2 2) HUA2 4 2 4 2

Floral organ RBE and SUP SUP 5 4 2 8

Floral organ JAG and NUB NUB 2 1 2 2

Floral organ AP2 ANT 2 5 2 0

Floral organ PTL PTL 5 4 2 4

Floral organ SPL8 2) SPL8 1 3 4 6

Floral organ UFO UFO 1 1 1 0

Floral organ AG, PI, AP3, AP1, SHP1 and 2, SEP1‐3

MIKCc‐type MADS‐box genes 5)

38 34 3 6

Floral organ LUG LUG 1 3 3 3

Gynoecium STY2 STY2 10 5 3 2

Gynoecium IND IND 5 5 2 6

Gynoecium ALC ALC 2 3 3 3

Endosperm TTN9 TTN9 1 1 1 1

36

Pollen AGL30, AGL65, AGL66, AGL94, and AGL104

MIKC*‐type MADS‐box genes 5)

6 3 3 11

Pollen EMS1/EXS EMS1 1 2 2 6

Pollen SPL/NZZ SPL 1 1 2 3

Pollen TPD1 3) TPD1 4 14 5 9

Pollen GPAT1 2) GPAT1 8 15 10 7

Pollen MEI1 MEI1 1 1 1 1

Pollen ASK1, ASK2, and OST1 2)

ASK1 10 10 3 4

Pollen CDC45 CDC45 1 2 1 1

Pollen ATK1 ATK1 4 3 2 2

Pollen TES/STD TES 2 1 1 3

Pollen MRE11 MRE11 1 1 1 1

Pollen ATRAD50 ATRAD50 1 1 1 1

Pollen MMD1/DUET 7) MMD1 4 2 1 2

Pollen MS5 3) MS5 5 4 2 7

Pollen SYN1 SYN1 1 1 1 1

Pollen NEF1 NEF1 1 1 1 1

Pollen AMS AMS 1 1 1 2

Pollen YRE/FLP1/WAX2 CER1 1 3 3 2

Pollen DEX1 DEX1 1 1 2 1

Pollen MS2 MS2 2 1 2 2

Pollen ADL1C ADL1C 3 3 3 5

Pollen DUO1 DUO1 1 1 2 2

Pollen BCP1 BCP16) 2 0 0 0

Pollen CER1 CER1 3 5 1 2

Pollen MYB4 and 32 MYB4 6 6 2 0

Pollen QRT3 QRT3 2 1 0 2

Pollen AT2G19620 AT2G19620 3 4 1 7

Pollen AT5G56200 2) AT5G56200 9 8 2 4

Pollen SEC8 SEC8 1 1 1 3

Pollen POK POK 2 1 1 1

Pollen NPG1 2) NPG1 3 3 2 2

Pollen ACA9 7) ACA9 6 4 2 3

Pollen ATAPY2 ATAPY2 2 8 1 3

Pollen KIP KIP 2 1 1 1

Pollen GCS1 GCS1 1 2 1 1

Other factors ACC1 ACC1 2 1 1 2

Other factors ACX4 ACX4 1 2 1 2

Other factors EMB1187 EMB1187 1 1 1 2

Other factors RPE RPE 1 1 1 3

Other factors TPC1 TPC1 1 1 0 2

Other factors MEE63/AT1G02140

MEE63 1 1 1 2

Other factors ATNAP6 ATNAP6 1 1 1 2

Other factors CPN60A and EMB3007 7)

CPN60A 2 2 2 3

Other factors VAM3 2) VAM3 4 3 2 4

Other factors NIT4 NIT1 1 2 1 1

Other factors ATHB1, 3, 5 to 7, 12, 13, and 16 1)

ATHB7 17 12 4 1

Other factors ARR22 ARR22 2 1 2 3

1For this group, a larger gene family was first divided into subgroups based on domain structure

37

and an amino acid similarity throughout most coding region; this minimum land plant clade is defined by the subgroup. 2For this group, membership was defined by domain structure or amino acid similarity throughout most of the coding region but bootstrap support of the relationships were low; here, all members in the gene family were counted as potential land plant orthologs. 3For this group, membership was defined by domain structure or amino acid similarity throughout most of the coding region but a root cannot be identified for lack of outgroup taxa; here, all members in the family were counted as potential land plant orthologs. 4An angiosperm gene other than an Arabidopsis gene was used as a query. 5Homologs of Arabidopsis MADS‐box genes were collected from the NCBI protein database. Homologs of rice MADS‐box genes were collected from the Database of Rice Transcription Factors (http://drtf.cbi.pku.edu.cn/). S. moellendorffii MADS‐box genes were collected using a S. remotifolia SrMADS1 MADS‐box gene as a query. 6Homologs only present in Arabidopsis; no phylogenetic tree was constructed. 7Number of potential orthologs were determined using the maximum‐likelihood tree. Metabolic pathways.

P450s

Methods. The raw Selaginella P450s were retrieved from the JGI Selaginella portal site by gathering the gene models tagged with P450 and by BLAST searches. The sequences were curated manually to best match with the EST dataset (if available) and/or the alignment with known related P450s. The non‐redundant Selaginella P450s (i.e., only one allele per locus) were assigned names according to the P450 nomenclature. Arabidopsis and Physcomitrella P450s were obtained from the P450 homepage (http://drnelson.uthsc.edu/cytochromeP450.html). The amino acid sequences in each P450 clan were aligned using ClustalW, and the Bayesian phylogenetic tree for each clan was built using MrBayes 3.1.1. The analyses invoked a comparable model (aamodelpr=mixed, nset=6, rates=invgamma). Phylogenetic trees of the different CYP clans are available at http://wiki.genomics.purdue.edu/index.php/Cytochrome_P450.

The distribution of Selaginella P450s across major plant CYP clans. Cytochrome P450s (CYPs) catalyze NADPH‐ and O2‐dependent hydroxylation reactions. P450s form a vast superfamily of genes that have been found in bacteria, insects, fish, mammals, plants, and fungi (109). They participate in a variety of biochemical pathways, including the biosynthesis of phenylpropanoids, alkaloids, terpenoids, lipids, cyanogenic glycosides, and glucosinolates, as well as plant growth regulators such as auxin, gibberellins, jasmonic acid and brassinosteroids (109). 390 full‐length P450s were identified in Selaginella; 225 were considered to be the non‐redundant P450s from one haplotype, with the remaining 165 allelic variants. All Selaginella P450s were sorted into the 9 plant CYP clans present in Arabidopsis plus two more that are absent in Arabidopsis (Table S7). Here, we compare the distribution of P450s across the major plant CYP clans in Physcomitrella, Selaginella, and Arabidopsis. Links to individual P450s are provided at http://wiki.genomics.purdue.edu/index.php/Cytochrome_P450.

38

The CYP51 clan. Selaginella has one homolog of CYP51 (Fig. S6), which encodes obtusifoliol 14α‐demethylase, an enzyme involved in the sterolbiosynthetic pathway (110). CYP51 is the only CYP family that is conserved in all eukaryotes.

Table S7. The number of genes per P450 clan in Physcomitrella, Selaginella and Arabidopsis.

CYP51 CYP71 CYP72 CYP74 CYP85 CYP86 CYP97 CYP710 CYP711 CYP727 CYP746

Physcomitrella (71 total)

1 41 4 3 5 10 3 2 0 1 1

Selaginella (225 total)

1 132 30 10 34 13 3 1 1 0 0

Arabidopsis (245 total)

1 154 19 2 28 33 3 4 1 0 0

The CYP71 Clan. The CYP71 clan is the largest, accounting for 62%, 59% and 58% of the total P450s in the genomes of Arabidopsis, Selaginella and Physcomitrella, respectively. Angiosperm CYP71 clan P450s are necessary for the biosynthesis of a vast array of secondary metabolites and polymers. Previous studies have shown that Chlamydomonas does not contain CYP71 clan P450s, suggesting that CYP71 clan evolved only in land plants (111).

Among the CYP71 clan families, CYP73, CYP78, CYP98, CYP701 and CYP703 are the only families that are conserved in all land plants (Fig. S7). CYP73 and CYP98 are hydroxylases in the phenylpropanoid pathway (112, 113) and considered critical for lignin biosynthesis, a hallmark of vascular plants. The CYP701A and CYP88 families encodes ent‐kaurene oxidases, which are involved in gibberellin biosynthesis (114). The conservation of CYP703C members may reflect their role in the development of the spore coat in land plants.

Physcomitrella has 41 CYP71 clan P450s; 12 fall into the five families discussed above and 29 are distributed among 11 other families. Physcomitrella CYP754, CYP755, CYP756, CYP759 and CYP760, together with Selaginella CYP782A1 form a clade that lacks Arabidopsis orthologs, indicating this clade of P450s was lost in angiosperms. One family, CYP758, is restricted to Physcomitrella.

Selaginella has 132 CYP71 clan P450s belonging to 22 families; 11 families (79 genes) form a clade that is unique to Selaginella (Fig. S7). This suggests that there have been parallel expansions of the CYP71 clan in the lycophyte and angiosperm lineages. This group of Selaginella P450s is likely to include enzymes that are involved in the synthesis of novel compounds in Selaginella.

The 154 Arabidopsis CYP71 clan P450s fall into 20 families, 15 of which are absent in Selaginella and Physcomitrella. These Arabidopsis specific families could mark the differences between flowering plants and plants from other land plant lineages. These proteins are involved glucosinolate and tryptophan‐dependent auxin biosynthesis (115‐122), terpenoid indole alkaloid, sesquiterpene, and triterpene biosynthesis (123‐125), phytoalexin biosynthesis (126‐128), alkaloid (129) and flavonoid metabolism (130, 131).

39

The CYP72 clan. Arabidopsis has seven families in the CYP72 clan, two of which (CYP734A1 and CYP72C1) are involved in brassinosteroid catabolism (132‐135). Selaginella has only one P450 (CYP773A1) that clusters with the root of the Arabidopsis clade containing CYP72, CYP721, CYP734, and CYP709 (Fig. S8). Other members of this clade are involved in cytokinin (trans‐zeatin) biosynthesis in Arabidopsis. Although cytokinin signaling has been reported in mosses (136), Physcomitrella and Selaginella lack cytokinin hydrolases, suggesting that trans‐zeatin biosynthesis is unique to angiosperms or that Selaginella and Physcomitrella employ alternative cytokinin biosynthetic routes. Selaginella has seven families in the CYP72 clan (Fig. S8), only two of which (CYP774 and CYP773) have decendants in Physcomitrella or Arabidopsis.

The CYP74 clan. The CYP74 family is the only family within the CYP74 clan. In angiosperms, these P450s are involved in the biosynthesis of jasmonic acid (137, 138) and a group of plant C6 volatiles (also known as green leaf volatiles) with roles in plant defense signaling (139, 140). The CYP74 family is absent in Chlamydomonas (111) but present in all land plants (Fig. S8). Interestingly, the Selaginella CYP74 family contains many more genes than Physcomitrella or Arabidopsis.

The CYP85 clan. Several families from the CYP85 clan, namely CYP85, CYP90 and CYP724, are involved in the biosynthesis of brassinosteroids in angiosperms (110, 141‐145). Selaginella has only a single family (CYP90) in this clade, which potentially represents brassinosteroid 23α‐hydroxylase orthologs (Fig. S9). Physcomitrella has only four P450s in the CYP85 clan, but none of them cluster with brassinosteroid biosynthetic P450s, suggesting that the brassinosteroid signaling pathway is absent in bryophytes but not Selaginella.

The CYP707 family encodes a ABA 8’‐hydroxylase that catalyzes the conversion of biologically active ABA into 8’‐hydroxy ABA, followed by spontaneous cyclization to form biologically inactive phaseic acid (146, 147). Selaginella has five P450s in the CYP707 family, among which CYP707A20 and CYP70743 are clearly Arabidopsis CYP707A orthologs (Fig. S10). Although ABA signaling is present in mosses (136), Physcomitrella lacks obvious Arabidopsis CYP707 orthologs. If Physcomitrella possesses an angiosperm‐like ABA catabolic pathway, the Physcomitrella CYP762 and CYP763 members are the most likely candidates for ABA hydroxylases, since they branch from the CYP707A clade.

CYP716 is the only family within the CYP85 clan that is conserved across land plants. The functions of members of this family are unknown.

The CYP86 clan. The Arabidopsis CYP86 clan includes four CYP families. The CYP86, CYP94 and CYP704 families are conserved in Physcomitrella, Selaginella and Arabidopsis, whereas CYP96 is unique to angiosperms (Fig. S10). The CYP86 clan P450s are associated with plant cuticle formation. Both CYP86 and CYP94 members have been shown to catalyze ω‐hydroxylation reactions on long chain fatty acids, which are required for cutin biosynthesis (148‐150). The CYP704 family is conserved in land plants, and is indispensable for pollen exine formation (151). CYP96A15 is responsible for the formation of secondary alcohols and ketones in stem cuticular waxes that may be unique to the angiosperms.

40

The CYP97 clan. Selaginella has three CYP97 P450s that are orthologs of the CYP97A, CYP97B, and, CYP97C subfamilies (Fig. S11). The three CYP97 subfamilies are highly conserved throughout the plant kingdom (111). In Arabidopsis, CYP97C1 and CYP97A3 are indispensable for lutein biosynthesis (152). Although its function is unknown, CYP97B is also likely to be involved in carotenoid metabolism. These results suggest that the ability to synthesize carotenoids arose early in plant evolution.

The CYP710 clan. This clan consists of only one family (CYP710), which is closely related to the CYP61 family in fungi. A CYP710 P450 is responsible for the biosynthesis of Δ22‐sterols (153). CYP710 exists as a multiple gene family in Arabidopsis and Physcomitrella, but as a single gene in Selaginella.

The CYP711 clan. The CYP711 family is the only family within the CYP711 clan. CYP711 is a single gene in Arabidopsis and Selaginella, and is absent in Physcomitrella. It has been suggested that the moss lineage might have lost this family independently, since CYP711 is present in Chlamydomonas (111). CYP711A1 (MAX1) has been shown to produce a carotenoid‐derived branch‐inhibiting hormone (154, 155).

Figure S6. Phylogeny of Selaginella CYP51 clan P450s.

At, Arabidopsis thaliana; Cr, Clamydomonas reinhardii; Dr, Danio rerio; Hs, Homo sapiens; Mt, Mycobacterium tuberculosis; Os, Oryza sativa; Pp, Physcomitrella paten; Sc, Saccharomyces cerevisiae; Sm, Selaginella moellendorffii. Color code: green algae, brown; mosses, blue; Selaginella, green; angiosperms, red.

41

!

!

42


At, Arabidopsis (red); Pp, Physcomitrella (blue); Sm, Selaginella (green).

43

Figure S8. Phylogenies of Selaginella CYP72 clan (left) and CYP74 clan (right) P450s.


44



45



46



47

BAHD acyltransferases

The BAHD acyltransferases comprise a large family (six phylogenetic clades) of plant‐specific enzymes. Products produced by characterized members include modified anthocyanins (156‐160), green‐leaf volatiles (161), lignin (162) and suberin (163). Some BAHD acyltransferases are essential for the synthesis of high value plant metabolites such as taxol and taxol analogs (164, 165) as well as volatile esters conferring flavor and odor to fruits and flowers (166, 167). However, the great majority of BAHD acyltransferases are uncharacterized and the nature of their substrates and products is unknown. The Selaginella genome encodes 46 BAHD acyltransferases per haplotype, comparable to the number in Arabidopsis (61 genes) and much greater than that found in Physcomitrella (13 genes). No BAHD genes could be found in the Chlamydomonas reinhardtii or Volvox carteri, indicating that the BAHD family appeared early in the evolution of land plants. Phylogenetic analyses (performed as described for the P450 genes) indicates that two of the six BAHD clades (clades IV and V) were present in the earliest land plants (Fig. S12). One clade (clade I) is present only in vascular plants, while all others (clades II, III, and VI) are either Selaginella or angiosperm specific. Several of the 14 Arabidopsis Clade III members are characterized acetyltransferases involved in the formation of volatile esters (168), suggesting that the expansion of this clade may have been important in the evolution of olfactory attractants associated with pollination. The independent diversification of the BAHD acyltransferase family in Selaginella suggests that secondary metabolism in this lineage could be substantially different than that seen in angiosperms, and opens up the potential for the discovery of new compounds or even classes of compounds in Selaginella species.

48

Figure S12. Cladogram of BAHD acyltransferases.

Asterisks indicate genes that cannot be placed into clades. Clade VI is a new clade. Links to individual Selaginella genes are available at http://wiki.genomics.purdue.edu/index.php/BAHD_family_acyltransferases.

49

Terpene synthases

Terpenoids are the largest class of plant natural products and serve a multitude of functions, including defense against herbivores and pathogens, pollinator attraction (by the emission of scented volatile terpenes) and GA‐mediated regulation of plant development (169). Terpene synthases (TSs) are cyclases that catalyze the formation of terpenoids, of which there are three major types in plants: monoterpenes (C10), sesquiterpenes (C15), and diterpenes (C20) (170, 171). Phylogenetic analysis of plant TS genes (performed as described for the P450 genes) is shown in Figure S13, and suggests that 1) all land plant TSs are derived from the diterpene synthases responsible for ent‐kaurene synthesis, a key step of GA biosynthesis; and 2) monoterpene synthases, and, perhaps, diterpene synthases, arose within the seed plant lineage.

50

Figure S13. Phylogeny of plant terpene synthases.

Blue: Physcomitrella; green: Selaginella; red: dicots; brown: monocots; purple: gymnosperms; MTS: monoterpene synthase; DTS: diterpene synthase; STS: sesquiterpene synthase. KS and CPS are involved in GA biosynthesis.

51

The plastome.

The plastome of Selaginella was extracted from unedited sequence reads using the MegaBLAST portal (http://www.ncbi.nlm.nih.gov/blast/mmtrace.shtml). Initially, the rbcL sequence of Selaginella (accession AF093256) was used to blast to this database. The resulting sequence reads and their paired‐end reads were downloaded and assembled using Sequencher version (Gene Codes Corporation, Ann Arbor, Michigan, USA). Approximately 300 nucleotides from each end of the resulting contig were then blasted and the resulting sequence reads and their paired‐end reads were assembled with the previous contig. This process was repeated until the entire plastome was assembled. The complete plastome was annotated using DOGMA (172). Codon usage was calculated using ebioinfogen BioTools (http://www.ebioinfogen.com/biotools/codon‐usage.htm). The plastome of Selaginella (GenBank accession number HM173080) is 143,775 bp in length and is comprised of an 83,665 bp LSC, a 35,882 bp SSC, and two 12,114 bp IRs (Fig. S14). Ninety‐eight genes were annotated in total, of which 8 are present in two copies (IR) and 5 are putative pseudogenes. At least 49 of the 77 (63.6%) intact protein‐coding genes appear to be RNA edited: 31 have non‐canonical start codons, 5 have non‐canonical stop codons, an additional 13 have both, and one has two internal stop codons. Four rRNA genes were annotated in their usual position in the IR. Typical land plant plastomes contain genes for about 30 tRNAs. We identified only 12 putatively functional tRNA genes and three tRNA pseudogenes. A similar pattern of tRNA loss has been reported in another Selaginella species, S. uncinata (173). The intact tRNAs alone are insufficient to accommodate translation in the plastid, thus tRNAs must be imported. The plastome gene order of Selaginella is nearly identical to that of another lycophyte, Huperzia lucidula (174), with one major exception. A large region (~14 kb) spanning rps4 to psbD has been translocated from the LSC to the SSC, with rps4 now residing in the IR (Fig. S14). A similar translocation has been reported for S. uncinata (173), though the genes included in this translocation differ slightly. In both Selaginella plastomes rps4 marks one end of the translocated segment, however the other endpoint is terminated by trnD‐GUC in S. uncinata, not psbD. In S. moellendorfii this region contains three fewer genes, trnE‐UUC, trnY‐GUA, and trnD‐GUC, which remain in the LSC adjacent to ycf2 (Fig. S14), as is the case in most land plant plastid genomes. Several additional unique features have been described in the S. uncinata plastome. These include a 20‐kb LSC inversion, duplication of the psbK and trnQ‐UUG genes, and the translocation of the petN gene (7). The S. moellendorffii plastome does not share these unusual features (Fig. S14). The plastome of S. moellendorffii has an overall GC content of 51%, a value well outside the 95% range (33.11% ‐ 40.79%), as previously noted (175). This result is similar, however, to the 55% GC content reported for the plastome of S. uncinata (173). The plastomes of S. moellendorffii and S. uncinata have lost all group IIA1 introns as well as some of the genes containing them. These introns are typically spliced by a maturase encoded by matK (176), which is in both Selaginella plastomes.

52

Figure S14. Map of the Selaginella plastid genome.

53

PPR gene family.

Selaginella PPR genes were sought using a BLASTP search against the FilteredModel2 dataset, which contains gene models for the two haplotypes. A total of 1718 genes (>800 per haplotype) encoding PPR motif(s) were identified in the Selaginella genome sequence. Of these, 144 encode a C‐terminal DYW motif containing PPR (PPR‐DYW) protein, which is substantially more than that found in rice (90 genes), or Arabidopsis (87 genes) (177). In Arabidopsis and rice, over 500 RNA editing sites are found in plastids and mitochondria (178, 179), while only 12 editing sites are found in Physcomitrella (180, 181). No RNA editing sites or PPR‐DYW proteins are found in Chlamydomonas. At least 63% of the protein‐coding plastome genes in Selaginella are RNA edited, with an estimated 3000 sites edited (Table S8), similar to that observed in Selaginella uncinata plastids (173). The high number of PPR‐DYW genes in Selaginella compared to other plants (Table S8) correlates with a high number of edited sites.

Table S8. PPR gene numbers and number of RNA editing sites in organellar genomes.

Chlamydomonas Physcomitrella Selaginella Arabidopsis Rice PPR proteins 12 103 >800 450 477 PPR‐DYW proteins

0 10 144 87 90

# RNA editing sites in plastids and mitochondria

0 12 >3 000 542 512

References and Notes 1. P. Kenrick, P. R. Crane, Nature 389, 33 (1997). 2. Details are given in the supporting materials on Science Online. 3. W. Wang et al., Construction of a bacterial artificial chromosome library from the spikemoss Selaginella moellendorffii: A new resource for plant comparative genomics. BMC Plant Biol. 5, 10 (2005). 4. M. J. Axtell, J. A. Snyder, D. P. Bartel, Common functions for diverse small RNAs of land plants. Plant Cell 19, 1750 (2007). 5. M. S. Barker, H. Vogel, M. E. Schranz, Paleopolyploidy in the Brassicales: Analyses of the Cleome transcriptome elucidate the history of genome duplications in Arabidopsis and other Brassicales. Genome Biol. Evol. 1, 391 (2009). 6. H. Tang et al., Synteny and collinearity in plant genomes. Science 320, 486 (2008).

54

7. H. Tang et al., Unraveling ancient hexaploidy through multiply‐aligned angiosperm gene maps. Genome Res. 18, 1944 (2008). 8. M. Ghildiyal, P. D. Zamore, Small silencing RNAs: An expanding universe. Nat. Rev. Genet. 10, 94 (2009). 9. D. Chen et al., Small RNAs in angiosperms: Sequence characteristics, distribution and generation. Bioinformatics 26, 1391 (2010). 10. K. D. Kasschau et al., Genome‐wide profiling and analysis of Arabidopsis siRNAs. PLoS Biol. 5, e57 (2007). 11. K. Nobuta et al., An expression atlas of rice mRNAs and small RNAs. Nat. Biotechnol. 25, 473 (2007). 12. S. H. Cho et al., Physcomitrella patens DCL3 is required for 22‐24 nt siRNA accumulation, suppression of retrotransposon‐derived transcripts, and normal development. PLoS Genet. 4, e1000314 (2008). 13. D. Bourc’his, O. Voinnet, A small‐RNA perspective on gametogenesis, fertilization, and early zygotic development. Science 330, 617 (2010). 14. D. H. Chitwood et al., Pattern formation via small RNA mobility. Genes Dev. 23, 549 (2009). 15. A. Peragine, M. Yoshikawa, G. Wu, H. L. Albrecht, R. S. Poethig, SGS3 and SGS2/SDE1/RDR6 are required for juvenile development and the production of trans‐ acting siRNAs in Arabidopsis. Genes Dev. 18, 2368 (2004). 16. Z. Xie, E. Allen, A. Wilken, J. C. Carrington, DICER‐LIKE 4 functions in trans‐acting small interfering RNA biogenesis and vegetative phase change in Arabidopsis thaliana. Proc. Natl. Acad. Sci. U.S.A. 102, 12984 (2005). 17. S. K. Floyd, J. L. Bowman, Distinct developmental mechanisms reflect the independent origins of leaves in vascular plants. Curr. Biol. 16, 1911 (2006).

18. C. J. Harrison et al., Independent recruitment of a conserved developmental mechanism during leaf evolution. Nature 434, 509 (2005).

19. S. Tsuji et al., The chloroplast genome from a lycophyte (microphyllophyte), Selaginella uncinata, has a unique inversion, transpositions and many gene losses. J. Plant Res. 120, 281 (2007).

55

20. F. Grewe et al., A unique transcriptome: 1782 positions of RNA editing alter 1406 codon identities in mitochondrial mRNAs of the lycophyte Isoetes engelmannii. Nucleic Acids

Res. 39, 2890 (2011). 21. V. Knoop, When you can’t trust the DNA: RNA editing changes transcript sequences. Cell.

Mol. Life Sci. 68, 567 (2011).

22. T. Demura, H. Fukuda, Transcriptional regulation in wood formation. Trends Plant Sci. 12, 64 (2007). 23. M. Bonke, S. Thitamadee, A. P. Mähönen, M. T. Hauser, Y. Helariutta, APL regulates vascular tissue identity in Arabidopsis. Nature 426, 181 (2003). 24. K. Hirano et al., The GID1‐mediated gibberellin perception mechanism is conserved in the Lycophyte Selaginella moellendorffii but not in the Bryophyte Physcomitrella patens. Plant Cell 19, 3058 (2007). 25. Y. Yasumura, M. Crumpton‐Taylor, S. Fuentes, N. P. Harberd, Step‐by‐step acquisition of the gibberellin‐DELLA growth‐regulatory mechanism during land‐plant evolution. Curr. Biol. 17, 1225 (2007). 26. Y. Cao et al., Bioactive flavones and biflavones from Selaginella moellendorffii Hieron. Fitoterapia 81, 253 (2010). 27. M. Luo, R. A. Wing, An improved method for plant BAC library construction. Methods Mol. Biol. 236, 3 (2003). 28. S. A. Rensing et al., The Physcomitrella genome reveals evolutionary insights into the conquest of land by plants. Science 319, 64 (2008). 29. S. Batzoglou et al., ARACHNE: a whole‐genome shotgun assembler. Genome Res. 12, 177 (2002). 30. J. C. Detter et al., Isothermal strand‐displacement amplification applications for high‐ throughput genomics. Genomics 80, 691 (2002). 31. B. Ewing, P. Green, Base‐calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186 (1998). 32. B. Ewing, L. Hillier, M. C. Wendl, P. Green, Base‐calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8, 175 (1998). 33. X. Huang, A. Madan, CAP3: A DNA sequence assembly program. Genome Res. 9, 868 (1999).

56

34. S. F. Altschul et al., Gapped BLAST and PSI‐BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389 (1997). 35. D. H. Huson, A. F. Auch, J. Qi, S. C. Schuster, MEGAN analysis of metagenomic data. Genome Res. 17, 377 (2007). 36. A. A. Salamov, V. V. Solovyev, Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516 (2000). 37. E. Birney, R. Durbin, Using GeneWise in the Drosophila annotation experiment. Genome Res. 10, 547 (2000). 38. E. M. Zdobnov, R. Apweiler, InterProScan—an integration platform for the signature‐ recognition methods in InterPro. Bioinformatics 17, 847 (2001). 39. M. Ashburner et al., The Gene Ontology Consortium, Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25 (2000). 40. E. V. Koonin et al., A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol. 5, R7 (2004). 41. M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno, M. Hattori, The KEGG resource for deciphering the genome. Nucleic Acids Res. 32 (Database issue), D277 (2004). 42. B. J. Haas, A. L. Delcher, J. R. Wortman, S. L. Salzberg, DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics 20, 3643 (2004). 43. R. Liu, J. L. Bennetzen, Enchilada redux: how complete is your genome sequence? New Phytol. 179, 249 (2008). 44. B. J. Haas et al., Complete reannotation of the Arabidopsis genome: methods, tools, protocols and the final release. BMC Biol. 3, 7 (2005). 45. D. Lang, A. D. Zimmer, S. A. Rensing, R. Reski, Exploring plant biodiversity: the Physcomitrella genome and beyond. Trends Plant Sci. 13, 542 (2008). 46. M. Takamiya, Comparative karyomorphology and Interrelationships of Selaginella in Japan. J. Plant Res. 106, 149 (1993). 47. R. Reski, M. Faust, X. H. Wang, M. Wehe, W. O. Abel, Genome analysis of the moss Physcomitrella patens (Hedw.) B.S.G. Mol. Gen. Genet. 244, 352 (1994). 48. M. S. Barker, H. Vogel, M. E. Schranz, Paleopolyploidy in the Brassicales: analyses of the

57

Cleome transcriptome elucidate the history of genome duplications in Arabidopsis and other Brassicales. Genome Biol. Evol. 1, 391 (2009). 49. H. Tang et al., Unraveling ancient hexaploidy through multiply‐aligned angiosperm gene maps. Genome Res. 18, 1944 (2008). 50. S. A. Rensing et al., An ancient genome duplication contributed to the abundance of metabolic genes in the moss Physcomitrella patens. BMC Evol. Biol. 7, 130 (2007). 51. E. M. McCarthy, J. F. McDonald, LTR_STRUC: a novel search and identification program for LTR retrotransposons. Bioinformatics 19, 362 (2003). 52. Z. Xu, H. Wang, LTR_FINDER: an efficient tool for the prediction of full‐length LTR retrotransposons. Nucleic Acids Res. 35 (Web Server issue), W265 (2007). 53. T. Wicker et al., A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet. 8, 973 (2007). 54. J. L. Bennetzen et al., Grass genomes. Proc. Natl. Acad. Sci. U.S.A. 95, 1975 (1998). 55. J. Ma, J. L. Bennetzen, Recombination, rearrangement, reshuffling, and divergence in a centromeric region of rice. Proc. Natl. Acad. Sci. U.S.A. 103, 383 (2006). 56. V. V. Kapitonov, J. Jurka, Rolling‐circle transposons in eukaryotes. Proc. Natl. Acad. Sci. U.S.A. 98, 8714 (2001). 57. L. Yang, J. L. Bennetzen, Structure‐based discovery and description of plant and animal Helitrons. Proc. Natl. Acad. Sci. U.S.A. 106, 12832 (2009). 58. Z. Bao, S. R. Eddy, Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 12, 1269 (2002). 59. S. Karlin, S. F. Altschul, Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. U.S.A. 87, 2264 (1990). 60. J. D. Thompson, D. G. Higgins, T. J. Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position‐specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673 (1994). 61. Z. Tu, Eight novel families of miniature inverted repeat transposable elements in the African malaria mosquito, Anopheles gambiae. Proc. Natl. Acad. Sci. U.S.A. 98, 1699 (2001). 62. J. D. DeBarry, R. Liu, J. L. Bennetzen, Discovery and assembly of repeat family

58

pseudomolecules from sparse genomic sequence data using the Assisted Automated Assembler of Repeat Families (AAARF) algorithm. BMC Bioinformatics 9, 235 (2008). 63. Z. Zhang, S. Schwartz, L. Wagner, W. Miller, A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203 (2000). 64. B. Ma, J. Tromp, M. Li, PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440 (2002). 65. D. L. Wheeler et al., Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 36 (Database issue), D13 (2008). 66. E. Birney, J. D. Thompson, T. J. Gibson, PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. Nucleic Acids Res. 24, 2730 (1996). 67. R. C. Edgar, MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792 (2004). 68. R. Wernersson, A. G. Pedersen, RevTrans: Multiple alignment of coding DNA from aligned amino acid sequences. Nucleic Acids Res. 31, 3537 (2003). 69. Z. H. Yang, PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13, 555 (1997). 70. N. Goldman, Z. Yang, A codon‐based model of nucleotide substitution for protein‐coding DNA sequences. Mol. Biol. Evol. 11, 725 (1994). 71. G. Blanc, K. H. Wolfe, Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell 16, 1667 (2004). 72. L. Y. Cui et al., Widespread genome duplications throughout the history of flowering plants. Genome Res. 16, 738 (2006). 73. P. Chaudhuri, J. S. Marron, SiZer for exploration of structures in curves. J. Am. Stat. Assoc. 94, 807 (1999). 74. G. McLachlan, D. Peel, K. Basford, P. Adams, The EMMIX software for the fitting of mixtures of normal and t‐components. J. Stat. Softw. 4, 1 (1999). 75. J. A. Schlueter et al., Mining EST databases to resolve evolutionary events in major crop species. Genome 47, 868 (2004). 76. G. Blanc, K. H. Wolfe, Functional divergence of duplicated genes formed by polyploidy

59

during Arabidopsis evolution. Plant Cell 16, 1679 (2004). 77. P. Korall, P. Kenrick, Phylogenetic relationships in Selaginellaceae based on RBCL sequences. Am. J. Bot. 89, 506 (2002). 78. M. J. Axtell, J. A. Snyder, D. P. Bartel, Common functions for diverse small RNAs of land plants. Plant Cell 19, 1750 (2007). 79. S. Griffiths‐Jones, H. K. Saini, S. van Dongen, A. J. Enright, miRBase: tools for microRNA genomics. Nucleic Acids Res. 36 (Database issue), D154 (2008). 80. K. Tamura, J. Dudley, M. Nei, S. Kumar, MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol. Biol. Evol. 24, 1596 (2007). 81. S. M. Rumble et al., SHRiMP: accurate mapping of short color‐space reads. PLOS Comput. Biol. 5, e1000386 (2009). 82. T. M. Lowe, S. R. Eddy, tRNAscan‐SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955 (1997). 83. G. Chuck, A. M. Cigan, K. Saeteurn, S. Hake, The heterochronic maize mutant Corngrass1 results from overexpression of a tandem microRNA. Nat. Genet. 39, 544 (2007). 84. K. D. Kasschau et al., Genome‐wide profiling and analysis of Arabidopsis siRNAs. PLoS Biol. 5, e57 (2007). 85. R. A. Mosher, F. Schwach, D. Studholme, D. C. Baulcombe, PolIVb influences RNA‐ directed DNA methylation independently of its role in siRNA biogenesis. Proc. Natl. Acad. Sci. U.S.A. 105, 3145 (2008). 86. S. H. Cho et al., Physcomitrella patens DCL3 is required for 22‐24 nt siRNA accumulation, suppression of retrotransposon‐derived transcripts, and normal development. PLoS Genet. 4, e1000314 (2008). 87. Z. Xie et al., Genetic and functional diversification of small RNA pathways in plants. PLoS Biol. 2, e104 (2004). 88. A. Deleris et al., Hierarchical action and inhibition of plant Dicer‐like proteins in antiviral defense. Science 313, 68 (2006). 89. M. D. Howell et al., Genome‐wide analysis of the RNA‐DEPENDENT RNA POLYMERASE6/DICER‐LIKE4 pathway in Arabidopsis reveals dependency on miRNA‐ and tasiRNA‐directed targeting. Plant Cell 19, 926 (2007).

60

90. H. Vaucheret, Plant ARGONAUTES. Trends Plant Sci. 13, 350 (2008). 91. X. B. Wang et al., RNAi‐mediated viral immunity requires amplification of virus‐derived siRNAs in Arabidopsis thaliana. Proc. Natl. Acad. Sci. U.S.A. 107, 484 (2010). 92. M. Talmor‐Neiman et al., Identification of trans‐acting siRNAs in moss and an RNA‐ dependent RNA polymerase required for their biogenesis. Plant J. 48, 511 (2006). 93. A. J. Herr, M. B. Jensen, T. Dalmay, D. C. Baulcombe, RNA polymerase IV directs silencing of endogenous DNA. Science 308, 118 (2005). 94. T. S. Ream et al., Subunit compositions of the RNA‐silencing enzymes Pol IV and Pol V reveal their origins as specialized forms of RNA polymerase II. Mol. Cell 33, 192 (2009). 95. Y. Onodera et al., Plant nuclear RNA polymerase IV mediates siRNA and DNA methylation‐ dependent heterochromatin formation. Cell 120, 613 (2005). 96. J. Luo, B. D. Hall, A multistep process gave rise to RNA polymerase IV of land plants. J. Mol. Evol. 64, 101 (2007). 97. T. A. Montgomery et al., Specificity of ARGONAUTE7‐miR390 interaction and dual functionality in TAS3 trans‐acting siRNA formation. Cell 133, 128 (2008). 98. N. Fahlgren et al., Regulation of AUXIN RESPONSE FACTOR3 by TAS3 ta‐siRNA affects developmental timing and patterning in Arabidopsis. Curr. Biol. 16, 939 (2006). 99. X. Adenot et al., DRB4‐dependent TAS3 trans‐acting siRNAs control leaf morphology through AGO7. Curr. Biol. 16, 927 (2006). 100. M. Talmor‐Neiman et al., Identification of trans‐acting siRNAs in moss and an RNA‐ dependent RNA polymerase required for their biogenesis. Plant J. 48, 511 (2006). 101. U. Hellsten et al., Accelerated gene evolution and subfunctionalization in the pseudotetraploid frog Xenopus laevis. BMC Biol. 5, 31 (2007). 102. N. Saitou, M. Nei, The neighbor‐joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406 (1987). 103. S. Guindon, O. Gascuel, A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52, 696 (2003). 104. K. Katoh, K. Kuma, H. Toh, T. Miyata, MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33, 511 (2005).

61

105. D. Maddison, W. Maddison, MacClade 4: Analysis of Phylogeny and Character Evolution., (Sinauer Associates, Sunderland, MA, 2000). 106. J. Felsenstein, Distributed by the author, Department of Genetics, University of Washington, Seattle (2005). 107. J. Adachi, M. Hasegawa, MOLPHY version 2.3: Programs for molecular phylogenetics based on maximum likelihood. Comput. Sci. Monogr. 28, 1 (1996). 108. D. T. Jones, W. R. Taylor, J. M. Thornton, The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8, 275 (1992). 109. C. Chapple, Molecular‐Genetic Analysis of Plant Cytochrome P450‐Dependent Monooxygenases. Annu. Rev. Plant Physiol. Plant Mol. Biol. 49, 311 (1998). 110. H. B. Kim et al., Arabidopsis cyp51 mutant shows postembryonic seedling lethality associated with lack of membrane integrity. Plant Physiol. 138, 2033 (2005). 111. D. R. Nelson, Plant cytochrome P450s from moss to poplar. Phytochem. Rev. 5, 193 (2006). 112. R. Franke et al., The Arabidopsis REF8 gene encodes the 3‐hydroxylase of phenylpropanoid metabolism. Plant J. 30, 33 (2002). 113. M. Mizutani, D. Ohta, R. Sato, Isolation of a cDNA and a genomic clone encoding cinnamate 4‐hydroxylase from Arabidopsis and its expression manner in planta. Plant Physiol. 113, 755 (1997). 114. C. A. Helliwell, A. Poole, W. J. Peacock, E. S. Dennis, Arabidopsis ent‐kaurene oxidase catalyzes three steps of gibberellin biosynthesis. Plant Physiol. 119, 507 (1999). 115. U. Wittstock, B. A. Halkier, Cytochrome P450 CYP79A2 from Arabidopsis thaliana L. Catalyzes the conversion of L‐phenylalanine to phenylacetaldoxime in the biosynthesis of benzylglucosinolate. J. Biol. Chem. 275, 14659 (2000). 116. S. Chen et al., CYP79F1 and CYP79F2 have distinct functions in the biosynthesis of aliphatic glucosinolates in Arabidopsis. Plant J. 33, 923 (2003). 117. T. Tantikanjana, M. D. Mikkelsen, M. Hussain, B. A. Halkier, V. Sundaresan, Functional analysis of the tandem‐duplicated P450 genes SPS/BUS/CYP79F1 and CYP79F2 in glucosinolate biosynthesis and plant development by Ds transposition‐generated double mutants. Plant Physiol. 135, 840 (2004). 118. B. Reintanz et al., Bus, a bushy Arabidopsis CYP79F1 knockout mutant with abolished synthesis of short‐chain aliphatic glucosinolates. Plant Cell 13, 351 (2001).

62

119. Y. Zhao et al., Trp‐dependent auxin biosynthesis in Arabidopsis: involvement of cytochrome P450s CYP79B2 and CYP79B3. Genes Dev. 16, 3100 (2002). 120. K. Ljung et al., Sites and regulation of auxin biosynthesis in Arabidopsis roots. Plant Cell 17, 1090 (2005). 121. A. K. Hull, R. Vij, J. L. Celenza, Arabidopsis cytochrome P450s that catalyze the first step of tryptophan‐dependent indole‐3‐acetic acid biosynthesis. Proc. Natl. Acad. Sci. U.S.A. 97, 2379 (2000). 122. B. A. Halkier, J. Gershenzon, Biology and biochemistry of glucosinolates. Annu. Rev. Plant Biol. 57, 303 (2006). 123. B. Field, A. E. Osbourn, Metabolic diversification—independent assembly of operon‐like gene clusters in different plants. Science 320, 543 (2008). 124. P. Luo, Y. H. Wang, G. D. Wang, M. Essenberg, X. Y. Chen, Molecular cloning and functional identification of (+)‐delta‐cadinene‐8‐hydroxylase, a cytochrome P450 mono‐ oxygenase (CYP706B1) of cotton sesquiterpene biosynthesis. Plant J. 28, 95 (2001). 125. G. Collu et al., Geraniol 10‐hydroxylase, a cytochrome P450 enzyme involved in terpenoid indole alkaloid biosynthesis. FEBS Lett. 508, 215 (2001). 126. N. Zhou, T. L. Tootle, J. Glazebrook, Arabidopsis PAD3, a gene required for camalexin biosynthesis, encodes a putative cytochrome P450 monooxygenase. Plant Cell 11, 2419 (1999). 127. R. Schuhegger et al., CYP71B15 (PAD3) catalyzes the final step in camalexin biosynthesis. Plant Physiol. 141, 1248 (2006). 128. M. Nafisi et al., Arabidopsis cytochrome P450 monooxygenase 71A13 catalyzes the conversion of indole‐3‐acetaldoxime in camalexin synthesis. Plant Cell 19, 2039 (2007). 129. B. Siminszky, L. Gavilano, S. W. Bowen, R. E. Dewey, Conversion of nicotine to nornicotine in Nicotiana tabacum is mediated by CYP82E4, a cytochrome P450 monooxygenase. Proc. Natl. Acad. Sci. U.S.A. 102, 14919 (2005). 130. W. Jung et al., Identification and expression of isoflavone synthase, the key enzyme for biosynthesis of isoflavones in legumes. Nat. Biotechnol. 18, 208 (2000). 131. C. J. Liu, D. Huhman, L. W. Sumner, R. A. Dixon, Regiospecific hydroxylation of isoflavones by cytochrome p450 81E enzymes from Medicago truncatula. Plant J. 36, 471 (2003).

63

132. M. Nakamura et al., Activation of the cytochrome P450 gene, CYP72C1, reduces the levels of active brassinosteroids in vivo. J. Exp. Bot. 56, 833 (2005). 133. M. M. Neff et al., BAS1: A gene regulating brassinosteroid levels and light responsiveness in Arabidopsis. Proc. Natl. Acad. Sci. U.S.A. 96, 15316 (1999). 134. E. M. Turk et al., BAS1 and SOB7 act redundantly to modulate Arabidopsis photomorphogenesis via unique brassinosteroid inactivation mechanisms. Plant J. 42, 23 (2005). 135. N. Takahashi et al., shk1‐D, a dwarf Arabidopsis mutant caused by activation of the CYP72C1 gene, has altered brassinosteroid levels. Plant J. 42, 13 (2005). 136. D. Cove, M. Bezanilla, P. Harries, R. Quatrano, Mosses as model systems for the study of metabolism and development. Annu. Rev. Plant Biol. 57, 497 (2006). 137. J. H. Park et al., A knock‐out mutation in allene oxide synthase results in male sterility and defective wound signal transduction in Arabidopsis due to a block in jasmonic acid biosynthesis. Plant J. 31, 1 (2002). 138. D. Laudert, U. Pfannschmidt, F. Lottspeich, H. Holländer‐Czytko, E. W. Weiler, Cloning, molecular and functional characterization of Arabidopsis thaliana allene oxide synthase (CYP 74), the first enzyme of the octadecanoid pathway to jasmonates. Plant Mol. Biol. 31, 323 (1996). 139. H. Duan, M. Y. Huang, K. Palacio, M. A. Schuler, Variations in CYP74B2 (hydroperoxide lyase) gene expression differentially affect hexenal signaling in the Columbia and Landsberg erecta ecotypes of Arabidopsis. Plant Physiol. 139, 1529 (2005). 140. N. J. Bate et al., Molecular characterization of an Arabidopsis gene encoding hydroperoxide lyase, a cytochrome P‐450 that is wound inducible. Plant Physiol. 117, 1393 (1998). 141. S. Choe, in Plant Hormones: Physiology, Biochemistry and Molecular Biology., P. J. Davies, Ed. (Kluwer Academic Publishers, Dordrecht, Netherlands, 2004), pp. 156‐178. 142. S. Tanabe et al., A novel cytochrome P450 is implicated in brassinosteroid biosynthesis via the characterization of a rice dwarf mutant, dwarf11, with reduced seed length. Plant Cell 17, 776 (2005). 143. T. Nomura et al., The last reaction producing brassinolide is catalyzed by cytochrome P‐ 450s, CYP85A3 in tomato and CYP85A2 in Arabidopsis. J. Biol. Chem. 280, 17873 (2005). 144. T. Ohnishi et al., C‐23 hydroxylation by Arabidopsis CYP90C1 and CYP90D1 reveals a

64

novel shortcut in brassinosteroid biosynthesis. Plant Cell 18, 3275 (2006). 145. S. Fujita et al., Arabidopsis CYP90B1 catalyses the early C‐22 hydroxylation of C27, C28 and C29 sterols. Plant J. 45, 765 (2006). 146. S. Saito et al., Arabidopsis CYP707As encode (+)‐abscisic acid 8′‐hydroxylase, a key enzyme in the oxidative catabolism of abscisic acid. Plant Physiol. 134, 1439 (2004). 147. T. Kushiro et al., The Arabidopsis cytochrome P450 CYP707A encodes ABA 8′‐ hydroxylases: key enzymes in ABA catabolism. EMBO J. 23, 1647 (2004). 148. F. Xiao et al., Arabidopsis CYP86A2 represses Pseudomonas syringae type III genes and is required for cuticle development. EMBO J. 23, 2903 (2004). 149. N. Tijet et al., Functional expression in yeast and characterization of a clofibrate‐inducible plant cytochrome P‐450 (CYP94A1) involved in cutin monomers synthesis. Biochem. J. 332, 583 (1998). 150. I. Benveniste et al., CYP86A1 from Arabidopsis thaliana encodes a cytochrome P450‐ dependent fatty acid omega‐hydroxylase. Biochem. Biophys. Res. Commun. 243, 688 (1998). 151. A. A. Dobritsa et al., CYP704B1 is a long‐chain fatty acid omega‐hydroxylase essential for sporopollenin synthesis in pollen of Arabidopsis. Plant Physiol. 151, 574 (2009). 152. L. Tian, V. Musetti, J. Kim, M. Magallanes‐Lundback, D. DellaPenna, The Arabidopsis LUT1 locus encodes a member of the cytochrome p450 family that is required for carotenoid epsilon‐ring hydroxylation activity. Proc. Natl. Acad. Sci. U.S.A. 101, 402 (2004). 153. T. Morikawa et al., Cytochrome P450 CYP710A encodes the sterol C‐22 desaturase in Arabidopsis and tomato. Plant Cell 18, 1008 (2006). 154. J. Booker et al., MAX1 encodes a cytochrome P450 family member that acts downstream of MAX3/4 to produce a carotenoid‐derived branch‐inhibiting hormone. Dev. Cell 8, 443 (2005). 155. S. Crawford et al., Strigolactones enhance competition between shoot branches by dampening auxin transport. Development 137, 2905 (2010). 156. H. Suzuki et al., Identification of a cDNA Encoding Malonyl‐Coenzyme A: Anthocyanidin 3‐O‐Glucoside 6 "‐O‐Malonyltransferase from Cineraria (Senecio cruentus) Flowers. Plant Biotechnol. 20, 229 (2003).

65

157. H. Suzuki et al., Malonyl‐CoA:Anthocyanin 5‐O‐glucoside‐6 ‐O‐malonyltransferase from scarlet sage (Salvia splendens) flowers. Enzyme purification, gene cloning, expression, and characterization. J. Biol. Chem. 276, 49013 (2001). 158. K. Yonekura‐Sakakibara et al., Molecular and biochemical characterization of a novel hydroxycinnamoyl‐CoA: anthocyanin 3‐O‐glucoside‐6″‐O‐acyltransferase from Perilla frutescens. Plant Cell Physiol. 41, 495 (2000). 159. H. Suzuki et al., cDNA cloning and functional characterization of flavonol 3‐O‐glucoside‐ 6''‐O‐malonyltransferases from flowers of Verbena hybrida and Lamium purpureum. J. Mol. Catal., B Enzym. 28, 87 (2004). 160. H. Suzuki, cDNA cloning and characterization of two Dendranthema × morifolium anthocyanin malonyltransferases with different functional activities. Plant Sci. 166, 89 (2004). 161. J. C. D’Auria, Acyltransferases in plants: a good time to be BAHD. Curr. Opin. Plant Biol. 9, 331 (2006). 162. L. Hoffmann, S. Maury, F. Martz, P. Geoffroy, M. Legrand, Purification, cloning, and properties of an acyltransferase controlling shikimate and quinate ester intermediates in phenylpropanoid metabolism. J. Biol. Chem. 278, 95 (2003). 163. I. Molina, Y. Li‐Beisson, F. Beisson, J. B. Ohlrogge, M. Pollard, Identification of an Arabidopsis feruloyl‐coenzyme A transferase required for suberin synthesis. Plant Physiol. 151, 1317 (2009). 164. K. Walker, R. Long, R. Croteau, The final acylation step in taxol biosynthesis: cloning of the taxoid C13‐side‐chain N‐benzoyltransferase from Taxus. Proc. Natl. Acad. Sci. U.S.A. 99, 9166 (2002). 165. K. Walker, R. Croteau, Molecular cloning of a 10‐deacetylbaccatin III‐10‐O‐acetyl transferase cDNA from Taxus and functional expression in Escherichia coli. Proc. Natl. Acad. Sci. U.S.A. 97, 583 (2000). 166. I. El‐Sharkawy et al., Functional characterization of a melon alcohol acyl‐transferase gene family involved in the biosynthesis of ester volatiles. Identification of the crucial role of a threonine residue for enzyme activity*. Plant Mol. Biol. 59, 345 (2005). 167. R. Dexter et al., Characterization of a petunia acetyltransferase involved in the biosynthesis of the floral volatile isoeugenol. Plant J. 49, 265 (2007). 168. V. Negruk, P. Yang, M. Subramanian, J. P. McNevin, B. Lemieux, Molecular cloning and characterization of the CER2 gene of Arabidopsis thaliana. Plant J. 9, 137 (1996).

66

169. N. Dudareva et al., (E)‐beta‐ocimene and myrcene synthase genes of floral scent biosynthesis in snapdragon: function and expression of three terpene synthase genes of a new terpene synthase subfamily. Plant Cell 15, 1227 (2003). 170. J. Bohlmann, G. Meyer‐Gauen, R. Croteau, Plant terpenoid synthases: molecular biology and phylogenetic analysis. Proc. Natl. Acad. Sci. U.S.A. 95, 4126 (1998). 171. S. C. Trapp, R. B. Croteau, Genomic organization of plant terpene synthases and molecular evolutionary implications. Genetics 158, 811 (2001). 172. S. K. Wyman, R. K. Jansen, J. L. Boore, Automatic annotation of organellar genomes with DOGMA. Bioinformatics 20, 3252 (2004). 173. S. Tsuji et al., The chloroplast genome from a lycophyte (microphyllophyte), Selaginella uncinata, has a unique inversion, transpositions and many gene losses. J. Plant Res. 120, 281 (2007). 174. P. G. Wolf et al., The first complete chloroplast genome sequence of a lycophyte, Huperzia lucidula (Lycopodiaceae). Gene 350, 117 (2005). 175. D. R. Smith, Unparalleled GC content in the plastid DNA of Selaginella. Plant Mol. Biol. 71, 627 (2009). 176. K. Liere, G. Link, RNA‐binding activity of the matK protein encoded by the chloroplast trnK intron from mustard (Sinapis alba L.). Nucleic Acids Res. 23, 917 (1995). 177. N. O’Toole et al., On the expansion of the pentatricopeptide repeat gene family in plants. Mol. Biol. Evol. 25, 1120 (2008). 178. S. Bentolila, L. E. Elliott, M. R. Hanson, Genetic architecture of mitochondrial editing in Arabidopsis thaliana. Genetics 178, 1693 (2008). 179. Y. Notsu et al., The complete sequence of the rice (Oryza sativa L.) mitochondrial genome: frequent DNA sequence acquisition and loss during the evolution of flowering plants. Mol. Genet. Genomics 268, 434 (2002). 180. Y. Miyata, M. Sugita, Tissue‐ and stage‐specific RNA editing of rps 14 transcripts in moss (Physcomitrella patens) chloroplasts. J. Plant Physiol. 161, 113 (2004). 181. M. Rüdinger, H. T. Funk, S. A. Rensing, U. G. Maier, V. Knoop, RNA editing: only eleven sites are present in the Physcomitrella patens mitochondrial transcriptome and a universal nomenclature proposal. Mol. Genet. Genomics 281, 473 (2009).

Supporting Online Material for - Science Selaginella Genome Identifies Genetic Changes Associated...

Documents

Transcript of Supporting Online Material for - Science Selaginella Genome Identifies Genetic Changes Associated...