Nature Genetics: doi:10.1038/ng · PDF fileOverall percentage of significant pfams for each...

14
Nature Genetics: doi:10.1038/ng.3859

Transcript of Nature Genetics: doi:10.1038/ng · PDF fileOverall percentage of significant pfams for each...

Page 1: Nature Genetics: doi:10.1038/ng · PDF fileOverall percentage of significant pfams for each ... which are much less likely ... applying our further filtering of SMRT-detected 6mA marks

Nature Genetics: doi:10.1038/ng.3859

Page 2: Nature Genetics: doi:10.1038/ng · PDF fileOverall percentage of significant pfams for each ... which are much less likely ... applying our further filtering of SMRT-detected 6mA marks

Supplementary Figure 1

Coverage, quality and reproducibility of 6mA marks across fungi.

Line graph showing per-strand coverage of 6mA marks in genomes of a) Dikarya (shown in greyscale to distinguish between different lineages), and b) early-diverging fungi. 6mA marks below a minimum coverage cutoff of 15x and above a maximum coverage (determined independently for each genome) were removed from downstream analyses. Coverage ranges for each lineage are shown in the figure legend. c) Modification quality value (mQV) distribution for each lineage and filtering cutoff used (black bar, 25 mQV). d) Box plots showing distribution of methylation ratios for methylated adenines in each genome analyzed (black bar within boxes shows median ratio). Methylation ratio refers to the proportion of molecules mapped to a given site which are methylated.

Nature Genetics: doi:10.1038/ng.3859

Page 3: Nature Genetics: doi:10.1038/ng · PDF fileOverall percentage of significant pfams for each ... which are much less likely ... applying our further filtering of SMRT-detected 6mA marks

Supplementary Figure 2

Validation of SMRT-detected 6mA using mass spectrometry and IP sequencing.

a) Percent methylated adenines as detected by SMRT-analysis (SMRT-6mA) and by Mass-Spectrometry (MS-6mA). As a measure of high confidence SMRT-detected sites, percent of methylated adenines at ApT sites (SMRT-6mApT) is also included. b) 6mA overlap between SMRT and IP-seq methods within a 100kb region of H. vesiculosa scaffold 1. Red tiles: MACs detected through SMRT-analysis. Black tiles: methylated regions identified through IP-sequencing. Significant peaks were detected using macs2

31, Q-value ≤

0.01. Read coverage tracks for both the control (inner circle) and pulldown (outer circle) are also shown. c) Comparison of 6mA-IP and SMRT analysis results across all lineages examined.

a refers to percent of 6mA bases identified by SMRT analysis prior to filtering.

Nature Genetics: doi:10.1038/ng.3859

Page 4: Nature Genetics: doi:10.1038/ng · PDF fileOverall percentage of significant pfams for each ... which are much less likely ... applying our further filtering of SMRT-detected 6mA marks

Supplementary Figure 3

Surrounding nucleotide context and relative genomic occurrence of 6mA.

a) Occurrence of 6mA at 4mers in early-diverged fungi. TAT/ATA trinucleotides are underscored in red. b) Percent of total ApT containing trinucleotides within MACs that are methylated.

Nature Genetics: doi:10.1038/ng.3859

Page 5: Nature Genetics: doi:10.1038/ng · PDF fileOverall percentage of significant pfams for each ... which are much less likely ... applying our further filtering of SMRT-detected 6mA marks

Supplementary Figure 4

Expression, thymine and TAT-trinucleotide frequencies flanking and across MACs.

Frequency of TAT trinucleotides (top), thymine bases (middle) and expression (bottom) are plotted upstream, downstream and across MACs. Frequency is calculated as: # occurrences ÷ total # MACs. As MACs vary in length, all MACs ≥ 100 bp were selected, fragmented into 100 sections from start to end, then average frequency is calculated within fragment. MACs are oriented by gene direction.

Nature Genetics: doi:10.1038/ng.3859

Page 6: Nature Genetics: doi:10.1038/ng · PDF fileOverall percentage of significant pfams for each ... which are much less likely ... applying our further filtering of SMRT-detected 6mA marks

Supplementary Figure 5

6mA is associated with active genes.

a) Expression and methylation level of all methylated genes, sorted by expression level. Genes are sorted by FPKM value (blue black), with 6mA levels shown immediately below (white dark green). While methylated genes rarely lack expression (FPKM < 1.0), the level of 6mA has no influence over the magnitude of expression. If the two were related, we would expect that as expression level increased, we would see a similar pattern in amount of 6mA present, which is not the case. b) FPKM levels of unmethylated genes, sorted by expression level.

Nature Genetics: doi:10.1038/ng.3859

Page 7: Nature Genetics: doi:10.1038/ng · PDF fileOverall percentage of significant pfams for each ... which are much less likely ... applying our further filtering of SMRT-detected 6mA marks

Supplementary Figure 6

MAC overlaps with various genomic features.

a) Percent of gene models containing MACs and proximity to their transcriptional start sites. While some MACs directly overlap with the TSS, many are located slightly downstream. b) Fixed window overlaps of MACs with micro RNAs. c) Fixed window overlaps of MACs with tRNAs.

Nature Genetics: doi:10.1038/ng.3859

Page 8: Nature Genetics: doi:10.1038/ng · PDF fileOverall percentage of significant pfams for each ... which are much less likely ... applying our further filtering of SMRT-detected 6mA marks

Supplementary Figure 7

6mA presence or absence is related to gene function.

a) Methylation presence/absence at all genes containing common pfam17

domains (present in at least 8 genes) and their deviation from expected. Pfams showing significant (p ≤ 0.05) departures from the expected were identified using Fisher’s exact test followed by FDR correction (significant = red, non-significant = blue). Overall percentage of significant pfams for each genome are shown in parenthesis next to lineage names. b) log2 fold change in methylation presence/absence at genes containing common pfam domains across all lineages (present in at least 8 genes across all genomes, significant in at least one lineage). Lineages showing significant departure from the expected are denoted with a * (adjusted p-value ≤ 0.05), or ** (adjusted p-value ≤ 0.01). Green = enriched in unmethylated gene set, purple = enriched in methylated gene set. Constitutively expressed housekeeping proteins, such as mitochondrial Rho proteins (blue arrow) are very frequently methylated, while some genes, such as Leucine-Rich-Repeat containing proteins (orange arrow) show variability across lineages.

Nature Genetics: doi:10.1038/ng.3859

Page 9: Nature Genetics: doi:10.1038/ng · PDF fileOverall percentage of significant pfams for each ... which are much less likely ... applying our further filtering of SMRT-detected 6mA marks

Supplementary Figure 8

6mA and 5mC enrichment by region.

Overall percent cytosines methylated per genome (a), context (b) and distribution of both epigenomic marks, 5mC and 6mA, across the genome (c and d, respectively).

Nature Genetics: doi:10.1038/ng.3859

Page 10: Nature Genetics: doi:10.1038/ng · PDF fileOverall percentage of significant pfams for each ... which are much less likely ... applying our further filtering of SMRT-detected 6mA marks

Supplementary Note 1

Validation of 6mA using MS and IP-Seq

To confirm abundance of 6mA in early-diverging fungi, we performed LC-MS analysis

for four early-diverging fungi (H. vesiculosa, S. racemosum, L. transversale, C. anguillulae) and

one Dikarya (K. imperatae). Overall, the levels of 6mA detected by LC-MS were lower than

what was found by SMRT sequencing (Supplementary Figure 2a). However, even when only

ApT SMRT-detected 6mA sites were considered, which are much less likely to be false

positives, SMRT-detected 6mA was still higher than MS (this was also true for symmetrically

methylated ApT sites in most cases). As the concentrations of the two compounds being

measured (6mA and A) are at the limits of the dynamic range (low and high) of the Q Exactive

Orbitrap MS, it is possible that this introduces error in accurately predicting the quantity of

genomic 6mA using LC-MS.

For 2 early-diverging fungi, H. vesiculosa and S. racemosum, as well as two members of

Dikarya, T. encephala and K. imperatae (Basidiomycota; Agaricomycotina) we performed 6mA

IP-seq to validate PacBio SMRT-detected 6mA marks. We found that for early-diverging fungi,

there was strong agreement between methods, with IP-seq detecting similar but slightly lower

numbers of methylated regions compared with SMRT-detected MACs. However, even for 6mA

regions only detected through SMRT-analysis, we frequently observed enrichment in IP signal,

although significance was obscured by variance in the input control (for example, see

Supplementary Figure 2b). Of significant IP-seq peaks called using using macs231

, 99%

overlapped MACs in both lineages (Supplementary Figure 2b-c). Additionally, even before

applying our further filtering of SMRT-detected 6mA marks (Supplementary Figure 1), over

83% of these were located directly underneath IP-seq peaks (Supplementary Figure 2c).

Nature Genetics: doi:10.1038/ng.3859

Page 11: Nature Genetics: doi:10.1038/ng · PDF fileOverall percentage of significant pfams for each ... which are much less likely ... applying our further filtering of SMRT-detected 6mA marks

However, we did not see much agreement between IP and SMRT results for either member of

the Dikarya (T. encephala and K. imperatae). As marks did not fall into dense MACs, it is

possible that either: i) sensitivity of IP-sequencing was not high enough to detect single, non-

symmetric 6mA marks, or ii) while the ratio of signal to noise in early-diverging fungi is very

high, in genomes with extremely low/no 6mA even minimal amounts of background noise may

introduce substantial difficulties in effectively discriminating between false and true-positives

during SMRT-analysis. Very likely, both of these challenges are contributing to difficulty in

accurately discerning 6mA presence in these genomes. Consequently, we are not convinced as to

the presence of 6mA in the Dikarya, but do not exclude the possibility of their existence.

Supplementary Note 2

Identification of putative methyltransferases

Although most early-diverging fungi remain recalcitrant to genetic modification to

date18,19

, we explored the genes potentially involved in fungal 6mA regulation through

interrogating pfam domain distribution across 196 fungal taxa. Unfortunately, while present in

several early-diverging fungi, none of the methyltransferases reported in animals2,3

consistently

followed our observed phylogenetic distribution of 6mA. While unlikely to be driven exclusively

by a single gene, we found that among others (Supplementary Data 1), abundance of 6mA

correlated relatively well with the presence of PF02384, an N-6 DNA methyltransferase domain.

This domain was present across almost all early-diverging fungi, but was rare in Dikarya. In

bacteria, this protein domain is involved in type I and IC restriction systems and acts as a

maintenance methyltransferase, converting hemimethylated DNA to symmetrically methylated

ApT sites51

. However, the function of this domain in eukaryotes remains untested. In addition to

PF02384, we found several methyltransferase-related domains which were enriched in early-

Nature Genetics: doi:10.1038/ng.3859

Page 12: Nature Genetics: doi:10.1038/ng · PDF fileOverall percentage of significant pfams for each ... which are much less likely ... applying our further filtering of SMRT-detected 6mA marks

diverging fungi relative to Dikarya, including UPF0020, TRM13, MTS, and MT-A70 (involved

in adenine methylation in C. elegans3) (Supplementary Data 1).

References: 51. Thorpe, P. H., Ternent, D. & Murray, N. E. The specificity of sty SKI, a type I restriction enzyme, implies

a structure with rotational symmetry. Nucleic Acids Res. 25, 1694–1700 (1997).

Nature Genetics: doi:10.1038/ng.3859

Page 13: Nature Genetics: doi:10.1038/ng · PDF fileOverall percentage of significant pfams for each ... which are much less likely ... applying our further filtering of SMRT-detected 6mA marks

Supplementary Table 1 | Genome assembly and annotation stats for all lineages sequenced

during this study. All genomes were assembled using Falcon and polished with quiver22

. *

indicates assemblies which were further improved using FinisherSC21

. a Refers to the number of

methylated bases after filtering. b Includes simple repeats as well as transposable elements

detected by RepeatScout and RepeatMasker24

. L50 refers to scaffold length (in Mb) where 50%

of all nucleotides are contained within scaffolds of that size or larger.

Lineage GC Size (Mb) # Genes # 6mAa % repeats

b # scaffolds (L50)

Hesseltinella vesiculosa 46 27.22* 11,141 362,980 6.46 120 (0.57)

Syncephalastrum racemosum 47 30.75 11,124 227,445 9.06 49 (2.37)

Absidia repens 38 47.42 14,919 326,862 17.85 94 (1.30)

Lobosporangium transversale 42 42.77* 11,822 332,008 14.49 138 (0.67)

Linderina pennispora 54 26.20* 9,351 24,815 3.97 227 (0.91)

Basidiobolus meristosporus 43 89.49* 16,111 243,904 50.83 1,366 (0.11)

Piromyces finnis 21 56.46* 10,992 723,021 51.39 232 (0.75)

Anaeromyces robustus 16 71.69* 12,832 980,382 56.80 1,035 (0.14)

Catenaria anguillulae 55 41.34 12,804 6,790 5.65 509 (0.22)

Rhizoclosmatium globosum 45 57.02* 16,990 11411 2.72 437 (0.29)

Clohesyomyces aquaticus 50 49.68* 15,810 22,041 12.01 761 (0.17)

Leucosporidiella creatinivora 59 26.33 9,854 4619 4.20 177 (0.30)

Protomyces lactucaedebilis 51 12.93 6,726 13,096 3.82 44 (0.57)

Pseudomassariella vexata 51 44.71* 12,565 14,111 9.99 37 (2.56)

Tremella encephala 49 19.79 7,964 10,590 1.40 151 (0.21)

Kockovaella imperatae 52 17.47 7,393 7342 0.83 38 (1.1)

Nature Genetics: doi:10.1038/ng.3859

Page 14: Nature Genetics: doi:10.1038/ng · PDF fileOverall percentage of significant pfams for each ... which are much less likely ... applying our further filtering of SMRT-detected 6mA marks

Supplementary Table 2 | Percent of genes methylated at various conservation levels.

CEGMA16

are core eukaryotic genes. Cells are empty if a given organism is: incertae sedis

(undefined) at that taxonomic level, or if it is the only sequenced representative of that group

included in ortholog clustering. Ortholog clustering included genomes from 196 taxa broadly

distributed across the fungal kingdom (See Supplementary Data 1 for a list of lineages

included).

Lineage CEGMA kingdom order family species

H. vesiculosa 87.80 77.44 59.17 59.56 54.31

A. repens 72.65 63.79 48.28 49.00 21.44

S. racemosum 70.43 60.45 39.88

31.41

L. transversale 69.53 56.33

24.43

B. meristosporus 58.50 50.45

17.24

L. pennispora 14.87 13.72

1.10

A. robustus 71.35 60.21

44.99 13.21

P. finnis 75.77 65.37

49.05 13.97

Nature Genetics: doi:10.1038/ng.3859