Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary...

23
Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version 0.7.4-r385) 1 and then pre-processed using GATK (version 2.7-2-g6bda569) per recommended best practices recommended by the Broad Institute 2-4 . The somatic SNV analysis was performed using Mutect (version 1.1.4) 5 , Varscan (version 2.3.6; samtools version 0.1.19) 6 , and somaticSniper (version 1.0.0.1) 7 . Specifically, the mutect command is: “java -Xmx2g -jar muTect-1.1.4.jar -rf BadCigar -- analysis_type MuTect --reference_sequence hg19.fa --cosmic b37_cosmic_v54_120711_chr.vcf --dbsnp dbsnp_137.hg19.sorted.vcf --input_file:normal normal.bam --input_file:tumor tumor.bam --out output.mutect.out.txt --coverage_file output.coverage.wig.txt -vcf output.vcf input.bam”. The varscan command is: “java varscan somatic <(samtools mpileup -q 20 -f hg19.fa normal.bam) <(samtools mpileup -q 20 -f hg19.fa tumor.bam) output --output-snp output.snv.vcf --output-indel output.indel.vcf --min-coverage 10 --min-coverage-normal 10 --min-coverage-tumor 10 - -min-var-freq 0.05 --min-freq-for-hom 0.75 --normal-purity 1.00 --tumor-purity 1.00 --p- value 0.99 --somatic-p-value 0.05 --strand-filter 1 --output-vcf 1”. The somaticSniper command is: “bam-somaticsniper -f hg19.fa -q 20 -Q 15 -F vcf tumor.bam normal.bam output.snp.ss.vcf”. Somatic mutations were determined by comparing mutations detected in the lymphocyte germline DNA to the mutations detected at each time point. The thresholds for somatic variant calls required that the variant be observed on reads from both strands, with a minimum tumor variant allele frequency (VAF) of 5%. The significance of allele frequency in varscan difference was calculated by Fisher's exact test. The p-value Nature Medicine: doi:10.1038/nm.4125

Transcript of Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary...

Page 1: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

Supplementary methods Next generation sequencing data analysis

Data were aligned using BWA (version 0.7.4-r385)1 and then pre-processed using GATK

(version 2.7-2-g6bda569) per recommended best practices recommended by the Broad

Institute2-4. The somatic SNV analysis was performed using Mutect (version 1.1.4)5,

Varscan (version 2.3.6; samtools version 0.1.19)6, and somaticSniper (version 1.0.0.1)7.

Specifically, the mutect command is: “java -Xmx2g -jar muTect-1.1.4.jar -rf BadCigar --

analysis_type MuTect --reference_sequence hg19.fa --cosmic

b37_cosmic_v54_120711_chr.vcf --dbsnp dbsnp_137.hg19.sorted.vcf --input_file:normal

normal.bam --input_file:tumor tumor.bam --out output.mutect.out.txt --coverage_file

output.coverage.wig.txt -vcf output.vcf input.bam”. The varscan command is: “java

varscan somatic <(samtools mpileup -q 20 -f hg19.fa normal.bam) <(samtools mpileup -q

20 -f hg19.fa tumor.bam) output --output-snp output.snv.vcf --output-indel

output.indel.vcf --min-coverage 10 --min-coverage-normal 10 --min-coverage-tumor 10 -

-min-var-freq 0.05 --min-freq-for-hom 0.75 --normal-purity 1.00 --tumor-purity 1.00 --p-

value 0.99 --somatic-p-value 0.05 --strand-filter 1 --output-vcf 1”. The somaticSniper

command is: “bam-somaticsniper -f hg19.fa -q 20 -Q 15 -F vcf tumor.bam normal.bam

output.snp.ss.vcf”.

Somatic mutations were determined by comparing mutations detected in the lymphocyte

germline DNA to the mutations detected at each time point. The thresholds for somatic

variant calls required that the variant be observed on reads from both strands, with a

minimum tumor variant allele frequency (VAF) of 5%. The significance of allele

frequency in varscan difference was calculated by Fisher's exact test. The p-value

Nature Medicine: doi:10.1038/nm.4125

Page 2: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

threshold for calling variants was < 0.05. For varscan calling, the results were further

filtered using the following three commands for SNVs and INDELs: “java -d64 -Xms4g -

Xmx4g -jar VarScan.v2.3.6.jar somaticFilter output.snv.vcf --min-coverage 8 --min-

strands2 2 --min-var-freq 0.05 --p-value 0.05 --indel-file output.indel.vcf --output-file

output1.snv.vcf” and “java -d64 -Xms4g -Xmx4g -jar VarScan.v2.3.6.jar processSomatic

output1.snv.vcf --min-tumor-freq 0.05 --max-normal-freq 0.02 --p-value 0.05” and “java

-d64 -Xms4g -Xmx4g -jar VarScan.v2.3.6.jar processSomatic output.indel.vcf --min-

tumor-freq 0.05 --max-normal-freq 0.02 --p-value 0.05”. For Varscan SNVs and

INDELs, only high confidence somatic mutations output from these commands were

included. For mutect callings we removed all the rejected mutations. For somaticsniper

calling, the somatic score is at least 40 and the mapping quality is at least 40.

For somatic SNVs, we used a minimal average base quality for variant supporting reads

of 20, minimum read coverage of at least 10 reads in both tumor and matched normal

samples, minimum tumor alternative allele frequency of at least 5%, and normal sample

alternative allele frequency less than 5%. Only somatic SNVs detected by at least two

callers were used. For somatic indels, we used minimum read coverage of at least 10

reads in both tumor and matched normal samples, minimum tumor alternative allele

frequency of at least 5%, and normal sample alternative allele frequency less than 2%.

Mutations located in the overlapping regions between NimbleGen SeqCap EZ Human

Exome Library v3.0 and Agilent Human Exon V3 were kept. We then keep the mutations

in the loci of somatic mutations with at least 10 reads in more than 108 samples

sequenced. Mutations were further filtered if not located in targeted regions by Agilent

SureSelect Human All Exon V4, unless the loci were covered by at least 10 reads in at

Nature Medicine: doi:10.1038/nm.4125

Page 3: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

least 140 samples sequenced. For Whole genome sequencing (WGS) data, only somatic

mutations with alternative allele frequency greater than 20% were included. The somatic

SNVs were used to for calculating the somatic mutation burden and allele frequency

density for WGS data of AML_130.

Mutations were annotated using SnpEFF (version 3.6)8. The SnpEFF command is

“snpEff.jar -s output.stat hg19 -v input.vcf -cancer -oicr -sequenceOntology

output.eff.vcf”.

For cluster 1 and 3 determined by eloci proportion, the proportion of somatic mutations

for diagnosis and relapse were compared using Wilcoxon signed rank test within each

cluster to measure the significance of dominance, respectively. The number of somatic

mutations between cluster 1 and 3 were compared by Wilcoxon rank sum test at

diagnosis and relapse stage, respectively. The proportions of somatic mutations with

increased or decreased VAF by at least 10% from diagnosis to relapse were compared

between cluster 1 & 3 using Wilcoxon rank sum test for those loci that covered by at least

10 reads in tumor samples and matched normal samples for each patients.

Copy number aberrations (CNA) were obtained from Varscan (version 2.3.6) and

XHMM9 (version 1.0). The Varscan CNA commands are: “java -Xms4g -Xmx4g -jar

VarScan.v2.3.6.jar copynumber <(samtools mpileup -q 1 -f hg19.fa normal.bam

tumor.bam | awk 'NF==9 && $4!=0' ) output --mpileup 1 --min-base-qual 20 --min-map-

qual 20 --min-coverage 30 --min-segment-size 100 --max-segment-size 10000 --p-value

0.01 --data-ratio 1.0”, java -Xms4g -Xmx4g -jar VarScan.v2.3.6.jar copyCaller

output.copynumber --output-file output.copynumber.called --output-homdel-file

output.copynumber.called.homdel”. The results were further refined by binary

Nature Medicine: doi:10.1038/nm.4125

Page 4: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

segmentation using the DNAcopy library10 (version 1.42.0) from BioConductor. The

XHMM parameter is: exome-wide CNA rate = 1e-08, mean number of targets in CNA =

6, mean distance between targets within CNA (in KB) = 70, mean of deletion z-score

distribution = -3, standard deviation of deletion z-score distribution = 1, mean of deploid

z-score distribution = 0, standard deviation of deploid z-score distribution = 1, mean of

duplication z-score distribution = 3, standard deviation of duplication z-score distribution

=1. The CNA results from Varscan and XHMM were merged to achieve a comprehensive

CNA prediction.

Genetic clonal evolution analysis was performed using sciClone (version 1.0.7)11 to infer

the clonality based on CNAs and SNVs with maximum of 6 clusters at minimum read

depth of 50. The sciClone command was executed in R: “sciClone(snv,

copyNumberCalls = cnv,sampleNames =’sample’,cnCallsAreLog2 = FALSE,

minimumDepth = 50, maximumClusters = 6)”. Only samples with at least 6 or more

somatic SNVs in the copy number neutral regions were included for sciClone analysis (n

= 40). For those loci that only have somatic mutations called at one stage, the other

stage’s coverage of all read and alternative allele reads was calculated using GATK

UnifiedGenotyper (version 2.7-2). For those locations not called by UnifiedGenotyper,

total read coverage was profiled using bedtools coverage, and alternative allele reads

were assigned as zero. Clonal complexity was measured by the number of subclone

clusters predicted by sciClone. A sample has an increased clonal complexity if there are

more subclone clusters predicted by sciClone at relapse stage than diagnosis stage. A

cluster was detected in one stage if the mean alternative allele frequency of the cluster at

the stage is at least 5%. A sample has a decreased clonal complexity if there are fewer

Nature Medicine: doi:10.1038/nm.4125

Page 5: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

subclone clusters at relapse stage than diagnosis stage. Otherwise, the sample’s clonal

complexity is defined as stable. The proportions of increased, stable, and decreased

clonal complexity between epigenetic clusters were compared using chi-square test. The

somatic SNVs with neutral copy number mutations were used to measure the proportion

of somatic mutations with increase or decreased variant allele frequency by at least 10%

at relapse. This proportion of somatic mutations was then compared between eloci

clusters 1 and 3 using Wilcoxon rank sum test.

Nature Medicine: doi:10.1038/nm.4125

Page 6: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

Supplementary table and figure legends

Supplementary Table 1: Summary of patient characteristics. Summary table of clinical

characteristics of patients in study cohort. Plus minus values are means + standard

deviation. Percentages may not total 100 because of rounding. MRC = Medical Research

Council risk stratification12; ELN = European Leukemia Net risk stratification13; Ara-C =

cytarabine arabinoside; AlloSCT = Allogeneic stem cell transplant; AutoSCT =

Autologous stem cell transplant.

Supplementary Table 2: Detailed description of patient characteristics and genomics

assays performed. Clinical characteristics are detailed for each patient included in the

study cohort. Genomic assays performed on each sample are indicated. WBC =

peripheral white blood cell count; BM = bone marrow; WGA (^) = whole genome

amplified; * = cytoreduction; Protocol: 1 = NimbleGen SeqCap EZ Human Exome

Library v3.0, 2 = Agilent Human Exon V3 (Exon 50Mb), 3 = Agilent SureSelect Human

All Exon V4 (51 MB); AlloSCT = Allogeneic Stem Cell Transplant; AutoSCT =

Autologous Stem Cell Transplant; ATRA = All Trans Retinoic Acid; KW-2449 =

investigational FLT3 tyrosine kinase inhibitor.

Supplementary Table 3: Sequencing statistics from ERRBS. Sequencing parameters

from all ERRBS libraries prepared and sequenced. Sample ID = assigned sample ID (Dx

= Diagnosis; Rel = Relapse; AML = Acute Myeloid Leukemia; NBM = normal bone

marrow CD34+). SRA biosample_id = assigned biosample identification code in dbGap

accession number phs001027.v1.p1. CpG covered = number of CpGs covered per sample

sequenced at a minimum of 10X. BS = Bisulfite.

Supplementary Table 4: Sequencing statistics from Genomic Sequencing. Sequencing

parameters from Whole Exome and Whole Genome libraries prepared and sequenced.

Sample ID = assigned sample ID (Dx = Diagnosis; Rel = Relapse; AML = Acute

Myeloid Leukemia; G = patient matched germline sample). Protocol = sequencing

protocol executed (WES = Whole Exome Sequencing; WGS = Whole Genome

Nature Medicine: doi:10.1038/nm.4125

Page 7: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

Sequencing). SRA biosample_id = assigned biosample identification code in dbGap

accession number phs001027.v1.p1. Mapping_Rate = read mapping rate.

Coverage_Rate_x15 = percent targets covered at a minimum of 15X. Average_Coverage

= average coverage per target.

Supplementary Table 5: Patterns of EPM and genetic changes in epigenetic clusters 1

and 3. Median EPM (Epialleles per Million loci), somatic mutations and VAF changes

between diagnosis and relapse AML patient specimens annotated into epigenetic clusters

1 and 3 (Wilcoxon rank sum test).

Supplementary Table 6: Somatic mutations in recurrently affected AML genes. Somatic

mutation (SNVs and indels) determined from exome capture data analysis in genes

recurrently affected in AML14,15 Sample ID = assigned sample identification number; Ref

= sequence present in the reference genome; Alt = sequence detected in tumor sample;

Func.refGene = genic region in which mutation detected; Gene.refGene = gene symbol;

ExonicFunc.refGene = mutation type; AAChange.refGene = predicted amino acid change

in translated gene; VAF_diagnosis = alternative variant allele frequency in diagnostic

sample; VAF_relapse = alternative variant allele frequency in relapsed sample; N.D.= not

determined due to inability to determine the variant allele frequency accurately; N.A.=

not available due to lack of coverage in data generated.

Supplementary Table 7: Sequencing statistics from RNA-Sequencing. Sequencing

parameters from RNA-sequencing libraries prepared and sequenced. Sample ID =

assigned sample ID (Dx = Diagnosis; Rel = Relapse; AML = Acute Myeloid Leukemia;

scRNAseq = single cell RNA-sequencing). Protocol = sequencing protocol executed.

SRA biosample_id = assigned biosample identification code in dbGap accession number

phs001027.v1.p1. Mapping_Rate = read mapping rate. Total_Read_Pairs = Number of

reads per sample. RefSeq_Gene_Number = number of Refseq defined genes covered per

sample. Batch_information: 1 = first batch of bulk sample RNA-seq libraries prepared

and sequenced; 2 = second batch of bulk sample RNA-seq libraries prepared and

sequenced. Epigenetic_Cluster = epigenetic cluster designation (1 versus 3).

Nature Medicine: doi:10.1038/nm.4125

Page 8: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

Supplementary Table 8: Differentially expressed genes between epigenetic clusters 3

and 1. Table of differentially expressed genes (DEGs) determined using bulk cell RNA-

sequencing data from patients in epigenetic cluster 1 compared to patients in epigenetic

cluster 3 using DESeq2. DEGs were defined by an absolute log2 fold change greater than

1.2 and adjusted P < 0.05. Positive Log2 fold change represents up-regulated genes in

Cluster 1 patients. The normalized gene expression determined using the DESeq2

variance stabilizing transformation approach (Mean_VSD) is included for each gene in

each cluster.

Supplementary Table 9: GO term enrichment analysis of differentially expressed genes

between epigenetic clusters 1 and 3. DEGs determined between epigenetic cluster 1 and 3

patients were analyzed using GO term enrichment analysis.

Supplementary Table 10: Somatic mutations gained in AML_130 at first relapse time

point (T2). VAF = Variant Allele Frequency.

Supplementary Table 11: Somatic mutations detected in AML_130 at T1 – T5 time

points. VAF = Variant Allele Frequency. T1 = Diagnosis; T2 = relapse; T3 = relapse 2;

T4 = relapse 3; T5 = relapse 4.

Supplementary Table 12: Genes associated with eloci linked to clinical outcomes. Eloci

were determined between diagnosis and NBM samples in patients with longer (n=68) or

shorter (n=69) times to relapse based on the median value. The frequency of promoter-

associated eloci in the patient groups was assessed using an odds ratio and significance

was determined using a fisher's exact test for each gene. Eloci with a P less than 0.05

were included in the table. Multiple hypothesis testing was performed (Benjamini-

Hochberg). Chromosome, Start and End columns designate the eloci genomic

coordinates. Symbol = gene symbol. Number_Short_Relapse_Patient = number of eloci

detected in the shorter time to relapse group. Total_Number_Short_Relapse_Patient =

total number of loci covered by ERRBS in the shorter time to relapse group.

Nature Medicine: doi:10.1038/nm.4125

Page 9: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

Number_Long_Relapse_Patient = number of eloci detected in the longer time to relapse

group. Total_Number_Long_Relapse_Patient = total number of loci covered by ERRBS

in the longer time to relapse group.

Nature Medicine: doi:10.1038/nm.4125

Page 10: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

Supplementary Figure legends

Supplementary Figure 1: Schematic diagram of epiallele shift measurement using

methyl-cytosine sequencing data. Open circles indicate unmodified cytosines while filled

circles represent methylated cytosines. Epialleles are defined as the collection of phased

DNA methylation patterns at genomic loci with four adjacent CpG sites. Eloci are loci

with high epiallele compositional change (epiallele shift; shown is a representative elocus

sample from relapse (R) versus diagnosis (D) sample). Overall differences between two

methylomes are defined by the number of eloci normalized by the number of loci covered

in the respective sequencing library (EPM = Eloci Per Million Loci).

Supplementary Figure 2: Representative results from methyl-capture assays of

AML_103 patient specimens. Roche (a-e) and Agilent (f-j) capture assay results were

compared to ERRBS results. Scatter plots of delta entropy measured for epiallele shift

using ERRBS compared to results from Roche (a) or Agilent (f) capture assays. (b-e)

Four loci with high epiallele shift validated by Roche capture approach. (g-j) Four loci

with high epiallele shift validated by Agilent capture approach.

Supplementary Figure 3: Somatic mutations in low and high EPM patient groups.

Violin plot of the number of somatic mutations in high and low EPM patients from Fig. 1

(Wilcoxon rank sum test).

Supplementary Figure 4: EPM patterns observed at diagnosis and relapse are

independent of clinical features. (a-b) Box plots of EPM (log10) detected in diagnosis and

relapse patient samples versus NBMs grouped by FAB disease classification. (c-d)

Scatter plots of EPM (log10) detected at diagnosis and relapse compared to patient ages.

(e-f) Scatter plots of blast purity (post lymphocyte depletion of AML_102 through

AML_140 assessed by flow cytometry) compared to EPM (log10) detected at diagnosis

(e) and relapse (f). (g,h) Scatter plot of EPM (log10) versus number of somatic mutations

(log10) at diagnosis (g) or relapse (h). Pearson correlation was used to calculate r and

Hoeffding’s D statistics was used for independency tests.

Nature Medicine: doi:10.1038/nm.4125

Page 11: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

Supplementary Figure 5: Genomic distributions of eloci between diagnosis and relapse

AML samples compared to NBM samples. (a) Schematic of CpG islands and flanking

regions used for eloci annotation. (b-d) Box plots of the distribution of eloci localized to

CpG islands, shores and shelves in diagnostic and relapsed samples compared to NBMs.

Wilcoxon signed rank test: P = 2.9 × 10!! (b), P = 0.26 (c), and P = 0.054 (d). (e)

Schematic of genic components used for annotation of eloci. (f-i) Box plots of the

distribution of promoter, exon, intron and intergenic eloci annotations in diagnostic and

relapsed samples compared to NBMs. Wilcoxon signed rank tests: P = 0.027 (f), P = 6.5

× 10!! (g), P = 5.3 × 10!! (h), and P = 7.6 × 10!! (i). (j-k) Box plots of the

distribution of eloci in active and poised enhancers. Wilcoxon signed rank tests: P = 1.4

× 10!! (j) and P = 7.5 × 10!! (k). P value definitions: *** = < 0.001, ** = <= 0.01, *

= <= 0.05. All the distributions were significantly different from the background loci

distributions covered by reduced representation bisulfite sequencing data (P > 0.01).

Supplemental Figure 6: Associations of K-means clustering of AML patients by eloci

abundance at diagnosis and relapse with clinical and genetic parameters. (a) Heatmap of

the proportion of eloci that are diagnosis-specific, shared, or relapse-specific (yellow =

cluster 1; blue = cluster 2; orange = cluster 3). (b) Age (years) distribution of patients (at

diagnosis) in eloci clusters 1-3. (c) Diagnostic sample FAB classification distribution of

patients in eloci clusters 1-3. (d) Blast purity distributions of patients in eloci clusters 1-3

(diagnostic sample post lymphocyte depletion of patient samples AML_102 through

AML_140). (e) Proportion of patients with higher (High MUT) or lower (low MUT)

number of somatic mutations at diagnosis from each eloci cluster (chi-square test). (f)

Heatmap of AML-specific recurrent somatic mutations in the epigenetically defined

patient clusters 1-3. Light blue = diagnosis-specific mutations; blue = relapse-specific

mutations; dark blue = shared mutations between diagnosis and relapse; white = not

detected in assay; grey = not covered in assay.

Supplemental Figure 7: Mutant allele frequency (VAF) plots of genetic clonal evolution

between diagnosis and relapse disease time points as predicted from sciClone analysis.

Nature Medicine: doi:10.1038/nm.4125

Page 12: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

(a) Cluster 1-assigned patients (patients characterized by predominance of diagnosis-

specific eloci). (b) Cluster 2-assigned patients (patients characterized by shared eloci

between diagnosis and relapse). (c) Cluster 3-assigned patients (patients characterized by

predominance of relapse-specific eloci). Colored symbols are standard sciClone output

representing the subclonal cluster predictions.

Supplemental Figure 8: Proportions of evolved genetic subclones as determined from

sciClone analysis in the epigenetically defined patient clusters 1-3. Increased (green) =

subclones with at least 10% VAF increase from diagnosis to relapse; decreased (red) =

subclones with at least 10% VAF decrease from diagnosis to relapse; stable (blue) =

subclones with < 10% VAF change between diagnosis and relapse. No significant

difference of patient proportions with decreased/increased/stable number of subclones

was observed among clusters (P = 0.915). P value was calculated using chi-square test.

Supplementary Figure 9: Frequency of shared and unique alleles during tumor

progression in AML_130. Pie charts of the frequency of eloci (a) and somatic mutations

(b) in five time points (T1-T5) from patient AML_130.

Supplemental Figure 10: DNA methylation heterogeneity in AML. (a) Intra-tumor

DNA methylation heterogeneity (as measured by epi-polymorphism; MH) at diagnosis

compared to relapse samples (n=138 samples from each group; Wilcoxon signed rank

test: P = 0.005618). (b) Kaplan–Meier plot comparing the relapse-free survival between

AML patients with high MH versus low MH (divided by median MH). MH levels at

diagnosis do not segregate patients with distinct clinic outcomes (n = 137). Mantel Cox

log rank test was used for the survival analysis (P = 0.168; Red: high MH patients;

Black: low MH patients).

1 Li,H.&Durbin,R.FastandaccurateshortreadalignmentwithBurrows-Wheelertransform.Bioinformatics25,1754-1760,doi:10.1093/bioinformatics/btp324(2009).

2 DePristo,M.A.etal.Aframeworkforvariationdiscoveryandgenotypingusingnext-generationDNAsequencingdata.NatGenet43,491-498,doi:10.1038/ng.806(2011).

Nature Medicine: doi:10.1038/nm.4125

Page 13: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

3 McKenna,A.etal.TheGenomeAnalysisToolkit:aMapReduceframeworkforanalyzingnext-generationDNAsequencingdata.Genomeresearch20,1297-1303,doi:10.1101/gr.107524.110(2010).

4 VanderAuwera,G.A.etal.FromFastQdatatohighconfidencevariantcalls:theGenomeAnalysisToolkitbestpracticespipeline.CurrProtocBioinformatics11,111011-111033,doi:10.1002/0471250953.bi1110s43(2013).

5 Cibulskis,K.etal.Sensitivedetectionofsomaticpointmutationsinimpureandheterogeneouscancersamples.NatBiotechnol31,213-219,doi:10.1038/nbt.2514(2013).

6 Koboldt,D.C.etal.VarScan2:Somaticmutationandcopynumberalterationdiscoveryincancerbyexomesequencing.Genomeresearch22,568-576,doi:10.1101/gr.129684.111(2012).

7 Larson,D.E.etal.SomaticSniper:identificationofsomaticpointmutationsinwholegenomesequencingdata.Bioinformatics28,311-317,doi:10.1093/bioinformatics/btr665(2012).

8 Cingolani,P.etal.Aprogramforannotatingandpredictingtheeffectsofsinglenucleotidepolymorphisms,SnpEff:SNPsinthegenomeofDrosophilamelanogasterstrainw1118;iso-2;iso-3.Fly(Austin)6,80-92,doi:10.4161/fly.19695(2012).

9 Fromer,M.etal.Discoveryandstatisticalgenotypingofcopy-numbervariationfromwhole-exomesequencingdepth.AmJHumGenet91,597-607,doi:10.1016/j.ajhg.2012.08.005(2012).

10 Venkatraman,E.S.&Olshen,A.B.AfastercircularbinarysegmentationalgorithmfortheanalysisofarrayCGHdata.Bioinformatics23,657-663,doi:10.1093/bioinformatics/btl646(2007).

11 Miller,C.A.etal.SciClone:inferringclonalarchitectureandtrackingthespatialandtemporalpatternsoftumorevolution.PLoSComputBiol10,e1003665,doi:10.1371/journal.pcbi.1003665(2014).

12 Grimwade,D.etal.Refinementofcytogeneticclassificationinacutemyeloidleukemia:determinationofprognosticsignificanceofrarerecurringchromosomalabnormalitiesamong5876youngeradultpatientstreatedintheUnitedKingdomMedicalResearchCounciltrials.Blood116,354-365,doi:10.1182/blood-2009-11-254441(2010).

13 Dohner,H.etal.Diagnosisandmanagementofacutemyeloidleukemiainadults:recommendationsfromaninternationalexpertpanel,onbehalfoftheEuropeanLeukemiaNet.Blood115,453-474,doi:10.1182/blood-2009-07-235358(2010).

14 Patel,J.P.etal.Prognosticrelevanceofintegratedgeneticprofilinginacutemyeloidleukemia.NEnglJMed366,1079-1089,doi:10.1056/NEJMoa1112304(2012).

15 CancerGenomeAtlasResearch,N.Genomicandepigenomiclandscapesofadultdenovoacutemyeloidleukemia.NEnglJMed368,2059-2074,doi:10.1056/NEJMoa1301689(2013).

Nature Medicine: doi:10.1038/nm.4125

Page 14: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

0.00

0.25

0.50

0.75

1.00

D RStage

Epi

alle

le p

atte

rns

prop

ortio

n0000000100100011010001010110011110001001101010111100110111101111

Genomic Locus Epialleles

Four adjacent CpG sites

Methylationpatterns

Elocus EPM

Bisulfite sequencing reads Determine global level of epiallele shift: total number of Eloci Per Million loci sequenced.

C C C C

Nature Medicine: doi:10.1038/nm.4125

Supplementary Fig. 1
Page 15: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

0

25

50

75

100

D RStage

Epi

alle

le p

atte

rns

prop

ortio

n

0000000100100011010001010110011110001001101010111100110111101111

ERRBS

0000000100100011010001010110011110001001101010111100110111101111

ROCHE

0000000100100011010001010110011110001001101010111100110111101111

ERRBS

0000000100100011010001010110011110001001101010111100110111101111

ROCHE

0000000100100011010001010110011110001001101010111100110111101111

ERRBS

0000000100100011010001010110011110001001101010111100110111101111

ROCHE

0000000100100011010001010110011110001001101010111100110111101111

ERRBS

0000000100100011010001010110011110001001101010111100110111101111

ROCHE

chr1: 166853610-166853619 chr7: 47576885-47576892

chr13: 25506204-25506236 chr20: 62318596-62318613

b c

d e

a

0000000100100011010001010110011110001001101010111100110111101111

ERRBS

0000000100100011010001010110011110001001101010111100110111101111

AGILENT

0000000100100011010001010110011110001001101010111100110111101111

ERRBS

0000000100100011010001010110011110001001101010111100110111101111

AGILENT

0000000100100011010001010110011110001001101010111100110111101111

ERRBS

0000000100100011010001010110011110001001101010111100110111101111

AGILENT

0000000100100011010001010110011110001001101010111100110111101111

ERRBS

0000000100100011010001010110011110001001101010111100110111101111

AGILENT

chr1 242688494-242688524 chr5: 87440240-87440266

chr6: 159589687-159589704 chr19: 719952-719986

g h

i j

f

ERRBS S

r = 0.898

r = 0.868

ERRBS S

AG

ILE

NT

SR

OC

HE

S

-120 -60 0

-120

-6

0

0-1

20

-60

0

-120 -60 00

25

50

75

100

D RStage

Epi

alle

le p

atte

rns

prop

ortio

n

0

25

50

75

100

D RStage

Epi

alle

le p

atte

rns

prop

ortio

n

0

25

50

75

100

D RStage

Epi

alle

le p

atte

rns

prop

ortio

n

0

25

50

75

100

D RStage

Epi

alle

le p

atte

rns

prop

ortio

n

0

25

50

75

100

D RStage

Epi

alle

le p

atte

rns

prop

ortio

n

0

25

50

75

100

D RStage

Epi

alle

le p

atte

rns

prop

ortio

n

0

25

50

75

100

D RStage

Epi

alle

le p

atte

rns

prop

ortio

n0

25

50

75

100

D RStage

Epi

alle

le p

atte

rns

prop

ortio

n

0

25

50

75

100

D RStage

Epi

alle

le p

atte

rns

prop

ortio

n

0

25

50

75

100

D RStage

Epi

alle

le p

atte

rns

prop

ortio

n

0

25

50

75

100

D RStage

Epi

alle

le p

atte

rns

prop

ortio

n

0

25

50

75

100

D RStage

Epi

alle

le p

atte

rns

prop

ortio

n

0

25

50

75

100

D RStage

Epi

alle

le p

atte

rns

prop

ortio

n

0

25

50

75

100

D RStage

Epi

alle

le p

atte

rns

prop

ortio

n

0

25

50

75

100

D RStage

Epi

alle

le p

atte

rns

prop

ortio

n

Nature Medicine: doi:10.1038/nm.4125

Supplementary Fig. 2
Page 16: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

1

2

3

4

high EPM low EPM

Som

atic

mut

atio

ns (l

og10

)

P = 0.778

Nature Medicine: doi:10.1038/nm.4125

Supplementary Fig. 3
Page 17: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

Nature Medicine: doi:10.1038/nm.4125

Page 18: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

0.000

0.025

0.050

0.075

0.100

diagnosis relapseStage

Pro

porti

on

diagnosis relapse

CpG shelf

0.05

0.10

0.15

0.20

diagnosis relapseStage

Pro

porti

on

diagnosis relapse

CpG shore

0.2

0.4

0.6

0.8

diagnosis relapseStage

Pro

porti

on

diagnosis relapse

CpG island

0.10

0.15

0.20

0.25

0.30

diagnosis relapseStage

Pro

porti

on

diagnosis relapse

Exon

0.1

0.2

0.3

0.4

diagnosis relapseStage

Pro

porti

on

diagnosis relapse

Intron

0.00.10.20.30.40.5

diagnosis relapseStage

Pro

porti

on

diagnosis relapse

Promoter

CpG islandCpG shore

CpG shore

CpG shelf

CpG shelf

2000 2000 2000 2000

ExonPromoterIntergenic Intron Intergenic

2000

Exon

0.0000

0.0025

0.0050

0.0075

0.0100

diagnosis relapseStage

Pro

porti

on

diagnosis relapse

Active enhancer

0.3

0.4

0.5

diagnosis relapseStage

Pro

porti

on

diagnosis relapse

Intergenic

0.02

0.04

0.06

diagnosis relapseStage

Pro

porti

on

diagnosis relapse

Poised enhancer

a

b c d

f g h

e

i

j k

***

* ***

**

** ***

**

Nature Medicine: doi:10.1038/nm.4125

Supplementary Fig. 5
Page 19: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

Diagnosis.specific

Shared

Relapse.specific

Cluster Cluster

0.2

0.4

0.6

0.8

a

Color Key:Proportion of eloci that are Diagnosis specific, Shared, or Relpase specific

Cluster

20

40

60

1 2 3Cluster

Age

All AML (n=138)value=0.4413

0.00

0.25

0.50

0.75

1.00

M1 M2 M4 M5

Elo

ci c

lust

er p

ropo

rtion

20

40

60

80

100

1 2 3Bl

ast p

urity

pos

t sep

arat

ion

b c

e

p-value = 0.736; n = 36

p-value = 0.441; n = 138

p-values >= 0.398; n = 56

FAB classification Cluster

Cluster

d

0.00

0.25

0.50

0.75

1.00

high MUT low MUT

Elo

ci c

lust

er p

ropo

rtion

123

P = 0.00774n = 48

Diagnosis-specific

Shared

Relapse-specific

Cluster 1Cluster 2Cluster 3

AML_088

AML_091

AML_079

AML_097

AML_096

AML_098

AML_108

AML_082

AML_135

AML_133

AML_085

AML_086

AML_090

AML_078

AML_080

AML_092

AML_095

AML_099

AML_113

AML_089

AML_084

AML_074

AML_119

AML_075

AML_106

AML_094

AML_104

AML_083

AML_111

AML_110

AML_126

AML_127

AML_121

AML_081

AML_115

AML_123

AML_124

AML_102

AML_103

AML_114

AML_105

AML_139

AML_101

AML_116

AML_109

AML_093

AML_100

AML_128

ASXL1BCORBCORL1CEBPACSF3RDNMT3AFLT3IDH1IDH2JAK2KITKRASNPM1NRASPTPN11RUNX1SF3B1SMC1ASMC3SRSF2STAG1STAG2TET2TP53U2AF1WT1

cluster clustercluster 1cluster 2cluster 3

1

0

1

2

3ClusterShared

Relapse-specific

Diagnosis-specific

Not detected

Not covered

Color Key:Somatic mutations that are Diagnosis specific, Shared, or Relpase specific

f

Nature Medicine: doi:10.1038/nm.4125

Supplementary Fig. 6
Page 20: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

AML_093 AML_100 AML_101 AML_102 AML_103

AML_105 AML_109 AML_110 AML_114 AML_115

AML_116 AML_121 AML_123 AML_124 AML_127

AML_128 AML_139

0

75

0

75

0

75

0

75

0 75 0 75Variant allele frequency at diagnosis

Varia

nt a

llele

freq

uenc

y at

rela

pse

1 2 3 4 5

AML_074 AML_075 AML_089 AML_092 AML_094

AML_095 AML_099 AML_104 AML_111 AML_113

AML_119

0

75

0

75

0

75

0 75Variant allele frequency at diagnosis

Varia

nt a

llele

freq

uenc

y at

rela

pse

1 2 3 4 5

AML_080 AML_085 AML_086 AML_088 AML_090

AML_091 AML_096 AML_097 AML_098 AML_108

AML_133

0

75

0

75

0

75

0 75Variant allele frequency at diagnosis

Varia

nt a

llele

freq

uenc

y at

rela

pse

1

2

3

4

5

6a

b

c

sciClone cluster

Eloci cluster 1

Eloci cluster 2

Eloci cluster 3

0 75 0 75 0 75 0 75

0 75 0 75 0 75 0 75

0 75 0 75 0 75

Nature Medicine: doi:10.1038/nm.4125

Supplementary Fig. 7
Page 21: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

0.00

0.25

0.50

0.75

1.00

1 2 3Cluster

Pro

porti

on o

f pat

ient

s

decreasingincreasingstable

P = 0.915

Nature Medicine: doi:10.1038/nm.4125

Supplementary Fig. 8
Page 22: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

Eloci Somatic mutations Number of time points

a b

12345

1

23 4

5

Nature Medicine: doi:10.1038/nm.4125

Supplementary Fig. 9
Page 23: Supplementary methods Next generation sequencing data analysis · 2018-03-07 · Supplementary methods Next generation sequencing data analysis Data were aligned using BWA (version

1 2 3 4

0.0

0.4

0.8

p vlaue = 0.168R

elap

se-fr

ee p

roba

bilit

y

Low MHHigh MH

50

55

60

65

diagnosis relapseStage

Intra

-tum

our

met

hyla

tion

hete

roge

neity

**a b

P = 0.168

Years

Nature Medicine: doi:10.1038/nm.4125

Supplementary Fig. 10