mint + annotatr: a pipeline to integrate and annotate DNA ...mint + annotatr: a pipeline to...

1
mint + annotatr: a pipeline to integrate and annotate DNA methylation and hydroxymethylation data Raymond G. Cavalcante 1 , Yanxiao Zhang 1 , Yongseok Park 2 , Snehal Patil 1 , Maureen A. Sartor 1,3,4 1 Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, USA, 2 Department of Biostatistics, University of Pittsburgh, Pittsburgh, USA 3 Department of Biostatistics, University of Michigan, Ann Arbor, USA, and 4 Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, USA bismark methylation extractor macs2 PePr methylSig Classifications wrt to condition 1 Hyper hmc Hypo hmc No DM No signal Hyper hmc + mc Hyper mc Hyper hmc Hyper mc Hypo hmc Hyper mc Hyper mc Hypo hmc + mc Hypo mc Hyper hmc Hypo mc Hypo hmc Hypo mc Hypo mc No DM Hyper hmc Hypo hmc No DM No DM No signal Hyper hmc Hypo hmc No DM unclassifiable hmc peak No hmc peak No signal High hmc + mc hmc mc hmc or mc Low hmc + mc hmc mc (low) hmc or mc (low) No hmc + mc hmc no methylation no methylation No signal hmc no methylation unclassifiable <1Kb 1-5Kb txStart txEnd 5’UTR 3’UTR CDS Promoter 5’UTR exon 5’UTR intron CDS intron CDS exon 3’UTR intron 3’UTR exon cdsStart cdsEnd CpG island (CGI) 2kb 2kb CpG shore CpG shelf open sea (interCGI) open sea (interCGI) ! CpG shore CpG shelf 2kb 2kb Motivation for integration DNA methylation occurs in a variety of forms. 5-methylcytosine (mc) is the most widely studied, and there is ample evidence for its importance in development and gene regulation. 5-hydroxymethylcytosine (hmc) is a less abundant, but appears to be a good marker for epigenetic changes correlating with changes in gene expression. The problem: Widely-used assays measuring methylation (e.g. BSseq & RRBS) do not distinguish mc and hmc. This may confuse biological interpretation because it is unclear which mark is playing a role relative to a phenotype. Integrating assays measuring mc+hmc (e.g. BSseq & RRBS) with others measuring mc (e.g. oxBSseq & MeDIPseq) or hmc (e.g. TABseq & hMeDIPseq), helps differentiate the mc and hmc marks from one another. C mc hmc T C C BSseq RRBS mc+hmc mc hmc RRBS hMeDIP mint integration The solution: We developed the mint pipeline to be a complete pipeline from raw reads to interpretation do sample methylation quantification do differential methylation analysis between groups differentiate between regions of mc vs hmc by integrating data from methylation assays using a simple classifier automatically generate UCSC genome browser tracks provide summaries of genomic annotations to facilitate biological interpretation. Supported designs Hybrid Bisulfite-based assay and pulldown assay. Pulldown Pulldown assays for mc and hmc. or and Sample-wise Integration on a per-sample basis. Comparison-wise Integration on a per-group basis. and or + + Overall mint workflow QC FastQC + MultiQC Adapter / Quality Trimming Trim Galore Alignment bismark / bowtie2 Sample Meth. Quantification bismark / macs2 Differential Meth. methylSig [1] / PePr [2] Classification UCSC Browser Tracks Annotation and Visualization annotatr [3] Setting up and running mint On a Mac or Linux system: 1. Get mint at github.com/sartorlab/mint 2. Setup dependencies and reference genomes. 3. Annotate the experimental design. 4. Initiate the project with: Rscript init.R —project name —genome g —datapath path 5. Customize the config.mk file with desired parameters. 6. Use make to run the analysis modules as per the design: make bisulfite_align make bisulfite_compare make compare_classification make pulldown_align make pulldown_sample make pulldown_compare make sample_classification References 1.Park Y, et al. (2014) “MethylSig: a whole genome DNA methylation analysis pipeline” Bioinformatics. 2.Zhang Y, et al. (2014) “PePr: A peak-calling prioritization pipeline to identify consistent or differential peaks from replicated ChIP-Seq data” Bioinformatics. 3. Cavalcante RG & Sartor MA (2016) “annotatr: Associating genomic regions with genomic annotations” bioRxiv. This work was funded by NIH grants RO1CA158286S1 and T32GM070499. 1 2 3 4 5 Quality control Classification schemas Sample classification Comparison classification annotatr: simple, fast & flexible annotation of genomic regions Annotation and Visualization We developed a general purpose R package, annotatr [3] that: annotates genomic regions in BED format to pre-built human and mouse CpG, genic (with Entrez Gene IDs and symbols), and enhancers (below), or custom genomic annotations reports all annotations overlapping input genomic regions to highlight multiple-annotations rather than using a prioritization summarizes and plots annotations with associated data (right) is faster than similar R packages, and on par with bedtools. Get annotatr at github.com/rcavalcante/annotatr UCSC Genome Browser Track Hub 0 10000 20000 30000 CpG islands CpG shores CpG shelves interCGI enhancers 1to5kb promoters 5UTRs exons introns 3UTRs intergenic Annotations # CpGs Data Random Regions Regions per annotation CpG islands CpG shores CpG shelves interCGI enhancers 1to5kb promoters 5UTRs exons introns 3UTRs intergenic 0.00 0.05 0.10 0.15 0.00 0.05 0.10 0.15 0.00 0.05 0.10 0.15 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Percent Methylation density All perc_meth perc_meth in annot_type Percent methylation over annotations CpG islands CpG shores CpG shelves interCGI enhancers 1to5kb promoters 5UTRs exons introns 3UTRs intergenic 0.000 0.025 0.050 0.075 0.000 0.025 0.050 0.075 0.000 0.025 0.050 0.075 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Fold Change density All fold fold in annot_type macs2 fold change over annotations 0.00 0.25 0.50 0.75 1.00 All IDH2mut NBM Random Regions DM status Proportion Annotations CpG islands CpG shores CpG shelves interCGI PePr peaks by annotation CpG islands CpG shores CpG shelves interCGI enhancers 1to5kb promoters 5UTRs exons introns 3UTRs intergenic 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 50 0 50 50 0 50 50 0 50 50 0 50 Methylation Difference (IDH2mut NBM) log10(pvalue) methylSig meth. diff. vs log10(pval) 0.00 0.25 0.50 0.75 1.00 All mc_hmc_high mc_hmc_med mc_hmc_low mc_hmc_none Random Regions Simple classification Proportion Annotations CpG islands CpG shores CpG shelves interCGI Bis. Simple classification by annotation 0.00 0.25 0.50 0.75 1.00 All IDH2mut NBM noDM Random Regions DM status Proportion Annotations enhancers 1to5kb promoters 5UTRs exons introns 3UTRs intergenic methylSig DM status by annotation 0.00 0.25 0.50 0.75 1.00 All hyper_mc_hyper_hmc hypo_mc_hypo_hmc hyper_mc_hypo_hmc hypo_mc_hyper_hmc hyper_mc hyper_hmc hypo_mc hypo_hmc no_DM Random Regions Compare classification Proportion Annotations enhancers 1to5kb promoters 5UTRs exons introns 3UTRs intergenic Compare classification by annotation Scale chr21: hmc_bisulfite_simp_class _1_hmc_pulldown_peaks 1 kb hg19 44,078,500 44,079,000 44,079,500 44,080,000 44,080,500 44,081,000 44,081,500 44,082,000 44,082,500 IDH2mut_v_NBM_compare_classification IDH2mut_v_NBM_hmc_pulldown_DM_PePr_peaks IDH2mut_v_NBM_mc_hmc_bisulfite_DM_methylSig IDH2mut_1_hmc_pulldown_simple_classification IDH2mut_1_mc_hmc_bisulfite_simple_classification IDH2mut_1_mc_hmc_bisulfite_percent_methylation IDH2mut_1_hmc_pulldown_macs2_peaks IDH2mut_1_hmc_pulldown_coverage IDH2mut_1_hmc_input_pulldown_coverage UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics) hypo_hmc no_DM hypo_hmc hyper_mc_hypo_hmc hypo_hmc no_DM hypo_hmc no_DM NBM NBM NBM hmc_med hmc_low PDE9A PDE9A IDH2mut_v_NBM_mc_hm 62.4012 _ 0 _ IDH2mut_1_mc_hmc_bis 100 _ 0 _ IDH2mut_1_hmc_pulldow 89 _ 1 _ IDH2mut_1_hmc_input_p 9 _ 1 _

Transcript of mint + annotatr: a pipeline to integrate and annotate DNA ...mint + annotatr: a pipeline to...

Page 1: mint + annotatr: a pipeline to integrate and annotate DNA ...mint + annotatr: a pipeline to integrate and annotate DNA methylation and hydroxymethylation data Raymond G. Cavalcante1,

mint + annotatr: a pipeline to integrate and annotate DNA methylation and hydroxymethylation dataRaymond G. Cavalcante1, Yanxiao Zhang1, Yongseok Park2, Snehal Patil1, Maureen A. Sartor1,3,4 1 Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, USA, 2 Department of Biostatistics, University of Pittsburgh, Pittsburgh, USA 3 Department of Biostatistics, University of Michigan, Ann Arbor, USA, and 4 Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, USA

bismark methylation extractor

macs2 PePr

methylSig

Classifications

wrt tocondition 1 Hyper hmc Hypo hmc No DM No signal

Hyperhmc + mc

Hyper mc Hyper hmc

Hyper mc Hypo hmc Hyper mc Hyper mc

Hypohmc + mc

Hypo mc Hyper hmc

Hypo mc Hypo hmc Hypo mc Hypo mc

No DM Hyper hmc Hypo hmc No DM No DM

No signal Hyper hmc Hypo hmc No DM unclassifiable

hmc peak No hmc peak No signal

Highhmc + mc hmc mc hmc or mc

Lowhmc + mc hmc mc (low) hmc or mc (low)

Nohmc + mc hmc no methylation no methylation

No signal hmc no methylation unclassifiable

<1Kb1-5Kb

txStart txEnd

5’UTR 3’UTR

CDS

Promoter

5’UTR exon

5’UTR intron

CDS intron CDS exon

3’UTR intron

3’UTR exon

cdsStart cdsEnd

CpG island (CGI)

2kb 2kb

CpG shore

CpG shelf

open sea (interCGI)open sea (interCGI)

!

CpG shore

CpG shelf

2kb2kb

<1Kb1-5Kb

txStart txEnd

5’UTR 3’UTR

CDS

Promoter

5’UTR exon

5’UTR intron

CDS intron CDS exon

3’UTR intron

3’UTR exon

cdsStart cdsEnd

CpG island (CGI)

2kb 2kb

CpG shore

CpG shelf

open sea (interCGI)open sea (interCGI)

!

CpG shore

CpG shelf

2kb2kb

Motivation for integrationDNA methylation occurs in a variety of forms. 5-methylcytosine (mc) is the most widely studied, and there is ample evidence for its importance in development and gene regulation. 5-hydroxymethylcytosine (hmc) is a less abundant, but appears to be a good marker for epigenetic changes correlating with changes in gene expression.

The problem: Widely-used assays measuring methylation (e.g. BSseq & RRBS) do not distinguish mc and hmc. This may confuse biological interpretation because it is unclear which mark is playing a role relative to a phenotype.

Integrating assays measuring mc+hmc (e.g. BSseq & RRBS) with others measuring mc (e.g. oxBSseq & MeDIPseq) or hmc (e.g. TABseq & hMeDIPseq), helps differentiate the mc and hmc marks from one another.

C mc hmc

T C C

BSseq RRBS

mc+hmc

mc

hmc

RRBS

hMeDIP

mint integration

The solution: We developed the mint pipeline to • be a complete pipeline from raw reads to interpretation • do sample methylation quantification • do differential methylation analysis between groups • differentiate between regions of mc vs hmc by integrating

data from methylation assays using a simple classifier • automatically generate UCSC genome browser tracks • provide summaries of genomic annotations to facilitate

biological interpretation.

Supported designsHybrid

Bisulfite-based assay and pulldown assay.

Pulldown

Pulldown assays for mc and hmc.

or

andSample-wise

Integration on a per-sample basis.

Comparison-wise

Integration on a per-group basis.

and or

+ +

Overall mint workflowQC

FastQC + MultiQC

Adapter / Quality Trimming

Trim Galore

Alignment bismark / bowtie2

Sample Meth.Quantification

bismark / macs2

Differential Meth. methylSig [1] /

PePr [2]

Classification

UCSC Browser Tracks

Annotation and Visualization annotatr [3]

Setting up and running mintOn a Mac or Linux system: 1. Get mint at github.com/sartorlab/mint 2. Setup dependencies and reference genomes. 3. Annotate the experimental design. 4. Initiate the project with:

Rscript init.R —project name —genome g —datapath path 5. Customize the config.mk file with desired parameters. 6. Use make to run the analysis modules as per the design:

make bisulfite_align make bisulfite_compare make compare_classification

make pulldown_align make pulldown_sample make pulldown_compare make sample_classification

References1. Park Y, et al. (2014) “MethylSig: a whole genome DNA methylation analysis

pipeline” Bioinformatics. 2. Zhang Y, et al. (2014) “PePr: A peak-calling prioritization pipeline to identify

consistent or differential peaks from replicated ChIP-Seq data” Bioinformatics. 3. Cavalcante RG & Sartor MA (2016) “annotatr: Associating genomic regions

with genomic annotations” bioRxiv.

This work was funded by NIH grants RO1CA158286S1 and T32GM070499.

1

2

3

4

5

Quality control

Classification schemasSample classification

Comparison classification

annotatr: simple, fast & flexible annotation of genomic regions

Annotation and Visualization

We developed a general purpose R package, annotatr [3] that: • annotates genomic regions in BED format to pre-built human

and mouse CpG, genic (with Entrez Gene IDs and symbols), and enhancers (below), or custom genomic annotations

• reports all annotations overlapping input genomic regions to highlight multiple-annotations rather than using a prioritization

• summarizes and plots annotations with associated data (right) • is faster than similar R packages, and on par with bedtools.

Get annotatr at github.com/rcavalcante/annotatr UCSC Genome Browser Track Hub

0

10000

20000

30000

CpG islands

CpG shores

CpG shelvesinterCGI

enhancers1to5kb

promoters5UTRs

exonsintrons

3UTRs

intergenic

Annotations

# C

pGs

Data Random Regions

Regions per annotationCpG islands CpG shores CpG shelves interCGI

enhancers 1to5kb promoters 5UTRs

exons introns 3UTRs intergenic

0.00

0.05

0.10

0.15

0.00

0.05

0.10

0.15

0.00

0.05

0.10

0.15

0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100

Percent Methylation

dens

ity

All perc_meth perc_meth in annot_type

Percent methylation over annotations

CpG islands CpG shores CpG shelves interCGI

enhancers 1to5kb promoters 5UTRs

exons introns 3UTRs intergenic

0.0000.0250.0500.075

0.0000.0250.0500.075

0.0000.0250.0500.075

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

Fold Changede

nsity

All fold fold in annot_type

macs2 fold change over annotations

0.00

0.25

0.50

0.75

1.00

All

IDH2mutNBM

Random Regions

DM status

Prop

ortio

n AnnotationsCpG islandsCpG shoresCpG shelvesinterCGI

PePr peaks by annotation

CpG islands CpG shores CpG shelves interCGI

enhancers 1to5kb promoters 5UTRs

exons introns 3UTRs intergenic

012345

012345

012345

−50 0 50 −50 0 50 −50 0 50 −50 0 50Methylation Difference (IDH2mut − NBM)

−log

10(p

valu

e)

methylSig meth. diff. vs −log10(pval)

0.00

0.25

0.50

0.75

1.00

All

mc_hmc_high

mc_hmc_med

mc_hmc_low

mc_hmc_none

Random Regions

Simple classification

Prop

ortio

n AnnotationsCpG islandsCpG shoresCpG shelvesinterCGI

Bis. Simple classification by annotation

0.00

0.25

0.50

0.75

1.00

All

IDH2mutNBM

noDM

Random Regions

DM status

Prop

ortio

n

Annotationsenhancers1to5kbpromoters5UTRsexonsintrons3UTRsintergenic

methylSig DM status by annotation

0.00

0.25

0.50

0.75

1.00

All

hype

r_mc_

hype

r_hmc

hypo

_mc_

hypo

_hmc

hype

r_mc_

hypo

_hmc

hypo

_mc_

hype

r_hmc

hype

r_mc

hype

r_hmc

hypo

_mc

hypo

_hmc

no_D

M

Rando

m Region

s

Compare classification

Prop

ortio

n

Annotationsenhancers1to5kbpromoters5UTRsexonsintrons3UTRsintergenic

Compare classification by annotation

Scalechr21:

IDH2mut_1_mc_hmc_bisulfite_simp_class

IDH2mut_1_hmc_pulldown_peaks

RefSeq Genes

SequencesSNPs

Human mRNAs

DNase Clusters

Txn Factor ChIP

RhesusMouse

DogElephantChicken

X_tropicalisZebrafishLamprey

Common SNPs(146)

RepeatMasker

1 kb hg1944,078,500 44,079,000 44,079,500 44,080,000 44,080,500 44,081,000 44,081,500 44,082,000 44,082,500

IDH2mut_v_NBM_compare_classification

IDH2mut_v_NBM_hmc_pulldown_DM_PePr_peaks

IDH2mut_v_NBM_mc_hmc_bisulfite_DM_methylSig

IDH2mut_1_hmc_pulldown_simple_classification

IDH2mut_1_mc_hmc_bisulfite_simple_classification

IDH2mut_1_mc_hmc_bisulfite_percent_methylation

IDH2mut_1_hmc_pulldown_macs2_peaks

IDH2mut_1_hmc_pulldown_coverage

IDH2mut_1_hmc_input_pulldown_coverage

UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)

RefSeq Genes

Publications: Sequences in Scientific Articles

Human mRNAs from GenBank

H3K27Ac Mark (Often Found Near Active Regulatory Elements) on 7 cell lines from ENCODE

DNaseI Hypersensitivity Clusters in 125 cell types from ENCODE (V3)

Transcription Factor ChIP-seq (161 factors) from ENCODE with Factorbook Motifs

100 vertebrates Basewise Conservation by PhyloP

Multiz Alignments of 100 Vertebrates

Simple Nucleotide Polymorphisms (dbSNP 146) Found in >= 1% of Samples

Repeating Elements by RepeatMasker

hypo_hmcno_DM

hypo_hmc

hyper_mc_hypo_hmchypo_hmc

no_DM

hypo_hmc

no_DM

NBM NBMNBM

hmc_med hmc_low

PDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9APDE9A

IDH2mut_v_NBM_mc_hmc_bisulfite_DM

62.4012 _

0 _

IDH2mut_1_mc_hmc_bisulfite_pct_meth

100 _

0 _

IDH2mut_1_hmc_pulldown_cov

89 _

1 _

IDH2mut_1_hmc_input_pulldown_cov

9 _

1 _

Layered H3K27Ac

100 Vert. Cons