MEDIPS Tutorial - ICGEBnet.icgeb.org/.../Documentation/MEDIPS.pdfMEDIPS Tutorial Lukas Chavez...

MEDIPS Tutorial

Lukas Chavez

October 18, 2010

Contents

1 Introduction 1

2 Preparations 2

3 MEDIPS Workflow 43.1 Create a MEDIPS SET from the input file . . . . . . . . . . . . . 43.2 Creating the Genome Vector and Export of RPM Signals . . . . 63.3 Saturation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 73.4 Receiving DNA Sequence Pattern Positions . . . . . . . . . . . . 113.5 Sequence Pattern Coverage Analysis . . . . . . . . . . . . . . . . 113.6 CpG Enrichment . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.7 Creating the Coupling Vector . . . . . . . . . . . . . . . . . . . . 153.8 Calibration Curve and Linear Regression . . . . . . . . . . . . . . 163.9 Relative Methylation Score and Export of Normalized Data . . . 203.10 Methylation Profiles and Absolute Methylation Score . . . . . . . 213.11 Additional MEDIPS SET Objects . . . . . . . . . . . . . . . . . 283.12 Methylation Profiles of Two MEDIPS SET Objects . . . . . . . . 293.13 Selecting Candidate DMRs and Annotation . . . . . . . . . . . . 31

4 Concluding Remarks 35

1 Introduction

MEDIPS was developed for analyzing data derived from methylated DNA im-munoprecipitation (MeDIP) experiments [Weber et al., 2005] followed by se-quencing (MeDIP-seq). Nevertheless, MEDIPS may be applied to other im-munoprecipitation based methylation analyses (e.g. MBD-seq), and selectedfunctionalities like the saturation analysis may be applied to other types of se-quencing data (e.g. ChIP-seq). MEDIPS adresses several aspects in the contextof MeDIP-seq data analysis. These are:

� estimating the reproducibility for obtaining full genome methylation pro-files with respect to the total number of given short reads and to the sizeof the reference genome,

� analyzing the coverage of genome wide DNA sequence patterns (e.g. CpGs)by the given reads,

1

� calculating an CpG enrichment factor as a quality control for the immuno-precipitation,

� calculating genome wide MeDIP-seq signal densities at a user specifiedresolution,

� calculating genome wide sequence pattern densities (e.g. CpGs) at a userspecified resolution,

� plotting of calibration plots as a data quality check and for a visual in-spection of the dependency between local sequence pattern (e.g. CpG)densities and MeDIP signals,

� normalization of MeDIP-seq data with respect to local sequence pattern(e.g. CpG) densities,

� summarized methylation values for genome wide windows of a specifiedlength or for user supplied regions of interest (ROIs),

� calculating differentially methylated regions on raw or normalized datacomparing two sets of MeDIP-seq data and with respect to Input data (ifavailable), and

� export of raw and normalized data for visualization in common genomebrowsers (e.g. the UCSC genome browser).

MEDIPS starts directly where the mapping tools stop and can be used forany genome of interest, limited only by the available genomes within the Bio-conductor BSgenome package.

2 Preparations

In order to execute MEDIPS, you need to have some other packages installed inyour R library. These are BSgenome and gtools, as well as the packages theydepend on. You can check this, by starting R and typing

> packageDescription("BSgenome")

> packageDescription("gtools")

R will inform you in case you do not have these packages installed. In caseyou do not have these necessary packages installed, start R and try typing

> source("http://bioconductor.org/biocLite.R")

> biocLite("BSgenome")

> biocLite("gtools")

Please be aware of having the latest BSgenome version installed (>1.14.2). Theversion number is returned by the packageDescription() command. Pleasenote, there is a bug in R distributions <=2.10.1 that sometimes causes errorswhen interacting with the BSgenome package. This bug should be fixed in Rversions >2.11.0.

Next, it is necessary to install the MEDIPS package into your R environment.Start R and enter:

2


> biocLite("MEDIPS")

Please note, MEDIPS became available on BioC 2.7 that is designed to workwith R.2.12. Therefore, installation of the MEDIPS package by biocLite()

only works on R >=2.12.0.In order to reproduce the presented MEDIPS workflow, the package includes

the example data sets MeDIP_hESCs_chr22.txt (17M), MeDIP_DE_chr22.txt(22M), and Input_StemCells_chr22.txt (8.8M) in the extdata subdirectoryof the MEDIPS package.

The files contain genomic regions from chromosome 22 only, as covered byshort reads obtained from a MeDIP experiment of human embryonic stem cells(hESCs), a MeDIP experiment of differentiated hESCs (definitive endoderm,DE), and of Input experiments from hESCs and DE [Chavez et al., 2010].

As input, MEDIPS requires tab-separated files without headers containingfour columns:

� the first coulumn is of type character and contains the chromosome of theregion (e.g. chr1)

� the second column is of type numeric and contains the start position ofthe mapped read

� the third column is of type numeric and contains the stop position of themapped read

� the fourth column is of type character and contains the strand informationof the mapped read

Each row represents a mapped read. These informations can be extractedfrom the output file(s) of common mapping tools. MEDIPS counts chromosomesequence positions starting at 1. Some alignment tools output the mappedregions by interpreting the first base of a chromosome as 0. MEDIPS requiresone input file for each condition that has to be analyzed. Region informationsfrom several mapping results have to be pooled into one file. Furthermore, itmight be worthwhile to filter out some mapped reads of low quality or to excludeartifical short read pile-ups from the results of the mapping procedure. Pleasenote, any such data pooling, further filtering or correction for the start and stoppositions has to be done by yourself before using MEDIPS.

Next, you need to have your genome of interest available. As soon as youhave the BSgenome package installed and the library loaded using

> library("BSgenome")

you can list all available genomes by typing

> available.genomes()

[1] "BSgenome.Amellifera.BeeBase.assembly4"

[2] "BSgenome.Amellifera.UCSC.apiMel2"

[3] "BSgenome.Athaliana.TAIR.01222004"

[4] "BSgenome.Athaliana.TAIR.04232008"

[5] "BSgenome.Btaurus.UCSC.bosTau3"

3

[6] "BSgenome.Btaurus.UCSC.bosTau4"

[7] "BSgenome.Celegans.UCSC.ce2"

[8] "BSgenome.Celegans.UCSC.ce6"

[9] "BSgenome.Cfamiliaris.UCSC.canFam2"

[10] "BSgenome.Dmelanogaster.UCSC.dm2"

[11] "BSgenome.Dmelanogaster.UCSC.dm3"

[12] "BSgenome.Drerio.UCSC.danRer5"

[13] "BSgenome.Drerio.UCSC.danRer6"

[14] "BSgenome.Ecoli.NCBI.20080805"

[15] "BSgenome.Ggallus.UCSC.galGal3"

[16] "BSgenome.Hsapiens.UCSC.hg17"



[19] "BSgenome.Mmusculus.UCSC.mm8"

[20] "BSgenome.Mmusculus.UCSC.mm9"

[21] "BSgenome.Ptroglodytes.UCSC.panTro2"

[22] "BSgenome.Rnorvegicus.UCSC.rn4"

[23] "BSgenome.Scerevisiae.UCSC.sacCer1"

[24] "BSgenome.Scerevisiae.UCSC.sacCer2"

In the given example, we mapped the short reads against the human genomebuild hg19. Therefore, we download and install this genome build:


> biocLite("BSgenome.Hsapiens.UCSC.hg19")

This takes some time, but has to be done only once for each necessary referencegenome.

3 MEDIPS Workflow

3.1 Create a MEDIPS SET from the input file

First, you have to load MEDIPS into your R environment. Herewith, the de-pendent libraries BSgenome and gtools as well as the packages they depend onwill be loaded.

> library(MEDIPS)

In the given example, we mapped the short reads against the human genomebuild hg19. Therefore, we load the pre-installed (see chapter 2) hg19 library:

> library(BSgenome.Hsapiens.UCSC.hg19)

Next, a MEDIPS SET is created by reading the input file. This is performedby the MEDIPS.readAlignedSequences() function. When calling this function,it is necessary to state the reference genome build and your input file. In caseyou know the number of rows within your input file, you can also specify thenumrows parameter. This will accelerate the reading of your file. Otherwise justskip the numrows parameter.

In this manual, we are using example data located in the extdata subdirec-tory of the MEDIPS package. Therefore, we have to point to the input examplefiles like:

4

> file = system.file("extdata", "MeDIP_hESCs_chr22.txt", package = "MEDIPS")

In the example MeDIP_hESCs_chr22.txt file, there are 672866 regions (here,only chromosome 22 is included). Therefore, we can now read the example fileby typing:

> CONTROL.SET = MEDIPS.readAlignedSequences(BSgenome = "BSgenome.Hsapiens.UCSC.hg19",

+ file = file, numrows = 672866)

Typically, you will execute the function by pointing directly to your inputfile using the file parameter. Here, you can also state the full directory pathto your input file together with the file name.

MEDIPS now created a MEDIPS SET from the input. The current contentof a MEDIPS SET can be viewed at any time by typing the name of yourMEDIPS SET object.

> CONTROL.SET

S4 Object of class MEDIPSset

=======================================

Regions information

=======================================

Regions file: MeDIP_hESCs_chr22.txt

Organism: BSgenome.Hsapiens.UCSC.hg19

Chromosomes: chr22

Chromosome lengths: 51304566

Number of regions: 672866

Regions chromosomes: chr22 chr22 chr22...

Regions start positions: 16053184 16054710 16080510...

Regions stop positions: 16053220 16054746 16080546...

Regions strand: - - +...

=======================================

Genome vector signals information

=======================================

Genome wide bin size:

Reads extended by:

Genome vector chromosomes: NA NA NA...

Genome vector positions: NA NA NA...

Genome vector signals: NA NA NA...

=======================================

Pattern information

=======================================

Pattern:

Number of patterns:

Pattern chromosomes: NA NA NA...

Pattern positions: NA NA NA...

=======================================

Genome vector coupling factor information

=======================================

Distance function:

Distance file:

5

Fragment length:

Genome vector coupling factors: NA...

=======================================

Calibration information

=======================================

Calibration curve mean signals: NA...

Calibration curve mean coupling factors: NA...

Calibration curve variance: NA...

Intercept:

Slope:

Calibration chromosome:

=======================================

Genome vector normalized signal information

=======================================

Normalization output interval: [0:1000]

Genome vector normalized signals: NA...

After reading the input file, the MEDIPS SET contains only informationabout the input regions, like the input file name, the dependent organism, thechromosomes included in the input file, the length of the included chromosomes(automatically loaded), the number of regions, and the start, stop and strandinformations of the regions. All further slots, for example for the weightingparameters and normalized data are still empty and will be filled during theworkflow.

3.2 Creating the Genome Vector and Export of RPM Sig-nals

Based on the given regions, a genome-wide coverage has to be calculated. Inorder to calculate the genome wide short read coverage, the user has to specifya targeted data resolution using the parameter bin_size (default: 50bp). Inprinciple, a bin_size=1 can be specified. Because the resolution of MeDIP-seqdata is restricted by the size of the sonicated DNA fragments after amplificationand size selection (that often is between 0.2-1kb), it might not be necessary tospecify very small bin sizes. Moreover, the smaller the bin size, the higher theneed for memory of your computer and the higher the runtime will be. Weconsider a bin size of 50bp as a reasonable compromise on data resolution andcomputational costs.

Each chromosome inside the MEDIPS SET will then be divided into bins ofsize 50bp and the short read coverage will be calculated on this resolution. Inthe following, we call the bin representation of the genome the genome vector.

Moreover, short reads generated by modern-day sequencers do not representthe full DNA fragments but are of shorter length (e.g. 36bp). Therefore, asmoothing of the data is recommended by extending the reads. This can beachieved by setting the parameter extend (default: 400bp). With this, eachregion is extended to a length of 400bp either along the + or along the - directionas specified by the dependent strand information. The extend value will not beadded to the given length of the short reads but the final length of the extendedreads will be the length as specified by the extend parameter.

You can create the genome vector by typing

6

> CONTROL.SET = MEDIPS.genomeVector(data = CONTROL.SET, bin_size = 50,

+ extend = 400)

For each pre-defined genomic bin, the genome vector stores the number ofprovided overlapping extended short reads and these are interpreted as the rawMeDIP-seq signals. After having called the MEDIPS.genomeVector function, theslots of the MEDIPS SET associated to the genome vector are occupied. Forexample, there is a slot that contains the raw signals for each genomic bin.

Based on the total number of provided short reads (n), the raw MeDIP-seq signals can be transformed into a reads per million (rpm) format in orderto assure that coverage profiles derived from different biological samples arecomparable, although generated from differing amounts of short reads. Letxbini be the raw MeDIP-seq signal of the genomic bin i, where i=1,...,m andm is the total number of genomic bins, then the rpm value of the genomic bin issimply defined as:

rpmbini=xbini

·106

n

It is already possible to export the raw signals in a reads per million (rpm)format as a wiggle (WIG) file by typing:

> MEDIPS.exportWIG(file = "output.rpm.control.WIG", data = CONTROL.SET,

+ raw = T, descr = "hESCs.rpm")

At this point, it is necessary to set the parameter raw=T. Otherwise, theexport function tries to write out the normalized data that does not exist yet.The descr parameter contains an arbitrary description for the wiggle file andwill be visualized by a suitable genome browser. It is recommended to gzip theWIG file before you upload it to e.g. the UCSC browser.

3.3 Saturation Analysis

The saturation analysis addresses the question, whether the number of inputregions is sufficient to generate a saturated and reproducible methylation pro-file of the reference genome. The main idea is that an insufficent number ofshort reads will not result in a saturated methylation profile. Only if there is asufficient number of short reads, the resulting genome wide methylation profilewill be reproducible by another independent set of a similar number of shortreads.

You can start the saturation analysis by typing

> sr.control = MEDIPS.saturationAnalysis(data = CONTROL.SET, bin_size = 50,

+ extend = 400, no_iterations = 10, no_random_iterations = 1)

For the saturation analysis, the total set of available regions is divided intotwo distinct random sets (A and B) of equal size. In our example, both sets A

and B will contain 336433 randomly selected regions. Both sets A and B areagain divided into random subsets of equal size where the number of subsetsis determined by the parameter no_iterations (default=10). In our example,each subset of A (A1,A2, ..,A10) and of B (B1,B2, ..,B10) will contain approx.

7

33643 randomly selected reads. For each set, A and B, the saturation anal-ysis iteratively selects an increasing number of subsets and creates accordinggenome vectors as described in section 3.2. Here, the parameters bin_size andextend fulfill the same tasks as described in section 3.2. In case, these param-eters remain un-specified, MEDIPS accesses the parameter settings as specifiedpreviously for generating the genome vector.

In each iteration step, the resulting genome vectors for the subsets of A andB are compared using pearson correlation. As the number of considered regionsincreases during each iteration step, it is assumed that the resulting genomevectors become more similar, a dependency that is expressed by an increasedcorrelation.

Because such a saturation analysis can be performed on two independentsets of short reads only, a true saturation can only be calculated for half of theavailable short reads. As it is of interest to examin the reproducibility of theMeDIP-seq experiment for the total set of available short reads, the saturationanalysis is always followed by an estimated saturation analysis. For the esti-mated saturation analysis, the full set of given regions is doubled by consideringeach region twice and then the described saturation analysis is performed on thisartificially doubled set of regions (see also Supplementary Methods in [Chavezet al., 2010]).

The saturation analysis does not modify the MEDIPS SET object but theresults are stored at the specified saturation results object (here sr.control).The results of the saturation and of the estimated saturation analysis can beviewed by typing

> sr.control

$distinctSets

[,1] [,2]

[1,] 0 0.0000000

[2,] 33643 0.4438016

[3,] 67286 0.6155172

[4,] 100929 0.7053598

[5,] 134572 0.7617214

[6,] 168215 0.7990678

[7,] 201858 0.8263384

[8,] 235501 0.8471478

[9,] 269144 0.8636274

[10,] 302787 0.8770265

[11,] 336433 0.8882301

$estimation

[,1] [,2]

[1,] 0 0.0000000

[2,] 33643 0.4533942

[3,] 67286 0.6230684

[4,] 100929 0.7134407

[5,] 134572 0.7683929

[6,] 168215 0.8048608

[7,] 201858 0.8317658

8

[8,] 235501 0.8526189

[9,] 269144 0.8684733

[10,] 302787 0.8822030

[11,] 336430 0.8922989

[12,] 370073 0.9011554

[13,] 403716 0.9086256

[14,] 437359 0.9152219

[15,] 471002 0.9206394

[16,] 504645 0.9256510

[17,] 538288 0.9300173

[18,] 571931 0.9340288

[19,] 605574 0.9372441

[20,] 639217 0.9403780

[21,] 672866 0.9431703

$numberReads

[1] 672866

$maxEstCor

[1] 6.728660e+05 9.431703e-01

$maxTruCor

[1] 3.36433e+05 8.88230e-01

The maximal obtained correlation of the saturation analysis is stored at themaxTruCor slot and the maximal obtained correlation of the estimated satura-tion analysis is stored at the maxEstCor slot of the saturation results object(first column: total number of considered reads, second column: obtained cor-relation). The results of each iteration step are stored in the distinctSets andestimation slots for the saturation and estimated saturation analysis, respec-tively (first column: total number of considered reads, second column: obtainedcorrelation).

These results can be visualized by typing

> MEDIPS.plotSaturation(sr.control)

9

0e+00 1e+05 2e+05 3e+05 4e+05 5e+05 6e+05

0.0

0.2

0.4

0.6

0.8

Saturation analysis

Saturation cor: 0.89; Estimated cor: 0.94Number of reads

Pea

rson

cor

rela

tion

coef

ficie

nt

SaturationEstimated saturation

Because the artificially doubled set of short reads, as utilized for the esti-mated saturation analysis, does not represent a true outcome of a MeDIP-seqexperiment, the calculated correlations will overestimate the true reproducibil-ity. It is assumed that the true correlation for the full set of available shortreads will be between the results of the true (0.89) and of the estimated (0.94)saturation analyis.

The saturation analysis assists for rating whether the costs of additionalsequencing runs are in proportion to the gain in quality of the reconstructedmethylome. Please note, the results of the saturation analysis are dependenton the size of the examined reference genome (here, only the chromosome 22 isconsidered).

Further parameters that can be specified for the saturation analysis are:

� no_random_iterations: methods that randomly select data entries maybe processed several times in order to obtain more stable results. By spec-ifying the no_random_iterations parameter (default=1) it is possible torun the saturation analysis several times. The final results returned tothe saturation results object are the averaged results of all the randomiteration steps.

� empty_bins: can be either TRUE or FALSE (default TRUE). This pa-rameter affects the way of calculating correlations between the resultinggenome vectors. If there occur genomic bins which contain no overlappingregions, neither from the subsets of A nor from the subsets of B, these binswill be neglected when the paramter is set to FALSE.

� rank: can be either TRUE or FALSE (default FALSE). This parame-ter also effects the way of calculating correlations between the resulting

10

genome vectors. If rank is set to TRUE, the correlation will be calculatedfor the ranks of the bins instead of considering the counts. Setting thisparameter to TRUE is a more robust approach that reduces the effectof possible occuring outliers (these are bins with a very high number ofoverlapping regions) to the correlation.

3.4 Receiving DNA Sequence Pattern Positions

The idea of a MeDIP experiment is to identify methylated cytosins. For this,an antibody is used that recognizes methylated cytosines. However, it has beenshown [Down et al., 2008], [Pelizzola et al., 2008] that MeDIP signals scale withlocal densities of CpGs and are not necessarily influenced by only methylatedcytosines. In order to integrate the information about CpG densities into thefollowing analysis, it is necessary to identify the genomic positions of all CpGs.This can be achieved by typing

> CONTROL.SET = MEDIPS.getPositions(data = CONTROL.SET, pattern = "CG")

MEDIPS returns all start positions of CpGs on the plus strand of the refer-ence genome. As the CG pattern is reverse complementary, it is only necessaryto scan the plus strand. The pattern dependent slots of the MEDIPS SET objectnow store the necessary informations. (Have a look by just typing the name ofyour MEDIPS SET object). In principle, it is possible to identify the positionsof any other sequence pattern. For example, Lister and colleagues [Lister et al.,2009] have shown that in human embryonic stem cells, methylation occurs atcytosines outside of the CpG context. Therefore, it might be worthwhile to scanfor all cytosines within the reference genome. Please note, for sequence patternsthat are not reverse complementary, the function returns all start positions ofthe pattern within the plus and the minus strand. But for now, we will continueusing the CpG pattern.

3.5 Sequence Pattern Coverage Analysis

The main idea of the coverage analysis is to test the number of CpGs (or anyother predefined sequence pattern, see section 3.4) covered by the given shortreads and to have a look at the depth of coverage. Before the coverage analysiscan be executed, it is necessary to previously execute the MEDIPS.getPositionsfunction (see section 3.4). Afterwards, the coverage analysis can be started bytyping

> cr.control = MEDIPS.coverageAnalysis(data = CONTROL.SET, extend = 400,

+ no_iterations = 10)

For the coverage analysis, the total set of available regions is divided into dis-tinct random subsets of equal size, where the number of subsets is determinedby the parameter no_iterations (default=10). The coverage analysis itera-tively selects an increasing number of subsets and and tests how many CpGsare covered by the available regions. Moreover, it is tested how many CpGs arecovered at least 1x, 2x, 3x, 4x, 5x, and 10x. These levels of coverage depths canbe adjusted by setting the coverages parameter (see below). As the regions

11

are typically of short length (e.g. 36bp), it is recommended to extend the regionlength by an extend value (see section 3.2).

The coverage analysis does not modify the MEDIPS SET object but theresults are stored at the specified coverage results object (here cr.control).The results of the coverage analysis can be viewed by typing

> cr.control

$matrix

[,1] [,2] [,3] [,4] [,5] [,6] [,7]

[1,] 0 1 2 3 4 5 10

[2,] 0 0 0 0 0 0 0

[3,] 67286 290870 154070 82263 45467 24878 1602

[4,] 134572 379912 262623 180968 126273 88608 16743

[5,] 201858 423084 326396 251105 192828 148980 44099

[6,] 269144 448132 367569 300259 244053 199114 75598

[7,] 336430 464660 395837 336892 284558 240549 106892

[8,] 403716 476574 416570 363602 316302 274119 136050

[9,] 471002 485216 431674 384350 341293 302171 163168

[10,] 538288 491979 443668 401253 361360 324901 188099

[11,] 605574 497745 453188 414599 378254 344293 211097

[12,] 672866 502169 460963 425535 392066 360240 232422

$maxPos

[1] 578097

$pattern

[1] "CG"

$coveredPos

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 1.00 2.0 3.00 4.00 5.00 10.0

[2,] 502169.00 460963.0 425535.00 392066.00 360240.00 232422.0

[3,] 0.87 0.8 0.74 0.68 0.62 0.4

For example, the maxPos slot shows the total number of CpGs within chro-mosome 22 of the reference genome. Moreover, the matrix slot shows the resultsof the coverage analysis for each iteration and for each of the tested coveragedepths. Here, the first row shows the tested levels of coverage. The first columncontains the number of considered short reads in each iteration. The follow-ing columns show the number of sequence patterns covered (at least) by theaccording coverage depths.

The results of the coverage analysis can be visualized by typing

> MEDIPS.plotCoverage(cr.control)

12

0e+00 1e+05 2e+05 3e+05 4e+05 5e+05 6e+05

0e+

001e

+05

2e+

053e

+05

4e+

055e

+05

Coverage analysis

Total number of CGs: 578097Number of reads

Num

ber

of c

over

ed C

Gs

min. 1−foldmin. 2−foldmin. 3−foldmin. 4−foldmin. 5−foldmin. 10−fold

Further parameters that can be specified for the coverage analysis are:

� no_random_iterations: methods that randomly select data entries maybe processed several times in order to obtain more stable results. By spec-ifying the no_random_iterations parameter (default=1) it is possible torun the coverage analysis several times. The final results returned to thecoverage results object are the averaged results of each random iterationstep.

� coverages: default is c(1, 2, 3, 4, 5, 10). The coverages define the depthlevels for testing how often a sequence pattern was covered by the givenregions. Just specify any other vector of coverage depths you would liketo test.

Although an increase of short reads will always improve the sequence pat-tern coverage, the coverage analysis together with the saturation analysis allowfor gaining an impression on the overall sequence pattern coverage and repro-ducibility of reconstructing a methylome based on the available short reads.These data quality controls assist in deciding whether the costs of additionalexperimental runs are in due proportion to the expected improvement on cov-erage and reproducibility.

3.6 CpG Enrichment

As a quality check for the enrichment of CpG rich DNA fragments obtainedby the immunoprecipitation step of a MeDIP experiment, MEDIPS providesthe functionality to calculate CpG enrichment values. The main idea is to

13

check, how strong the regions are enriched for CpGs compared to the refer-ence genome. For this, MEDIPS counts the number of Cs, the number of Gs,the number CpGs, and the total number of bases within the specified referencegenome. Subsequently, MEDIPS calculates the relative frequency of CpGs andthe observed/expected ratio of CpGs present in the reference genome. Addition-ally, MEDIPS calculates the same for the DNA sequences underlying the givenregions. The final enrichment values result by dividing the relative frequencyof CpGs (or the observed/expected value, respectively) of the regions by therelative frequency of CpGs (or the observed/expected value, respectively) of thereference genome. (See also Supplementary Material in [Chavez et al., 2010].)

You can start the CpG enrichment analysis by typing

> er.control = MEDIPS.CpGenrich(data = CONTROL.SET)

The CpG enrichment analysis does not modify the MEDIPS SET object butthe results are stored at the specified enrichment results object (here er.control).You can access the results by typing

> er.control

$regions.CG

[1] 628173

$regions.C

[1] 6551995

$regions.G

[1] 6575462

$regions.relH

[1] 2.523184

$regions.GoGe

[1] 0.3630026

$genome.C

[1] 592445731

$genome.G

[1] 592804204

$genome.CG

[1] 28670425

$genome.relH

[1] 0.9903856

$genome.GoGe

[1] 0.2363220

$enrichment.score.relH

14

[1] 2.547679

$enrichment.score.GoGe

[1] 1.536051

The enrichment results object contains several slots that show the numberof Cs, Gs, and CpGs within the reference genome and within the given regions.Additionally, there are slots that show the relative frequency as well as the ob-served/expected CpG ratio within the reference genome and within the given re-gions. Finally, the slots enrichment.score.relH and enrichment.score.GoGe

indicate the enrichment of CpGs within the given regions compared to the ref-erence genome. For short reads derived from Input experiments (that is se-quencing of none-enriched DNA fragments), the enrichment values should beclose to 1 (see an example in section 3.11). In contrast, a MeDIP-seq exper-iment should return CpG rich sequences what will be indicated by increasedenrichment scores. In our example, the enrichment score for the relative CpGenrichment is 2.547679, indicating an enrichment of CpG rich regions.

In case you would like to examine not only the regions defined by the shortreads, but also the DNA sequences of the putative longer DNA fragments fromwhere the short reads were derived, it is possible to increase the length of theregions by specifying the extend (default NULL) parameter. By setting theextend to any positive value, the regions will be extended to the plus or tothe minus dircetion (dependent on the strand information of the reads) andafterwards the CpG enrichment will be calculated for the extended regions.

3.7 Creating the Coupling Vector

The need for MeDIP-seq data correction occurs by a varying efficiency of an-tibody binding with respect to local CpG densities. Similar to other MeDIPnormalization methods [Down et al., 2008], [Pelizzola et al., 2008], MEDIPStries to correct for this effect by incorporating local CpG densities into theMeDIP-seq signals. In order to correct for local CpG densities, first a couplingvector has to be calculated. The coupling vector is of the same size as the pre-defined genome vector (see also section 3.2) but contains local CpG densities(also called coupling factors) instead of the raw signals for each genomic bin.Before the coupling vector can be created, it is necessary to previously excecutethe MEDIPS.getPositions function (see section 3.4).

The coupling vector is created and attached to the MEDIPS SET object bytyping e.g.

> CONTROL.SET = MEDIPS.couplingVector(data = CONTROL.SET, fragmentLength = 700,

+ func = "count")

For each pre-defined genomic bin, the density of surrounding CpGs (or ofanother pre-defined sequence pattern, respectively, see section 3.4) is calcu-lated. For this, first a maximal distance has to be defined by specifying theparameter fragmentLength. Only CpGs within the range of [(bin_position-fragmentLength), bin_position+fragmentLength] will contribute to the fi-nal local coupling factor. The optimized value for the fragmentLength param-eter will reflect the estimated size of the sonicated DNA fragments. There areseveral ways for calculating a coupling factor for a genomic bin. The simplest

15

way is to count the number of CpGs within the maximal defined distance arounda genomic bin. Another approach is to weight each CpG by its distance to thecurrent genomic bin. CpGs further away from the current genomic bin will re-ceive smaller weights, whereas CpGs close to the genomic bin will receive higherweights. Again, there are several possible ways for such a weighting function.MEDIPS supports the following weighting functions (specified by the parameterfunc):

� count: simply count the number of CpGs within the predefined maximaldistance to the current bin

� linear: the weights for CpGs decrease in a linear way and end at 0 atthe predefined maximal distance to the current bin

� exp: the weights for CpGs decrease in an exponential way [Pelizzola et al.,2008]

� log: the weights for CpGs decrease in a logarithmic way [Pelizzola et al.,2008]

� custom: by setting the parameter to custom, it is required to specify acustom distance weights file using the parameter distFile. For exam-ple, [Down et al., 2008] have generated distance weights in an empiricalway. Based on their results, we have created the file flat_400_700.tab

that can be downloaded from http://medips.molgen.mpg.de. You cancreate such a distance file by your own and specify it here. Here, thefragmentLength parameter will be neglected and the maximal distancewithin your provided distance file will be the limit.

We have systematically calculated coupling factors with varying fragmentLength

and func parameters and compared the resulting coupling vectors to DNA-methylation values derived from bisulphite experiments performed by the humanepigenome project (HEP) [Eckhardt et al., 2006]. The best negative correlation(that is the higher the CpG density, the lower the bisulfite derived methylationvalues) was achieved by setting the parameters to fragmentLength=700 andfunc=count (see Supplementary Material in [Chavez et al., 2010]).

The coupling vector can be exported into a wiggle file by typing

> MEDIPS.exportWIG(file = "PatternDensity.WIG", data = CONTROL.SET,

+ pattern.density = TRUE, descr = "Pattern.density")

The exported Wiggle file can be uploaded into common genome browsers andallows for visualizing the density of the specified sequence pattern (e.g. CpGs)along the chromosomes. We recommend to gzip the file before uploading it tothe genome browser.

3.8 Calibration Curve and Linear Regression

As we have created a genome vector containing the raw signals at each genomicbin as well as an according coupling vector containing coupling factors at eachgenomic bin (both stored within the MEDIPS SET object), we can now examinethe dependency of local MeDIP-seq signal intensities and local CpG densities.This dependency can be made tangible by calculating the calibration curve:

16

> CONTROL.SET = MEDIPS.calibrationCurve(data = CONTROL.SET)

Calculation of the calibration curve is achieved by first dividing the totalrange of coupling factors into regular levels. Second, all genomic bins are parti-tioned into these levels by considering their associated coupling factors. Finally,for each level of coupling factors, MEDIPS calculates the mean raw signal andmean coupling factor of all genomic bins that fall into this level. (For a detaileddescription see Supplementary Material in [Chavez et al., 2010].)

The calibration curve represents averaged signals and coupling factors overthe full range of coupling factors. It indicates the experiment specific depen-dency between local signal intensities and CpG densities. The results of thecalibration curve calculation can be visualized by typing e.g.

> MEDIPS.plotCalibrationPlot(data = CONTROL.SET, linearFit = T,

+ plot_chr = "chr22")

Because the amount of data to be plotted can become very huge when plot-ting full genome data, it is strongly recommended to direct the output to acompressed file. This can be achieved by calling a e.g. png("plot.png") func-tion before calling the plot command. The plot will be available in the specifiedfile (do not forget the dev.off() command after the plotting command). Oth-erwise, R might not be able to visualize the plot in reasonable time. Each datavalue within the calibration plot represents a genomic bin. The X-axis showsthe raw signals and the Y-axis shows the coupling factors for the genomic bins.The red curve represents the calibration curve.

17

The calibration curve reveals that, in average, an increase of MeDIP-seqsignals is caused by an increasing CpG density. This approximately linear de-pendency is visible for the low range of coupling factors, only. For higher levelsof CpG densities, the mean MeDIP-seq signals decrease. It is assumed thatthis decrease is caused by the fact that in mammalian cells, regions of higherCpG densities are mainly unmethylated. In agreement with this assumption,Pelizzola and colleagues [Pelizzola et al., 2008] have shown that the dependencyof MeDIP derived signals and CpG density continues for higher levels of CpGdensities, by analyzing artificially fully methylated samples using MeDIP-Chip.In detail, they have identified a sigmoidal dependency between CpG density andMeDIP-Chip data. In agreement with Pelizzola et al. [Pelizzola et al., 2008], itis assumed that their signal plateau in the lower range of chip signals is causedby background noise. However, it is assumed that their signal plateau in theupper range of chip signals occurs by a saturation of hybridization events andis therefore an array specific artefact. Motivated by the observations made byPelizzola et al. [Pelizzola et al., 2008] and by visual inspection of the MeDIP-seqderived calibration curve, a continuing linear dependency of MeDIP-seq signalsfor higher levels of CpG densities is assumed. Analogous to Down et al. [Downet al., 2008], the local maximum of mean MeDIP-seq signals of the calibrationcurve in the lower part of coupling factors is identified. Let

y = y1,...,yl

be the mean coupling factors, and let

x = x1,...,xl

be the according mean MeDIP-seq signals of the calibration curve, where l isthe number of tested coupling factor levels and i = 1,...,l, then the smallestlevel i is identified, where

xi−3 ≤ xi−2 ≤ xi−1 ≤ xi ≥ xi+1 ≥ xi+2 ≥ xi+3

Let imax be the according identified level of i, then

ymax = y1, ..., yimax

xmax = x1, ..., ximax

are the parts of the calibration curve in the low range of coupling factors, wherean approximately linear dependency between MeDIP-seq signals and couplingfactors is observed. Here, xmax can be explained by a function of ymax as

xmax = f(ymax) + ε

where ε is an error variable (i.e. measurement errors) that is expected to spreadby chance and therefore, its expectation value is E(ε) = 0. Because a lineardependency between xmax and ymax is assumed, xmax can be described as

xmax = α+ β · ymax + ε

where the parameter α is the theoretical y-intercept, and the parameter β is thetheoretical slope. Based on the pre-calculated xmax and ymax vectors, linearregression is performed, in order to identify a suitable linear model. Linearregression estimates regression coefficients a and b for the parameters α and βso that it is valid:

18

xmaxi= a+ b · ymaxi

+ ei

where i = 1, ..., imax. Here, the residuum ei reflects the difference between theregression curve a+ b · ymaxi and the measurements of xmaxi . Moreover, xmaxi

can be replaced by an estimate x̂maxi , where

xmaxi− x̂maxi

= ei

and therefore, it is valid:

x̂maxi = a+ b · ymaxi

After having calculated the calibration curve, the MEDIPS.calibrationCurve()function performs the described linear regression and stores concrete values forthe parameters a (intercept) and b (slope) within the according slots of theMEDIPS SET. By accessing the received parameters a and b, concrete valuesfor the parameter x̂maxi can be calculated by the latter formula. For the lowrange of coupling factors, these estimates model the observed progression of thecalibration curve. As discussed above, a continuing linear dependency betweenMeDIP-seq signals and CpG density is expected for the higher range of couplingfactors. Based on the obtained linear model parameters, concrete x̂maxi

valuescan be calculated for the full range of coupling factors. Therefore,

x̂ = x̂1, ..., x̂maxi , ..., x̂l

are the estimated mean MeDIP-seq signals over the full range of coupling factorlevels l, calculated with respect to the obtained linear model parameters.

When the parameter linearFit of the MEDIPS.plotCalibrationPlot()

function was set to TRUE, the calibration plot contains a linear curve (greencurve) that visualizes the results of the performed linear regression. This curverepresents the calculated linear dependency between signals and CpG densitiesas estimated from the low range of coupling factors.

Further parameters that can be specified when plotting the calibration curveare:

� plot_chr: default="all". Please don’t forget to call a e.g. png("file.png")function before calling the plot command using all (see above). Alter-natively, you can specify a selected chromosome (e.g. chr1). Here, theplot_chr parameter only affects the plot and does not affect the MEDIPSSET object.

� xrange: The mean signal range of the calibration curve typically falls intoa low signal range. By setting the xrange parameter to e.g. 50 (suitablefor raw data), the calibration plot will only plot genomic bins associatedwith signals <=50. Therefore, the effect of an increased CpG density toan increased signal can be better visualized, especially if the data containsgenomic bins with high signals.

� rpm: can be either TRUE or FALSE. If set to TRUE, the signals willbe transformed into reads per million (rpm) before plotted. Additionally,the mean signal values of the calibration curve and of the estimated lin-ear curve will be transformed to rpm scale. The coupling values remainuntouched.

19

The calibration plot is very characteristic for MeDIP-seq experiments. Thequality of the enrichment step of the MeDIP experiment can be estimated byvisual inspection of the progression of the calibration curve. Calibration curvesfor data derived from Input experiments look different (please see an examplein section 3.11).

3.9 Relative Methylation Score and Export of NormalizedData

As soon as the normalization parameters are calculated (see previous section),the raw signals will be normalized and the normalized data will be stored withinthe MEDIPS SET object.

> CONTROL.SET = MEDIPS.normalize(data = CONTROL.SET)

For MeDIP-seq data normalization, x̂ (see section 3.8) is utilized in orderto weight the observed MeDIP-seq signals of the genomic bins with respect totheir associated coupling factors. Let (xbini

, ybini) be the raw MeDIP-seq signal

of the genomic bin i (i.e. the number of overlapping extended short reads), andthe pre-calculated coupling factor at the genomic bin i, where i = 1, ...,m andm is the total number of genomic bins, then the normalized relative methylationscore is defined as

rmsbini=

xbini·106

(a+b·ybini)·n =

xbini·106

x̂bini·n

where x̂bini= a + b · ybini

is the estimated weighting parameter obtained byconsidering the coupling factor ybini

of the genomic bin i, and a and b arethe pre-calculated regression parameters. Based on the total number of shortreads (n), the raw MeDIP-seq signals are, in parallel, transformed into a readsper million format in order to assure that rms values are comparable betweenmethylomes generated from differing amounts of short reads.

The rms values can be exported into a wiggle file by typing

> MEDIPS.exportWIG(file = "output.rms.control.WIG", data = CONTROL.SET,

+ raw = F, descr = "hESCs.rms")

All slots of the MEDIPS SET are now occupied. You can again have a lookat the MEDIPS SET by typing

> CONTROL.SET

S4 Object of class MEDIPSset

=======================================

Regions information

=======================================

Regions file: MeDIP_hESCs_chr22.txt

Organism: BSgenome.Hsapiens.UCSC.hg19

Chromosomes: chr22

Chromosome lengths: 51304566

Number of regions: 672866

Regions chromosomes: chr22 chr22 chr22...

20

Regions start positions: 16053184 16054710 16080510...

Regions stop positions: 16053220 16054746 16080546...

Regions strand: - - +...

=======================================

Genome vector signals information

=======================================

Genome wide bin size: 50

Reads extended by: 400

Genome vector chromosomes: chr22 chr22 chr22...

Genome vector positions: 1 51 101...

Genome vector signals: 0 0 0...

=======================================

Pattern information

=======================================

Pattern: CG

Number of patterns: 578097

Pattern chromosomes: chr22 chr22 chr22...

Pattern positions: 16050097 16050114 16050174...

=======================================

Genome vector coupling factor information

=======================================

Distance function: count

Distance file: empty

Fragment length: 700

Genome vector coupling factors: 0 0 0...

=======================================

Calibration information

=======================================

Calibration curve mean signals: 0.001075285 1.609183 1.049421...

Calibration curve mean coupling factors: 0 1 2...

Calibration curve variance: NA NA NA...

Intercept: 1.1525742073778

Slope: 2.51962253452531

Calibration chromosome: all

=======================================

Genome vector normalized signal information

=======================================

Normalization output interval: [0:1000]

Genome vector normalized signals: 0 0 0...

3.10 Methylation Profiles and Absolute Methylation Score

A typical question of MeDIP based DNA-methylation experiments is to exam-ine the methylation state of specific genomic regions, like e.g. CpG islands,promoters and other regions of interest (ROI).

MEDIPS provides the functionalities to calculate averaged methylation val-ues for either pre-defined ROIs or for genome wide windows. In order to receivethe methylation state of targeted genomic regions, first a ROI file has to becreated. The required structure of any file containing regions of interest is:

21

� the first column contains the chromosome of the ROI (e.g. chr1)

� the second column contains the start position of the ROI

� the third column contains the stop position of the ROI

� the fourth column contains an identifier of the ROI

As an example, you can access the file hg19.chr22.txt in the extdata

subdirectory of the MEDIPS package.The file contains hg19 promoter regions of Ensembl transcripts located on

the chromosome 22 as received from www.biomart.org. Here, the genomiccoordinates start at -1kb of the transcript start sites (TSSs) and stop at +0.5kbdownstream of the TSSs. Mean methylation values for these regions can besummarized based on the generated MEDIPS SET object by typing:

> file = system.file("extdata", "hg19.chr22.txt", package = "MEDIPS")

> promoter = MEDIPS.methylProfiling(data1 = CONTROL.SET, ROI_file = file,

+ math = mean, select = 2)

Further parameters that can be specified are:

� chr: only the specified chromosome will be evaluated (e.g. chr1).

� select: can be either 1 or 2. If set to 1, variances will be calculated basedon the rpm values; if set to 2, variances will be calculated based on therms values.

� transf: default=TRUE; If set to TRUE, MEDIPS transforms the mean rmsand ams values into log2 scale and subsequently transforms their resultingdata range into the consistent interval [0, 1000] before finally stored.

� math: default=mean; Here, you can specify other functions available in Rfor summarizing values like median or sum.

� data2: default=NULL; Here, a second MEDIPS SET can be provided, incase two MEDIPS SETs have to be compared (see also section 3.12).

� input: default=NULL; Here, an INPUT SET can be provided. An INPUTSET is a MEDIPS SET but generated from data derived from an Inputsample. An Input sample is generated by sonication of the DNA butwithout a subsequent methylated cytosine specific immunoprecipitationstep. Therefore, Input data reflects the genomic background. MEDIPSallows for accessing Input data (if available) in order to calculate a genomicbackground signal distribution. Such genomic background signals can beutilized for identifying hyper-methylated regions when two MEDIPS SETsare compared (see also section 3.12).

� frame_size: default=NULL; In spite of calculating averaged methylationvalues for given regions of interest, MEDIPS allows for calculating methy-lation levels for genome wide windows. The frame_size parameter definesthe size of the genomic windows to be tested. It is obvious that either theframe_size or the ROI_file parameter has to be specified.

22

� step: default=NULL; In case the frame_size parameter is specified, thestep parameter can be set to any arbitrary positive value. The step pa-rameter defines the number of bases by which the genomic windows areshifted along the chromosomes. If step remains NULL, non overlappingadjacent genomic windows will be examined. By setting the step param-eter to e.g. 250 bp and by setting the frame_size parameter to e.g. 500bp, overlapping genomic windows will be examined, where the overlap ofneighbouring genomic windows is 250 bp (see also the example below).

The results are stored as a list at the specified object (here promoter). Alllist objects are vectors of the same length, where the length is defined by thenumber of tested ROIs. Please note, as we have generated and specified onlyone MEDIPS SET object so far, the results object only contains results (e.g.mean rpm and rms values) from one MEDIPS SET (provided at data1). Theremaining vectors are empty and will be occupied when two MEDIPS SETobjects will be compared (see section 3.12). Each row refers to a ROI, the rownames contain the IDs of the provided ROIs, and the vectors of the list are:

1. chr: the chromosome of the ROI

2. start: the start position of the ROI

3. stop: the stop position of the ROI

4. length: the number of genomic bins included in the ROI

5. coupling: the mean coupling factor of the ROI

6. input: the mean rpm value of the INPUT SET at input (if provided)

7. rpm_A: the mean rpm value for the MEDIPS SET at data1

8. rpm_B: the mean rpm value for the MEDIPS SET at data2 (if a secondMEDIPS SET is provided)

9. rms_A: the (transformed, see below) mean rms value for the MEDIPS SETat data1

10. rms_B: the (transformed, see below) mean rms value for the MEDIPS SETat data2 (if a second MEDIPS SET is provided)

11. ams_A: the (transformed, see below) mean absolute methylation score (seebelow) for the MEDIPS SET at data1

12. ams_B: the (transformed, see below) mean absolute methylation score (seebelow) for the MEDIPS SET at data2 (if a second MEDIPS SET is pro-vided)

13. var_A: the variance of the rpm (or rms, respectively, please see the pa-rameter select) values of the MEDIPS SET at data1

14. var_B: the variance of the rpm (or rms, respectively, please see the param-eter select) of the MEDIPS SET at data2 (if a second MEDIPS SET isprovided)

23

15. var_co_A: the coefficient of variation of the rpm (or rms, respectively,please see the parameter select) of the MEDIPS SET at data1

16. var_co_B: the coefficient of variation of the rpm (or rms, respectively,please see the parameter select) of the MEDIPS SET at data2 (if asecond MEDIPS SET is provided)

17. ratio: rpm A/rpm B (or rms A/rms B, respectively, please see the pa-rameter select) (if a second MEDIPS SET is provided)

18. pvalue.wilcox: the p.value returned by R’s wilcox.test function forcomparing the rpm values (or rms values, respectively, please see the pa-rameter select) of the MEDIPS SET on data1 and of the MEDIPS SETat data2 (if a second MEDIPS SET is provided)

19. pvalue.ttest: the p.value returned by R’s t.test function for compar-ing the rpm values (or rms values, respectively, please see the parameterselect) of the MEDIPS SET on data1 and of the MEDIPS SET at data2(if a second MEDIPS SET is provided)

Mean rms values (rms_A and rms_B) are calculated based on the pre-calculatedrmsbini

values of the genomic bins enclosed by the ROI. In case, the parametertransf was set to TRUE, MEDIPS transforms the mean rms values into log2scale and subsequently transforms the resulting data range into the consistentinterval [0, 1000] before finally stored. Therefore, the minimal transformed meanrms_A (or mean rms_B, respectively) value will always be 0 and the maximaltransformed mean rms_A (or mean rms_B, respectively) will always be 1000.This transformation assures that the resulting mean rms values of the testedROIs will spread over a consistent interval. Therefore, highly methylated ROIsmay be subsequently identified by selecting ROIs associated to e.g. rms_A>600.The other way round, lowly methylated ROIs may be subsequently identifiedby selecting ROIs associated to e.g. rms_A<400. However, it has to be kept inmind that the resulting data range ([0, 1000]) reflects the relative methylationlevels of the tested ROIs.

We consider the rms values as experiment specific normalized MeDIP-seqsignals corrected for the varying efficiency of antibody binding and immunopre-cipitation in genomic regions with different CpG densities. In order to identifyan absolute methylation estimate for any specified region of interest, i.e. ei-ther for any functional genomic regions like promoters or CpG islands or forgenome wide windows of arbitrary length, the raw MeDIP-seq values can benormalized into absolute methylation scores (ams). The absolute methylationscores additionally correct for the general CpG density of the region of inter-est and therefore, allow for comparing methylation profiles of genomic regionswith different CpG densities. This is especially needful, when local methylationlevels are associated to further functional and regulatory mechanism like e.g.gene expression alterations. As an example, it is supposed that methylationlevels of proximal promoters influence the transcription rate of the accordinggenes. However, promoters are known to show a wide spread spectrum of CpGdensities. Therefore, a fully methylated high CpG density promoter will showmuch higher MeDIP signals than a fully methylated low CpG density promoter,although in both cases the promoter methylation level may influence the tran-scription rate in a comparable way. Therefore, it remains inaccurate to conclude

24

an absolute measure of promoter methylation by comparing MeDIP-seq derivedrpm or rms signals from promoters with different CpG densities.

Let

ROI = ((xbin1, ybin1

), ..., (xbins, ybins

))

be the raw MeDIP-seq signals and coupling factors of adjacent genomic bins ithat define a region of interest (ROI), where i = 1, ..., s and s is the total numberof genomic bins comprised by the ROI, then the absolute methylation score forthe ROI is defined as:

amsROI =

1s

s∑i=1

xbini·106

(a+b·ybini)·n

1s

s∑i=1

ybini

After having exectuted the MEDIPS.methylProfiling() function for thespecified ROI file containing promoter regions, one can have a look at e.g. thehistogram of CpG densities or methylation levels. Simply call Rs hist() func-tion and specify the desired column of the matrix. For the mean coupling factorstype e.g.

> hist(promoter$coupling, breaks = 100, main = "Promoter CpG densities",

+ xlab = "CpG coupling factors")

Promoter CpG densities

CpG coupling factors

Fre

quen

cy

0 50 100 150 200

050

100

150

The histogram shows a bimodal distribution of promoter CpG densities. Fora histogram of mean rpm values type

> hist(promoter$rpm_A[promoter$rpm_A != 0], breaks = 100, main = "RPM signals",

+ xlab = "reads/bin")

25

RPM signals

reads/bin

Fre

quen

cy

0 20 40 60 80 100

010

020

030

040

050

0

Because there are promoter regions without any signals, we do not considerthem for the histogram. The plot shows that a bimodal methylation distributionof the promoters is not visible for the rpm signals. The rms or ams valuesindicate a bimodal distribution of promoter methylation:

> hist(promoter$ams_A[promoter$ams_A != 0], breaks = 100, main = "AMS signals",

+ xlab = "absolute methylation score (ams)")

26

AMS signals

absolute methylation score (ams)

Fre

quen

cy

0 200 400 600 800 1000

020

4060

8010

012

014

0

We have shown [Chavez et al., 2010] that such summarized methylationvalues for ROIs, especially the ams values, are most suitable for comparingMeDIP data to e.g. bisulphite sequencing or whole genome shotgun bisulphitesequencing data.

Besides summarizing methylation values for pre-defined ROIs, MEDIPS al-lows for calculating mean methylation values along the chromosomes. For this,you have to specify a desired frame size using the parameter frame_size. Ad-ditionally, you can specify the step parameter. The step parameter defines thenumber of bases by which the frames are shifted along the chromosome. If youe.g. set the frame_size parameter to 500 and the step parameter to 250, thenMEDIPS calculates mean methylation values for overlapping 500bp windows,where the size of the overlap will be 250bp for all neighbouring windows. With-out specifying the step parameter, MEDIPS will calculate mean methylationvalues for all none-overlapping windows of size frame_size.

> frames.frame500.step250 = MEDIPS.methylProfiling(data1 = CONTROL.SET,

+ frame_size = 500, step = 250, math = mean, select = 2)

Please note, the MEDIPS.methylProfiling() function takes a comparablelong processing time when called for genome wide short windows. For example,the processing of the full human genome using overlapping 500bp windows takesapprox. 10h on our hardware. Therefore, you may want to store the receivedmatrix afterwards by using R’s write.table() function like:

> write.table(frames.frame500.step250, file = "frames.chr22.meth.txt",

+ sep = "\t", quote = F, col.names = T, row.names = F)

27

Here, you do not need to store the row.names as genome wide frames willnot have identifiers. For ROIs provided within a ROI file, you might want toset row.names=T in order to keep the identifiers.

You can upload the results table at any later time into R by typing

> frames.frame500.step250 = read.table(file = "frames.chr22.meth.txt",

+ header = T)

3.11 Additional MEDIPS SET Objects

In order to identify differentially methylated regions (DMRs) between two dif-ferent conditions, a second MEDIPS SET has to be created and processed. Inour example, the file MeDIP_DE_chr22.txt contains MeDIP-seq data of chromo-some 22 derived from human embryonic stem cells after differentiation along theendodermal lineage into definitive endoderm (DE) ([Chavez et al., 2010]). Wenow process the data using the same parameter settings as for the previouslycreated CONTROL.SET:

> file = system.file("extdata", "MeDIP_DE_chr22.txt", package = "MEDIPS")

> TREAT.SET = MEDIPS.readAlignedSequences(BSgenome = "BSgenome.Hsapiens.UCSC.hg19",


> TREAT.SET = MEDIPS.genomeVector(data = TREAT.SET, bin_size = 50,

+ extend = 400)

> TREAT.SET = MEDIPS.getPositions(data = TREAT.SET, pattern = "CG")

> TREAT.SET = MEDIPS.couplingVector(data = TREAT.SET, fragmentLength = 700,

+ func = "count")

> TREAT.SET = MEDIPS.calibrationCurve(data = TREAT.SET)

> TREAT.SET = MEDIPS.normalize(data = TREAT.SET)

So far, we have the second MEDIPS SET sufficiently processed for the subse-quent comparison of two MEDIPS SETs. Here, we will not perform the qualitycontrols or further diagnostic analyses for the TREAT.SET. However, it is rec-ommended to calculate these quality metrics for each data set.

For identification of differentially methylated regions, it is recommended (butnot necessary) to provide background data from an Input experiment (that issequencing of non-enriched DNA fragments). By providing an Input data set,MEDIPS will calculate a background data distribution used for the identificationof DMRs. Therefore, we now process the available Input data given in theexample file Input_StemCells_chr22.txt:

> file = system.file("extdata", "Input_StemCells_chr22.txt", package = "MEDIPS")

> INPUT.SET = MEDIPS.readAlignedSequences(BSgenome = "BSgenome.Hsapiens.UCSC.hg19",


> INPUT.SET = MEDIPS.genomeVector(data = INPUT.SET, bin_size = 50,

+ extend = 400)

28

In case of Input data, we do not correct for CpG denstities because noenrichment for methylated CpG’s was performed. Only the rpm values willbe considered. Therefore, the INPUT SET is already sufficiently processed forbeeing integrated into the DMR identification process. In the context of thismanual, we do not perform the quality controls and further diagnostic analysesfor the INPUT.SET. However, we strongly recommend to calculate the qualitycontrol metrics as described for the CONTROL.SET, in order to work out thedifferences between MeDIP and Input data (e.g. saturation-, coverage-, CpGenrichment analyses, and creation of the calibration plot). These analyses canbe performed for the INPUT.SET by typing e.g.:

> MEDIPS.exportWIG(file = "output.rpm.input.WIG", data = INPUT.SET,

+ raw = T, descr = "INPUT.rpm")

> sr.input = MEDIPS.saturationAnalysis(data = INPUT.SET, bin_size = 50,

+ extend = 400, no_iterations = 10, no_random_iterations = 1)

> MEDIPS.plotSaturation(sr.input)

The saturation analysis will show that the reproducibility increases moreslowly for Input data than for MeDIP data. This is due to an increased com-plexity of available genomic DNA that has to be sequenced when no specificenrichment was performed.

> INPUT.SET = MEDIPS.getPositions(data = INPUT.SET, pattern = "CG")

> cr.input = MEDIPS.coverageAnalysis(data = INPUT.SET, extend = 400,

+ no_iterations = 10)

> MEDIPS.plotCoverage(cr.input)

> er.input = MEDIPS.CpGenrich(data = INPUT.SET)

The CpG enrichment values indicate that Input data is not as strong enrichedfor CpGs as MeDIP data.

> INPUT.SET = MEDIPS.couplingVector(data = INPUT.SET, fragmentLength = 700,

+ func = "count")

> INPUT.SET = MEDIPS.calibrationCurve(data = INPUT.SET)

> png("CalibrationPlotINPUT.png")

> MEDIPS.plotCalibrationPlot(data = INPUT.SET, linearFit = F, plot_chr = "chr22")

> dev.off()

The calibration curve of the calibration plot allows for discriminating MeDIPfrom Input data.

3.12 Methylation Profiles of Two MEDIPS SET Objects

As we have generated two MEDIPS SETs and one INPUT SET, we will nowcalculate mean methylation values for all three sets in parallel and comparethe methylation profiles of the two MEDIPS SETs with respect to the INPUTSET. Methylation profiles and comparisons can be performed for given regionsof interest (see section 3.10) or for genome wide (sliding) frames. This is alldone by the MEDIPS.methylProfiling() function.

Here, we process genome wide adjacent 500bp windows:

29

> diff.meth = MEDIPS.methylProfiling(data1 = CONTROL.SET, data2 = TREAT.SET,

+ input = INPUT.SET, frame_size = 500, select = 2)

Please note, the MEDIPS.methylProfiling() function takes a comparativelylong processing time when called for genome wide short windows. In fact, thisis the most time-consuming bottleneck of the whole procedure for identifyingDMRs.

The results are stored as a list at the specified object (here diff.meth). Alllist objects are vectors of the same length, where the length is defined by thenumber of tested frames. Each row refers to a ROI and the individual vectorsare as described in section 3.10. Here, all vectors are occupied, including meanmethylation values for the two MEDIPS SETs and for the INPUT SET, as wellas ratios and p-values obtained from comparing the two MEDIPS SETs.

Let C, T , and I be the genome vectors generated based on the sequencingdata from the CONTROL.SET, TREAT.SET, and INPUT.SET objects usingan arbitrary bin size b and let ROI be a set of ROIs (e.g. genome wide windows),whereROI = ROI1, ..., ROIi, ...ROIn, and n is the number of ROIs to be tested.Let the ROIs be of length m1, ...,mn. In the following, the identification ofDMRs is only supported for any ROIi of length mi ≥ 5·b. Therefore, each ROIiincludes at least five genomic bins bini,j , where bini,1, ..., bini,j , ..., bini,kiεROIiand ki = floor(mi

b ). For each ROIi, mean rpm and mean rms values arecalculated based on C and T as:

C.RPMROIi = 1ki

ki∑j=1

rpm(C.bi,j)

C.RMSROIi = 1ki

ki∑j=1

rms(C.bi,j)

T.RPMROIi = 1ki

ki∑j=1

rpm(T.bi,j)

T.RMSROIi = 1ki

ki∑j=1

rms(T.bi,j)

where rpm(C.bi,j), rms(C.bi,j), rpm(T.bi,j), and rms(T.bi,j) are the pre-calculated rpm and rms (see sections 3.2 and 3.9) values of the genomic binsfrom the Control and Treatment samples. In addition, for each ROIi, meanrpm values are calculated based on I as:

I.RPMROIi = 1ki

ki∑j=1

rpm(I.bi,j)

where rpm(I.bi,j) are the pre-calculated rpm values of the genomic binsfrom the Input sample. Based on the mean rms values of the Control and ofthe Treatment sample, for each ROIi the following ratios are calculated:

r.rmsROIi =C.RMSROIi

T.RMSROIi

In addition, by considering the mean rpm values of the Control or of theTreatment sample, respectively, the following ratios are calculated with respectto mean rpm values of the Input sample:

30

r.rpm.CROIi =C.RPMROIi

I.RPMROIi

r.rpm.TROIi =T.RPMROIi

I.RPMROIi

Because local background sequencing signals are variable along the chro-mosomes due to differing DNA availability, a global background rpm signalthreshold is estimated based on the distribution of all calculated I.RPMROIi

values. This is done by defining a targeted quantile qt (e.g. qt = 0.95) andby identifying the I.RPMROIi value (t), where qt% of all I.RPMROIi valuesare < t. This estimated global minimal mean rpm threshold t will serve as anadditional parameter for selecting genomic regions that show a mean MeDIP-seq derived rpm signal of at least t in the Control or the Treatment sample,respectively (see next section).

In addition, statistical testing is utilized in order to rate whether the ob-tained rms data series of the genomic bins within any ROIi significantly differin the Control sample compared to the Treatment sample. For each ROIi it istested, whether the rpm (or rms, respectively, see parameter select) values ofthe genomic bins bini,1, ..., bini,j , ..., bini,ki

εROIi of the Control sample signifi-cantly differ from the rpm (or rms, respectively, see parameter select) valuesof the according genomic bins of the Treatment sample. For this, the MEDIPSpackage utilizes the t.test() and wilcox.test() functions of the R environ-ment with default parameter settings (two-sided tests in both cases). Therefore,for each tested ROIi, two p-values (ROI.p.value.ti and ROI.p.value.wi) will becalculated and serve as a further level for discriminating between local methy-lation profiles (see next section).

Please note, in case the trasnf parameter was set to TRUE, the returnedmean rms and ams values within the diff.meth object are in log2 scale andwere transformed into the consistent interval [0:1000]. However, ratios andp.values were calculated before the data was log2 scaled and shifted into theinterval [0:1000]. Ratios and p.values can be calculated based on the rpm or rmsvalues by specifying the parameter select of the MEDIPS.methylProfiling()

function.You may want to store the received result matrix by using R’s write.table()

function like:

> write.table(diff.meth, file = "diff.meth.txt", sep = "\t", quote = F,

+ col.names = T, row.names = F)

You can upload the result matrix at any later time into R by typing

> diff.meth = read.table(file = "diff.meth.txt", header = T)

3.13 Selecting Candidate DMRs and Annotation

Based on the results returned from the MEDIPS.methylProfiling() function insection 3.12 (here diff.meth), we now select candidate ROIs that show signifi-cant differential methylation between the CONTROL.SET and the TREAT.SETin consideration of the background data included in the INPUT.SET. For this,MEDIPS offers the possibility to specify the following parameters in order toapply several filters to the full set of ROIs:

31

� frames: specifies the result table

� input: default=T; Setting the parameter to TRUE requires that the re-sults table includes a column for summarized rpm values of an INPUTSET. In case there is no Input data available, the input parameter has tobe set to a rpm value that will be used as a lower threshold during the sub-sequent analysis. How to estimate such a threshold without backgrounddata is not yet solved by MEDIPS.

� quant: default=0.9; from the distribution of all Input rpm values, MEDIPScalculates the rpm value that represents the quant quantile of the wholeInput distribution.

� control: can be either TRUE or FALSE. MEDIPS allows for selectingROIs that are higher methylated in the CONTROL SET compared tothe TREAT SET and vice versa but both approaches have to be per-fomed in two independent runs. By setting control=TRUE, ROIs will beselected that are higher methylated in the CONTROL SET. By settingcontrol=FALSE, ROIs will be selected that are higher methylated in theTREAT SET.

� up: default=1.333333; defines the lower threshold for the ratio CON-TROL/TREATMENT as well as the lower ratio for CONTROL/INPUT(if control=T) or TREATMENT/INPUT (if control=F), respectively.

� down: default=0.75; defines the upper threshold for the ratio: CON-TROL/TREATMENT (only if control=F).

� p.value: default=0.01; defines the threshold for the p-values. One of thetwo p-values derived from the wilcox.test and t.test functions has tobe <= p.value.

The following command filters for candidate frames that are higher methy-lated in human embryonic stem cells than in differentiated hESCs:

> diff.meth.control = MEDIPS.selectSignificants(frames = diff.meth,

+ control = T, input = T, up = 2, p.value = 1e-04, quant = 0.99)

[1] Processing input distribution...

[1] Done.

[1] Total number of frames: 102610

[1] Number of frames where control or treatment !=0: 66656

[1] Remaining number of frames with p.value<=1e-04: 7567

[1] Remaining number of frames where control/treatment ratio >= 2: 4108

[1] Estimated rpm threshold for input quantile 0.99 is: 27.4977604916713

[1] Remaining number of frames with control rpm >=27.4977604916713: 433

[1] Remaining number of frames with control/input ratio>=2: 294

[1] [Note: There are 0 frames associated to a p-value==0.]

[1] [Note: There are 0 frames, where control/treatment ratio = Inf (i.e. treatment==0).]

For identifying ROIi’s that show differential methylation between the Con-trol and the Treatment sample with respect to the Input sample, based on the

32

pre-calculated parameters (see previous section), a filtering procedure is per-formed. The following filtering procedure also discriminates between increasedmethylation in the Control sample compared to the Treatment sample (Con-trol>Treatment, a) and vice versa (Treatment>Control, b):

1. ROIi’s where C.RPMROIi = T.RPMROIi = 0 are neglected,

2. ROIi’s where ROI.p.value.ti > p.value and ROI.p.value.wi > p are ne-glected, where p.value is any targeted level of significance (e.g. p.value =0.01),

3. Filtering for the ratio:

� a) ROIi’s where r.rmsROIi < up are neglected, where up is an upperratio threshold (e.g. up = 1.33),

� b) ROIi’s where r.rmsROIi > down are neglected, where down is alower ratio threshold (e.g. down = 0.75),

4. Filtering for global Input derived background signals:

� a) ROIi’s where C.RPMROIi < t are neglected,

� b) ROIi’s where T.RPMROIi < t are neglected,

5. Filtering for local Input derived background signals:

� a) ROIi’s where r.rpm.CROIi < up are neglected,

� b) ROIi’s where r.rpm.TROIi < up are neglected.

In our example, there are 294 frames of length 500bp that remain after theMEDIPS.selectSignificants() step. In case, the MEDIPS.methylProfiling()function was executed in order to test overlapping frames (i.e. specifyingthe step parameter), overlapping significant frames may be returned to thediff.meth.control results table. For these cases it is worthwhile to mergeoverlapping regions by typing:

> diff.meth.control.merged = MEDIPS.mergeFrames(frames = diff.meth.control)

Please note, merged frames are represented only by their genomic coordinateswithin the diff.meth.control.merged table (these are the chromosome names,new start, and new stop positions). The result table does not contain any mergedrpm, rms, variance, p.value, etc. values any more.

Moreover, it is important to keep in mind, that there are three main reasonswhy an analysis of only a subset of the full genome (here only chromosome22) will probably end up in a different number of significant DMRs: First, thecalibration parameters were calculated for only one chromosome. Second, thetotal number of given regions differs and therefore, the reads per million valueswill differ. Third, the rpm threshold derived from the background Input datadistribution was calculated based on data from only one chromosome.

You can write out the obtained regions by typing some suitable R code like:

> write.table(diff.meth.control, file = "DiffMethyl.Up.hESCs.bed",

+ sep = "\t", quote = F, row.names = F, col.names = F)

33

You can upload the resulting file into a genome browser and the DMRs willbe visualized as black blocks.

Finally, it might be of interest to annotate the DMRs. We fall back on theexample ROI file hg19.chr22.txt that contains pre-defined promoter regions(-1kb to +0.5kb around the TSSs). We annotate the identified DMRs by thetranscript promoters included in the ROI file by typing


> diff.meth.control.annotated = MEDIPS.annotate(diff.meth.control,

+ anno = file)

The resulting table is of the following format:

1. chr: the chromosome name of the DMR

2. start: the start position of the DMR

3. end: the stop position of the DMR

4. feature: the name of the matched annotation

For each provided region (DMR), the function returns all overlapping anno-tations included in the provided annotation file. Note, in case there are severaloverlapping annotations, the region (DMR) is returned several times in sepa-rated rows, each entry associated to one annotation. In order to receive e.g. aunique list of ensembl transcript names whose promoter regions overlap with aDMR, you can now easily select for the appropriate entries using standard Rcommands, e.g.

> length(unique(diff.meth.control.annotated[, 4]))

[1] 25

In our example, the 294 identified DMRs on chromosome 22 can be asso-ciated to promoter regions of 25 unique transcripts (including genes as wellas pseudogenes and small RNAs). These DMRs can be interpreted as regionswhere de-methylation events occur during the differentiation of hESCs alongthe endodermal lineage.

The other way round, we now select for frames higher methylated in differ-entiated hESCs (DE) than in pluripotent hESCs:

> diff.meth.treat = MEDIPS.selectSignificants(frames = diff.meth,

+ control = F, input = T, up = 2, down = 0.5, p.value = 1e-04,

+ quant = 0.99)

[1] Processing input distribution...

[1] Done.

[1] Total number of frames: 102610

[1] Number of frames where control or treatment !=0: 66656

[1] Remaining number of frames with p.value<=1e-04: 7567

[1] Remaining number of frames where treatment/control ratio <= 0.5: 1605

[1] Estimated rpm threshold for input quantile 0.99 is: 27.4977604916713

[1] Remaining number of frames with treatment rpm >=27.4977604916713: 64

34

[1] Remaining number of frames with treatment vs. input ratio>=2: 47

[1] [Note: There are 0 frames associated to a p-value==0.]

[1] [Note: There are 0 frames, where control/treatment ratio = 0 (i.e. control=0).]

> write.table(diff.meth.treat, file = "DiffMethyl.Up.DE.bed", sep = "\t",

+ quote = F, row.names = F, col.names = F)


> diff.meth.treat.annotated = MEDIPS.annotate(diff.meth.treat,

+ anno = file)

> length(unique(diff.meth.treat.annotated[, 4]))

[1] 28

In our example, there are 47 non-overlapping DMRs on chromosome 22 as-sociated to promoter regions of 28 unique transcripts. These DMRs can beinterpreted as regions where de-novo methylation events occur during the dif-ferentiation of hESCs along the endodermal lineage.

4 Concluding Remarks

In our opinion, MEDIPS provides several helpful functionalities for analysingMeDIP-seq data in reasonable time compared to other available approaches.Nevertheless, there are some limitations that have to be adressed in the future.Main issues are:

� Because MEDIPS processes full genome data at once, MEDIPS needs alot of memory. Especially, when two MEDIPS SETs as well as an INPUTSET is uploaded and utilized for the identification of DMRs, the need formemory is very huge.

� The runtime, especially calculation of mean methylation values or identi-fication of DMRs at genome wide short windows remains a bottleneck.

� The MEDIPS.methylProfiling() function is a novel approach for theidentification of DMRs, and is especially suitable for MeDIP-seq data be-cause DNA methylation is expected to occur on longer DNA stretchescompared to smaller enrichments derived from e.g. ChIP-seq data. How-ever, identification of DMRs is very static. The definition of fixed windows,although when overlapping windows are allowed, is not very flexible. Adynamic method for the identification of optimized DMRs might be ofinterest.

However, to the best of our knowledge, MEDIPS is currently the most com-prehensive software for processing MeDIP-seq data. It starts where the mappingtools stop, touches several aspects of data quality checks, allows for exportingraw and normalized methylation profiles, calculates mean methylation valuesfor any specified ROI and identifies genome wide DMRs when hunting for dif-ferential DNA-methylation comparing two conditions.

35

References

Lukas Chavez, Justyna Jozefczuk, Christina Grimm, Joern Dietrich, Bernd Tim-mermann, Hans Lehrach, Ralf Herwig, and James Adjaye. Computationalanalysis of genome-wide dna methylation during the differentiation of humanembryonic stem cells along the endodermal lineage. Genome Res, Aug 2010.

Thomas A Down, Vardhman K Rakyan, Daniel J Turner, Paul Flicek, HengLi, Eugene Kulesha, Stefan Graef, Nathan Johnson, Javier Herrero, Eleni MTomazou, Natalie P Thorne, Liselotte Baeckdahl, Marlis Herberth, Kevin LHowe, David K Jackson, Marcos M Miretti, John C Marioni, Ewan Birney,Tim J P Hubbard, Richard Durbin, Simon Tavare, and Stephan Beck. Abayesian deconvolution strategy for immunoprecipitation-based dna methy-lome analysis. Nat Biotechnol, 26(7):779–785, Jul 2008.

Florian Eckhardt, Joern Lewin, Rene Cortese, Vardhman K Rakyan, JohnAttwood, Matthias Burger, John Burton, Tony V Cox, Rob Davies, Thomas ADown, Carolina Haefliger, Roger Horton, Kevin Howe, David K Jackson,Jan Kunde, Christoph Koenig, Jennifer Liddle, David Niblett, Thomas Otto,Roger Pettett, Stefanie Seemann, Christian Thompson, Tony West, JaneRogers, Alex Olek, Kurt Berlin, and Stephan Beck. Dna methylation pro-filing of human chromosomes 6, 20 and 22. Nat Genet, 38(12):1378–1385, Dec2006.

Ryan Lister, Mattia Pelizzola, Robert H Dowen, R. David Hawkins, Gary Hon,Julian Tonti-Filippini, Joseph R Nery, Leonard Lee, Zhen Ye, Que-Minh Ngo,Lee Edsall, Jessica Antosiewicz-Bourget, Ron Stewart, Victor Ruotti, A. Har-vey Millar, James A Thomson, Bing Ren, and Joseph R Ecker. Human dnamethylomes at base resolution show widespread epigenomic differences. Na-ture, 462(7271):315–322, Nov 2009.

Mattia Pelizzola, Yasuo Koga, Alexander Eckehart Urban, Michael Krautham-mer, Sherman Weissman, Ruth Halaban, and Annette M Molinaro. Medme:an experimental and analytical methodology for the estimation of dna methy-lation levels based on microarray derived medip-enrichment. Genome Res, 18(10):1652–1659, Oct 2008.

Michael Weber, Jonathan J Davies, David Wittig, Edward J Oakeley, MichaelHaase, Wan L Lam, and Dirk Schuebeler. Chromosome-wide and promoter-specific analyses identify sites of differential dna methylation in normal andtransformed human cells. Nat Genet, 37(8):853–862, Aug 2005.

36

MEDIPS Tutorial - ICGEBnet.icgeb.org/.../Documentation/MEDIPS.pdfMEDIPS Tutorial Lukas Chavez...

Documents

Transcript of MEDIPS Tutorial - ICGEBnet.icgeb.org/.../Documentation/MEDIPS.pdfMEDIPS Tutorial Lukas Chavez...