Arpeggio: harmonic compression of ChIP-seq data reveals protein ...

12
Arpeggio: harmonic compression of ChIP-seq data reveals protein-chromatin interaction signatures Kelly Patrick Stanton 1 , Fabio Parisi 1 , Francesco Strino 1 , Neta Rabin 2 , Patrik Asp 3 and Yuval Kluger 1,4, * 1 Department of Pathology, Yale University School of Medicine, 333 Cedar Street, New Haven, CT 06520, USA, 2 Department of Exact Sciences, Afeka - Tel-Aviv Academic College of Engineering, Tel-Aviv 69107, Israel, 3 Department Of Liver Transplant, Montefiore Medical Center, Albert Einstein College of Medicine, Bronx, NY 10467, USA and 4 NYU Center for Health Informatics and Bioinformatics, New York University Langone Medical Center, 227 East 30th Street, New York, NY 10016, USA Received November 11, 2011; Revised June 21, 2013; Accepted June 24, 2013 ABSTRACT Researchers generating new genome-wide data in an exploratory sequencing study can gain biological insights by comparing their data with well- annotated data sets possessing similar genomic patterns. Data compression techniques are needed for efficient comparisons of a new genomic experi- ment with large repositories of publicly available profiles. Furthermore, data representations that allow comparisons of genomic signals from differ- ent platforms and across species enhance our ability to leverage these large repositories. Here, we present a signal processing approach that char- acterizes protein–chromatin interaction patterns at length scales of several kilobases. This allows us to efficiently compare numerous chromatin-immuno- precipitation sequencing (ChIP-seq) data sets con- sisting of many types of DNA-binding proteins collected from a variety of cells, conditions and or- ganisms. Importantly, these interaction patterns broadly reflect the biological properties of the binding events. To generate these profiles, termed Arpeggio profiles, we applied harmonic deconvolu- tion techniques to the autocorrelation profiles of the ChIP-seq signals. We used 806 publicly available ChIP-seq experiments and showed that Arpeggio profiles with similar spectral densities shared biolo- gical properties. Arpeggio profiles of ChIP-seq data sets revealed characteristics that are not easily de- tected by standard peak finders. They also allowed us to relate sequencing data sets from different genomes, experimental platforms and protocols. Arpeggio is freely available at http://sourceforge. net/p/arpeggio/wiki/Home/. INTRODUCTION The advent of automation and use of high-throughput sequencing techniques has brought a remarkable increase in the rate at which biological data sets are accumulated. Public repositories, such as the Sequence Read Archive (SRA, http://www.ncbi.nlm.nih.gov/sra), and The Cancer Genome Atlas (http://cancergenome. nih.gov/) already store thousands of genome-wide data sets from a variety of cells and biological conditions. It is plausible that similarities of genomic profiles from dif- ferent experiments (e.g. binding profiles of transcription factors or transcriptomes of unique samples) are due to similar biological mechanisms, and theoretically it is there- fore possible to use existing repositories to explore un- charted relationships between a variety of genomic signals in various biological systems and conditions. For instance, it would be possible to gain biological insights relevant to a new genome-wide study by using exploratory unsupervised learning approaches that link newly generated data with well-annotated data sets possessing similar genomic patterns. Integration of new data with existing repositories in standard pipelines for sequence analysis is computationally challenging owing to the high dimensionality of each genomic profile and the massive storage size of these databases. Data com- pression techniques are needed for addressing these issues and can be incorporated into these pipelines to provide efficient comparisons of new genomic profiles to the large volume of publicly available profiles. Furthermore, data compression techniques that allow comparison of genomic signals from different platforms and across *To whom correspondence should be addressed. Tel: +1 203 737 6262; Fax:+1 203 785 6486; Email: [email protected] The authors wish it to be known that, in their opinion, the first three authors should be regarded as joint First Authors. Published online 19 July 2013 Nucleic Acids Research, 2013, Vol. 41, No. 16 e161 doi:10.1093/nar/gkt627 ß The Author(s) 2013. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Downloaded from https://academic.oup.com/nar/article-abstract/41/16/e161/2411628 by guest on 11 April 2018

Transcript of Arpeggio: harmonic compression of ChIP-seq data reveals protein ...

Page 1: Arpeggio: harmonic compression of ChIP-seq data reveals protein ...

Arpeggio harmonic compression of ChIP-seq datareveals protein-chromatin interaction signaturesKelly Patrick Stanton1 Fabio Parisi1 Francesco Strino1 Neta Rabin2 Patrik Asp3 and

Yuval Kluger14

1Department of Pathology Yale University School of Medicine 333 Cedar Street New Haven CT 06520USA 2Department of Exact Sciences Afeka - Tel-Aviv Academic College of Engineering Tel-Aviv 69107 Israel3Department Of Liver Transplant Montefiore Medical Center Albert Einstein College of Medicine Bronx NY10467 USA and 4NYU Center for Health Informatics and Bioinformatics New York University Langone MedicalCenter 227 East 30th Street New York NY 10016 USA

Received November 11 2011 Revised June 21 2013 Accepted June 24 2013

ABSTRACT

Researchers generating new genome-wide data inan exploratory sequencing study can gain biologicalinsights by comparing their data with well-annotated data sets possessing similar genomicpatterns Data compression techniques are neededfor efficient comparisons of a new genomic experi-ment with large repositories of publicly availableprofiles Furthermore data representations thatallow comparisons of genomic signals from differ-ent platforms and across species enhance ourability to leverage these large repositories Herewe present a signal processing approach that char-acterizes proteinndashchromatin interaction patterns atlength scales of several kilobases This allows us toefficiently compare numerous chromatin-immuno-precipitation sequencing (ChIP-seq) data sets con-sisting of many types of DNA-binding proteinscollected from a variety of cells conditions and or-ganisms Importantly these interaction patternsbroadly reflect the biological properties of thebinding events To generate these profiles termedArpeggio profiles we applied harmonic deconvolu-tion techniques to the autocorrelation profiles of theChIP-seq signals We used 806 publicly availableChIP-seq experiments and showed that Arpeggioprofiles with similar spectral densities shared biolo-gical properties Arpeggio profiles of ChIP-seq datasets revealed characteristics that are not easily de-tected by standard peak finders They also allowedus to relate sequencing data sets from differentgenomes experimental platforms and protocols

Arpeggio is freely available at httpsourceforgenetparpeggiowikiHome

INTRODUCTION

The advent of automation and use of high-throughputsequencing techniques has brought a remarkableincrease in the rate at which biological data sets areaccumulated Public repositories such as the SequenceRead Archive (SRA httpwwwncbinlmnihgovsra)and The Cancer Genome Atlas (httpcancergenomenihgov) already store thousands of genome-wide datasets from a variety of cells and biological conditions Itis plausible that similarities of genomic profiles from dif-ferent experiments (eg binding profiles of transcriptionfactors or transcriptomes of unique samples) are due tosimilar biological mechanisms and theoretically it is there-fore possible to use existing repositories to explore un-charted relationships between a variety of genomicsignals in various biological systems and conditionsFor instance it would be possible to gain biological

insights relevant to a new genome-wide study by usingexploratory unsupervised learning approaches thatlink newly generated data with well-annotated data setspossessing similar genomic patterns Integration ofnew data with existing repositories in standard pipelinesfor sequence analysis is computationally challengingowing to the high dimensionality of each genomic profileand the massive storage size of these databases Data com-pression techniques are needed for addressing these issuesand can be incorporated into these pipelines to provideefficient comparisons of new genomic profiles to thelarge volume of publicly available profiles Furthermoredata compression techniques that allow comparison ofgenomic signals from different platforms and across

To whom correspondence should be addressed Tel +1 203 737 6262 Fax +1 203 785 6486 Email yuvalklugeryaleedu

The authors wish it to be known that in their opinion the first three authors should be regarded as joint First Authors

Published online 19 July 2013 Nucleic Acids Research 2013 Vol 41 No 16 e161doi101093nargkt627

The Author(s) 2013 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby30) whichpermits unrestricted reuse distribution and reproduction in any medium provided the original work is properly cited

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

species would enhance our ability to use large existingrepositoriesCompressing and organizing large sequencing archives

involves characterization of each experiment in terms ofits genomic features such as a list of peaks representingevents along the genome application of dissimilaritymeasures to determine pairwise affinities between thefeature vectors of each pair of experiments (eg overlapbetween two lists of peaks) clustering dimensional reduc-tion annotation and visualization of the collection ofthese feature vectors We usually assume that theunderlying cellular mechanisms captured by a pair of dis-similar feature vectors such as lists of binding sites of twodifferent DNA-binding proteins are different Howeverdata analysis of each type of sequencing experiment can bedone in numerous ways that affect the affinity betweenpairs of unique data sets The most obvious factors thatinfluence affinity include the preferred choice of featurespace and dissimilarity measures used In most sequencinganalyses practitioners tend to use standard feature spacesand common dissimilarity measures Specifically in RNA-seq analysis the feature space of an experiment comprisescounts of reads or sequenced fragments (eg RPKMs orFPKMs) of all genes and the similarity between two tran-scriptomes is characterized by standard correlationmeasures (12) in DNA-seq analysis of cancer samplesthe standard feature space includes point mutationsindels copy number alterations and translocations andsimilarities are evaluated using basic associationmeasures to determine prevalence (3) in chromatin-immunoprecipitation sequencing (ChIP-seq) experimentsbinding events are determined by peak detectors andsimilarities between two experiments are typicallyevaluated simply by the number or fraction of overlappingpeaks (4ndash6)ChIP-seq in particular has been widely used to unravel

transcriptional and epigenetic regulatory programs thatultimately determine the biological phenotype Thousandsof ChIP-seq experiments have already been collected bylarge community-wide efforts such as the ENCODEproject (78) pilot initiatives (4ndash69) and smaller projects(10ndash45) Application of computational approachesfor interrogating the genome-wide interactions betweenchromatin and proteins by high-throughput shortread sequencing of genomic DNA from ChIP-seq experi-ments can reveal certain aspects of the underlying biology(46ndash48)In the present study we use deconvolution to extract

the biological component that is indicative of distinctiveproteinndashchromatin interaction configurations from theautocorrelation profiles of ChIP-seq signals Weexplored the space of the Fourier transform of the auto-correlation profiles (spectral densities of the read coveragedistributions) using machine-learning approaches to char-acterize proteinndashchromatin interaction patterns at inter-mediate length scales of several kilobases and showed itsutility in the organization of large repositories of ChIP-seqdata These characteristic spectral density profiles allowedus to efficiently compare a large number of ChIP-seq datasets consisting of transcription factors epigenetic marksand other types of chromatin interacting proteins collected

from a variety of cell types conditions and organismsMoreover the deconvolved autocorrelation functionswhich we term Arpeggio profiles reflect the biologicalnature of proteinndashchromatin interactions such as eventsthat are locally isolated coordinated events or dynamic-ally flexible events

We used 806 publicly available ChIP-seq experimentsfrom several unrelated studies (458ndash45) and showedthat Arpeggio profiles with similar spectral densitiesshared biological properties Arpeggio profiles can bemodeled using a small number of parameters and thusare mappable to a low-dimensional space that capturesbiological aspects of the interaction between proteinsand chromatin This representation facilitates efficientindexing of databases and application of supervised un-supervised and inference methods to large repositories ofsequencing data comprising different genomes experimen-tal platforms and protocols We also show that ourapproach can be used to derive experimental and biologic-ally meaningful quantities such as fragment length distri-butions as well as the expected nucleosome spacing Ourresults suggest that harmonic analysis of ChIP-seq dataunravels signatures that are not easily captured bystandard computational means Analogously to catalog-ing cerebral activity with Electroencephalography (EEG)Arpeggio analysis efficiently locates a new sample in themap of differing proteinndashchromatin interaction states

MATERIALS AND METHODS

Data sets and preprocessing

DataWe analyzed 806 public ChIP-seq experiments from datasets obtained from the SRA (httpwwwncbinlmnihgovsra Supplementary Table S1) The analyzedproteins included transcription factors or histone modifi-cations from human mouse or fruit fly (SupplementaryFigure S1) In all experiments chromatin was cross-linkedand fragmented using either sonication or MNase diges-tion followed by immunoprecipitation using specificantibodies (Supplementary Table S2) The ChIP-seq proto-cols used in all these studies are transcribed verbatimand provided in Supplementary Table S3 Controlsconsist of high-throughput sequencing of immunoglobulinG immunoprecipitation (IP) or total DNA input

PreprocessingWith the exception of sequenced reads from the study byBarski et al (5) whose genome-wide alignment is providedby the authors all sequenced reads were mapped to theircorresponding reference genomes using the Bowtie aligner(49) with parameters lsquo-n2 -k1 -m1 ndashbest ndashstratarsquo corres-ponding to reporting unique alignments for each read withat most two mismatches

Autocorrelation and cross-correlation of ChIP-seq profiles

AutocorrelationGiven a short read sequencing sample consisting of a set Rof N aligned reads on the same strand r 2 R where r in-dicates the 50 end of an aligned read we defined the count

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 2 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

of all pairs of reads separated by a fixed distance of tnucleotides as

RethTHORN frac14XNifrac141

XNjfrac141

1 if ri rj frac14 0 otherwise

eth1THORN

RethTHORN is equivalent to the autocorrelation of the empiricalread coverage depth D(t) at each position t along thegenome

RethTHORN frac14P1

tfrac141DethtTHORNDetht+THORN

frac14 DethtTHORN DethtTHORNfrac12 ethTHORN

eth2THORN

where is the convolution operatorWe note that proper normalization of the read count

data is ambiguous (5051) therefore rather than the stat-istical definition of autocorrelation which is meancentered and normalized by variance we use the digitalsignal processing (DSP) definition of autocorrelation Thelatter does not require prior knowledge of the distributionshape To minimize the impact of PCR artifactsduplicated reads were only considered once

Cross-correlationGiven a short read sequencing sample consisting of a set Rof aligned reads r 2 R where r+ indicates the startingposition of an aligned read on the positive strand andr indicates the starting position of an aligned read onthe negative strand we defined the aggregated crossdistance between reads on opposing strands as

RethTHORN frac14XN+

ifrac141

XNjfrac141

1 if r+i rj frac14 0 otherwise

eth3THORN

where t is the offset in nucleotides and N+ and N are thenumber of reads on the positive and negative strand re-spectively This is equivalent to the cross-correlation of theempirical read coverage depth on the positive strand D+ethtTHORNwith the empirical read coverage depth on the negativestrand DethtTHORN at each position t along the genome

RethTHORN frac14P1

tfrac141D+ethtTHORNDetht+THORN

frac14 D+ethtTHORN DethtTHORNfrac12 ethTHORNeth4THORN

As in the case of autocorrelation duplicated reads wereonly considered once

Principal component analysisPrincipal Component Analysis (PCA) was applied to thecollection of Fourier transforms of the Arpeggio profilesOnly the real part of the transform should exist and toavoid numerical errors the negligible imaginary part wasdiscarded To avoid bias due to noise affecting length-scales below 40 bp we applied a low-pass filter to theFourier transform

DaviesndashBouldin indexThe data in our data set were annotated by several classvariables (eg Antibody target cell line cellular mechan-ism organism study ID) each consisting of multiple class

labels (eg for Antibody target histone 3 Lysine 27 tri-methylation (H3K27me3) H3K27me2 E2F4 etc) Foreach class variable a sample was assigned one and onlyone class label We assigned samples to clusters basedon class label and computed the DaviesndashBouldin index(52) as

DB frac141

n

Xnifrac141

maxji6frac14j

i+jkCi Cjk2

eth5THORN

where kCi Cjk2 was the Euclidean distance betweenthe two centroids Ci and Cj computed using the sixleading principal components n was the number ofclusters and i and j were the mean distances of thepoints in the i-th and j-th clusters from the cluster cen-troids Ci and Cj P-values were computed viabootstrapping (n=1000)

Class label aggregation for different data representationsFor each sample we tested whether proximity in a givendata representation is indicative of similar class label as-signments We selected a class variable (eg Antibodytarget orgnanism cellular mechanism) and for a givensample we constructed a binary class that designates aspositives other samples with the same class label and des-ignates all other samples as negatives Using this binaryclass label we computed the Area Under the ReceiverOperator Characteristic Curve (AUC) for the pairwise dis-tances of the given sample with all other samples (53) Asthe fraction of positive samples in close proximity to thegiven sample increases so does the AUCTo avoid sampling bias we considered only one repli-

cate at random for each group of replicated samples Oncea sample that determines the positive class label is chosenits replicates are excluded This AUC calculation wasrepeated 100 times (different replicate combinationschosen at random) for each sample and we computedthe sample-specific expected value for the AUC as wellas its standard deviation For all samples with the sameclass label we computed the average over these expectedAUCs and the average standard deviation for each classlabel similarly to standard approaches (54) Finally foreach class variable we reported the medians across allclass labels (eg human mouse and fly for the organismclass variable) for both the average expected AUC and itsaverage standard deviation We report the median ratherthan the mean because it is more robust to outliers and is abetter estimator of expected value for arbitrarydistributions

Supervised analyses

K-nearest neighbors classifiers (k=1) were trained toidentify the class labels in the class variables of interestTo avoid sampling bias we considered only one replicateat random for each group of replicated samples For eachclass variable we kept one sample for testing and per-formance was computed using the balanced accuracy(accuracy for each class label averaged over all labels)(55) This procedure was repeated for all samples andP-values were computed via bootstrapping (n=100)

PAGE 3 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

Statistical analyses and software

All analyses were performed using our Java-based softwarepackage and the R statistical software (56) Our Arpeggiosoftware can be used to download data from the SRAmap reads to reference genomes compute autocorrelationand Arpeggio profiles An R script is also available togenerate and plot Arpeggio profiles The Arpeggiosoftware suite is freely available at httpsourceforgenetparpeggiowikiHome together with a detailedtutorial

RESULTS

Spectral density of ChIP-seq signals

To enhance our understanding of a new ChIP-seq sampleit is often beneficial to relate it to relevant ChIP-seq ex-periments in public repositories Here we relate experi-ments based on their signal proximity and therefore anappropriate distance metric is neededA naıve comparison of ChIP-seq experiments can be

done by measuring the distance between their coveragedepth graphs The genome consists of a large number ofgenomic positions (3 billion for the human genome)from which one can theoretically sample reads Thispairwise comparison is not efficient for large data setsand the dimension may be too large to identify neighbors(57) In addition at single nucleotide resolution meaning-ful patterns can be obscured by noise Standard compari-sons between pairs of ChIP-seq experiments arecommonly done by evaluating the overlap betweenpeaks detected in these experiments (458) Intuitivelythe union of peaks from a large data sets which isneeded to generate pairwise distances produces manygenomic intervals and is thus high dimensional Toquickly relate a given ChIP-seq sample to relevant experi-ments we sought to design a data representation thatcaptures the underlying biology is easy to compute canbe expressed by a relatively small number of dimensionsand finally is robust to suboptimal read coverageWe leveraged two important characteristics of ChIP-seq

data to create this low-dimensional representation Firstreads are localized in islands surrounding the interactingproteins (eg factors histones or polymerases) that weretargeted by the antibody (58) We therefore examined thesystem at intermediate genomic length scales Specificallywe consider the autocorrelation function of the readcoverage depth This function captures recurrent eventsalong the genome and aggregates this information to sig-natures of read co-occurrence at specific length scales orlags (Supplementary Figures S2ndashS4) As a consequence ofthe localized nature of ChIP-seq the autocorrelationfunction exhibits regular non-random interactions withina relatively small offset 4095 bp 4096 bp afterwhich it is uninformative Second read count data arestochastic and typically undersampled resulting inspikey noise that obscures signals occurring at nucleotideresolution (Supplementary Figure S5) At long lengthscales far beyond the length of proteinndashchromatin inter-action islands and also at short length scales on the order

of nucleotides the signal is dominated by noise and there-fore we set out to capture the signal at intermediate lengthscales where the signal-to-noise ratio is the highest To thisaim we applied a Fast Fourier Transform to the autocor-relation function for all lags 4095 bp 4096 bp (seelsquoMaterials and Methodsrsquo section) resulting in the spectraldensity of the empirical read coverage distribution AethTHORNas a function of the resolution o

The autocorrelation is restricted to a fixed range(4095 bp 4096 bp) and thus the associatedspectral density is fast and easy to compute and it isconstructed as the histogram of pairwise distancesbetween all N reads (see lsquoMaterials and Methodsrsquosection) This approach enables a large coverage foreach lag (t) in the autocorrelation function

Extraction of the IP signal using deconvolution

The observed autocorrelation signal consists of a biolo-gical component relevant to the ChIP-seq experimentmodulated on a component capturing irrelevant pro-perties such as DNA accessibility and experimental biasWe therefore designed an approach to deconvolve thecomponent of the signal that captures the biologicalaspects of the experiment

We formulated the problem using a DSP approach TheChIP-independent properties of the signal are denotedtechnical variability X(t) and arise from technical biasesand the stochastic nature of read count data We denoteby Z(t) the true IP signal which reflects effects associatedwith experiment-specific components eg antibody pre-cipitation cross-linking of chromatin to other proteinsin the same complex We therefore model the measuredChIP signal Y(t) as the convolution of the true IP signalwith the technical variability frac12ZethtTHORN XethtTHORNethTHORN For brevitywe will omit ethTHORN in the equations when clear from thecontext

In the DSP framework X(t) is the input Z(t) is the finiteimpulse response function associated with the specific bio-logical signal and Y(t) is the observed output To recoverthe specific finite impulse response and remove ChIP-independent components we used harmonic analysis tech-niques that are commonly used in engineering disciplines(59) In harmonic analysis signals are represented as thesum of characteristic harmonic components (ie sinus-oidal functions each with a specific period and phase)In this formulation if the technical variability X(t) isknown then applying the convolution theorem it ispossible to recover the true IP signal Z(t) from themeasured signal Y(t) consequently if the autocorrelationXethtTHORN XethtTHORN is known then it is possible to recover theautocorrelation of the true IP signal ZethtTHORN ZethtTHORN

frac12ZethtTHORN ZethtTHORNethTHORN frac14 F1 FethYethtTHORNYethtTHORNTHORNF ethXethtTHORNXethtTHORNTHORN

frac14 F1FethYethTHORNTHORNF ethXethTHORNTHORN

eth6THORN

where F is the Fourier transform operator F1 is itsinverse YethtTHORN frac14 XethtTHORN ZethtTHORN and YethTHORN frac14 YethtTHORN YethtTHORN andXethTHORN frac14 XethtTHORN XethtTHORN are the autocorrelations of theChIP-seq signal Y and of the control X respectively

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 4 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

We named the recovered autocorrelation of the true IPsignal ZethtTHORN ZethtTHORN the Arpeggio profile (Figure 1)

In studies where multiple controls are available or nocontrols are available we matched to each ChIP-seq ex-periment the control that is closest in terms of its spectralproperties We applied the same control matching proced-ure to the rest of the samples and found that most samplesmatched controls done in the same study or cell line Thiscontrol matching procedure is a conservative approach inwhich we try to identify the parts of the Y spectra that aresignificantly distinct from the X spectra and thus increasesspecificity In the extreme case where the autocorrelationof the measured ChIP signal YethTHORN and of the technicalvariability XethTHORN have proportional spectral densities iethey are linearly correlated their ratio will be constant andtheir deconvolution is an impulse indicating that there isno true IP signal (Figure 2)

We matched controls to experiments from the pool ofcontrols in our data set by first matching the organism andDNA-shearing technique and selected the control with thehighest correlation (Pearsonrsquos ) to the ChIP experimentin the base resolution domain (frequency domain ieafter applying the Fourier transform) We also requirethat gt 085 We note that this leads to the highest speci-ficity in the context of our spectral analysis We recall thatcorrelation in the resolution domain does not imply cor-relation in the genomic co-ordinates domain The value ofthe Arpeggio profile at frac14 0 reflects differences in readcount between experiment and control From the values of frac14 0 recorded in our data set we concluded that noexperiment control pair had a perfectly matching readcount In general this did not significantly affect ourability to recover the autocorrelation of the true IP signal

However for 31 samples of 806 in our data set wecould not identify any matching control Of the remaining775 there was also a small fraction of experiments forwhich the resulting Arpeggio profile exhibited several arti-facts suggesting poorly matched controls In particularwhen the autocorrelation of the control had a differentdecay rate than that of the experiment we observed dipsaround frac14 0 or gradual rises or falls as opposed toleveling out at greater values of jj (SupplementaryFigure S6) This difference in decay rate is likely due totechnical variability dependent on read counts Howeverthere may be biological components to it as well eg if apolymerase binds and then moves along the genome it isreasonable to believe that the reads will be more spreadthan a transcription factor that binds to a single locationIt would have been desirable to extract the true IP signal

Z(t) from ZethtTHORN ZethtTHORN directly by square root in the reso-lution domain However in general FethZethtTHORNTHORN is not entirelypositive or real thus there is no unique solution for Z(t)because the phase information is lost in the autocorrel-ation operation In practice the Arpeggio profilesZethtTHORN ZethtTHORN are sufficient for the purpose of comparingChIP-seq experiments and can also be used to examine thespectral density as a function of base resolution

Recovering the length distribution of ChIP-seq fragments

The fragment length distribution is an important param-eter for algorithms that seeks to identify binding eventlocations from short read experiments such as ChIP-seq

minus2000 minus1000 0 1000 2000

000

000

0005

000

100

0015

000

20

Offset(bp)

IP s

igna

l (R

eads

ChI

P

Tot

al R

eads

ChI

P)

(R

eads

Con

trol

Tot

al R

eads

Con

trol

)

HNF4APolIIDNA Input

Figure 2 Examples of the deconvolved Arpeggio profiles for a tran-scription factor PolII and a total DNA input sample The transcriptionfactor HNF4A shows a strong spike followed by a steep decayindicating isolated binding and PolII shows an isolated pulsefollowed by gradual decay The isolated pulse for total DNA inputindicates that sequenced reads provide no additional informationbeyond technical variability ethXethtTHORNTHORN The DNA input Arpeggio was con-structed using the samplersquos best matched control As in the previousfigure the read count difference between the sample and the matchedcontrol at frac14 0 was removed and smoothing was applied to aid invisualization

minus2000 minus1000 0 1000 2000

000

000

0005

000

100

0015

000

20

Offset(bp)

IP s

igna

l (R

eads

ChI

P

Tot

al R

eads

ChI

P)

(R

eads

Con

trol

Tot

al R

eads

Con

trol

)

H3K36me3H3K27me3H3K27AcpRB

Figure 1 Examples of the deconvolved Arpeggio profiles for histonemarks H3K27me3 Histone 3 lysine 36 tri-methylation (H3K36me3)Histone 3 lysine 27 acetylation (H3K27Ac) and for retinoblastoma(pRB) The H3K27me3 and H3K36me3 profiles show periodicity of190 bp consistent with previous reports of nucleosome frequency(60) The value representing the total read count difference betweenChIP and control at frac14 0 was removed and smoothing was appliedto aid in visualization

PAGE 5 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

peak callers (61) If paired-end reads are available theycan be used to recover the fragment length distributionempirically however many experiments in currentrepositories were not done using paired-end reads Wenote that for a given number of nucleotides sequencedthe single end approach represents twice as many frag-ments and thus is more sensitive than the paired-endapproach The latter however has an advantage inmapability in repetitive regionsPrevious studies have used cross-correlation to infer the

average fragment length of singled-end short read ChIP-seq data (62ndash64) As known when considering reads fromopposing strands some of the measured distances reflectfragment lengths (62ndash66)We modeled the probability associated with sampling a

fragment starting at any given position on the positivestrand of the genome as Prfrac12W frac14 t If the sequencedread from this fragment happens to map to the positivestrand it will start at the same genomic position t givingthe same probability distribution for the readsPrfrac12R frac14 Prfrac12W If however the sequenced read maps tothe negative strand then its starting position is F= f thefragment length nucleotides downstream from thestarting position of the fragment W= t Thus given theprobability of sampling a read on the positive strandPrfrac12R there is an equivalent probability of sampling aread on the negative strand Prfrac12R+F where F is arandom variable representing the fragment lengthWe show that harmonic deconvolution can be used to

determine not only the average fragment length but alsothe full empirical fragment length distribution This isdone by deconvolving the cross-correlation (seelsquoMaterials and Methodsrsquo section) between the start ofreads aligning to opposite strands which we denoted asethTHORN from the autocorrelation of reads aligning to thesame strand for the same experiment ethTHORN We recall thatthe probability distribution of the sum of two randomvariables is equivalent to the convolution of their individ-ual probability distributions and thus

c frac14 Prfrac12R+F Prfrac12R

frac14 Prfrac12R Prfrac12R Prfrac12Feth7THORN

where c is a normalization constant such thatc DethtTHORN frac14 Prfrac12R Similarly to Equation (6) we deconvolvethe fragment length distribution

Prfrac12F frac14 F1 FethPrfrac12RPrfrac12RPrfrac12FTHORNFethPrfrac12RPrfrac12RTHORN

frac14 F1

FethcYethTHORNTHORN

FethcYethTHORNTHORN

eth8THORN

We found that the reported fragment size from the differ-ent studies included in our data set matched the fragmentsize inferred using our deconvolution approach(Supplementary Figure S7 and Supplementary Table S2)Moreover our deconvolution approach using only oneread from each read pair of a paired end experimentproduced an estimate of the fragment length distributionthat closely matched to the length distribution of thepaired-end fragments (Figure 3 see SupplementaryNote A)

Arpeggio captures the biology of proteinndashchromatininteraction

The space of Arpeggio spectral densities is lowdimensionalTo facilitate organization of large sequencing archives inparticular during search operations for complex queriesor computationally intensive machine-learning tasks it isdesirable to represent the samples using a small number offeatures Compared with the size of the genome spectraldensities of Arpeggio profiles described by only 8192elements are already relatively small We investigatedwhether the space comprising all spectral densities in ourdatabase could be further reduced using Multi-Dimensional Scaling (MDS) such as PCA We foundthat six principal components were sufficient to capture85 of the variability present in our collection ofspectral densities

Although more advanced techniques (67) may result ina better compression ie fewer dimensions we decided touse PCA which is readily available and familiar for manypractitioners We used the leading six spectral densityprincipal components to organize hundreds of ChIP-seqexperiments and aid annotation of novel samples Wetermed this representation Arpeggio MDS

Use of Arpeggio MDS coordinates for classification andclusteringClassification and clustering are affected by choice of datarepresentation and dissimilarity measures Here we usethe DaviesndashBouldin index (52) to assess discernability

0 200 400 600 800 1000

000

00

001

000

20

003

000

40

005

000

6

fragment size (bp)de

nsity

PairedminusEnd estimate (212bp)SingleminusEnd Arpeggio (214bp)Experimental protocol (200bp)Read length (75bp)

Figure 3 Comparison between paired-end fragment length distributionand Arpeggio fragment length distribution The Arpeggio fragmentlength distribution was estimated from only one read of each readpair The reported fragment length from the experimental protocoltogether with the sequenced read length is shown as dashed linesThe arpeggio fragment length distribution shows an additional spikeat the read length which has been previously observed in opposingstrand cross-correlation (63)

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 6 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

between clusters and inferred class labels for each classvariable using a k-nearest neighbor classification

First for each class variables eg antibody targetcellular mechanism or organism we considered classlabels as cluster indices and computed the DaviesndashBouldin index (see lsquoMaterials and Methodsrsquo section)Our analysis showed significantly small DaviesndashBouldinindices for all class variables (Supplementary Table S4)This indicates that the organization of the data based onArpeggio MDS coordinates has a structure amenable for avariety of machine-learning approaches The DaviesndashBouldin index might have been affected by samplingbias of proteins analyzed within each organism groupskewing the index for the organism class variable

Second for each class variable we trained a k-nearestneighbor classifier (k=1) and using a leave-one-outapproach we computed the balanced accuracy of classlabel assignments (see lsquoMaterials and Methodsrsquo section)For most class variables the performance of the trainedclassifier was above the performance of a classifier assign-ing labels at random (Supplementary Table S4)Importantly misclassification was rare but was morecommon between total DNA input and immunoglobulinG controls (Figure 4)

Similarity between Arpeggio profiles is indicative ofbiological functions

We showed that the Arpeggio profiles could be used totrain classifiers that suggest annotation for new experi-ments We sought to extend this paradigm and addresswhether given a new set of experiments Arpeggioprofiles could be used to rapidly select available ChIP-seq experiments that would enable a more complete de-scription of the biological mechanism of interestTypically this analysis involves identification of peaksfrom the ChIP-seq signals and the study of the overlapbetween events in two or more different experiments (8)For this reason we compared the proximity mappingdetermined by the Arpeggio profiles with the proximityscore determined using the Jaccard distance of peakoverlaps

For each ChIP-seq experiment peaks were identifiedusing the Qeseq program (61) assigning to each experi-ment the best matching control as described in theprevious sections We set the fragment size to 150 bp forexperiments using MNase digestion and to 250 bp for ex-periments using sonication (see Supplementary Table S2)We considered two peaks to be overlapping if they sharedat least 1 bp For any pair of experiments the totalnumber of peaks was computed as the sum of thenumber of peaks in each experiment minus the numberof overlapping peaks The pairwise peak overlap score wascomputed using the Jaccard distance namely one minusthe ratio between the number of overlapping peaks andthe total number of peaks In contrast to our Arpeggioapproach peak overlap does not directly allow cross-species comparison To ensure a fair comparisonbetween data representations we split our data set intoa set of human samples (n=541) and a set of murinesamples (n=237) and analyzed them separately

Applying PCA to the peak overlap Jaccard distancematrix revealed that for the human set 365 principalcomponents were needed to capture 85 of the variabilityin the data In contrast 85 of the variability of thepairwise distance matrix of the Fourier transforms of theArpeggio profiles was captured by the six leading principalcomponents Thus the Arpeggio MDS provides a morecompact representation of the dataNext for each class variable we studied the classification

performance using three distance measures peak overlapJaccard distance inverse correlation measure betweenspectral density of Arpeggio profiles and Euclideandistance between Arpeggio MDS coordinates Weanalyzed human and mouse samples separately For eachclass variable we quantified the performance using theArea Under the receiver operator characteristic Curve(AUC) as described in the lsquoMaterials and Methodsrsquo sectionApplication of a k-nearest neighbor classifier with k=1

to these three distance measures resulted in comparableperformances in most class variables However proximitybetween Arpeggio MDS profiles as compared with peakoverlap Jaccard distance reflected higher association forcellular mechanisms This was more evident in thehuman set where the larger number of experiments corres-ponded to smaller error bars (Figure 5)Arpeggio retains information related to mode of

binding and performs best in clustering cellular mechan-ism in contrast peak overlap contains information aboutbinding locations and best clusters more variables such asbatch effects associated with the Study ID and cell line

Fact

or

His

tone

Mod

ifica

tion

RN

A P

olym

eras

e

Gen

omic

Con

trol

Uns

peci

fic C

ontr

ol

Factor

Histone Modification

RNA Polymerase

Genomic Control

Unspecific Control

Pre

dict

ed la

bel

True label

516

107

96

103

178

253

479

82

8

106

267

97

485

66

85

16

8

39

206

515

169

28

35

304

464

Figure 4 Confusion matrix of predicting functional annotations fromthe Arpeggio profiles using a k-nearest neighbor classifier with k=1The values along each column of the matrix represent how the instancesin the actual class were assigned to the predicted class The diagonalelements indicate sensitivity Darker colors indicate higher classificationfrequencies Most misclassifications occurred between controls

PAGE 7 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

(Supplementary Table S5 Figure 5 and SupplementaryFigure S8)

Reading biological features from Arpeggio profiles

In the previous sections we have provided evidence thatArpeggio profiles and their spectral densities can be usedto rapidly compare a large number of experiments Inthis section we show that Arpeggio profiles can also beused to derive biologically and technically meaningfulinformationFor instance H3K27me3 Arpeggio profiles exhibited

distinct periodicity with high amplitudes of oscillation(Figure 1) This suggested a highly ordered array of nu-cleosomes consistent with a static chromatin structurewhere nucleosomes are precisely positioned This agreedwell with the known role of Polycomb and H3K27 tri-methylation in transcriptional repression and heterochro-matin formation We recall that the profile represents anaggregate of binding events across the whole genome theclear and distinct periodicity for H3K27me3 suggestedthat nucleosomes with the tri-methylated H3K27 markhad remarkably constant periodicities throughout thegenome (4) Further we show that this periodicity andthe width of the signal (number of clear oscillations) issimilar across species (Figure 6)Another oscillatory pattern can be observed for Histone

3 Lysine 36 tri-methylation (H3K36me3) This histonemodification mark is deposited along with activelytranscribing PolII complexes and is by far the mostreliable histone mark for actively transcribed genes Incontrast to H3K27me3 profiles where the oscillation isonly slowly dampening due to the static chromatin

structure imposed by this mark H3K36me3 profilesshow a central peak surrounded by slightly smallerpeaks with a rapidly degrading oscillation indicative ofa fluidic chromatin state where nucleosomes are not pre-cisely positioned This is in agreement with this markbeing found in actively transcribed genes as the nucleo-somes in actively transcribed chromatin are perturbed bythe passing polymerase complexes (Figure 1)

The difference in chromatin state is particularly evidentin the harmonic analysis of Arpeggio profiles Inspired byprevious work on nucleosome spacing (60) we computedthe ratio

frac14Aeth 1190THORN

Aeth 1180THORN+Aeth 1200THORN

2

eth9THORN

where Aeth1=tTHORN is the magnitude of the t bp length scale inthe spectral density H3K27me3 exhibited a stronger acompared with H3K36me3 (Figure 7) which suggestsflexibility in the spacing of nucleosomes carryingH3K36me3 mark Interestingly the ratio a was alsolarge in Histone 3 Lysine 4 tri-methylation (H3K4me3)and in the Retinoblastoma protein (pRb) Although thereasons for such a precise placement of H3K4me3 areunclear pRb is an important factor in establishingheterochromatin

The effects of open fluidic chromatin states are strongerin the Arpeggio profiles of histone acetylations Theseprofiles are characterized by a high central peak sur-rounded by unorganized oscillations These profiles areclearly indicative of an open fluid chromatin configur-ation which is consistent with actively transcribedregions (Figure 1)

However the reference ChIP-seq signal for openactively transcribed chromatin is the Arpeggio profile ofRNA Polymerase II The PolII complexes move alongactively transcribed genes thus the ChIP-seq information

Hsapiens

AU

C

00

02

04

06

08

10

Pro

tein

AB

targ

et

Inte

ract

ion

dom

ain

Fun

ctio

nal a

nnot

atio

n

Act

ivity

ann

otat

ion

Cel

lula

r M

echa

nism

She

arin

g

Cel

l Lin

e

Stu

dy ID

RandomPeak overlapArpeggioArpeggio MDS

Figure 5 Classification of biological and experimental factors in termsof peak-based and Arpeggio-based ChIP-seq signatures for humansamples The comparison was performed based on proximitycomputed using peak overlap between pairs of experiments inversecorrelation of Arpeggio spectral densities or Euclidean distances ofArpeggio MDS coordinates The AUC of a classifier assigningrandom labels is shown as a black dashed line Compared with peakoverlap proximity based on Arpeggio MDS had a higher median per-formance of predicting cellular mechanism

500 1000 1500 2000minus4e

minus04

minus2e

minus04

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Offset(bp)

IP s

igna

l (R

eads

ChI

P

Tot

al R

eads

ChI

P)

(R

eads

Con

trol

Tot

al R

eads

Con

trol

)

Fly (187bp period)Mouse (188bp period)Human (184bp period)

Figure 6 H3K27me3 shows similar periodicity across fly mouse andhuman Periodicity was calculated using the local maxima of thespectral density for the range of the Arpeggio profiles shown

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 8 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

is spread over a relatively large spatial region For thisreason PolII ChIP experiments need significantsequencing depth to obtain a good picture of PolIIactivity The Arpeggio profiles for PolII show a strongcentral signal surrounded by two shoulders each flankedby disorganized nucleosomes (Figure 2) The central peakis where PolII is located most closely to the DNA strandand thus efficient cross-linking can occur However PolIIis part of a large lsquoholoenzymersquo transcription machineryand the surrounding smaller humps are likely where thiscomplex is close enough to the DNA strand to be cross-linked and enriched in a ChIP-seq experiment The un-organized pattern surrounding these peaks is likely aresult of the actively transcribing PolII complex As itmoves through the chromatin it perturbs andor disas-sembles nucleosomes in front and re-deposits thembehind giving highly disorganized chromatin structure(Figure 2)

Other proteins such as Androgen Receptor SPDEF(SAM pointed domain-containing Ets transcriptionfactor) ERG (Ets-Related Gene) FL1 (Follicularlymphoma susceptibility to 1) display typical site-specific DNA-binding profiles of transcription factors inwhich a strong signal occurs at the binding siteaccompanied by disorganized surrounding patterns indi-cative of active fluidic chromatin (Figure 2)

DISCUSSION

The contribution of this study is the design of a compactChIP-seq data representation based on the denoised auto-correlation We show that use of this compact data repre-sentation has several advantages that facilitates efficientcomputation and data storage linking the cellular mech-anisms of protein targets in novel ChIP-seq experiments todata in current repositories low-dimensional organizationof large repositories of ChIP-seq data that facilitates ex-ploratory data analysis cross-species and cross-cell linecomparisons extraction of technical features such asfragment length distribution and biological relevantinterpretation

Large collections of ChIP-seq data have been leveragedto gain new biological insights (8) The volume of ChIP-seq data in public repositories has noticeably increased inrecent years Typically users retrieve samples based ontheir prior knowledge and expectations of the biologicalsystem Currently available data retrieval systems arebased on matching qualitative annotations such asorganism cell-type condition and specific immunopre-cipitated protein We suggest the use of our novel com-pression technique Arpeggio to enable searching forsamples similar to the query experiment in terms of quan-titative patterns present in theirs signals thus facilitatingnovel biological discoveries We note that non-linear datacompression approaches applied to the autocorrelationfunctions can organize the data and reveal new insights(see diffusion map analysis Supplementary Note B) Wepresent Arpeggio profiles and their spectral densities Thislow-dimensional harmonic data representation can beused for selecting publicly available experiments that arebiologically related to an experiment of interest TheArpeggio profiles are computed from the autocorrelationof ChIP-seq signals which have been previously exploredin the context of data quality assessment (7)In contrast to previous approaches we applied signal

processing techniques to derive a profile of the IP auto-correlation that is diminished in technical variability andrequires little pre-filtering We found that Arpeggioprofiles were remarkably organized in four maincategories corresponding to intuitive classes of structuralinteractions factors showed peaks with sharply decayingtails polymerases showed peaks as well but with slowlydecaying tails histone modifications showed damped os-cillations corresponding to trains of peaks at fixed dis-tances from one another lastly controls showed a singlepulse sharper than the peak of factors indicating that asexpected sequenced reads from total DNA inputs have norecurrent properties Typical binding patterns of a particu-lar proteinndashchromatin interaction are obscured by noiseArpeggio profiles overcome this problem by aggregatingthe recurrent patterns of proteinndashchromatin interactionThe quality of autocorrelation also improves quickerthan the read coverage density as the number of readsincreases In peak finding the average coverage isexpected to increase linearly on the order of OethN L=GTHORNwhere N is the number of sequenced reads L is the read-length and G is the size of the genome in contrast thenumber of distances between reads contributing to thecomputation of the autocorrelation scales quadraticallyin the worst case as OethN2 W=GTHORN where W is themaximum lag at which the autocorrelation is evaluatedwhere W L We note that Arpeggio profiles do notprovide the location of such binding events however weplan to further develop these characteristic spectralbinding patterns for locating peaksIn this work we used an unsupervised approach to

organize a large volume of ChIP-seq experiments Weshow that close proximity between the denoised spectraldensities of two different proteins is often associated withsimilar cellular mechanism In this work we manuallyannotated 806 of the 14 306 currently marked as ChIP-seq samples in the SRA Databases are expected to

DN

AIn

put

ER

GG

RH

2BK

12A

cH

2BK

15A

cH

2BK

20A

cH

3H

3K18

Ac

H3K

27A

cH

3K27

me3

H3K

36m

e3H

3K4m

e1H

3K4m

e2H

3K4m

e3H

3K79

me1

H3K

79m

e2H

3K79

me3

H3K

9Ac

H3K

9me3

H4K

12A

cH

4K20

me1

H4K

5Ac

IgG

P30

0P

olII

RA

RS

PD

EF

VitD

rhE

Ra

pRb

099

100

101

102

103

104

105

106

α

Figure 7 Flexibility of nucleosome spacing across different experi-ments Flexible spacing or isolated events (SupplementaryFigure S2) is represented by ratios close to one Ratios clearly aboveone indicate strict spacing of 190 bp between events (SupplementaryFigure S4) Histone modifications associated with heterochromatinexhibit higher ratios indicating less flexibility

PAGE 9 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

evolve to be more structured and enable automatic re-trieval eliminating the need for data entry tasks andallowing us to organize thousands of samples at a timeMoreover high-throughput sequencing is becoming cheapfacilitating the mass survey of novel ChIP targets forwhich function is yet to be determined Applying theproposed spectral representation to thousands of existingannotated ChIP-seq experiments will allow us to screenthese new ChIP targets reducing the resources requiredto elucidate their functions Interpretation of spectralpatterns in many fields of science and engineering (ieradiology control systems analysis imaging) is often theproduct of years of study In this study we focused onproperties that discriminate coarse categories of proteinndashchromatin interaction There is a wealth of knowledgehidden in Arpeggio representation and we anticipatethat with increasing database size and quality it willprovide information on a much finer scale

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Onlineincluding [68ndash122]

ACKNOWLEDGEMENTS

The authors thank Ronen Talmon Sherman WeissmanRonald Coifman and Paolo Barbano for insightfuldiscussions

FUNDING

National Institute of Health [T15 LM07056 to KS] and[CA-158167 to YK] Yale Cancer Center translationalresearch pilot funds (to FP and FS) American CancerSociety Award [M130572 to FS] the American-ItalianCancer Foundation [Post-Doctoral Research Fellowshipto FS] the Peter T Rowley Breast Cancer ResearchProjects funded by the New York State Department ofHealth [FAU 0812160900 YK] Funding for openaccess charge The Peter T Rowley Breast CancerResearch Projects funded by the New York StateDepartment of Health [FAU 0812160900 YK]

Conflict of interest statement None declared

REFERENCES

1 MortazaviA WilliamsBA MccueK SchaefferL and WoldB(2008) Mapping and quantifying mammalian transcriptomes byRNA-Seq Nat Methods 5 621ndash628

2 TrapnellC RobertsA GoffL PerteaG KimD KelleyDRPimentelH SalzbergSL RinnJL and PachterL (2012)Differential gene and transcript expression analysis of RNA-seqexperiments with TopHat and Cufflinks Nat Protoc 7 562ndash578

3 GreenmanC StephensP SmithR DalglieshGL HunterCBignellG DaviesH TeagueJ ButlerA StevensC et al(2007) Patterns of somatic mutation in human cancer genomesNature 446 153ndash158

4 AspP BlumR VethanthamV ParisiF MicsinaiM ChengJBowmanC KlugerY and DynlachtBD (2011) Genome-wide

remodeling of the epigenetic landscape during myogenicdifferentiation Proc Natl Acad Sci USA 108 E149ndashE158

5 BarskiA and ZhaoK (2009) Genomic location analysis byChIP-Seq J Cell Biochem 107 11ndash18

6 MikkelsenTS KuMC JaffeDB IssacB LiebermanEGiannoukosG AlvarezP BrockmanW KimTK KocheRPet al (2007) Genome-wide maps of chromatin state in pluripotentand lineage-committed cells Nature 448 553ndash560

7 ENCODE Project Consortium (2004) The ENCODE(ENCyclopedia of DNA elements) project Science 306 636ndash640

8 ENCODE Project Consortium (2012) An integrated encyclopediaof DNA elements in the human genome Nature 489 57ndash74

9 ListerR PelizzolaM DowenRH HawkinsRD HonGTonti-FilippiniJ NeryJR LeeL YeZ NgoQM et al(2009) Human DNA methylomes at base resolution showwidespread epigenomic differences Nature 462 315ndash322

10 BellO SchwaigerM OakeleyEJ LienertF BeiselCStadlerMB and SchubelerD (2010) Accessibility of theDrosophila genome discriminates PcG repression H4K16acetylation and replication timing Nat Struct Mol Biol 17894ndash900

11 BlowM McCulleyDJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2010)ChIP-Seq identification of weakly conserved heart enhancers NatGenet 42 806ndash810

12 BottomlyD KylerSL McWeeneySK and YochumGS(2010) Identification of b-catenin binding regions in colon cancercells using ChIP-Seq Nucleic Acids Res 38 5735ndash5745

13 ChicasA WangX ZhangC McCurrachM ZhaoZ MertODickinsRA NaritaM ZhangM and LoweSW (2010)Dissecting the unique role of the retinoblastoma tumor suppressorduring cellular senescence Cancer Cell 17 376ndash387

14 CreyghtonMP ChengAW WelsteadGG KooistraTCareyBW SteineEJ HannaJ LodatoMA FramptonGMSharpPA et al (2010) From the cover histone H3K27acseparates active from poised enhancers and predictsdevelopmental state Proc Natl Acad Sci USA 10721931ndash21936

15 DurantL WatfordWT RamosHL LaurenceA VahediGWeiL TakahashiH SunH-W KannoY PowrieF et al(2010) Diverse targets of the transcription factor STAT3contribute to T cell pathogenicity and homeostasis Immunity 32605ndash615

16 EnderleD BeiselC StadlerMB GerstungM AthriP andParoR (2010) Polycomb preferentially targets stalled promotersof coding and noncoding transcripts Genome Res 21 216ndash226

17 FullwoodMJ LiuMH PanYF LiuJ XuHMohamedYB OrlovYL VelkovS HoA MeiPH et al(2009) An oestrogen-receptor-a-bound human chromatininteractome Nature 462 58ndash64

18 GanQ SchonesDE Ho EunS WeiG CuiK ZhaoK andChenX (2010) Monovalent and unpoised status of most genes inundifferentiated cell-enriched Drosophila testis Genome Biol 11R42

19 GorchakovAA AlekseyenkoAA KharchenkoP ParkPJand KurodaMI (2009) Long-range spreading of dosagecompensation in Drosophila captures transcribed autosomal genesinserted on X Genes Dev 23 2266ndash2271

20 GuentherMG FramptonGM SoldnerF HockemeyerDMitalipovaM JaenischR and YoungRA (2010) Chromatinstructure and gene expression programs of human embryonic andinduced pluripotent stem cells Cell Stem Cell 7 249ndash257

21 HoffmanBG RobertsonG ZavagliaB BeachM CullumRLeeS SoukhatchevaG LiL WederellED ThiessenN et al(2010) Locus co-occupancy nucleosome positioning andH3K4me1 regulate the functionality of FOXA2- HNF4A- andPDX1-bound loci in islets and liver Genome Res 20 1037ndash1051

22 HurtadoA HolmesKA Ross-InnesCS SchmidtD andCarrollJS (2010) FOXA1 is a key determinant of estrogenreceptor function and endocrine response Nat Genet 43 27ndash33

23 IpJY SchmidtD PanQ RamaniAK FraserAGOdomDT and BlencoweBJ (2011) Global impact of RNApolymerase II elongation inhibition on alternative splicingregulation Genome Res 21 390ndash401

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 10 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

24 JohnS SaboPJ ThurmanRE SungM-H BiddieSCJohnsonTA HagerGL and StamatoyannopoulosJA (2011)Chromatin accessibility pre-determines glucocorticoid receptorbinding patterns Nat Genet 43 264ndash268

25 JolmaA KiviojaT ToivonenJ ChengL WeiG EngeMTaipaleM VaquerizasJM YanJ SillanpaaMJ et al (2010)Multiplexed massively parallel SELEX for characterization ofhuman transcription factor binding specificities Genome Res 20861ndash873

26 JungH LacombeJ MazzoniEO LiemKF Jr GrinsteinJMahonyS MukhopadhyayD GiffordDK YoungRAAndersonKV et al (2010) Global control of motor neurontopography mediated by the repressive actions of a single hoxgene Neuron 67 781ndash796

27 KageyMH NewmanJJ BilodeauS ZhanY OrlandoDAvan BerkumNL EbmeierCC GoossensJ RahlPBLevineSS et al (2010) Mediator and cohesin connect geneexpression and chromatin architecture Nature 467 430ndash435

28 KasowskiM GrubertF HeffelfingerC HariharanMAsabereA WaszakSM HabeggerL RozowskyJ ShiMUrbanAE et al (2010) Variation in transcription factor bindingamong humans Science 328 232ndash235

29 KocheRP SmithZD AdliM GuH KuM GnirkeABernsteinBE and MeissnerA (2011) Reprogramming factorexpression initiates widespread targeted chromatin remodelingCell Stem Cell 8 96ndash105

30 KramerJM KochinkeK OortveldMAW MarksHKramerD de JongEK AsztalosZ WestwoodJTStunnenbergHG SokolowskiMB et al (2011) Epigeneticregulation of learning and memory by Drosophila EHMTG9aPLoS Biol 9 e1000569

31 LeeBK BhingeAA and IyerVR (2011) Wide-rangingfunctions of E2F4 in transcriptional activation and repressionrevealed by genome-wide analysis Nucleic Acids Res 393558ndash3573

32 LiL JothiR CuiK LeeJY CohenT GorivodskyMTzchoriI ZhaoY HayesSM BresnickEH et al (2010)Nuclear adaptor Ldb1 regulates a transcriptional programessential for the maintenance of hematopoietic stem cells NatImmunol 12 129ndash136

33 MahonyS MazzoniEO McCuineS YoungRA WichterleHand GiffordDK (2011) Ligand-dependent dynamics of retinoicacid receptor binding during early neurogenesis Genome Biol12 R2

34 MarsonA LevineSS ColeMF FramptonGMBrambrinkT JohnstoneS GuentherMG JohnstonWKWernigM and NewmanJ (2008) Connecting microRNA genesto the core transcriptional regulatory circuitry of embryonic stemcells Cell 134 521ndash533

35 PaulerFM SloaneMA HuangR ReghaK KoernerMVTamirI SommerA AszodiA JenuweinT and BarlowDP(2008) H3K27me3 forms BLOCs over silent genes and intergenicregions and specifies a histone banding pattern on a mouseautosomal chromosome Genome Res 19 221ndash233

36 Rada-IglesiasA BajpaiR SwigutT BrugmannSAFlynnRA and WysockaJ (2010) A unique chromatin signatureuncovers early developmental enhancers in humans Nature 470279ndash283

37 RamagopalanSV HegerA BerlangaAJ MaugeriNJLincolnMR BurrellA HandunnetthiL HandelAEDisantoG OrtonS-M et al (2010) A ChIP-seq definedgenome-wide map of vitamin D receptor bindingassociations with disease and evolution Genome Res 201352ndash1360

38 Rugg-GunnPJ CoxBJ RalstonA and RossantJ (2010)Distinct histone modifications in stem cell lines and tissue lineagesfrom the early mouse embryo Proc Natl Acad Sci USA 10710783ndash10790

39 SchmidtD SchwaliePC Ross-InnesCS HurtadoABrownGD CarrollJS FlicekP and OdomDT (2010) ACTCF-independent role for cohesin in tissue-specific transcriptionGenome Res 20 578ndash588

40 VermeulenM EberlHC MatareseF MarksH DenissovSButterF LeeKK OlsenJV HymanAA and

StunnenbergHG (2010) Quantitative interaction proteomics andgenome-wide profiling of epigenetic histone marks and theirreaders Cell 142 967ndash980

41 VerziMP ShinH HeHH SulahianR MeyerCAMontgomeryRK FleetJC BrownM LiuXS andShivdasaniRA (2010) Differentiation-specific histonemodifications reveal dynamic chromatin interactions and partnersfor the intestinal transcription factor CDX2 Dev Cell 19713ndash726

42 WeiGH BadisG BergerMF KiviojaT PalinK EngeMBonkeM JolmaA VarjosaloM GehrkeAR et al (2010)Genome-wide analysis of ETS-family DNA-binding in vitro andin vivo EMBO J 29 2147ndash2160

43 WeiL VahediG SunH-W WatfordWT TakatoriHRamosHL TakahashiH LiangJ Gutierrez-CruzG ZangCet al (2010) Discrete roles of STAT4 and STAT6 transcriptionfactors in tuning epigenetic modifications and transcription duringT helper cell differentiation Immunity 32 840ndash851

44 YangY LuY EspejoA WuJ XuW LiangS andBedfordMT (2010) TDRD3 is an effector molecule for arginine-methylated histone marks Mol Cell 40 1016ndash1023

45 YuS CuiK JothiR ZhaoD-M JingX ZhaoK andXueH-H (2010) GABP controls a critical transcriptionregulatory module that is essential for maintenance anddifferentiation of hematopoietic stemprogenitor cells Blood 1172166ndash2178

46 HoffmanMM ErnstJ WilderSP KundajeA HarrisRSLibbrechtM GiardineB EllenbogenPM BilmesJABirneyE et al (2013) Integrative annotation of chromatinelements from ENCODE data Nucleic Acids Res 41 827ndash841

47 LimSJ TanTW and TongJC (2010) Computationalepigenetics the new scientific paradigm Bioinformation 4331ndash337

48 ThurmanRE DayN NobleWS and StamatoyannopoulosJA(2007) Identification of higher-order functional domains in thehuman ENCODE regions Genome Res 17 917ndash927

49 LangmeadB TrapnellC PopM and SalzbergSL (2009)Ultrafast and memory-efficient alignment of short DNAsequences to the human genome Genome Biol 10 R25

50 DiazA ParkK LimDA and SongJS (2012) Normalizationbias correction and peak calling for ChIP-seq Stat Appl GenetMol Biol 11 9

51 LiangK and KelesS (2012) Normalization of ChIP-seq datawith control BMC Bioinformatics 13 199

52 R Development Core Team (2012) R A Language andEnvironment for Statistical Computing R Foundation forStatistical Computing Vienna Austria

53 DaviesDL and BouldinDW (1979) A cluster separationmeasure IEEE Trans Pattern Anal Mach Intell 1 224ndash227

54 HanleyJA and McNeilBJ (1982) The meaning and use of thearea under a receiver operating characteristic (ROC) curveRadiology 143 29ndash36

55 GarnerSR (1995) Weka the waikato environment forknowledge analysis In Procceding of the New Zealand ComputerScience Research Students Conference Hamilton New Zealandpp 57ndash64

56 Shi-YunS Kai-QuanS Chong-JinO Xiao-PingL andWilder-SmithE (2008) Automatic identification and removal ofartifacts in EEG using a probabilistic multi-class SVM approachwith error correction In IEEE International Conference onSystems Man and Cybernetics Singapore Singaporepp 1134ndash1139

57 BellmanRE (2003) Dynamic Programming Dover PublicationsNew York NY

58 BickmoreWA and van SteenselB (2013) Genome architecturedomain organization of interphase chromosomes Cell 1521270ndash1284

59 OppenheimAV SchaferRW and BuckJR (1999) Discrete-Time Signal Processing Prentice-Hall Inc Upper SaddleRiver NJ

60 BlankTA and BeckerPB (1995) Electrostatic mechanism ofnucleosome spacing J Mol Biol 252 305ndash313

61 MicsinaiM ParisiF StrinoF AspP DynlachtBD andKlugerY (2012) Picking ChIP-seq peak detectors for

PAGE 11 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

analyzing chromatin modification experiments Nucleic Acids Res40 e70

62 KharchenkoPV TolstorukovMY and ParkPJ (2008) Designand analysis of ChIP-seq experiments for DNA-binding proteinsNat Biotechnol 26 1351ndash1359

63 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of theENCODE and modENCODE consortia Genome Res 221813ndash1831

64 RamachandranP PalidworGA PorterCJ and PerkinsTJ(2013) MaSC mappability-sensitive cross-correlation forestimating mean fragment length of single-end short-readsequencing data Bioinformatics 29 444ndash450

65 NarlikarL and JothiR (2012) ChIP-Seq data analysisidentification of ProteinndashDNA binding sites with SISSRs peak-finder In WangJ TanAC and TianT (eds) Next GenerationMicroarray Bioinformatics Vol 802 Humana Press New YorkUSA pp 305ndash322

66 ZhangY LiuT MeyerCA EeckhouteJ JohnsonDSBernsteinBE NussbaumC MyersRM BrownM LiWet al (2008) Model-based analysis of ChIP-Seq (MACS) GenomeBiol 9 R137

67 LittleAV LeeJ Yoon-MoJ and MaggioniM (2009) IEEESP15th Workshop on Statistical Signal Processing (SSPrsquo09) CardiffUnited Kingdom pp 85ndash88

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 12 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

Page 2: Arpeggio: harmonic compression of ChIP-seq data reveals protein ...

species would enhance our ability to use large existingrepositoriesCompressing and organizing large sequencing archives

involves characterization of each experiment in terms ofits genomic features such as a list of peaks representingevents along the genome application of dissimilaritymeasures to determine pairwise affinities between thefeature vectors of each pair of experiments (eg overlapbetween two lists of peaks) clustering dimensional reduc-tion annotation and visualization of the collection ofthese feature vectors We usually assume that theunderlying cellular mechanisms captured by a pair of dis-similar feature vectors such as lists of binding sites of twodifferent DNA-binding proteins are different Howeverdata analysis of each type of sequencing experiment can bedone in numerous ways that affect the affinity betweenpairs of unique data sets The most obvious factors thatinfluence affinity include the preferred choice of featurespace and dissimilarity measures used In most sequencinganalyses practitioners tend to use standard feature spacesand common dissimilarity measures Specifically in RNA-seq analysis the feature space of an experiment comprisescounts of reads or sequenced fragments (eg RPKMs orFPKMs) of all genes and the similarity between two tran-scriptomes is characterized by standard correlationmeasures (12) in DNA-seq analysis of cancer samplesthe standard feature space includes point mutationsindels copy number alterations and translocations andsimilarities are evaluated using basic associationmeasures to determine prevalence (3) in chromatin-immunoprecipitation sequencing (ChIP-seq) experimentsbinding events are determined by peak detectors andsimilarities between two experiments are typicallyevaluated simply by the number or fraction of overlappingpeaks (4ndash6)ChIP-seq in particular has been widely used to unravel

transcriptional and epigenetic regulatory programs thatultimately determine the biological phenotype Thousandsof ChIP-seq experiments have already been collected bylarge community-wide efforts such as the ENCODEproject (78) pilot initiatives (4ndash69) and smaller projects(10ndash45) Application of computational approachesfor interrogating the genome-wide interactions betweenchromatin and proteins by high-throughput shortread sequencing of genomic DNA from ChIP-seq experi-ments can reveal certain aspects of the underlying biology(46ndash48)In the present study we use deconvolution to extract

the biological component that is indicative of distinctiveproteinndashchromatin interaction configurations from theautocorrelation profiles of ChIP-seq signals Weexplored the space of the Fourier transform of the auto-correlation profiles (spectral densities of the read coveragedistributions) using machine-learning approaches to char-acterize proteinndashchromatin interaction patterns at inter-mediate length scales of several kilobases and showed itsutility in the organization of large repositories of ChIP-seqdata These characteristic spectral density profiles allowedus to efficiently compare a large number of ChIP-seq datasets consisting of transcription factors epigenetic marksand other types of chromatin interacting proteins collected

from a variety of cell types conditions and organismsMoreover the deconvolved autocorrelation functionswhich we term Arpeggio profiles reflect the biologicalnature of proteinndashchromatin interactions such as eventsthat are locally isolated coordinated events or dynamic-ally flexible events

We used 806 publicly available ChIP-seq experimentsfrom several unrelated studies (458ndash45) and showedthat Arpeggio profiles with similar spectral densitiesshared biological properties Arpeggio profiles can bemodeled using a small number of parameters and thusare mappable to a low-dimensional space that capturesbiological aspects of the interaction between proteinsand chromatin This representation facilitates efficientindexing of databases and application of supervised un-supervised and inference methods to large repositories ofsequencing data comprising different genomes experimen-tal platforms and protocols We also show that ourapproach can be used to derive experimental and biologic-ally meaningful quantities such as fragment length distri-butions as well as the expected nucleosome spacing Ourresults suggest that harmonic analysis of ChIP-seq dataunravels signatures that are not easily captured bystandard computational means Analogously to catalog-ing cerebral activity with Electroencephalography (EEG)Arpeggio analysis efficiently locates a new sample in themap of differing proteinndashchromatin interaction states

MATERIALS AND METHODS

Data sets and preprocessing

DataWe analyzed 806 public ChIP-seq experiments from datasets obtained from the SRA (httpwwwncbinlmnihgovsra Supplementary Table S1) The analyzedproteins included transcription factors or histone modifi-cations from human mouse or fruit fly (SupplementaryFigure S1) In all experiments chromatin was cross-linkedand fragmented using either sonication or MNase diges-tion followed by immunoprecipitation using specificantibodies (Supplementary Table S2) The ChIP-seq proto-cols used in all these studies are transcribed verbatimand provided in Supplementary Table S3 Controlsconsist of high-throughput sequencing of immunoglobulinG immunoprecipitation (IP) or total DNA input

PreprocessingWith the exception of sequenced reads from the study byBarski et al (5) whose genome-wide alignment is providedby the authors all sequenced reads were mapped to theircorresponding reference genomes using the Bowtie aligner(49) with parameters lsquo-n2 -k1 -m1 ndashbest ndashstratarsquo corres-ponding to reporting unique alignments for each read withat most two mismatches

Autocorrelation and cross-correlation of ChIP-seq profiles

AutocorrelationGiven a short read sequencing sample consisting of a set Rof N aligned reads on the same strand r 2 R where r in-dicates the 50 end of an aligned read we defined the count

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 2 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

of all pairs of reads separated by a fixed distance of tnucleotides as

RethTHORN frac14XNifrac141

XNjfrac141

1 if ri rj frac14 0 otherwise

eth1THORN

RethTHORN is equivalent to the autocorrelation of the empiricalread coverage depth D(t) at each position t along thegenome

RethTHORN frac14P1

tfrac141DethtTHORNDetht+THORN

frac14 DethtTHORN DethtTHORNfrac12 ethTHORN

eth2THORN

where is the convolution operatorWe note that proper normalization of the read count

data is ambiguous (5051) therefore rather than the stat-istical definition of autocorrelation which is meancentered and normalized by variance we use the digitalsignal processing (DSP) definition of autocorrelation Thelatter does not require prior knowledge of the distributionshape To minimize the impact of PCR artifactsduplicated reads were only considered once

Cross-correlationGiven a short read sequencing sample consisting of a set Rof aligned reads r 2 R where r+ indicates the startingposition of an aligned read on the positive strand andr indicates the starting position of an aligned read onthe negative strand we defined the aggregated crossdistance between reads on opposing strands as

RethTHORN frac14XN+

ifrac141

XNjfrac141

1 if r+i rj frac14 0 otherwise

eth3THORN

where t is the offset in nucleotides and N+ and N are thenumber of reads on the positive and negative strand re-spectively This is equivalent to the cross-correlation of theempirical read coverage depth on the positive strand D+ethtTHORNwith the empirical read coverage depth on the negativestrand DethtTHORN at each position t along the genome

RethTHORN frac14P1

tfrac141D+ethtTHORNDetht+THORN

frac14 D+ethtTHORN DethtTHORNfrac12 ethTHORNeth4THORN

As in the case of autocorrelation duplicated reads wereonly considered once

Principal component analysisPrincipal Component Analysis (PCA) was applied to thecollection of Fourier transforms of the Arpeggio profilesOnly the real part of the transform should exist and toavoid numerical errors the negligible imaginary part wasdiscarded To avoid bias due to noise affecting length-scales below 40 bp we applied a low-pass filter to theFourier transform

DaviesndashBouldin indexThe data in our data set were annotated by several classvariables (eg Antibody target cell line cellular mechan-ism organism study ID) each consisting of multiple class

labels (eg for Antibody target histone 3 Lysine 27 tri-methylation (H3K27me3) H3K27me2 E2F4 etc) Foreach class variable a sample was assigned one and onlyone class label We assigned samples to clusters basedon class label and computed the DaviesndashBouldin index(52) as

DB frac141

n

Xnifrac141

maxji6frac14j

i+jkCi Cjk2

eth5THORN

where kCi Cjk2 was the Euclidean distance betweenthe two centroids Ci and Cj computed using the sixleading principal components n was the number ofclusters and i and j were the mean distances of thepoints in the i-th and j-th clusters from the cluster cen-troids Ci and Cj P-values were computed viabootstrapping (n=1000)

Class label aggregation for different data representationsFor each sample we tested whether proximity in a givendata representation is indicative of similar class label as-signments We selected a class variable (eg Antibodytarget orgnanism cellular mechanism) and for a givensample we constructed a binary class that designates aspositives other samples with the same class label and des-ignates all other samples as negatives Using this binaryclass label we computed the Area Under the ReceiverOperator Characteristic Curve (AUC) for the pairwise dis-tances of the given sample with all other samples (53) Asthe fraction of positive samples in close proximity to thegiven sample increases so does the AUCTo avoid sampling bias we considered only one repli-

cate at random for each group of replicated samples Oncea sample that determines the positive class label is chosenits replicates are excluded This AUC calculation wasrepeated 100 times (different replicate combinationschosen at random) for each sample and we computedthe sample-specific expected value for the AUC as wellas its standard deviation For all samples with the sameclass label we computed the average over these expectedAUCs and the average standard deviation for each classlabel similarly to standard approaches (54) Finally foreach class variable we reported the medians across allclass labels (eg human mouse and fly for the organismclass variable) for both the average expected AUC and itsaverage standard deviation We report the median ratherthan the mean because it is more robust to outliers and is abetter estimator of expected value for arbitrarydistributions

Supervised analyses

K-nearest neighbors classifiers (k=1) were trained toidentify the class labels in the class variables of interestTo avoid sampling bias we considered only one replicateat random for each group of replicated samples For eachclass variable we kept one sample for testing and per-formance was computed using the balanced accuracy(accuracy for each class label averaged over all labels)(55) This procedure was repeated for all samples andP-values were computed via bootstrapping (n=100)

PAGE 3 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

Statistical analyses and software

All analyses were performed using our Java-based softwarepackage and the R statistical software (56) Our Arpeggiosoftware can be used to download data from the SRAmap reads to reference genomes compute autocorrelationand Arpeggio profiles An R script is also available togenerate and plot Arpeggio profiles The Arpeggiosoftware suite is freely available at httpsourceforgenetparpeggiowikiHome together with a detailedtutorial

RESULTS

Spectral density of ChIP-seq signals

To enhance our understanding of a new ChIP-seq sampleit is often beneficial to relate it to relevant ChIP-seq ex-periments in public repositories Here we relate experi-ments based on their signal proximity and therefore anappropriate distance metric is neededA naıve comparison of ChIP-seq experiments can be

done by measuring the distance between their coveragedepth graphs The genome consists of a large number ofgenomic positions (3 billion for the human genome)from which one can theoretically sample reads Thispairwise comparison is not efficient for large data setsand the dimension may be too large to identify neighbors(57) In addition at single nucleotide resolution meaning-ful patterns can be obscured by noise Standard compari-sons between pairs of ChIP-seq experiments arecommonly done by evaluating the overlap betweenpeaks detected in these experiments (458) Intuitivelythe union of peaks from a large data sets which isneeded to generate pairwise distances produces manygenomic intervals and is thus high dimensional Toquickly relate a given ChIP-seq sample to relevant experi-ments we sought to design a data representation thatcaptures the underlying biology is easy to compute canbe expressed by a relatively small number of dimensionsand finally is robust to suboptimal read coverageWe leveraged two important characteristics of ChIP-seq

data to create this low-dimensional representation Firstreads are localized in islands surrounding the interactingproteins (eg factors histones or polymerases) that weretargeted by the antibody (58) We therefore examined thesystem at intermediate genomic length scales Specificallywe consider the autocorrelation function of the readcoverage depth This function captures recurrent eventsalong the genome and aggregates this information to sig-natures of read co-occurrence at specific length scales orlags (Supplementary Figures S2ndashS4) As a consequence ofthe localized nature of ChIP-seq the autocorrelationfunction exhibits regular non-random interactions withina relatively small offset 4095 bp 4096 bp afterwhich it is uninformative Second read count data arestochastic and typically undersampled resulting inspikey noise that obscures signals occurring at nucleotideresolution (Supplementary Figure S5) At long lengthscales far beyond the length of proteinndashchromatin inter-action islands and also at short length scales on the order

of nucleotides the signal is dominated by noise and there-fore we set out to capture the signal at intermediate lengthscales where the signal-to-noise ratio is the highest To thisaim we applied a Fast Fourier Transform to the autocor-relation function for all lags 4095 bp 4096 bp (seelsquoMaterials and Methodsrsquo section) resulting in the spectraldensity of the empirical read coverage distribution AethTHORNas a function of the resolution o

The autocorrelation is restricted to a fixed range(4095 bp 4096 bp) and thus the associatedspectral density is fast and easy to compute and it isconstructed as the histogram of pairwise distancesbetween all N reads (see lsquoMaterials and Methodsrsquosection) This approach enables a large coverage foreach lag (t) in the autocorrelation function

Extraction of the IP signal using deconvolution

The observed autocorrelation signal consists of a biolo-gical component relevant to the ChIP-seq experimentmodulated on a component capturing irrelevant pro-perties such as DNA accessibility and experimental biasWe therefore designed an approach to deconvolve thecomponent of the signal that captures the biologicalaspects of the experiment

We formulated the problem using a DSP approach TheChIP-independent properties of the signal are denotedtechnical variability X(t) and arise from technical biasesand the stochastic nature of read count data We denoteby Z(t) the true IP signal which reflects effects associatedwith experiment-specific components eg antibody pre-cipitation cross-linking of chromatin to other proteinsin the same complex We therefore model the measuredChIP signal Y(t) as the convolution of the true IP signalwith the technical variability frac12ZethtTHORN XethtTHORNethTHORN For brevitywe will omit ethTHORN in the equations when clear from thecontext

In the DSP framework X(t) is the input Z(t) is the finiteimpulse response function associated with the specific bio-logical signal and Y(t) is the observed output To recoverthe specific finite impulse response and remove ChIP-independent components we used harmonic analysis tech-niques that are commonly used in engineering disciplines(59) In harmonic analysis signals are represented as thesum of characteristic harmonic components (ie sinus-oidal functions each with a specific period and phase)In this formulation if the technical variability X(t) isknown then applying the convolution theorem it ispossible to recover the true IP signal Z(t) from themeasured signal Y(t) consequently if the autocorrelationXethtTHORN XethtTHORN is known then it is possible to recover theautocorrelation of the true IP signal ZethtTHORN ZethtTHORN

frac12ZethtTHORN ZethtTHORNethTHORN frac14 F1 FethYethtTHORNYethtTHORNTHORNF ethXethtTHORNXethtTHORNTHORN

frac14 F1FethYethTHORNTHORNF ethXethTHORNTHORN

eth6THORN

where F is the Fourier transform operator F1 is itsinverse YethtTHORN frac14 XethtTHORN ZethtTHORN and YethTHORN frac14 YethtTHORN YethtTHORN andXethTHORN frac14 XethtTHORN XethtTHORN are the autocorrelations of theChIP-seq signal Y and of the control X respectively

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 4 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

We named the recovered autocorrelation of the true IPsignal ZethtTHORN ZethtTHORN the Arpeggio profile (Figure 1)

In studies where multiple controls are available or nocontrols are available we matched to each ChIP-seq ex-periment the control that is closest in terms of its spectralproperties We applied the same control matching proced-ure to the rest of the samples and found that most samplesmatched controls done in the same study or cell line Thiscontrol matching procedure is a conservative approach inwhich we try to identify the parts of the Y spectra that aresignificantly distinct from the X spectra and thus increasesspecificity In the extreme case where the autocorrelationof the measured ChIP signal YethTHORN and of the technicalvariability XethTHORN have proportional spectral densities iethey are linearly correlated their ratio will be constant andtheir deconvolution is an impulse indicating that there isno true IP signal (Figure 2)

We matched controls to experiments from the pool ofcontrols in our data set by first matching the organism andDNA-shearing technique and selected the control with thehighest correlation (Pearsonrsquos ) to the ChIP experimentin the base resolution domain (frequency domain ieafter applying the Fourier transform) We also requirethat gt 085 We note that this leads to the highest speci-ficity in the context of our spectral analysis We recall thatcorrelation in the resolution domain does not imply cor-relation in the genomic co-ordinates domain The value ofthe Arpeggio profile at frac14 0 reflects differences in readcount between experiment and control From the values of frac14 0 recorded in our data set we concluded that noexperiment control pair had a perfectly matching readcount In general this did not significantly affect ourability to recover the autocorrelation of the true IP signal

However for 31 samples of 806 in our data set wecould not identify any matching control Of the remaining775 there was also a small fraction of experiments forwhich the resulting Arpeggio profile exhibited several arti-facts suggesting poorly matched controls In particularwhen the autocorrelation of the control had a differentdecay rate than that of the experiment we observed dipsaround frac14 0 or gradual rises or falls as opposed toleveling out at greater values of jj (SupplementaryFigure S6) This difference in decay rate is likely due totechnical variability dependent on read counts Howeverthere may be biological components to it as well eg if apolymerase binds and then moves along the genome it isreasonable to believe that the reads will be more spreadthan a transcription factor that binds to a single locationIt would have been desirable to extract the true IP signal

Z(t) from ZethtTHORN ZethtTHORN directly by square root in the reso-lution domain However in general FethZethtTHORNTHORN is not entirelypositive or real thus there is no unique solution for Z(t)because the phase information is lost in the autocorrel-ation operation In practice the Arpeggio profilesZethtTHORN ZethtTHORN are sufficient for the purpose of comparingChIP-seq experiments and can also be used to examine thespectral density as a function of base resolution

Recovering the length distribution of ChIP-seq fragments

The fragment length distribution is an important param-eter for algorithms that seeks to identify binding eventlocations from short read experiments such as ChIP-seq

minus2000 minus1000 0 1000 2000

000

000

0005

000

100

0015

000

20

Offset(bp)

IP s

igna

l (R

eads

ChI

P

Tot

al R

eads

ChI

P)

(R

eads

Con

trol

Tot

al R

eads

Con

trol

)

HNF4APolIIDNA Input

Figure 2 Examples of the deconvolved Arpeggio profiles for a tran-scription factor PolII and a total DNA input sample The transcriptionfactor HNF4A shows a strong spike followed by a steep decayindicating isolated binding and PolII shows an isolated pulsefollowed by gradual decay The isolated pulse for total DNA inputindicates that sequenced reads provide no additional informationbeyond technical variability ethXethtTHORNTHORN The DNA input Arpeggio was con-structed using the samplersquos best matched control As in the previousfigure the read count difference between the sample and the matchedcontrol at frac14 0 was removed and smoothing was applied to aid invisualization

minus2000 minus1000 0 1000 2000

000

000

0005

000

100

0015

000

20

Offset(bp)

IP s

igna

l (R

eads

ChI

P

Tot

al R

eads

ChI

P)

(R

eads

Con

trol

Tot

al R

eads

Con

trol

)

H3K36me3H3K27me3H3K27AcpRB

Figure 1 Examples of the deconvolved Arpeggio profiles for histonemarks H3K27me3 Histone 3 lysine 36 tri-methylation (H3K36me3)Histone 3 lysine 27 acetylation (H3K27Ac) and for retinoblastoma(pRB) The H3K27me3 and H3K36me3 profiles show periodicity of190 bp consistent with previous reports of nucleosome frequency(60) The value representing the total read count difference betweenChIP and control at frac14 0 was removed and smoothing was appliedto aid in visualization

PAGE 5 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

peak callers (61) If paired-end reads are available theycan be used to recover the fragment length distributionempirically however many experiments in currentrepositories were not done using paired-end reads Wenote that for a given number of nucleotides sequencedthe single end approach represents twice as many frag-ments and thus is more sensitive than the paired-endapproach The latter however has an advantage inmapability in repetitive regionsPrevious studies have used cross-correlation to infer the

average fragment length of singled-end short read ChIP-seq data (62ndash64) As known when considering reads fromopposing strands some of the measured distances reflectfragment lengths (62ndash66)We modeled the probability associated with sampling a

fragment starting at any given position on the positivestrand of the genome as Prfrac12W frac14 t If the sequencedread from this fragment happens to map to the positivestrand it will start at the same genomic position t givingthe same probability distribution for the readsPrfrac12R frac14 Prfrac12W If however the sequenced read maps tothe negative strand then its starting position is F= f thefragment length nucleotides downstream from thestarting position of the fragment W= t Thus given theprobability of sampling a read on the positive strandPrfrac12R there is an equivalent probability of sampling aread on the negative strand Prfrac12R+F where F is arandom variable representing the fragment lengthWe show that harmonic deconvolution can be used to

determine not only the average fragment length but alsothe full empirical fragment length distribution This isdone by deconvolving the cross-correlation (seelsquoMaterials and Methodsrsquo section) between the start ofreads aligning to opposite strands which we denoted asethTHORN from the autocorrelation of reads aligning to thesame strand for the same experiment ethTHORN We recall thatthe probability distribution of the sum of two randomvariables is equivalent to the convolution of their individ-ual probability distributions and thus

c frac14 Prfrac12R+F Prfrac12R

frac14 Prfrac12R Prfrac12R Prfrac12Feth7THORN

where c is a normalization constant such thatc DethtTHORN frac14 Prfrac12R Similarly to Equation (6) we deconvolvethe fragment length distribution

Prfrac12F frac14 F1 FethPrfrac12RPrfrac12RPrfrac12FTHORNFethPrfrac12RPrfrac12RTHORN

frac14 F1

FethcYethTHORNTHORN

FethcYethTHORNTHORN

eth8THORN

We found that the reported fragment size from the differ-ent studies included in our data set matched the fragmentsize inferred using our deconvolution approach(Supplementary Figure S7 and Supplementary Table S2)Moreover our deconvolution approach using only oneread from each read pair of a paired end experimentproduced an estimate of the fragment length distributionthat closely matched to the length distribution of thepaired-end fragments (Figure 3 see SupplementaryNote A)

Arpeggio captures the biology of proteinndashchromatininteraction

The space of Arpeggio spectral densities is lowdimensionalTo facilitate organization of large sequencing archives inparticular during search operations for complex queriesor computationally intensive machine-learning tasks it isdesirable to represent the samples using a small number offeatures Compared with the size of the genome spectraldensities of Arpeggio profiles described by only 8192elements are already relatively small We investigatedwhether the space comprising all spectral densities in ourdatabase could be further reduced using Multi-Dimensional Scaling (MDS) such as PCA We foundthat six principal components were sufficient to capture85 of the variability present in our collection ofspectral densities

Although more advanced techniques (67) may result ina better compression ie fewer dimensions we decided touse PCA which is readily available and familiar for manypractitioners We used the leading six spectral densityprincipal components to organize hundreds of ChIP-seqexperiments and aid annotation of novel samples Wetermed this representation Arpeggio MDS

Use of Arpeggio MDS coordinates for classification andclusteringClassification and clustering are affected by choice of datarepresentation and dissimilarity measures Here we usethe DaviesndashBouldin index (52) to assess discernability

0 200 400 600 800 1000

000

00

001

000

20

003

000

40

005

000

6

fragment size (bp)de

nsity

PairedminusEnd estimate (212bp)SingleminusEnd Arpeggio (214bp)Experimental protocol (200bp)Read length (75bp)

Figure 3 Comparison between paired-end fragment length distributionand Arpeggio fragment length distribution The Arpeggio fragmentlength distribution was estimated from only one read of each readpair The reported fragment length from the experimental protocoltogether with the sequenced read length is shown as dashed linesThe arpeggio fragment length distribution shows an additional spikeat the read length which has been previously observed in opposingstrand cross-correlation (63)

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 6 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

between clusters and inferred class labels for each classvariable using a k-nearest neighbor classification

First for each class variables eg antibody targetcellular mechanism or organism we considered classlabels as cluster indices and computed the DaviesndashBouldin index (see lsquoMaterials and Methodsrsquo section)Our analysis showed significantly small DaviesndashBouldinindices for all class variables (Supplementary Table S4)This indicates that the organization of the data based onArpeggio MDS coordinates has a structure amenable for avariety of machine-learning approaches The DaviesndashBouldin index might have been affected by samplingbias of proteins analyzed within each organism groupskewing the index for the organism class variable

Second for each class variable we trained a k-nearestneighbor classifier (k=1) and using a leave-one-outapproach we computed the balanced accuracy of classlabel assignments (see lsquoMaterials and Methodsrsquo section)For most class variables the performance of the trainedclassifier was above the performance of a classifier assign-ing labels at random (Supplementary Table S4)Importantly misclassification was rare but was morecommon between total DNA input and immunoglobulinG controls (Figure 4)

Similarity between Arpeggio profiles is indicative ofbiological functions

We showed that the Arpeggio profiles could be used totrain classifiers that suggest annotation for new experi-ments We sought to extend this paradigm and addresswhether given a new set of experiments Arpeggioprofiles could be used to rapidly select available ChIP-seq experiments that would enable a more complete de-scription of the biological mechanism of interestTypically this analysis involves identification of peaksfrom the ChIP-seq signals and the study of the overlapbetween events in two or more different experiments (8)For this reason we compared the proximity mappingdetermined by the Arpeggio profiles with the proximityscore determined using the Jaccard distance of peakoverlaps

For each ChIP-seq experiment peaks were identifiedusing the Qeseq program (61) assigning to each experi-ment the best matching control as described in theprevious sections We set the fragment size to 150 bp forexperiments using MNase digestion and to 250 bp for ex-periments using sonication (see Supplementary Table S2)We considered two peaks to be overlapping if they sharedat least 1 bp For any pair of experiments the totalnumber of peaks was computed as the sum of thenumber of peaks in each experiment minus the numberof overlapping peaks The pairwise peak overlap score wascomputed using the Jaccard distance namely one minusthe ratio between the number of overlapping peaks andthe total number of peaks In contrast to our Arpeggioapproach peak overlap does not directly allow cross-species comparison To ensure a fair comparisonbetween data representations we split our data set intoa set of human samples (n=541) and a set of murinesamples (n=237) and analyzed them separately

Applying PCA to the peak overlap Jaccard distancematrix revealed that for the human set 365 principalcomponents were needed to capture 85 of the variabilityin the data In contrast 85 of the variability of thepairwise distance matrix of the Fourier transforms of theArpeggio profiles was captured by the six leading principalcomponents Thus the Arpeggio MDS provides a morecompact representation of the dataNext for each class variable we studied the classification

performance using three distance measures peak overlapJaccard distance inverse correlation measure betweenspectral density of Arpeggio profiles and Euclideandistance between Arpeggio MDS coordinates Weanalyzed human and mouse samples separately For eachclass variable we quantified the performance using theArea Under the receiver operator characteristic Curve(AUC) as described in the lsquoMaterials and Methodsrsquo sectionApplication of a k-nearest neighbor classifier with k=1

to these three distance measures resulted in comparableperformances in most class variables However proximitybetween Arpeggio MDS profiles as compared with peakoverlap Jaccard distance reflected higher association forcellular mechanisms This was more evident in thehuman set where the larger number of experiments corres-ponded to smaller error bars (Figure 5)Arpeggio retains information related to mode of

binding and performs best in clustering cellular mechan-ism in contrast peak overlap contains information aboutbinding locations and best clusters more variables such asbatch effects associated with the Study ID and cell line

Fact

or

His

tone

Mod

ifica

tion

RN

A P

olym

eras

e

Gen

omic

Con

trol

Uns

peci

fic C

ontr

ol

Factor

Histone Modification

RNA Polymerase

Genomic Control

Unspecific Control

Pre

dict

ed la

bel

True label

516

107

96

103

178

253

479

82

8

106

267

97

485

66

85

16

8

39

206

515

169

28

35

304

464

Figure 4 Confusion matrix of predicting functional annotations fromthe Arpeggio profiles using a k-nearest neighbor classifier with k=1The values along each column of the matrix represent how the instancesin the actual class were assigned to the predicted class The diagonalelements indicate sensitivity Darker colors indicate higher classificationfrequencies Most misclassifications occurred between controls

PAGE 7 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

(Supplementary Table S5 Figure 5 and SupplementaryFigure S8)

Reading biological features from Arpeggio profiles

In the previous sections we have provided evidence thatArpeggio profiles and their spectral densities can be usedto rapidly compare a large number of experiments Inthis section we show that Arpeggio profiles can also beused to derive biologically and technically meaningfulinformationFor instance H3K27me3 Arpeggio profiles exhibited

distinct periodicity with high amplitudes of oscillation(Figure 1) This suggested a highly ordered array of nu-cleosomes consistent with a static chromatin structurewhere nucleosomes are precisely positioned This agreedwell with the known role of Polycomb and H3K27 tri-methylation in transcriptional repression and heterochro-matin formation We recall that the profile represents anaggregate of binding events across the whole genome theclear and distinct periodicity for H3K27me3 suggestedthat nucleosomes with the tri-methylated H3K27 markhad remarkably constant periodicities throughout thegenome (4) Further we show that this periodicity andthe width of the signal (number of clear oscillations) issimilar across species (Figure 6)Another oscillatory pattern can be observed for Histone

3 Lysine 36 tri-methylation (H3K36me3) This histonemodification mark is deposited along with activelytranscribing PolII complexes and is by far the mostreliable histone mark for actively transcribed genes Incontrast to H3K27me3 profiles where the oscillation isonly slowly dampening due to the static chromatin

structure imposed by this mark H3K36me3 profilesshow a central peak surrounded by slightly smallerpeaks with a rapidly degrading oscillation indicative ofa fluidic chromatin state where nucleosomes are not pre-cisely positioned This is in agreement with this markbeing found in actively transcribed genes as the nucleo-somes in actively transcribed chromatin are perturbed bythe passing polymerase complexes (Figure 1)

The difference in chromatin state is particularly evidentin the harmonic analysis of Arpeggio profiles Inspired byprevious work on nucleosome spacing (60) we computedthe ratio

frac14Aeth 1190THORN

Aeth 1180THORN+Aeth 1200THORN

2

eth9THORN

where Aeth1=tTHORN is the magnitude of the t bp length scale inthe spectral density H3K27me3 exhibited a stronger acompared with H3K36me3 (Figure 7) which suggestsflexibility in the spacing of nucleosomes carryingH3K36me3 mark Interestingly the ratio a was alsolarge in Histone 3 Lysine 4 tri-methylation (H3K4me3)and in the Retinoblastoma protein (pRb) Although thereasons for such a precise placement of H3K4me3 areunclear pRb is an important factor in establishingheterochromatin

The effects of open fluidic chromatin states are strongerin the Arpeggio profiles of histone acetylations Theseprofiles are characterized by a high central peak sur-rounded by unorganized oscillations These profiles areclearly indicative of an open fluid chromatin configur-ation which is consistent with actively transcribedregions (Figure 1)

However the reference ChIP-seq signal for openactively transcribed chromatin is the Arpeggio profile ofRNA Polymerase II The PolII complexes move alongactively transcribed genes thus the ChIP-seq information

Hsapiens

AU

C

00

02

04

06

08

10

Pro

tein

AB

targ

et

Inte

ract

ion

dom

ain

Fun

ctio

nal a

nnot

atio

n

Act

ivity

ann

otat

ion

Cel

lula

r M

echa

nism

She

arin

g

Cel

l Lin

e

Stu

dy ID

RandomPeak overlapArpeggioArpeggio MDS

Figure 5 Classification of biological and experimental factors in termsof peak-based and Arpeggio-based ChIP-seq signatures for humansamples The comparison was performed based on proximitycomputed using peak overlap between pairs of experiments inversecorrelation of Arpeggio spectral densities or Euclidean distances ofArpeggio MDS coordinates The AUC of a classifier assigningrandom labels is shown as a black dashed line Compared with peakoverlap proximity based on Arpeggio MDS had a higher median per-formance of predicting cellular mechanism

500 1000 1500 2000minus4e

minus04

minus2e

minus04

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Offset(bp)

IP s

igna

l (R

eads

ChI

P

Tot

al R

eads

ChI

P)

(R

eads

Con

trol

Tot

al R

eads

Con

trol

)

Fly (187bp period)Mouse (188bp period)Human (184bp period)

Figure 6 H3K27me3 shows similar periodicity across fly mouse andhuman Periodicity was calculated using the local maxima of thespectral density for the range of the Arpeggio profiles shown

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 8 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

is spread over a relatively large spatial region For thisreason PolII ChIP experiments need significantsequencing depth to obtain a good picture of PolIIactivity The Arpeggio profiles for PolII show a strongcentral signal surrounded by two shoulders each flankedby disorganized nucleosomes (Figure 2) The central peakis where PolII is located most closely to the DNA strandand thus efficient cross-linking can occur However PolIIis part of a large lsquoholoenzymersquo transcription machineryand the surrounding smaller humps are likely where thiscomplex is close enough to the DNA strand to be cross-linked and enriched in a ChIP-seq experiment The un-organized pattern surrounding these peaks is likely aresult of the actively transcribing PolII complex As itmoves through the chromatin it perturbs andor disas-sembles nucleosomes in front and re-deposits thembehind giving highly disorganized chromatin structure(Figure 2)

Other proteins such as Androgen Receptor SPDEF(SAM pointed domain-containing Ets transcriptionfactor) ERG (Ets-Related Gene) FL1 (Follicularlymphoma susceptibility to 1) display typical site-specific DNA-binding profiles of transcription factors inwhich a strong signal occurs at the binding siteaccompanied by disorganized surrounding patterns indi-cative of active fluidic chromatin (Figure 2)

DISCUSSION

The contribution of this study is the design of a compactChIP-seq data representation based on the denoised auto-correlation We show that use of this compact data repre-sentation has several advantages that facilitates efficientcomputation and data storage linking the cellular mech-anisms of protein targets in novel ChIP-seq experiments todata in current repositories low-dimensional organizationof large repositories of ChIP-seq data that facilitates ex-ploratory data analysis cross-species and cross-cell linecomparisons extraction of technical features such asfragment length distribution and biological relevantinterpretation

Large collections of ChIP-seq data have been leveragedto gain new biological insights (8) The volume of ChIP-seq data in public repositories has noticeably increased inrecent years Typically users retrieve samples based ontheir prior knowledge and expectations of the biologicalsystem Currently available data retrieval systems arebased on matching qualitative annotations such asorganism cell-type condition and specific immunopre-cipitated protein We suggest the use of our novel com-pression technique Arpeggio to enable searching forsamples similar to the query experiment in terms of quan-titative patterns present in theirs signals thus facilitatingnovel biological discoveries We note that non-linear datacompression approaches applied to the autocorrelationfunctions can organize the data and reveal new insights(see diffusion map analysis Supplementary Note B) Wepresent Arpeggio profiles and their spectral densities Thislow-dimensional harmonic data representation can beused for selecting publicly available experiments that arebiologically related to an experiment of interest TheArpeggio profiles are computed from the autocorrelationof ChIP-seq signals which have been previously exploredin the context of data quality assessment (7)In contrast to previous approaches we applied signal

processing techniques to derive a profile of the IP auto-correlation that is diminished in technical variability andrequires little pre-filtering We found that Arpeggioprofiles were remarkably organized in four maincategories corresponding to intuitive classes of structuralinteractions factors showed peaks with sharply decayingtails polymerases showed peaks as well but with slowlydecaying tails histone modifications showed damped os-cillations corresponding to trains of peaks at fixed dis-tances from one another lastly controls showed a singlepulse sharper than the peak of factors indicating that asexpected sequenced reads from total DNA inputs have norecurrent properties Typical binding patterns of a particu-lar proteinndashchromatin interaction are obscured by noiseArpeggio profiles overcome this problem by aggregatingthe recurrent patterns of proteinndashchromatin interactionThe quality of autocorrelation also improves quickerthan the read coverage density as the number of readsincreases In peak finding the average coverage isexpected to increase linearly on the order of OethN L=GTHORNwhere N is the number of sequenced reads L is the read-length and G is the size of the genome in contrast thenumber of distances between reads contributing to thecomputation of the autocorrelation scales quadraticallyin the worst case as OethN2 W=GTHORN where W is themaximum lag at which the autocorrelation is evaluatedwhere W L We note that Arpeggio profiles do notprovide the location of such binding events however weplan to further develop these characteristic spectralbinding patterns for locating peaksIn this work we used an unsupervised approach to

organize a large volume of ChIP-seq experiments Weshow that close proximity between the denoised spectraldensities of two different proteins is often associated withsimilar cellular mechanism In this work we manuallyannotated 806 of the 14 306 currently marked as ChIP-seq samples in the SRA Databases are expected to

DN

AIn

put

ER

GG

RH

2BK

12A

cH

2BK

15A

cH

2BK

20A

cH

3H

3K18

Ac

H3K

27A

cH

3K27

me3

H3K

36m

e3H

3K4m

e1H

3K4m

e2H

3K4m

e3H

3K79

me1

H3K

79m

e2H

3K79

me3

H3K

9Ac

H3K

9me3

H4K

12A

cH

4K20

me1

H4K

5Ac

IgG

P30

0P

olII

RA

RS

PD

EF

VitD

rhE

Ra

pRb

099

100

101

102

103

104

105

106

α

Figure 7 Flexibility of nucleosome spacing across different experi-ments Flexible spacing or isolated events (SupplementaryFigure S2) is represented by ratios close to one Ratios clearly aboveone indicate strict spacing of 190 bp between events (SupplementaryFigure S4) Histone modifications associated with heterochromatinexhibit higher ratios indicating less flexibility

PAGE 9 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

evolve to be more structured and enable automatic re-trieval eliminating the need for data entry tasks andallowing us to organize thousands of samples at a timeMoreover high-throughput sequencing is becoming cheapfacilitating the mass survey of novel ChIP targets forwhich function is yet to be determined Applying theproposed spectral representation to thousands of existingannotated ChIP-seq experiments will allow us to screenthese new ChIP targets reducing the resources requiredto elucidate their functions Interpretation of spectralpatterns in many fields of science and engineering (ieradiology control systems analysis imaging) is often theproduct of years of study In this study we focused onproperties that discriminate coarse categories of proteinndashchromatin interaction There is a wealth of knowledgehidden in Arpeggio representation and we anticipatethat with increasing database size and quality it willprovide information on a much finer scale

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Onlineincluding [68ndash122]

ACKNOWLEDGEMENTS

The authors thank Ronen Talmon Sherman WeissmanRonald Coifman and Paolo Barbano for insightfuldiscussions

FUNDING

National Institute of Health [T15 LM07056 to KS] and[CA-158167 to YK] Yale Cancer Center translationalresearch pilot funds (to FP and FS) American CancerSociety Award [M130572 to FS] the American-ItalianCancer Foundation [Post-Doctoral Research Fellowshipto FS] the Peter T Rowley Breast Cancer ResearchProjects funded by the New York State Department ofHealth [FAU 0812160900 YK] Funding for openaccess charge The Peter T Rowley Breast CancerResearch Projects funded by the New York StateDepartment of Health [FAU 0812160900 YK]

Conflict of interest statement None declared

REFERENCES

1 MortazaviA WilliamsBA MccueK SchaefferL and WoldB(2008) Mapping and quantifying mammalian transcriptomes byRNA-Seq Nat Methods 5 621ndash628

2 TrapnellC RobertsA GoffL PerteaG KimD KelleyDRPimentelH SalzbergSL RinnJL and PachterL (2012)Differential gene and transcript expression analysis of RNA-seqexperiments with TopHat and Cufflinks Nat Protoc 7 562ndash578

3 GreenmanC StephensP SmithR DalglieshGL HunterCBignellG DaviesH TeagueJ ButlerA StevensC et al(2007) Patterns of somatic mutation in human cancer genomesNature 446 153ndash158

4 AspP BlumR VethanthamV ParisiF MicsinaiM ChengJBowmanC KlugerY and DynlachtBD (2011) Genome-wide

remodeling of the epigenetic landscape during myogenicdifferentiation Proc Natl Acad Sci USA 108 E149ndashE158

5 BarskiA and ZhaoK (2009) Genomic location analysis byChIP-Seq J Cell Biochem 107 11ndash18

6 MikkelsenTS KuMC JaffeDB IssacB LiebermanEGiannoukosG AlvarezP BrockmanW KimTK KocheRPet al (2007) Genome-wide maps of chromatin state in pluripotentand lineage-committed cells Nature 448 553ndash560

7 ENCODE Project Consortium (2004) The ENCODE(ENCyclopedia of DNA elements) project Science 306 636ndash640

8 ENCODE Project Consortium (2012) An integrated encyclopediaof DNA elements in the human genome Nature 489 57ndash74

9 ListerR PelizzolaM DowenRH HawkinsRD HonGTonti-FilippiniJ NeryJR LeeL YeZ NgoQM et al(2009) Human DNA methylomes at base resolution showwidespread epigenomic differences Nature 462 315ndash322

10 BellO SchwaigerM OakeleyEJ LienertF BeiselCStadlerMB and SchubelerD (2010) Accessibility of theDrosophila genome discriminates PcG repression H4K16acetylation and replication timing Nat Struct Mol Biol 17894ndash900

11 BlowM McCulleyDJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2010)ChIP-Seq identification of weakly conserved heart enhancers NatGenet 42 806ndash810

12 BottomlyD KylerSL McWeeneySK and YochumGS(2010) Identification of b-catenin binding regions in colon cancercells using ChIP-Seq Nucleic Acids Res 38 5735ndash5745

13 ChicasA WangX ZhangC McCurrachM ZhaoZ MertODickinsRA NaritaM ZhangM and LoweSW (2010)Dissecting the unique role of the retinoblastoma tumor suppressorduring cellular senescence Cancer Cell 17 376ndash387

14 CreyghtonMP ChengAW WelsteadGG KooistraTCareyBW SteineEJ HannaJ LodatoMA FramptonGMSharpPA et al (2010) From the cover histone H3K27acseparates active from poised enhancers and predictsdevelopmental state Proc Natl Acad Sci USA 10721931ndash21936

15 DurantL WatfordWT RamosHL LaurenceA VahediGWeiL TakahashiH SunH-W KannoY PowrieF et al(2010) Diverse targets of the transcription factor STAT3contribute to T cell pathogenicity and homeostasis Immunity 32605ndash615

16 EnderleD BeiselC StadlerMB GerstungM AthriP andParoR (2010) Polycomb preferentially targets stalled promotersof coding and noncoding transcripts Genome Res 21 216ndash226

17 FullwoodMJ LiuMH PanYF LiuJ XuHMohamedYB OrlovYL VelkovS HoA MeiPH et al(2009) An oestrogen-receptor-a-bound human chromatininteractome Nature 462 58ndash64

18 GanQ SchonesDE Ho EunS WeiG CuiK ZhaoK andChenX (2010) Monovalent and unpoised status of most genes inundifferentiated cell-enriched Drosophila testis Genome Biol 11R42

19 GorchakovAA AlekseyenkoAA KharchenkoP ParkPJand KurodaMI (2009) Long-range spreading of dosagecompensation in Drosophila captures transcribed autosomal genesinserted on X Genes Dev 23 2266ndash2271

20 GuentherMG FramptonGM SoldnerF HockemeyerDMitalipovaM JaenischR and YoungRA (2010) Chromatinstructure and gene expression programs of human embryonic andinduced pluripotent stem cells Cell Stem Cell 7 249ndash257

21 HoffmanBG RobertsonG ZavagliaB BeachM CullumRLeeS SoukhatchevaG LiL WederellED ThiessenN et al(2010) Locus co-occupancy nucleosome positioning andH3K4me1 regulate the functionality of FOXA2- HNF4A- andPDX1-bound loci in islets and liver Genome Res 20 1037ndash1051

22 HurtadoA HolmesKA Ross-InnesCS SchmidtD andCarrollJS (2010) FOXA1 is a key determinant of estrogenreceptor function and endocrine response Nat Genet 43 27ndash33

23 IpJY SchmidtD PanQ RamaniAK FraserAGOdomDT and BlencoweBJ (2011) Global impact of RNApolymerase II elongation inhibition on alternative splicingregulation Genome Res 21 390ndash401

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 10 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

24 JohnS SaboPJ ThurmanRE SungM-H BiddieSCJohnsonTA HagerGL and StamatoyannopoulosJA (2011)Chromatin accessibility pre-determines glucocorticoid receptorbinding patterns Nat Genet 43 264ndash268

25 JolmaA KiviojaT ToivonenJ ChengL WeiG EngeMTaipaleM VaquerizasJM YanJ SillanpaaMJ et al (2010)Multiplexed massively parallel SELEX for characterization ofhuman transcription factor binding specificities Genome Res 20861ndash873

26 JungH LacombeJ MazzoniEO LiemKF Jr GrinsteinJMahonyS MukhopadhyayD GiffordDK YoungRAAndersonKV et al (2010) Global control of motor neurontopography mediated by the repressive actions of a single hoxgene Neuron 67 781ndash796

27 KageyMH NewmanJJ BilodeauS ZhanY OrlandoDAvan BerkumNL EbmeierCC GoossensJ RahlPBLevineSS et al (2010) Mediator and cohesin connect geneexpression and chromatin architecture Nature 467 430ndash435

28 KasowskiM GrubertF HeffelfingerC HariharanMAsabereA WaszakSM HabeggerL RozowskyJ ShiMUrbanAE et al (2010) Variation in transcription factor bindingamong humans Science 328 232ndash235

29 KocheRP SmithZD AdliM GuH KuM GnirkeABernsteinBE and MeissnerA (2011) Reprogramming factorexpression initiates widespread targeted chromatin remodelingCell Stem Cell 8 96ndash105

30 KramerJM KochinkeK OortveldMAW MarksHKramerD de JongEK AsztalosZ WestwoodJTStunnenbergHG SokolowskiMB et al (2011) Epigeneticregulation of learning and memory by Drosophila EHMTG9aPLoS Biol 9 e1000569

31 LeeBK BhingeAA and IyerVR (2011) Wide-rangingfunctions of E2F4 in transcriptional activation and repressionrevealed by genome-wide analysis Nucleic Acids Res 393558ndash3573

32 LiL JothiR CuiK LeeJY CohenT GorivodskyMTzchoriI ZhaoY HayesSM BresnickEH et al (2010)Nuclear adaptor Ldb1 regulates a transcriptional programessential for the maintenance of hematopoietic stem cells NatImmunol 12 129ndash136

33 MahonyS MazzoniEO McCuineS YoungRA WichterleHand GiffordDK (2011) Ligand-dependent dynamics of retinoicacid receptor binding during early neurogenesis Genome Biol12 R2

34 MarsonA LevineSS ColeMF FramptonGMBrambrinkT JohnstoneS GuentherMG JohnstonWKWernigM and NewmanJ (2008) Connecting microRNA genesto the core transcriptional regulatory circuitry of embryonic stemcells Cell 134 521ndash533

35 PaulerFM SloaneMA HuangR ReghaK KoernerMVTamirI SommerA AszodiA JenuweinT and BarlowDP(2008) H3K27me3 forms BLOCs over silent genes and intergenicregions and specifies a histone banding pattern on a mouseautosomal chromosome Genome Res 19 221ndash233

36 Rada-IglesiasA BajpaiR SwigutT BrugmannSAFlynnRA and WysockaJ (2010) A unique chromatin signatureuncovers early developmental enhancers in humans Nature 470279ndash283

37 RamagopalanSV HegerA BerlangaAJ MaugeriNJLincolnMR BurrellA HandunnetthiL HandelAEDisantoG OrtonS-M et al (2010) A ChIP-seq definedgenome-wide map of vitamin D receptor bindingassociations with disease and evolution Genome Res 201352ndash1360

38 Rugg-GunnPJ CoxBJ RalstonA and RossantJ (2010)Distinct histone modifications in stem cell lines and tissue lineagesfrom the early mouse embryo Proc Natl Acad Sci USA 10710783ndash10790

39 SchmidtD SchwaliePC Ross-InnesCS HurtadoABrownGD CarrollJS FlicekP and OdomDT (2010) ACTCF-independent role for cohesin in tissue-specific transcriptionGenome Res 20 578ndash588

40 VermeulenM EberlHC MatareseF MarksH DenissovSButterF LeeKK OlsenJV HymanAA and

StunnenbergHG (2010) Quantitative interaction proteomics andgenome-wide profiling of epigenetic histone marks and theirreaders Cell 142 967ndash980

41 VerziMP ShinH HeHH SulahianR MeyerCAMontgomeryRK FleetJC BrownM LiuXS andShivdasaniRA (2010) Differentiation-specific histonemodifications reveal dynamic chromatin interactions and partnersfor the intestinal transcription factor CDX2 Dev Cell 19713ndash726

42 WeiGH BadisG BergerMF KiviojaT PalinK EngeMBonkeM JolmaA VarjosaloM GehrkeAR et al (2010)Genome-wide analysis of ETS-family DNA-binding in vitro andin vivo EMBO J 29 2147ndash2160

43 WeiL VahediG SunH-W WatfordWT TakatoriHRamosHL TakahashiH LiangJ Gutierrez-CruzG ZangCet al (2010) Discrete roles of STAT4 and STAT6 transcriptionfactors in tuning epigenetic modifications and transcription duringT helper cell differentiation Immunity 32 840ndash851

44 YangY LuY EspejoA WuJ XuW LiangS andBedfordMT (2010) TDRD3 is an effector molecule for arginine-methylated histone marks Mol Cell 40 1016ndash1023

45 YuS CuiK JothiR ZhaoD-M JingX ZhaoK andXueH-H (2010) GABP controls a critical transcriptionregulatory module that is essential for maintenance anddifferentiation of hematopoietic stemprogenitor cells Blood 1172166ndash2178

46 HoffmanMM ErnstJ WilderSP KundajeA HarrisRSLibbrechtM GiardineB EllenbogenPM BilmesJABirneyE et al (2013) Integrative annotation of chromatinelements from ENCODE data Nucleic Acids Res 41 827ndash841

47 LimSJ TanTW and TongJC (2010) Computationalepigenetics the new scientific paradigm Bioinformation 4331ndash337

48 ThurmanRE DayN NobleWS and StamatoyannopoulosJA(2007) Identification of higher-order functional domains in thehuman ENCODE regions Genome Res 17 917ndash927

49 LangmeadB TrapnellC PopM and SalzbergSL (2009)Ultrafast and memory-efficient alignment of short DNAsequences to the human genome Genome Biol 10 R25

50 DiazA ParkK LimDA and SongJS (2012) Normalizationbias correction and peak calling for ChIP-seq Stat Appl GenetMol Biol 11 9

51 LiangK and KelesS (2012) Normalization of ChIP-seq datawith control BMC Bioinformatics 13 199

52 R Development Core Team (2012) R A Language andEnvironment for Statistical Computing R Foundation forStatistical Computing Vienna Austria

53 DaviesDL and BouldinDW (1979) A cluster separationmeasure IEEE Trans Pattern Anal Mach Intell 1 224ndash227

54 HanleyJA and McNeilBJ (1982) The meaning and use of thearea under a receiver operating characteristic (ROC) curveRadiology 143 29ndash36

55 GarnerSR (1995) Weka the waikato environment forknowledge analysis In Procceding of the New Zealand ComputerScience Research Students Conference Hamilton New Zealandpp 57ndash64

56 Shi-YunS Kai-QuanS Chong-JinO Xiao-PingL andWilder-SmithE (2008) Automatic identification and removal ofartifacts in EEG using a probabilistic multi-class SVM approachwith error correction In IEEE International Conference onSystems Man and Cybernetics Singapore Singaporepp 1134ndash1139

57 BellmanRE (2003) Dynamic Programming Dover PublicationsNew York NY

58 BickmoreWA and van SteenselB (2013) Genome architecturedomain organization of interphase chromosomes Cell 1521270ndash1284

59 OppenheimAV SchaferRW and BuckJR (1999) Discrete-Time Signal Processing Prentice-Hall Inc Upper SaddleRiver NJ

60 BlankTA and BeckerPB (1995) Electrostatic mechanism ofnucleosome spacing J Mol Biol 252 305ndash313

61 MicsinaiM ParisiF StrinoF AspP DynlachtBD andKlugerY (2012) Picking ChIP-seq peak detectors for

PAGE 11 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

analyzing chromatin modification experiments Nucleic Acids Res40 e70

62 KharchenkoPV TolstorukovMY and ParkPJ (2008) Designand analysis of ChIP-seq experiments for DNA-binding proteinsNat Biotechnol 26 1351ndash1359

63 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of theENCODE and modENCODE consortia Genome Res 221813ndash1831

64 RamachandranP PalidworGA PorterCJ and PerkinsTJ(2013) MaSC mappability-sensitive cross-correlation forestimating mean fragment length of single-end short-readsequencing data Bioinformatics 29 444ndash450

65 NarlikarL and JothiR (2012) ChIP-Seq data analysisidentification of ProteinndashDNA binding sites with SISSRs peak-finder In WangJ TanAC and TianT (eds) Next GenerationMicroarray Bioinformatics Vol 802 Humana Press New YorkUSA pp 305ndash322

66 ZhangY LiuT MeyerCA EeckhouteJ JohnsonDSBernsteinBE NussbaumC MyersRM BrownM LiWet al (2008) Model-based analysis of ChIP-Seq (MACS) GenomeBiol 9 R137

67 LittleAV LeeJ Yoon-MoJ and MaggioniM (2009) IEEESP15th Workshop on Statistical Signal Processing (SSPrsquo09) CardiffUnited Kingdom pp 85ndash88

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 12 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

Page 3: Arpeggio: harmonic compression of ChIP-seq data reveals protein ...

of all pairs of reads separated by a fixed distance of tnucleotides as

RethTHORN frac14XNifrac141

XNjfrac141

1 if ri rj frac14 0 otherwise

eth1THORN

RethTHORN is equivalent to the autocorrelation of the empiricalread coverage depth D(t) at each position t along thegenome

RethTHORN frac14P1

tfrac141DethtTHORNDetht+THORN

frac14 DethtTHORN DethtTHORNfrac12 ethTHORN

eth2THORN

where is the convolution operatorWe note that proper normalization of the read count

data is ambiguous (5051) therefore rather than the stat-istical definition of autocorrelation which is meancentered and normalized by variance we use the digitalsignal processing (DSP) definition of autocorrelation Thelatter does not require prior knowledge of the distributionshape To minimize the impact of PCR artifactsduplicated reads were only considered once

Cross-correlationGiven a short read sequencing sample consisting of a set Rof aligned reads r 2 R where r+ indicates the startingposition of an aligned read on the positive strand andr indicates the starting position of an aligned read onthe negative strand we defined the aggregated crossdistance between reads on opposing strands as

RethTHORN frac14XN+

ifrac141

XNjfrac141

1 if r+i rj frac14 0 otherwise

eth3THORN

where t is the offset in nucleotides and N+ and N are thenumber of reads on the positive and negative strand re-spectively This is equivalent to the cross-correlation of theempirical read coverage depth on the positive strand D+ethtTHORNwith the empirical read coverage depth on the negativestrand DethtTHORN at each position t along the genome

RethTHORN frac14P1

tfrac141D+ethtTHORNDetht+THORN

frac14 D+ethtTHORN DethtTHORNfrac12 ethTHORNeth4THORN

As in the case of autocorrelation duplicated reads wereonly considered once

Principal component analysisPrincipal Component Analysis (PCA) was applied to thecollection of Fourier transforms of the Arpeggio profilesOnly the real part of the transform should exist and toavoid numerical errors the negligible imaginary part wasdiscarded To avoid bias due to noise affecting length-scales below 40 bp we applied a low-pass filter to theFourier transform

DaviesndashBouldin indexThe data in our data set were annotated by several classvariables (eg Antibody target cell line cellular mechan-ism organism study ID) each consisting of multiple class

labels (eg for Antibody target histone 3 Lysine 27 tri-methylation (H3K27me3) H3K27me2 E2F4 etc) Foreach class variable a sample was assigned one and onlyone class label We assigned samples to clusters basedon class label and computed the DaviesndashBouldin index(52) as

DB frac141

n

Xnifrac141

maxji6frac14j

i+jkCi Cjk2

eth5THORN

where kCi Cjk2 was the Euclidean distance betweenthe two centroids Ci and Cj computed using the sixleading principal components n was the number ofclusters and i and j were the mean distances of thepoints in the i-th and j-th clusters from the cluster cen-troids Ci and Cj P-values were computed viabootstrapping (n=1000)

Class label aggregation for different data representationsFor each sample we tested whether proximity in a givendata representation is indicative of similar class label as-signments We selected a class variable (eg Antibodytarget orgnanism cellular mechanism) and for a givensample we constructed a binary class that designates aspositives other samples with the same class label and des-ignates all other samples as negatives Using this binaryclass label we computed the Area Under the ReceiverOperator Characteristic Curve (AUC) for the pairwise dis-tances of the given sample with all other samples (53) Asthe fraction of positive samples in close proximity to thegiven sample increases so does the AUCTo avoid sampling bias we considered only one repli-

cate at random for each group of replicated samples Oncea sample that determines the positive class label is chosenits replicates are excluded This AUC calculation wasrepeated 100 times (different replicate combinationschosen at random) for each sample and we computedthe sample-specific expected value for the AUC as wellas its standard deviation For all samples with the sameclass label we computed the average over these expectedAUCs and the average standard deviation for each classlabel similarly to standard approaches (54) Finally foreach class variable we reported the medians across allclass labels (eg human mouse and fly for the organismclass variable) for both the average expected AUC and itsaverage standard deviation We report the median ratherthan the mean because it is more robust to outliers and is abetter estimator of expected value for arbitrarydistributions

Supervised analyses

K-nearest neighbors classifiers (k=1) were trained toidentify the class labels in the class variables of interestTo avoid sampling bias we considered only one replicateat random for each group of replicated samples For eachclass variable we kept one sample for testing and per-formance was computed using the balanced accuracy(accuracy for each class label averaged over all labels)(55) This procedure was repeated for all samples andP-values were computed via bootstrapping (n=100)

PAGE 3 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

Statistical analyses and software

All analyses were performed using our Java-based softwarepackage and the R statistical software (56) Our Arpeggiosoftware can be used to download data from the SRAmap reads to reference genomes compute autocorrelationand Arpeggio profiles An R script is also available togenerate and plot Arpeggio profiles The Arpeggiosoftware suite is freely available at httpsourceforgenetparpeggiowikiHome together with a detailedtutorial

RESULTS

Spectral density of ChIP-seq signals

To enhance our understanding of a new ChIP-seq sampleit is often beneficial to relate it to relevant ChIP-seq ex-periments in public repositories Here we relate experi-ments based on their signal proximity and therefore anappropriate distance metric is neededA naıve comparison of ChIP-seq experiments can be

done by measuring the distance between their coveragedepth graphs The genome consists of a large number ofgenomic positions (3 billion for the human genome)from which one can theoretically sample reads Thispairwise comparison is not efficient for large data setsand the dimension may be too large to identify neighbors(57) In addition at single nucleotide resolution meaning-ful patterns can be obscured by noise Standard compari-sons between pairs of ChIP-seq experiments arecommonly done by evaluating the overlap betweenpeaks detected in these experiments (458) Intuitivelythe union of peaks from a large data sets which isneeded to generate pairwise distances produces manygenomic intervals and is thus high dimensional Toquickly relate a given ChIP-seq sample to relevant experi-ments we sought to design a data representation thatcaptures the underlying biology is easy to compute canbe expressed by a relatively small number of dimensionsand finally is robust to suboptimal read coverageWe leveraged two important characteristics of ChIP-seq

data to create this low-dimensional representation Firstreads are localized in islands surrounding the interactingproteins (eg factors histones or polymerases) that weretargeted by the antibody (58) We therefore examined thesystem at intermediate genomic length scales Specificallywe consider the autocorrelation function of the readcoverage depth This function captures recurrent eventsalong the genome and aggregates this information to sig-natures of read co-occurrence at specific length scales orlags (Supplementary Figures S2ndashS4) As a consequence ofthe localized nature of ChIP-seq the autocorrelationfunction exhibits regular non-random interactions withina relatively small offset 4095 bp 4096 bp afterwhich it is uninformative Second read count data arestochastic and typically undersampled resulting inspikey noise that obscures signals occurring at nucleotideresolution (Supplementary Figure S5) At long lengthscales far beyond the length of proteinndashchromatin inter-action islands and also at short length scales on the order

of nucleotides the signal is dominated by noise and there-fore we set out to capture the signal at intermediate lengthscales where the signal-to-noise ratio is the highest To thisaim we applied a Fast Fourier Transform to the autocor-relation function for all lags 4095 bp 4096 bp (seelsquoMaterials and Methodsrsquo section) resulting in the spectraldensity of the empirical read coverage distribution AethTHORNas a function of the resolution o

The autocorrelation is restricted to a fixed range(4095 bp 4096 bp) and thus the associatedspectral density is fast and easy to compute and it isconstructed as the histogram of pairwise distancesbetween all N reads (see lsquoMaterials and Methodsrsquosection) This approach enables a large coverage foreach lag (t) in the autocorrelation function

Extraction of the IP signal using deconvolution

The observed autocorrelation signal consists of a biolo-gical component relevant to the ChIP-seq experimentmodulated on a component capturing irrelevant pro-perties such as DNA accessibility and experimental biasWe therefore designed an approach to deconvolve thecomponent of the signal that captures the biologicalaspects of the experiment

We formulated the problem using a DSP approach TheChIP-independent properties of the signal are denotedtechnical variability X(t) and arise from technical biasesand the stochastic nature of read count data We denoteby Z(t) the true IP signal which reflects effects associatedwith experiment-specific components eg antibody pre-cipitation cross-linking of chromatin to other proteinsin the same complex We therefore model the measuredChIP signal Y(t) as the convolution of the true IP signalwith the technical variability frac12ZethtTHORN XethtTHORNethTHORN For brevitywe will omit ethTHORN in the equations when clear from thecontext

In the DSP framework X(t) is the input Z(t) is the finiteimpulse response function associated with the specific bio-logical signal and Y(t) is the observed output To recoverthe specific finite impulse response and remove ChIP-independent components we used harmonic analysis tech-niques that are commonly used in engineering disciplines(59) In harmonic analysis signals are represented as thesum of characteristic harmonic components (ie sinus-oidal functions each with a specific period and phase)In this formulation if the technical variability X(t) isknown then applying the convolution theorem it ispossible to recover the true IP signal Z(t) from themeasured signal Y(t) consequently if the autocorrelationXethtTHORN XethtTHORN is known then it is possible to recover theautocorrelation of the true IP signal ZethtTHORN ZethtTHORN

frac12ZethtTHORN ZethtTHORNethTHORN frac14 F1 FethYethtTHORNYethtTHORNTHORNF ethXethtTHORNXethtTHORNTHORN

frac14 F1FethYethTHORNTHORNF ethXethTHORNTHORN

eth6THORN

where F is the Fourier transform operator F1 is itsinverse YethtTHORN frac14 XethtTHORN ZethtTHORN and YethTHORN frac14 YethtTHORN YethtTHORN andXethTHORN frac14 XethtTHORN XethtTHORN are the autocorrelations of theChIP-seq signal Y and of the control X respectively

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 4 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

We named the recovered autocorrelation of the true IPsignal ZethtTHORN ZethtTHORN the Arpeggio profile (Figure 1)

In studies where multiple controls are available or nocontrols are available we matched to each ChIP-seq ex-periment the control that is closest in terms of its spectralproperties We applied the same control matching proced-ure to the rest of the samples and found that most samplesmatched controls done in the same study or cell line Thiscontrol matching procedure is a conservative approach inwhich we try to identify the parts of the Y spectra that aresignificantly distinct from the X spectra and thus increasesspecificity In the extreme case where the autocorrelationof the measured ChIP signal YethTHORN and of the technicalvariability XethTHORN have proportional spectral densities iethey are linearly correlated their ratio will be constant andtheir deconvolution is an impulse indicating that there isno true IP signal (Figure 2)

We matched controls to experiments from the pool ofcontrols in our data set by first matching the organism andDNA-shearing technique and selected the control with thehighest correlation (Pearsonrsquos ) to the ChIP experimentin the base resolution domain (frequency domain ieafter applying the Fourier transform) We also requirethat gt 085 We note that this leads to the highest speci-ficity in the context of our spectral analysis We recall thatcorrelation in the resolution domain does not imply cor-relation in the genomic co-ordinates domain The value ofthe Arpeggio profile at frac14 0 reflects differences in readcount between experiment and control From the values of frac14 0 recorded in our data set we concluded that noexperiment control pair had a perfectly matching readcount In general this did not significantly affect ourability to recover the autocorrelation of the true IP signal

However for 31 samples of 806 in our data set wecould not identify any matching control Of the remaining775 there was also a small fraction of experiments forwhich the resulting Arpeggio profile exhibited several arti-facts suggesting poorly matched controls In particularwhen the autocorrelation of the control had a differentdecay rate than that of the experiment we observed dipsaround frac14 0 or gradual rises or falls as opposed toleveling out at greater values of jj (SupplementaryFigure S6) This difference in decay rate is likely due totechnical variability dependent on read counts Howeverthere may be biological components to it as well eg if apolymerase binds and then moves along the genome it isreasonable to believe that the reads will be more spreadthan a transcription factor that binds to a single locationIt would have been desirable to extract the true IP signal

Z(t) from ZethtTHORN ZethtTHORN directly by square root in the reso-lution domain However in general FethZethtTHORNTHORN is not entirelypositive or real thus there is no unique solution for Z(t)because the phase information is lost in the autocorrel-ation operation In practice the Arpeggio profilesZethtTHORN ZethtTHORN are sufficient for the purpose of comparingChIP-seq experiments and can also be used to examine thespectral density as a function of base resolution

Recovering the length distribution of ChIP-seq fragments

The fragment length distribution is an important param-eter for algorithms that seeks to identify binding eventlocations from short read experiments such as ChIP-seq

minus2000 minus1000 0 1000 2000

000

000

0005

000

100

0015

000

20

Offset(bp)

IP s

igna

l (R

eads

ChI

P

Tot

al R

eads

ChI

P)

(R

eads

Con

trol

Tot

al R

eads

Con

trol

)

HNF4APolIIDNA Input

Figure 2 Examples of the deconvolved Arpeggio profiles for a tran-scription factor PolII and a total DNA input sample The transcriptionfactor HNF4A shows a strong spike followed by a steep decayindicating isolated binding and PolII shows an isolated pulsefollowed by gradual decay The isolated pulse for total DNA inputindicates that sequenced reads provide no additional informationbeyond technical variability ethXethtTHORNTHORN The DNA input Arpeggio was con-structed using the samplersquos best matched control As in the previousfigure the read count difference between the sample and the matchedcontrol at frac14 0 was removed and smoothing was applied to aid invisualization

minus2000 minus1000 0 1000 2000

000

000

0005

000

100

0015

000

20

Offset(bp)

IP s

igna

l (R

eads

ChI

P

Tot

al R

eads

ChI

P)

(R

eads

Con

trol

Tot

al R

eads

Con

trol

)

H3K36me3H3K27me3H3K27AcpRB

Figure 1 Examples of the deconvolved Arpeggio profiles for histonemarks H3K27me3 Histone 3 lysine 36 tri-methylation (H3K36me3)Histone 3 lysine 27 acetylation (H3K27Ac) and for retinoblastoma(pRB) The H3K27me3 and H3K36me3 profiles show periodicity of190 bp consistent with previous reports of nucleosome frequency(60) The value representing the total read count difference betweenChIP and control at frac14 0 was removed and smoothing was appliedto aid in visualization

PAGE 5 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

peak callers (61) If paired-end reads are available theycan be used to recover the fragment length distributionempirically however many experiments in currentrepositories were not done using paired-end reads Wenote that for a given number of nucleotides sequencedthe single end approach represents twice as many frag-ments and thus is more sensitive than the paired-endapproach The latter however has an advantage inmapability in repetitive regionsPrevious studies have used cross-correlation to infer the

average fragment length of singled-end short read ChIP-seq data (62ndash64) As known when considering reads fromopposing strands some of the measured distances reflectfragment lengths (62ndash66)We modeled the probability associated with sampling a

fragment starting at any given position on the positivestrand of the genome as Prfrac12W frac14 t If the sequencedread from this fragment happens to map to the positivestrand it will start at the same genomic position t givingthe same probability distribution for the readsPrfrac12R frac14 Prfrac12W If however the sequenced read maps tothe negative strand then its starting position is F= f thefragment length nucleotides downstream from thestarting position of the fragment W= t Thus given theprobability of sampling a read on the positive strandPrfrac12R there is an equivalent probability of sampling aread on the negative strand Prfrac12R+F where F is arandom variable representing the fragment lengthWe show that harmonic deconvolution can be used to

determine not only the average fragment length but alsothe full empirical fragment length distribution This isdone by deconvolving the cross-correlation (seelsquoMaterials and Methodsrsquo section) between the start ofreads aligning to opposite strands which we denoted asethTHORN from the autocorrelation of reads aligning to thesame strand for the same experiment ethTHORN We recall thatthe probability distribution of the sum of two randomvariables is equivalent to the convolution of their individ-ual probability distributions and thus

c frac14 Prfrac12R+F Prfrac12R

frac14 Prfrac12R Prfrac12R Prfrac12Feth7THORN

where c is a normalization constant such thatc DethtTHORN frac14 Prfrac12R Similarly to Equation (6) we deconvolvethe fragment length distribution

Prfrac12F frac14 F1 FethPrfrac12RPrfrac12RPrfrac12FTHORNFethPrfrac12RPrfrac12RTHORN

frac14 F1

FethcYethTHORNTHORN

FethcYethTHORNTHORN

eth8THORN

We found that the reported fragment size from the differ-ent studies included in our data set matched the fragmentsize inferred using our deconvolution approach(Supplementary Figure S7 and Supplementary Table S2)Moreover our deconvolution approach using only oneread from each read pair of a paired end experimentproduced an estimate of the fragment length distributionthat closely matched to the length distribution of thepaired-end fragments (Figure 3 see SupplementaryNote A)

Arpeggio captures the biology of proteinndashchromatininteraction

The space of Arpeggio spectral densities is lowdimensionalTo facilitate organization of large sequencing archives inparticular during search operations for complex queriesor computationally intensive machine-learning tasks it isdesirable to represent the samples using a small number offeatures Compared with the size of the genome spectraldensities of Arpeggio profiles described by only 8192elements are already relatively small We investigatedwhether the space comprising all spectral densities in ourdatabase could be further reduced using Multi-Dimensional Scaling (MDS) such as PCA We foundthat six principal components were sufficient to capture85 of the variability present in our collection ofspectral densities

Although more advanced techniques (67) may result ina better compression ie fewer dimensions we decided touse PCA which is readily available and familiar for manypractitioners We used the leading six spectral densityprincipal components to organize hundreds of ChIP-seqexperiments and aid annotation of novel samples Wetermed this representation Arpeggio MDS

Use of Arpeggio MDS coordinates for classification andclusteringClassification and clustering are affected by choice of datarepresentation and dissimilarity measures Here we usethe DaviesndashBouldin index (52) to assess discernability

0 200 400 600 800 1000

000

00

001

000

20

003

000

40

005

000

6

fragment size (bp)de

nsity

PairedminusEnd estimate (212bp)SingleminusEnd Arpeggio (214bp)Experimental protocol (200bp)Read length (75bp)

Figure 3 Comparison between paired-end fragment length distributionand Arpeggio fragment length distribution The Arpeggio fragmentlength distribution was estimated from only one read of each readpair The reported fragment length from the experimental protocoltogether with the sequenced read length is shown as dashed linesThe arpeggio fragment length distribution shows an additional spikeat the read length which has been previously observed in opposingstrand cross-correlation (63)

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 6 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

between clusters and inferred class labels for each classvariable using a k-nearest neighbor classification

First for each class variables eg antibody targetcellular mechanism or organism we considered classlabels as cluster indices and computed the DaviesndashBouldin index (see lsquoMaterials and Methodsrsquo section)Our analysis showed significantly small DaviesndashBouldinindices for all class variables (Supplementary Table S4)This indicates that the organization of the data based onArpeggio MDS coordinates has a structure amenable for avariety of machine-learning approaches The DaviesndashBouldin index might have been affected by samplingbias of proteins analyzed within each organism groupskewing the index for the organism class variable

Second for each class variable we trained a k-nearestneighbor classifier (k=1) and using a leave-one-outapproach we computed the balanced accuracy of classlabel assignments (see lsquoMaterials and Methodsrsquo section)For most class variables the performance of the trainedclassifier was above the performance of a classifier assign-ing labels at random (Supplementary Table S4)Importantly misclassification was rare but was morecommon between total DNA input and immunoglobulinG controls (Figure 4)

Similarity between Arpeggio profiles is indicative ofbiological functions

We showed that the Arpeggio profiles could be used totrain classifiers that suggest annotation for new experi-ments We sought to extend this paradigm and addresswhether given a new set of experiments Arpeggioprofiles could be used to rapidly select available ChIP-seq experiments that would enable a more complete de-scription of the biological mechanism of interestTypically this analysis involves identification of peaksfrom the ChIP-seq signals and the study of the overlapbetween events in two or more different experiments (8)For this reason we compared the proximity mappingdetermined by the Arpeggio profiles with the proximityscore determined using the Jaccard distance of peakoverlaps

For each ChIP-seq experiment peaks were identifiedusing the Qeseq program (61) assigning to each experi-ment the best matching control as described in theprevious sections We set the fragment size to 150 bp forexperiments using MNase digestion and to 250 bp for ex-periments using sonication (see Supplementary Table S2)We considered two peaks to be overlapping if they sharedat least 1 bp For any pair of experiments the totalnumber of peaks was computed as the sum of thenumber of peaks in each experiment minus the numberof overlapping peaks The pairwise peak overlap score wascomputed using the Jaccard distance namely one minusthe ratio between the number of overlapping peaks andthe total number of peaks In contrast to our Arpeggioapproach peak overlap does not directly allow cross-species comparison To ensure a fair comparisonbetween data representations we split our data set intoa set of human samples (n=541) and a set of murinesamples (n=237) and analyzed them separately

Applying PCA to the peak overlap Jaccard distancematrix revealed that for the human set 365 principalcomponents were needed to capture 85 of the variabilityin the data In contrast 85 of the variability of thepairwise distance matrix of the Fourier transforms of theArpeggio profiles was captured by the six leading principalcomponents Thus the Arpeggio MDS provides a morecompact representation of the dataNext for each class variable we studied the classification

performance using three distance measures peak overlapJaccard distance inverse correlation measure betweenspectral density of Arpeggio profiles and Euclideandistance between Arpeggio MDS coordinates Weanalyzed human and mouse samples separately For eachclass variable we quantified the performance using theArea Under the receiver operator characteristic Curve(AUC) as described in the lsquoMaterials and Methodsrsquo sectionApplication of a k-nearest neighbor classifier with k=1

to these three distance measures resulted in comparableperformances in most class variables However proximitybetween Arpeggio MDS profiles as compared with peakoverlap Jaccard distance reflected higher association forcellular mechanisms This was more evident in thehuman set where the larger number of experiments corres-ponded to smaller error bars (Figure 5)Arpeggio retains information related to mode of

binding and performs best in clustering cellular mechan-ism in contrast peak overlap contains information aboutbinding locations and best clusters more variables such asbatch effects associated with the Study ID and cell line

Fact

or

His

tone

Mod

ifica

tion

RN

A P

olym

eras

e

Gen

omic

Con

trol

Uns

peci

fic C

ontr

ol

Factor

Histone Modification

RNA Polymerase

Genomic Control

Unspecific Control

Pre

dict

ed la

bel

True label

516

107

96

103

178

253

479

82

8

106

267

97

485

66

85

16

8

39

206

515

169

28

35

304

464

Figure 4 Confusion matrix of predicting functional annotations fromthe Arpeggio profiles using a k-nearest neighbor classifier with k=1The values along each column of the matrix represent how the instancesin the actual class were assigned to the predicted class The diagonalelements indicate sensitivity Darker colors indicate higher classificationfrequencies Most misclassifications occurred between controls

PAGE 7 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

(Supplementary Table S5 Figure 5 and SupplementaryFigure S8)

Reading biological features from Arpeggio profiles

In the previous sections we have provided evidence thatArpeggio profiles and their spectral densities can be usedto rapidly compare a large number of experiments Inthis section we show that Arpeggio profiles can also beused to derive biologically and technically meaningfulinformationFor instance H3K27me3 Arpeggio profiles exhibited

distinct periodicity with high amplitudes of oscillation(Figure 1) This suggested a highly ordered array of nu-cleosomes consistent with a static chromatin structurewhere nucleosomes are precisely positioned This agreedwell with the known role of Polycomb and H3K27 tri-methylation in transcriptional repression and heterochro-matin formation We recall that the profile represents anaggregate of binding events across the whole genome theclear and distinct periodicity for H3K27me3 suggestedthat nucleosomes with the tri-methylated H3K27 markhad remarkably constant periodicities throughout thegenome (4) Further we show that this periodicity andthe width of the signal (number of clear oscillations) issimilar across species (Figure 6)Another oscillatory pattern can be observed for Histone

3 Lysine 36 tri-methylation (H3K36me3) This histonemodification mark is deposited along with activelytranscribing PolII complexes and is by far the mostreliable histone mark for actively transcribed genes Incontrast to H3K27me3 profiles where the oscillation isonly slowly dampening due to the static chromatin

structure imposed by this mark H3K36me3 profilesshow a central peak surrounded by slightly smallerpeaks with a rapidly degrading oscillation indicative ofa fluidic chromatin state where nucleosomes are not pre-cisely positioned This is in agreement with this markbeing found in actively transcribed genes as the nucleo-somes in actively transcribed chromatin are perturbed bythe passing polymerase complexes (Figure 1)

The difference in chromatin state is particularly evidentin the harmonic analysis of Arpeggio profiles Inspired byprevious work on nucleosome spacing (60) we computedthe ratio

frac14Aeth 1190THORN

Aeth 1180THORN+Aeth 1200THORN

2

eth9THORN

where Aeth1=tTHORN is the magnitude of the t bp length scale inthe spectral density H3K27me3 exhibited a stronger acompared with H3K36me3 (Figure 7) which suggestsflexibility in the spacing of nucleosomes carryingH3K36me3 mark Interestingly the ratio a was alsolarge in Histone 3 Lysine 4 tri-methylation (H3K4me3)and in the Retinoblastoma protein (pRb) Although thereasons for such a precise placement of H3K4me3 areunclear pRb is an important factor in establishingheterochromatin

The effects of open fluidic chromatin states are strongerin the Arpeggio profiles of histone acetylations Theseprofiles are characterized by a high central peak sur-rounded by unorganized oscillations These profiles areclearly indicative of an open fluid chromatin configur-ation which is consistent with actively transcribedregions (Figure 1)

However the reference ChIP-seq signal for openactively transcribed chromatin is the Arpeggio profile ofRNA Polymerase II The PolII complexes move alongactively transcribed genes thus the ChIP-seq information

Hsapiens

AU

C

00

02

04

06

08

10

Pro

tein

AB

targ

et

Inte

ract

ion

dom

ain

Fun

ctio

nal a

nnot

atio

n

Act

ivity

ann

otat

ion

Cel

lula

r M

echa

nism

She

arin

g

Cel

l Lin

e

Stu

dy ID

RandomPeak overlapArpeggioArpeggio MDS

Figure 5 Classification of biological and experimental factors in termsof peak-based and Arpeggio-based ChIP-seq signatures for humansamples The comparison was performed based on proximitycomputed using peak overlap between pairs of experiments inversecorrelation of Arpeggio spectral densities or Euclidean distances ofArpeggio MDS coordinates The AUC of a classifier assigningrandom labels is shown as a black dashed line Compared with peakoverlap proximity based on Arpeggio MDS had a higher median per-formance of predicting cellular mechanism

500 1000 1500 2000minus4e

minus04

minus2e

minus04

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Offset(bp)

IP s

igna

l (R

eads

ChI

P

Tot

al R

eads

ChI

P)

(R

eads

Con

trol

Tot

al R

eads

Con

trol

)

Fly (187bp period)Mouse (188bp period)Human (184bp period)

Figure 6 H3K27me3 shows similar periodicity across fly mouse andhuman Periodicity was calculated using the local maxima of thespectral density for the range of the Arpeggio profiles shown

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 8 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

is spread over a relatively large spatial region For thisreason PolII ChIP experiments need significantsequencing depth to obtain a good picture of PolIIactivity The Arpeggio profiles for PolII show a strongcentral signal surrounded by two shoulders each flankedby disorganized nucleosomes (Figure 2) The central peakis where PolII is located most closely to the DNA strandand thus efficient cross-linking can occur However PolIIis part of a large lsquoholoenzymersquo transcription machineryand the surrounding smaller humps are likely where thiscomplex is close enough to the DNA strand to be cross-linked and enriched in a ChIP-seq experiment The un-organized pattern surrounding these peaks is likely aresult of the actively transcribing PolII complex As itmoves through the chromatin it perturbs andor disas-sembles nucleosomes in front and re-deposits thembehind giving highly disorganized chromatin structure(Figure 2)

Other proteins such as Androgen Receptor SPDEF(SAM pointed domain-containing Ets transcriptionfactor) ERG (Ets-Related Gene) FL1 (Follicularlymphoma susceptibility to 1) display typical site-specific DNA-binding profiles of transcription factors inwhich a strong signal occurs at the binding siteaccompanied by disorganized surrounding patterns indi-cative of active fluidic chromatin (Figure 2)

DISCUSSION

The contribution of this study is the design of a compactChIP-seq data representation based on the denoised auto-correlation We show that use of this compact data repre-sentation has several advantages that facilitates efficientcomputation and data storage linking the cellular mech-anisms of protein targets in novel ChIP-seq experiments todata in current repositories low-dimensional organizationof large repositories of ChIP-seq data that facilitates ex-ploratory data analysis cross-species and cross-cell linecomparisons extraction of technical features such asfragment length distribution and biological relevantinterpretation

Large collections of ChIP-seq data have been leveragedto gain new biological insights (8) The volume of ChIP-seq data in public repositories has noticeably increased inrecent years Typically users retrieve samples based ontheir prior knowledge and expectations of the biologicalsystem Currently available data retrieval systems arebased on matching qualitative annotations such asorganism cell-type condition and specific immunopre-cipitated protein We suggest the use of our novel com-pression technique Arpeggio to enable searching forsamples similar to the query experiment in terms of quan-titative patterns present in theirs signals thus facilitatingnovel biological discoveries We note that non-linear datacompression approaches applied to the autocorrelationfunctions can organize the data and reveal new insights(see diffusion map analysis Supplementary Note B) Wepresent Arpeggio profiles and their spectral densities Thislow-dimensional harmonic data representation can beused for selecting publicly available experiments that arebiologically related to an experiment of interest TheArpeggio profiles are computed from the autocorrelationof ChIP-seq signals which have been previously exploredin the context of data quality assessment (7)In contrast to previous approaches we applied signal

processing techniques to derive a profile of the IP auto-correlation that is diminished in technical variability andrequires little pre-filtering We found that Arpeggioprofiles were remarkably organized in four maincategories corresponding to intuitive classes of structuralinteractions factors showed peaks with sharply decayingtails polymerases showed peaks as well but with slowlydecaying tails histone modifications showed damped os-cillations corresponding to trains of peaks at fixed dis-tances from one another lastly controls showed a singlepulse sharper than the peak of factors indicating that asexpected sequenced reads from total DNA inputs have norecurrent properties Typical binding patterns of a particu-lar proteinndashchromatin interaction are obscured by noiseArpeggio profiles overcome this problem by aggregatingthe recurrent patterns of proteinndashchromatin interactionThe quality of autocorrelation also improves quickerthan the read coverage density as the number of readsincreases In peak finding the average coverage isexpected to increase linearly on the order of OethN L=GTHORNwhere N is the number of sequenced reads L is the read-length and G is the size of the genome in contrast thenumber of distances between reads contributing to thecomputation of the autocorrelation scales quadraticallyin the worst case as OethN2 W=GTHORN where W is themaximum lag at which the autocorrelation is evaluatedwhere W L We note that Arpeggio profiles do notprovide the location of such binding events however weplan to further develop these characteristic spectralbinding patterns for locating peaksIn this work we used an unsupervised approach to

organize a large volume of ChIP-seq experiments Weshow that close proximity between the denoised spectraldensities of two different proteins is often associated withsimilar cellular mechanism In this work we manuallyannotated 806 of the 14 306 currently marked as ChIP-seq samples in the SRA Databases are expected to

DN

AIn

put

ER

GG

RH

2BK

12A

cH

2BK

15A

cH

2BK

20A

cH

3H

3K18

Ac

H3K

27A

cH

3K27

me3

H3K

36m

e3H

3K4m

e1H

3K4m

e2H

3K4m

e3H

3K79

me1

H3K

79m

e2H

3K79

me3

H3K

9Ac

H3K

9me3

H4K

12A

cH

4K20

me1

H4K

5Ac

IgG

P30

0P

olII

RA

RS

PD

EF

VitD

rhE

Ra

pRb

099

100

101

102

103

104

105

106

α

Figure 7 Flexibility of nucleosome spacing across different experi-ments Flexible spacing or isolated events (SupplementaryFigure S2) is represented by ratios close to one Ratios clearly aboveone indicate strict spacing of 190 bp between events (SupplementaryFigure S4) Histone modifications associated with heterochromatinexhibit higher ratios indicating less flexibility

PAGE 9 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

evolve to be more structured and enable automatic re-trieval eliminating the need for data entry tasks andallowing us to organize thousands of samples at a timeMoreover high-throughput sequencing is becoming cheapfacilitating the mass survey of novel ChIP targets forwhich function is yet to be determined Applying theproposed spectral representation to thousands of existingannotated ChIP-seq experiments will allow us to screenthese new ChIP targets reducing the resources requiredto elucidate their functions Interpretation of spectralpatterns in many fields of science and engineering (ieradiology control systems analysis imaging) is often theproduct of years of study In this study we focused onproperties that discriminate coarse categories of proteinndashchromatin interaction There is a wealth of knowledgehidden in Arpeggio representation and we anticipatethat with increasing database size and quality it willprovide information on a much finer scale

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Onlineincluding [68ndash122]

ACKNOWLEDGEMENTS

The authors thank Ronen Talmon Sherman WeissmanRonald Coifman and Paolo Barbano for insightfuldiscussions

FUNDING

National Institute of Health [T15 LM07056 to KS] and[CA-158167 to YK] Yale Cancer Center translationalresearch pilot funds (to FP and FS) American CancerSociety Award [M130572 to FS] the American-ItalianCancer Foundation [Post-Doctoral Research Fellowshipto FS] the Peter T Rowley Breast Cancer ResearchProjects funded by the New York State Department ofHealth [FAU 0812160900 YK] Funding for openaccess charge The Peter T Rowley Breast CancerResearch Projects funded by the New York StateDepartment of Health [FAU 0812160900 YK]

Conflict of interest statement None declared

REFERENCES

1 MortazaviA WilliamsBA MccueK SchaefferL and WoldB(2008) Mapping and quantifying mammalian transcriptomes byRNA-Seq Nat Methods 5 621ndash628

2 TrapnellC RobertsA GoffL PerteaG KimD KelleyDRPimentelH SalzbergSL RinnJL and PachterL (2012)Differential gene and transcript expression analysis of RNA-seqexperiments with TopHat and Cufflinks Nat Protoc 7 562ndash578

3 GreenmanC StephensP SmithR DalglieshGL HunterCBignellG DaviesH TeagueJ ButlerA StevensC et al(2007) Patterns of somatic mutation in human cancer genomesNature 446 153ndash158

4 AspP BlumR VethanthamV ParisiF MicsinaiM ChengJBowmanC KlugerY and DynlachtBD (2011) Genome-wide

remodeling of the epigenetic landscape during myogenicdifferentiation Proc Natl Acad Sci USA 108 E149ndashE158

5 BarskiA and ZhaoK (2009) Genomic location analysis byChIP-Seq J Cell Biochem 107 11ndash18

6 MikkelsenTS KuMC JaffeDB IssacB LiebermanEGiannoukosG AlvarezP BrockmanW KimTK KocheRPet al (2007) Genome-wide maps of chromatin state in pluripotentand lineage-committed cells Nature 448 553ndash560

7 ENCODE Project Consortium (2004) The ENCODE(ENCyclopedia of DNA elements) project Science 306 636ndash640

8 ENCODE Project Consortium (2012) An integrated encyclopediaof DNA elements in the human genome Nature 489 57ndash74

9 ListerR PelizzolaM DowenRH HawkinsRD HonGTonti-FilippiniJ NeryJR LeeL YeZ NgoQM et al(2009) Human DNA methylomes at base resolution showwidespread epigenomic differences Nature 462 315ndash322

10 BellO SchwaigerM OakeleyEJ LienertF BeiselCStadlerMB and SchubelerD (2010) Accessibility of theDrosophila genome discriminates PcG repression H4K16acetylation and replication timing Nat Struct Mol Biol 17894ndash900

11 BlowM McCulleyDJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2010)ChIP-Seq identification of weakly conserved heart enhancers NatGenet 42 806ndash810

12 BottomlyD KylerSL McWeeneySK and YochumGS(2010) Identification of b-catenin binding regions in colon cancercells using ChIP-Seq Nucleic Acids Res 38 5735ndash5745

13 ChicasA WangX ZhangC McCurrachM ZhaoZ MertODickinsRA NaritaM ZhangM and LoweSW (2010)Dissecting the unique role of the retinoblastoma tumor suppressorduring cellular senescence Cancer Cell 17 376ndash387

14 CreyghtonMP ChengAW WelsteadGG KooistraTCareyBW SteineEJ HannaJ LodatoMA FramptonGMSharpPA et al (2010) From the cover histone H3K27acseparates active from poised enhancers and predictsdevelopmental state Proc Natl Acad Sci USA 10721931ndash21936

15 DurantL WatfordWT RamosHL LaurenceA VahediGWeiL TakahashiH SunH-W KannoY PowrieF et al(2010) Diverse targets of the transcription factor STAT3contribute to T cell pathogenicity and homeostasis Immunity 32605ndash615

16 EnderleD BeiselC StadlerMB GerstungM AthriP andParoR (2010) Polycomb preferentially targets stalled promotersof coding and noncoding transcripts Genome Res 21 216ndash226

17 FullwoodMJ LiuMH PanYF LiuJ XuHMohamedYB OrlovYL VelkovS HoA MeiPH et al(2009) An oestrogen-receptor-a-bound human chromatininteractome Nature 462 58ndash64

18 GanQ SchonesDE Ho EunS WeiG CuiK ZhaoK andChenX (2010) Monovalent and unpoised status of most genes inundifferentiated cell-enriched Drosophila testis Genome Biol 11R42

19 GorchakovAA AlekseyenkoAA KharchenkoP ParkPJand KurodaMI (2009) Long-range spreading of dosagecompensation in Drosophila captures transcribed autosomal genesinserted on X Genes Dev 23 2266ndash2271

20 GuentherMG FramptonGM SoldnerF HockemeyerDMitalipovaM JaenischR and YoungRA (2010) Chromatinstructure and gene expression programs of human embryonic andinduced pluripotent stem cells Cell Stem Cell 7 249ndash257

21 HoffmanBG RobertsonG ZavagliaB BeachM CullumRLeeS SoukhatchevaG LiL WederellED ThiessenN et al(2010) Locus co-occupancy nucleosome positioning andH3K4me1 regulate the functionality of FOXA2- HNF4A- andPDX1-bound loci in islets and liver Genome Res 20 1037ndash1051

22 HurtadoA HolmesKA Ross-InnesCS SchmidtD andCarrollJS (2010) FOXA1 is a key determinant of estrogenreceptor function and endocrine response Nat Genet 43 27ndash33

23 IpJY SchmidtD PanQ RamaniAK FraserAGOdomDT and BlencoweBJ (2011) Global impact of RNApolymerase II elongation inhibition on alternative splicingregulation Genome Res 21 390ndash401

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 10 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

24 JohnS SaboPJ ThurmanRE SungM-H BiddieSCJohnsonTA HagerGL and StamatoyannopoulosJA (2011)Chromatin accessibility pre-determines glucocorticoid receptorbinding patterns Nat Genet 43 264ndash268

25 JolmaA KiviojaT ToivonenJ ChengL WeiG EngeMTaipaleM VaquerizasJM YanJ SillanpaaMJ et al (2010)Multiplexed massively parallel SELEX for characterization ofhuman transcription factor binding specificities Genome Res 20861ndash873

26 JungH LacombeJ MazzoniEO LiemKF Jr GrinsteinJMahonyS MukhopadhyayD GiffordDK YoungRAAndersonKV et al (2010) Global control of motor neurontopography mediated by the repressive actions of a single hoxgene Neuron 67 781ndash796

27 KageyMH NewmanJJ BilodeauS ZhanY OrlandoDAvan BerkumNL EbmeierCC GoossensJ RahlPBLevineSS et al (2010) Mediator and cohesin connect geneexpression and chromatin architecture Nature 467 430ndash435

28 KasowskiM GrubertF HeffelfingerC HariharanMAsabereA WaszakSM HabeggerL RozowskyJ ShiMUrbanAE et al (2010) Variation in transcription factor bindingamong humans Science 328 232ndash235

29 KocheRP SmithZD AdliM GuH KuM GnirkeABernsteinBE and MeissnerA (2011) Reprogramming factorexpression initiates widespread targeted chromatin remodelingCell Stem Cell 8 96ndash105

30 KramerJM KochinkeK OortveldMAW MarksHKramerD de JongEK AsztalosZ WestwoodJTStunnenbergHG SokolowskiMB et al (2011) Epigeneticregulation of learning and memory by Drosophila EHMTG9aPLoS Biol 9 e1000569

31 LeeBK BhingeAA and IyerVR (2011) Wide-rangingfunctions of E2F4 in transcriptional activation and repressionrevealed by genome-wide analysis Nucleic Acids Res 393558ndash3573

32 LiL JothiR CuiK LeeJY CohenT GorivodskyMTzchoriI ZhaoY HayesSM BresnickEH et al (2010)Nuclear adaptor Ldb1 regulates a transcriptional programessential for the maintenance of hematopoietic stem cells NatImmunol 12 129ndash136

33 MahonyS MazzoniEO McCuineS YoungRA WichterleHand GiffordDK (2011) Ligand-dependent dynamics of retinoicacid receptor binding during early neurogenesis Genome Biol12 R2

34 MarsonA LevineSS ColeMF FramptonGMBrambrinkT JohnstoneS GuentherMG JohnstonWKWernigM and NewmanJ (2008) Connecting microRNA genesto the core transcriptional regulatory circuitry of embryonic stemcells Cell 134 521ndash533

35 PaulerFM SloaneMA HuangR ReghaK KoernerMVTamirI SommerA AszodiA JenuweinT and BarlowDP(2008) H3K27me3 forms BLOCs over silent genes and intergenicregions and specifies a histone banding pattern on a mouseautosomal chromosome Genome Res 19 221ndash233

36 Rada-IglesiasA BajpaiR SwigutT BrugmannSAFlynnRA and WysockaJ (2010) A unique chromatin signatureuncovers early developmental enhancers in humans Nature 470279ndash283

37 RamagopalanSV HegerA BerlangaAJ MaugeriNJLincolnMR BurrellA HandunnetthiL HandelAEDisantoG OrtonS-M et al (2010) A ChIP-seq definedgenome-wide map of vitamin D receptor bindingassociations with disease and evolution Genome Res 201352ndash1360

38 Rugg-GunnPJ CoxBJ RalstonA and RossantJ (2010)Distinct histone modifications in stem cell lines and tissue lineagesfrom the early mouse embryo Proc Natl Acad Sci USA 10710783ndash10790

39 SchmidtD SchwaliePC Ross-InnesCS HurtadoABrownGD CarrollJS FlicekP and OdomDT (2010) ACTCF-independent role for cohesin in tissue-specific transcriptionGenome Res 20 578ndash588

40 VermeulenM EberlHC MatareseF MarksH DenissovSButterF LeeKK OlsenJV HymanAA and

StunnenbergHG (2010) Quantitative interaction proteomics andgenome-wide profiling of epigenetic histone marks and theirreaders Cell 142 967ndash980

41 VerziMP ShinH HeHH SulahianR MeyerCAMontgomeryRK FleetJC BrownM LiuXS andShivdasaniRA (2010) Differentiation-specific histonemodifications reveal dynamic chromatin interactions and partnersfor the intestinal transcription factor CDX2 Dev Cell 19713ndash726

42 WeiGH BadisG BergerMF KiviojaT PalinK EngeMBonkeM JolmaA VarjosaloM GehrkeAR et al (2010)Genome-wide analysis of ETS-family DNA-binding in vitro andin vivo EMBO J 29 2147ndash2160

43 WeiL VahediG SunH-W WatfordWT TakatoriHRamosHL TakahashiH LiangJ Gutierrez-CruzG ZangCet al (2010) Discrete roles of STAT4 and STAT6 transcriptionfactors in tuning epigenetic modifications and transcription duringT helper cell differentiation Immunity 32 840ndash851

44 YangY LuY EspejoA WuJ XuW LiangS andBedfordMT (2010) TDRD3 is an effector molecule for arginine-methylated histone marks Mol Cell 40 1016ndash1023

45 YuS CuiK JothiR ZhaoD-M JingX ZhaoK andXueH-H (2010) GABP controls a critical transcriptionregulatory module that is essential for maintenance anddifferentiation of hematopoietic stemprogenitor cells Blood 1172166ndash2178

46 HoffmanMM ErnstJ WilderSP KundajeA HarrisRSLibbrechtM GiardineB EllenbogenPM BilmesJABirneyE et al (2013) Integrative annotation of chromatinelements from ENCODE data Nucleic Acids Res 41 827ndash841

47 LimSJ TanTW and TongJC (2010) Computationalepigenetics the new scientific paradigm Bioinformation 4331ndash337

48 ThurmanRE DayN NobleWS and StamatoyannopoulosJA(2007) Identification of higher-order functional domains in thehuman ENCODE regions Genome Res 17 917ndash927

49 LangmeadB TrapnellC PopM and SalzbergSL (2009)Ultrafast and memory-efficient alignment of short DNAsequences to the human genome Genome Biol 10 R25

50 DiazA ParkK LimDA and SongJS (2012) Normalizationbias correction and peak calling for ChIP-seq Stat Appl GenetMol Biol 11 9

51 LiangK and KelesS (2012) Normalization of ChIP-seq datawith control BMC Bioinformatics 13 199

52 R Development Core Team (2012) R A Language andEnvironment for Statistical Computing R Foundation forStatistical Computing Vienna Austria

53 DaviesDL and BouldinDW (1979) A cluster separationmeasure IEEE Trans Pattern Anal Mach Intell 1 224ndash227

54 HanleyJA and McNeilBJ (1982) The meaning and use of thearea under a receiver operating characteristic (ROC) curveRadiology 143 29ndash36

55 GarnerSR (1995) Weka the waikato environment forknowledge analysis In Procceding of the New Zealand ComputerScience Research Students Conference Hamilton New Zealandpp 57ndash64

56 Shi-YunS Kai-QuanS Chong-JinO Xiao-PingL andWilder-SmithE (2008) Automatic identification and removal ofartifacts in EEG using a probabilistic multi-class SVM approachwith error correction In IEEE International Conference onSystems Man and Cybernetics Singapore Singaporepp 1134ndash1139

57 BellmanRE (2003) Dynamic Programming Dover PublicationsNew York NY

58 BickmoreWA and van SteenselB (2013) Genome architecturedomain organization of interphase chromosomes Cell 1521270ndash1284

59 OppenheimAV SchaferRW and BuckJR (1999) Discrete-Time Signal Processing Prentice-Hall Inc Upper SaddleRiver NJ

60 BlankTA and BeckerPB (1995) Electrostatic mechanism ofnucleosome spacing J Mol Biol 252 305ndash313

61 MicsinaiM ParisiF StrinoF AspP DynlachtBD andKlugerY (2012) Picking ChIP-seq peak detectors for

PAGE 11 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

analyzing chromatin modification experiments Nucleic Acids Res40 e70

62 KharchenkoPV TolstorukovMY and ParkPJ (2008) Designand analysis of ChIP-seq experiments for DNA-binding proteinsNat Biotechnol 26 1351ndash1359

63 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of theENCODE and modENCODE consortia Genome Res 221813ndash1831

64 RamachandranP PalidworGA PorterCJ and PerkinsTJ(2013) MaSC mappability-sensitive cross-correlation forestimating mean fragment length of single-end short-readsequencing data Bioinformatics 29 444ndash450

65 NarlikarL and JothiR (2012) ChIP-Seq data analysisidentification of ProteinndashDNA binding sites with SISSRs peak-finder In WangJ TanAC and TianT (eds) Next GenerationMicroarray Bioinformatics Vol 802 Humana Press New YorkUSA pp 305ndash322

66 ZhangY LiuT MeyerCA EeckhouteJ JohnsonDSBernsteinBE NussbaumC MyersRM BrownM LiWet al (2008) Model-based analysis of ChIP-Seq (MACS) GenomeBiol 9 R137

67 LittleAV LeeJ Yoon-MoJ and MaggioniM (2009) IEEESP15th Workshop on Statistical Signal Processing (SSPrsquo09) CardiffUnited Kingdom pp 85ndash88

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 12 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

Page 4: Arpeggio: harmonic compression of ChIP-seq data reveals protein ...

Statistical analyses and software

All analyses were performed using our Java-based softwarepackage and the R statistical software (56) Our Arpeggiosoftware can be used to download data from the SRAmap reads to reference genomes compute autocorrelationand Arpeggio profiles An R script is also available togenerate and plot Arpeggio profiles The Arpeggiosoftware suite is freely available at httpsourceforgenetparpeggiowikiHome together with a detailedtutorial

RESULTS

Spectral density of ChIP-seq signals

To enhance our understanding of a new ChIP-seq sampleit is often beneficial to relate it to relevant ChIP-seq ex-periments in public repositories Here we relate experi-ments based on their signal proximity and therefore anappropriate distance metric is neededA naıve comparison of ChIP-seq experiments can be

done by measuring the distance between their coveragedepth graphs The genome consists of a large number ofgenomic positions (3 billion for the human genome)from which one can theoretically sample reads Thispairwise comparison is not efficient for large data setsand the dimension may be too large to identify neighbors(57) In addition at single nucleotide resolution meaning-ful patterns can be obscured by noise Standard compari-sons between pairs of ChIP-seq experiments arecommonly done by evaluating the overlap betweenpeaks detected in these experiments (458) Intuitivelythe union of peaks from a large data sets which isneeded to generate pairwise distances produces manygenomic intervals and is thus high dimensional Toquickly relate a given ChIP-seq sample to relevant experi-ments we sought to design a data representation thatcaptures the underlying biology is easy to compute canbe expressed by a relatively small number of dimensionsand finally is robust to suboptimal read coverageWe leveraged two important characteristics of ChIP-seq

data to create this low-dimensional representation Firstreads are localized in islands surrounding the interactingproteins (eg factors histones or polymerases) that weretargeted by the antibody (58) We therefore examined thesystem at intermediate genomic length scales Specificallywe consider the autocorrelation function of the readcoverage depth This function captures recurrent eventsalong the genome and aggregates this information to sig-natures of read co-occurrence at specific length scales orlags (Supplementary Figures S2ndashS4) As a consequence ofthe localized nature of ChIP-seq the autocorrelationfunction exhibits regular non-random interactions withina relatively small offset 4095 bp 4096 bp afterwhich it is uninformative Second read count data arestochastic and typically undersampled resulting inspikey noise that obscures signals occurring at nucleotideresolution (Supplementary Figure S5) At long lengthscales far beyond the length of proteinndashchromatin inter-action islands and also at short length scales on the order

of nucleotides the signal is dominated by noise and there-fore we set out to capture the signal at intermediate lengthscales where the signal-to-noise ratio is the highest To thisaim we applied a Fast Fourier Transform to the autocor-relation function for all lags 4095 bp 4096 bp (seelsquoMaterials and Methodsrsquo section) resulting in the spectraldensity of the empirical read coverage distribution AethTHORNas a function of the resolution o

The autocorrelation is restricted to a fixed range(4095 bp 4096 bp) and thus the associatedspectral density is fast and easy to compute and it isconstructed as the histogram of pairwise distancesbetween all N reads (see lsquoMaterials and Methodsrsquosection) This approach enables a large coverage foreach lag (t) in the autocorrelation function

Extraction of the IP signal using deconvolution

The observed autocorrelation signal consists of a biolo-gical component relevant to the ChIP-seq experimentmodulated on a component capturing irrelevant pro-perties such as DNA accessibility and experimental biasWe therefore designed an approach to deconvolve thecomponent of the signal that captures the biologicalaspects of the experiment

We formulated the problem using a DSP approach TheChIP-independent properties of the signal are denotedtechnical variability X(t) and arise from technical biasesand the stochastic nature of read count data We denoteby Z(t) the true IP signal which reflects effects associatedwith experiment-specific components eg antibody pre-cipitation cross-linking of chromatin to other proteinsin the same complex We therefore model the measuredChIP signal Y(t) as the convolution of the true IP signalwith the technical variability frac12ZethtTHORN XethtTHORNethTHORN For brevitywe will omit ethTHORN in the equations when clear from thecontext

In the DSP framework X(t) is the input Z(t) is the finiteimpulse response function associated with the specific bio-logical signal and Y(t) is the observed output To recoverthe specific finite impulse response and remove ChIP-independent components we used harmonic analysis tech-niques that are commonly used in engineering disciplines(59) In harmonic analysis signals are represented as thesum of characteristic harmonic components (ie sinus-oidal functions each with a specific period and phase)In this formulation if the technical variability X(t) isknown then applying the convolution theorem it ispossible to recover the true IP signal Z(t) from themeasured signal Y(t) consequently if the autocorrelationXethtTHORN XethtTHORN is known then it is possible to recover theautocorrelation of the true IP signal ZethtTHORN ZethtTHORN

frac12ZethtTHORN ZethtTHORNethTHORN frac14 F1 FethYethtTHORNYethtTHORNTHORNF ethXethtTHORNXethtTHORNTHORN

frac14 F1FethYethTHORNTHORNF ethXethTHORNTHORN

eth6THORN

where F is the Fourier transform operator F1 is itsinverse YethtTHORN frac14 XethtTHORN ZethtTHORN and YethTHORN frac14 YethtTHORN YethtTHORN andXethTHORN frac14 XethtTHORN XethtTHORN are the autocorrelations of theChIP-seq signal Y and of the control X respectively

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 4 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

We named the recovered autocorrelation of the true IPsignal ZethtTHORN ZethtTHORN the Arpeggio profile (Figure 1)

In studies where multiple controls are available or nocontrols are available we matched to each ChIP-seq ex-periment the control that is closest in terms of its spectralproperties We applied the same control matching proced-ure to the rest of the samples and found that most samplesmatched controls done in the same study or cell line Thiscontrol matching procedure is a conservative approach inwhich we try to identify the parts of the Y spectra that aresignificantly distinct from the X spectra and thus increasesspecificity In the extreme case where the autocorrelationof the measured ChIP signal YethTHORN and of the technicalvariability XethTHORN have proportional spectral densities iethey are linearly correlated their ratio will be constant andtheir deconvolution is an impulse indicating that there isno true IP signal (Figure 2)

We matched controls to experiments from the pool ofcontrols in our data set by first matching the organism andDNA-shearing technique and selected the control with thehighest correlation (Pearsonrsquos ) to the ChIP experimentin the base resolution domain (frequency domain ieafter applying the Fourier transform) We also requirethat gt 085 We note that this leads to the highest speci-ficity in the context of our spectral analysis We recall thatcorrelation in the resolution domain does not imply cor-relation in the genomic co-ordinates domain The value ofthe Arpeggio profile at frac14 0 reflects differences in readcount between experiment and control From the values of frac14 0 recorded in our data set we concluded that noexperiment control pair had a perfectly matching readcount In general this did not significantly affect ourability to recover the autocorrelation of the true IP signal

However for 31 samples of 806 in our data set wecould not identify any matching control Of the remaining775 there was also a small fraction of experiments forwhich the resulting Arpeggio profile exhibited several arti-facts suggesting poorly matched controls In particularwhen the autocorrelation of the control had a differentdecay rate than that of the experiment we observed dipsaround frac14 0 or gradual rises or falls as opposed toleveling out at greater values of jj (SupplementaryFigure S6) This difference in decay rate is likely due totechnical variability dependent on read counts Howeverthere may be biological components to it as well eg if apolymerase binds and then moves along the genome it isreasonable to believe that the reads will be more spreadthan a transcription factor that binds to a single locationIt would have been desirable to extract the true IP signal

Z(t) from ZethtTHORN ZethtTHORN directly by square root in the reso-lution domain However in general FethZethtTHORNTHORN is not entirelypositive or real thus there is no unique solution for Z(t)because the phase information is lost in the autocorrel-ation operation In practice the Arpeggio profilesZethtTHORN ZethtTHORN are sufficient for the purpose of comparingChIP-seq experiments and can also be used to examine thespectral density as a function of base resolution

Recovering the length distribution of ChIP-seq fragments

The fragment length distribution is an important param-eter for algorithms that seeks to identify binding eventlocations from short read experiments such as ChIP-seq

minus2000 minus1000 0 1000 2000

000

000

0005

000

100

0015

000

20

Offset(bp)

IP s

igna

l (R

eads

ChI

P

Tot

al R

eads

ChI

P)

(R

eads

Con

trol

Tot

al R

eads

Con

trol

)

HNF4APolIIDNA Input

Figure 2 Examples of the deconvolved Arpeggio profiles for a tran-scription factor PolII and a total DNA input sample The transcriptionfactor HNF4A shows a strong spike followed by a steep decayindicating isolated binding and PolII shows an isolated pulsefollowed by gradual decay The isolated pulse for total DNA inputindicates that sequenced reads provide no additional informationbeyond technical variability ethXethtTHORNTHORN The DNA input Arpeggio was con-structed using the samplersquos best matched control As in the previousfigure the read count difference between the sample and the matchedcontrol at frac14 0 was removed and smoothing was applied to aid invisualization

minus2000 minus1000 0 1000 2000

000

000

0005

000

100

0015

000

20

Offset(bp)

IP s

igna

l (R

eads

ChI

P

Tot

al R

eads

ChI

P)

(R

eads

Con

trol

Tot

al R

eads

Con

trol

)

H3K36me3H3K27me3H3K27AcpRB

Figure 1 Examples of the deconvolved Arpeggio profiles for histonemarks H3K27me3 Histone 3 lysine 36 tri-methylation (H3K36me3)Histone 3 lysine 27 acetylation (H3K27Ac) and for retinoblastoma(pRB) The H3K27me3 and H3K36me3 profiles show periodicity of190 bp consistent with previous reports of nucleosome frequency(60) The value representing the total read count difference betweenChIP and control at frac14 0 was removed and smoothing was appliedto aid in visualization

PAGE 5 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

peak callers (61) If paired-end reads are available theycan be used to recover the fragment length distributionempirically however many experiments in currentrepositories were not done using paired-end reads Wenote that for a given number of nucleotides sequencedthe single end approach represents twice as many frag-ments and thus is more sensitive than the paired-endapproach The latter however has an advantage inmapability in repetitive regionsPrevious studies have used cross-correlation to infer the

average fragment length of singled-end short read ChIP-seq data (62ndash64) As known when considering reads fromopposing strands some of the measured distances reflectfragment lengths (62ndash66)We modeled the probability associated with sampling a

fragment starting at any given position on the positivestrand of the genome as Prfrac12W frac14 t If the sequencedread from this fragment happens to map to the positivestrand it will start at the same genomic position t givingthe same probability distribution for the readsPrfrac12R frac14 Prfrac12W If however the sequenced read maps tothe negative strand then its starting position is F= f thefragment length nucleotides downstream from thestarting position of the fragment W= t Thus given theprobability of sampling a read on the positive strandPrfrac12R there is an equivalent probability of sampling aread on the negative strand Prfrac12R+F where F is arandom variable representing the fragment lengthWe show that harmonic deconvolution can be used to

determine not only the average fragment length but alsothe full empirical fragment length distribution This isdone by deconvolving the cross-correlation (seelsquoMaterials and Methodsrsquo section) between the start ofreads aligning to opposite strands which we denoted asethTHORN from the autocorrelation of reads aligning to thesame strand for the same experiment ethTHORN We recall thatthe probability distribution of the sum of two randomvariables is equivalent to the convolution of their individ-ual probability distributions and thus

c frac14 Prfrac12R+F Prfrac12R

frac14 Prfrac12R Prfrac12R Prfrac12Feth7THORN

where c is a normalization constant such thatc DethtTHORN frac14 Prfrac12R Similarly to Equation (6) we deconvolvethe fragment length distribution

Prfrac12F frac14 F1 FethPrfrac12RPrfrac12RPrfrac12FTHORNFethPrfrac12RPrfrac12RTHORN

frac14 F1

FethcYethTHORNTHORN

FethcYethTHORNTHORN

eth8THORN

We found that the reported fragment size from the differ-ent studies included in our data set matched the fragmentsize inferred using our deconvolution approach(Supplementary Figure S7 and Supplementary Table S2)Moreover our deconvolution approach using only oneread from each read pair of a paired end experimentproduced an estimate of the fragment length distributionthat closely matched to the length distribution of thepaired-end fragments (Figure 3 see SupplementaryNote A)

Arpeggio captures the biology of proteinndashchromatininteraction

The space of Arpeggio spectral densities is lowdimensionalTo facilitate organization of large sequencing archives inparticular during search operations for complex queriesor computationally intensive machine-learning tasks it isdesirable to represent the samples using a small number offeatures Compared with the size of the genome spectraldensities of Arpeggio profiles described by only 8192elements are already relatively small We investigatedwhether the space comprising all spectral densities in ourdatabase could be further reduced using Multi-Dimensional Scaling (MDS) such as PCA We foundthat six principal components were sufficient to capture85 of the variability present in our collection ofspectral densities

Although more advanced techniques (67) may result ina better compression ie fewer dimensions we decided touse PCA which is readily available and familiar for manypractitioners We used the leading six spectral densityprincipal components to organize hundreds of ChIP-seqexperiments and aid annotation of novel samples Wetermed this representation Arpeggio MDS

Use of Arpeggio MDS coordinates for classification andclusteringClassification and clustering are affected by choice of datarepresentation and dissimilarity measures Here we usethe DaviesndashBouldin index (52) to assess discernability

0 200 400 600 800 1000

000

00

001

000

20

003

000

40

005

000

6

fragment size (bp)de

nsity

PairedminusEnd estimate (212bp)SingleminusEnd Arpeggio (214bp)Experimental protocol (200bp)Read length (75bp)

Figure 3 Comparison between paired-end fragment length distributionand Arpeggio fragment length distribution The Arpeggio fragmentlength distribution was estimated from only one read of each readpair The reported fragment length from the experimental protocoltogether with the sequenced read length is shown as dashed linesThe arpeggio fragment length distribution shows an additional spikeat the read length which has been previously observed in opposingstrand cross-correlation (63)

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 6 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

between clusters and inferred class labels for each classvariable using a k-nearest neighbor classification

First for each class variables eg antibody targetcellular mechanism or organism we considered classlabels as cluster indices and computed the DaviesndashBouldin index (see lsquoMaterials and Methodsrsquo section)Our analysis showed significantly small DaviesndashBouldinindices for all class variables (Supplementary Table S4)This indicates that the organization of the data based onArpeggio MDS coordinates has a structure amenable for avariety of machine-learning approaches The DaviesndashBouldin index might have been affected by samplingbias of proteins analyzed within each organism groupskewing the index for the organism class variable

Second for each class variable we trained a k-nearestneighbor classifier (k=1) and using a leave-one-outapproach we computed the balanced accuracy of classlabel assignments (see lsquoMaterials and Methodsrsquo section)For most class variables the performance of the trainedclassifier was above the performance of a classifier assign-ing labels at random (Supplementary Table S4)Importantly misclassification was rare but was morecommon between total DNA input and immunoglobulinG controls (Figure 4)

Similarity between Arpeggio profiles is indicative ofbiological functions

We showed that the Arpeggio profiles could be used totrain classifiers that suggest annotation for new experi-ments We sought to extend this paradigm and addresswhether given a new set of experiments Arpeggioprofiles could be used to rapidly select available ChIP-seq experiments that would enable a more complete de-scription of the biological mechanism of interestTypically this analysis involves identification of peaksfrom the ChIP-seq signals and the study of the overlapbetween events in two or more different experiments (8)For this reason we compared the proximity mappingdetermined by the Arpeggio profiles with the proximityscore determined using the Jaccard distance of peakoverlaps

For each ChIP-seq experiment peaks were identifiedusing the Qeseq program (61) assigning to each experi-ment the best matching control as described in theprevious sections We set the fragment size to 150 bp forexperiments using MNase digestion and to 250 bp for ex-periments using sonication (see Supplementary Table S2)We considered two peaks to be overlapping if they sharedat least 1 bp For any pair of experiments the totalnumber of peaks was computed as the sum of thenumber of peaks in each experiment minus the numberof overlapping peaks The pairwise peak overlap score wascomputed using the Jaccard distance namely one minusthe ratio between the number of overlapping peaks andthe total number of peaks In contrast to our Arpeggioapproach peak overlap does not directly allow cross-species comparison To ensure a fair comparisonbetween data representations we split our data set intoa set of human samples (n=541) and a set of murinesamples (n=237) and analyzed them separately

Applying PCA to the peak overlap Jaccard distancematrix revealed that for the human set 365 principalcomponents were needed to capture 85 of the variabilityin the data In contrast 85 of the variability of thepairwise distance matrix of the Fourier transforms of theArpeggio profiles was captured by the six leading principalcomponents Thus the Arpeggio MDS provides a morecompact representation of the dataNext for each class variable we studied the classification

performance using three distance measures peak overlapJaccard distance inverse correlation measure betweenspectral density of Arpeggio profiles and Euclideandistance between Arpeggio MDS coordinates Weanalyzed human and mouse samples separately For eachclass variable we quantified the performance using theArea Under the receiver operator characteristic Curve(AUC) as described in the lsquoMaterials and Methodsrsquo sectionApplication of a k-nearest neighbor classifier with k=1

to these three distance measures resulted in comparableperformances in most class variables However proximitybetween Arpeggio MDS profiles as compared with peakoverlap Jaccard distance reflected higher association forcellular mechanisms This was more evident in thehuman set where the larger number of experiments corres-ponded to smaller error bars (Figure 5)Arpeggio retains information related to mode of

binding and performs best in clustering cellular mechan-ism in contrast peak overlap contains information aboutbinding locations and best clusters more variables such asbatch effects associated with the Study ID and cell line

Fact

or

His

tone

Mod

ifica

tion

RN

A P

olym

eras

e

Gen

omic

Con

trol

Uns

peci

fic C

ontr

ol

Factor

Histone Modification

RNA Polymerase

Genomic Control

Unspecific Control

Pre

dict

ed la

bel

True label

516

107

96

103

178

253

479

82

8

106

267

97

485

66

85

16

8

39

206

515

169

28

35

304

464

Figure 4 Confusion matrix of predicting functional annotations fromthe Arpeggio profiles using a k-nearest neighbor classifier with k=1The values along each column of the matrix represent how the instancesin the actual class were assigned to the predicted class The diagonalelements indicate sensitivity Darker colors indicate higher classificationfrequencies Most misclassifications occurred between controls

PAGE 7 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

(Supplementary Table S5 Figure 5 and SupplementaryFigure S8)

Reading biological features from Arpeggio profiles

In the previous sections we have provided evidence thatArpeggio profiles and their spectral densities can be usedto rapidly compare a large number of experiments Inthis section we show that Arpeggio profiles can also beused to derive biologically and technically meaningfulinformationFor instance H3K27me3 Arpeggio profiles exhibited

distinct periodicity with high amplitudes of oscillation(Figure 1) This suggested a highly ordered array of nu-cleosomes consistent with a static chromatin structurewhere nucleosomes are precisely positioned This agreedwell with the known role of Polycomb and H3K27 tri-methylation in transcriptional repression and heterochro-matin formation We recall that the profile represents anaggregate of binding events across the whole genome theclear and distinct periodicity for H3K27me3 suggestedthat nucleosomes with the tri-methylated H3K27 markhad remarkably constant periodicities throughout thegenome (4) Further we show that this periodicity andthe width of the signal (number of clear oscillations) issimilar across species (Figure 6)Another oscillatory pattern can be observed for Histone

3 Lysine 36 tri-methylation (H3K36me3) This histonemodification mark is deposited along with activelytranscribing PolII complexes and is by far the mostreliable histone mark for actively transcribed genes Incontrast to H3K27me3 profiles where the oscillation isonly slowly dampening due to the static chromatin

structure imposed by this mark H3K36me3 profilesshow a central peak surrounded by slightly smallerpeaks with a rapidly degrading oscillation indicative ofa fluidic chromatin state where nucleosomes are not pre-cisely positioned This is in agreement with this markbeing found in actively transcribed genes as the nucleo-somes in actively transcribed chromatin are perturbed bythe passing polymerase complexes (Figure 1)

The difference in chromatin state is particularly evidentin the harmonic analysis of Arpeggio profiles Inspired byprevious work on nucleosome spacing (60) we computedthe ratio

frac14Aeth 1190THORN

Aeth 1180THORN+Aeth 1200THORN

2

eth9THORN

where Aeth1=tTHORN is the magnitude of the t bp length scale inthe spectral density H3K27me3 exhibited a stronger acompared with H3K36me3 (Figure 7) which suggestsflexibility in the spacing of nucleosomes carryingH3K36me3 mark Interestingly the ratio a was alsolarge in Histone 3 Lysine 4 tri-methylation (H3K4me3)and in the Retinoblastoma protein (pRb) Although thereasons for such a precise placement of H3K4me3 areunclear pRb is an important factor in establishingheterochromatin

The effects of open fluidic chromatin states are strongerin the Arpeggio profiles of histone acetylations Theseprofiles are characterized by a high central peak sur-rounded by unorganized oscillations These profiles areclearly indicative of an open fluid chromatin configur-ation which is consistent with actively transcribedregions (Figure 1)

However the reference ChIP-seq signal for openactively transcribed chromatin is the Arpeggio profile ofRNA Polymerase II The PolII complexes move alongactively transcribed genes thus the ChIP-seq information

Hsapiens

AU

C

00

02

04

06

08

10

Pro

tein

AB

targ

et

Inte

ract

ion

dom

ain

Fun

ctio

nal a

nnot

atio

n

Act

ivity

ann

otat

ion

Cel

lula

r M

echa

nism

She

arin

g

Cel

l Lin

e

Stu

dy ID

RandomPeak overlapArpeggioArpeggio MDS

Figure 5 Classification of biological and experimental factors in termsof peak-based and Arpeggio-based ChIP-seq signatures for humansamples The comparison was performed based on proximitycomputed using peak overlap between pairs of experiments inversecorrelation of Arpeggio spectral densities or Euclidean distances ofArpeggio MDS coordinates The AUC of a classifier assigningrandom labels is shown as a black dashed line Compared with peakoverlap proximity based on Arpeggio MDS had a higher median per-formance of predicting cellular mechanism

500 1000 1500 2000minus4e

minus04

minus2e

minus04

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Offset(bp)

IP s

igna

l (R

eads

ChI

P

Tot

al R

eads

ChI

P)

(R

eads

Con

trol

Tot

al R

eads

Con

trol

)

Fly (187bp period)Mouse (188bp period)Human (184bp period)

Figure 6 H3K27me3 shows similar periodicity across fly mouse andhuman Periodicity was calculated using the local maxima of thespectral density for the range of the Arpeggio profiles shown

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 8 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

is spread over a relatively large spatial region For thisreason PolII ChIP experiments need significantsequencing depth to obtain a good picture of PolIIactivity The Arpeggio profiles for PolII show a strongcentral signal surrounded by two shoulders each flankedby disorganized nucleosomes (Figure 2) The central peakis where PolII is located most closely to the DNA strandand thus efficient cross-linking can occur However PolIIis part of a large lsquoholoenzymersquo transcription machineryand the surrounding smaller humps are likely where thiscomplex is close enough to the DNA strand to be cross-linked and enriched in a ChIP-seq experiment The un-organized pattern surrounding these peaks is likely aresult of the actively transcribing PolII complex As itmoves through the chromatin it perturbs andor disas-sembles nucleosomes in front and re-deposits thembehind giving highly disorganized chromatin structure(Figure 2)

Other proteins such as Androgen Receptor SPDEF(SAM pointed domain-containing Ets transcriptionfactor) ERG (Ets-Related Gene) FL1 (Follicularlymphoma susceptibility to 1) display typical site-specific DNA-binding profiles of transcription factors inwhich a strong signal occurs at the binding siteaccompanied by disorganized surrounding patterns indi-cative of active fluidic chromatin (Figure 2)

DISCUSSION

The contribution of this study is the design of a compactChIP-seq data representation based on the denoised auto-correlation We show that use of this compact data repre-sentation has several advantages that facilitates efficientcomputation and data storage linking the cellular mech-anisms of protein targets in novel ChIP-seq experiments todata in current repositories low-dimensional organizationof large repositories of ChIP-seq data that facilitates ex-ploratory data analysis cross-species and cross-cell linecomparisons extraction of technical features such asfragment length distribution and biological relevantinterpretation

Large collections of ChIP-seq data have been leveragedto gain new biological insights (8) The volume of ChIP-seq data in public repositories has noticeably increased inrecent years Typically users retrieve samples based ontheir prior knowledge and expectations of the biologicalsystem Currently available data retrieval systems arebased on matching qualitative annotations such asorganism cell-type condition and specific immunopre-cipitated protein We suggest the use of our novel com-pression technique Arpeggio to enable searching forsamples similar to the query experiment in terms of quan-titative patterns present in theirs signals thus facilitatingnovel biological discoveries We note that non-linear datacompression approaches applied to the autocorrelationfunctions can organize the data and reveal new insights(see diffusion map analysis Supplementary Note B) Wepresent Arpeggio profiles and their spectral densities Thislow-dimensional harmonic data representation can beused for selecting publicly available experiments that arebiologically related to an experiment of interest TheArpeggio profiles are computed from the autocorrelationof ChIP-seq signals which have been previously exploredin the context of data quality assessment (7)In contrast to previous approaches we applied signal

processing techniques to derive a profile of the IP auto-correlation that is diminished in technical variability andrequires little pre-filtering We found that Arpeggioprofiles were remarkably organized in four maincategories corresponding to intuitive classes of structuralinteractions factors showed peaks with sharply decayingtails polymerases showed peaks as well but with slowlydecaying tails histone modifications showed damped os-cillations corresponding to trains of peaks at fixed dis-tances from one another lastly controls showed a singlepulse sharper than the peak of factors indicating that asexpected sequenced reads from total DNA inputs have norecurrent properties Typical binding patterns of a particu-lar proteinndashchromatin interaction are obscured by noiseArpeggio profiles overcome this problem by aggregatingthe recurrent patterns of proteinndashchromatin interactionThe quality of autocorrelation also improves quickerthan the read coverage density as the number of readsincreases In peak finding the average coverage isexpected to increase linearly on the order of OethN L=GTHORNwhere N is the number of sequenced reads L is the read-length and G is the size of the genome in contrast thenumber of distances between reads contributing to thecomputation of the autocorrelation scales quadraticallyin the worst case as OethN2 W=GTHORN where W is themaximum lag at which the autocorrelation is evaluatedwhere W L We note that Arpeggio profiles do notprovide the location of such binding events however weplan to further develop these characteristic spectralbinding patterns for locating peaksIn this work we used an unsupervised approach to

organize a large volume of ChIP-seq experiments Weshow that close proximity between the denoised spectraldensities of two different proteins is often associated withsimilar cellular mechanism In this work we manuallyannotated 806 of the 14 306 currently marked as ChIP-seq samples in the SRA Databases are expected to

DN

AIn

put

ER

GG

RH

2BK

12A

cH

2BK

15A

cH

2BK

20A

cH

3H

3K18

Ac

H3K

27A

cH

3K27

me3

H3K

36m

e3H

3K4m

e1H

3K4m

e2H

3K4m

e3H

3K79

me1

H3K

79m

e2H

3K79

me3

H3K

9Ac

H3K

9me3

H4K

12A

cH

4K20

me1

H4K

5Ac

IgG

P30

0P

olII

RA

RS

PD

EF

VitD

rhE

Ra

pRb

099

100

101

102

103

104

105

106

α

Figure 7 Flexibility of nucleosome spacing across different experi-ments Flexible spacing or isolated events (SupplementaryFigure S2) is represented by ratios close to one Ratios clearly aboveone indicate strict spacing of 190 bp between events (SupplementaryFigure S4) Histone modifications associated with heterochromatinexhibit higher ratios indicating less flexibility

PAGE 9 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

evolve to be more structured and enable automatic re-trieval eliminating the need for data entry tasks andallowing us to organize thousands of samples at a timeMoreover high-throughput sequencing is becoming cheapfacilitating the mass survey of novel ChIP targets forwhich function is yet to be determined Applying theproposed spectral representation to thousands of existingannotated ChIP-seq experiments will allow us to screenthese new ChIP targets reducing the resources requiredto elucidate their functions Interpretation of spectralpatterns in many fields of science and engineering (ieradiology control systems analysis imaging) is often theproduct of years of study In this study we focused onproperties that discriminate coarse categories of proteinndashchromatin interaction There is a wealth of knowledgehidden in Arpeggio representation and we anticipatethat with increasing database size and quality it willprovide information on a much finer scale

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Onlineincluding [68ndash122]

ACKNOWLEDGEMENTS

The authors thank Ronen Talmon Sherman WeissmanRonald Coifman and Paolo Barbano for insightfuldiscussions

FUNDING

National Institute of Health [T15 LM07056 to KS] and[CA-158167 to YK] Yale Cancer Center translationalresearch pilot funds (to FP and FS) American CancerSociety Award [M130572 to FS] the American-ItalianCancer Foundation [Post-Doctoral Research Fellowshipto FS] the Peter T Rowley Breast Cancer ResearchProjects funded by the New York State Department ofHealth [FAU 0812160900 YK] Funding for openaccess charge The Peter T Rowley Breast CancerResearch Projects funded by the New York StateDepartment of Health [FAU 0812160900 YK]

Conflict of interest statement None declared

REFERENCES

1 MortazaviA WilliamsBA MccueK SchaefferL and WoldB(2008) Mapping and quantifying mammalian transcriptomes byRNA-Seq Nat Methods 5 621ndash628

2 TrapnellC RobertsA GoffL PerteaG KimD KelleyDRPimentelH SalzbergSL RinnJL and PachterL (2012)Differential gene and transcript expression analysis of RNA-seqexperiments with TopHat and Cufflinks Nat Protoc 7 562ndash578

3 GreenmanC StephensP SmithR DalglieshGL HunterCBignellG DaviesH TeagueJ ButlerA StevensC et al(2007) Patterns of somatic mutation in human cancer genomesNature 446 153ndash158

4 AspP BlumR VethanthamV ParisiF MicsinaiM ChengJBowmanC KlugerY and DynlachtBD (2011) Genome-wide

remodeling of the epigenetic landscape during myogenicdifferentiation Proc Natl Acad Sci USA 108 E149ndashE158

5 BarskiA and ZhaoK (2009) Genomic location analysis byChIP-Seq J Cell Biochem 107 11ndash18

6 MikkelsenTS KuMC JaffeDB IssacB LiebermanEGiannoukosG AlvarezP BrockmanW KimTK KocheRPet al (2007) Genome-wide maps of chromatin state in pluripotentand lineage-committed cells Nature 448 553ndash560

7 ENCODE Project Consortium (2004) The ENCODE(ENCyclopedia of DNA elements) project Science 306 636ndash640

8 ENCODE Project Consortium (2012) An integrated encyclopediaof DNA elements in the human genome Nature 489 57ndash74

9 ListerR PelizzolaM DowenRH HawkinsRD HonGTonti-FilippiniJ NeryJR LeeL YeZ NgoQM et al(2009) Human DNA methylomes at base resolution showwidespread epigenomic differences Nature 462 315ndash322

10 BellO SchwaigerM OakeleyEJ LienertF BeiselCStadlerMB and SchubelerD (2010) Accessibility of theDrosophila genome discriminates PcG repression H4K16acetylation and replication timing Nat Struct Mol Biol 17894ndash900

11 BlowM McCulleyDJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2010)ChIP-Seq identification of weakly conserved heart enhancers NatGenet 42 806ndash810

12 BottomlyD KylerSL McWeeneySK and YochumGS(2010) Identification of b-catenin binding regions in colon cancercells using ChIP-Seq Nucleic Acids Res 38 5735ndash5745

13 ChicasA WangX ZhangC McCurrachM ZhaoZ MertODickinsRA NaritaM ZhangM and LoweSW (2010)Dissecting the unique role of the retinoblastoma tumor suppressorduring cellular senescence Cancer Cell 17 376ndash387

14 CreyghtonMP ChengAW WelsteadGG KooistraTCareyBW SteineEJ HannaJ LodatoMA FramptonGMSharpPA et al (2010) From the cover histone H3K27acseparates active from poised enhancers and predictsdevelopmental state Proc Natl Acad Sci USA 10721931ndash21936

15 DurantL WatfordWT RamosHL LaurenceA VahediGWeiL TakahashiH SunH-W KannoY PowrieF et al(2010) Diverse targets of the transcription factor STAT3contribute to T cell pathogenicity and homeostasis Immunity 32605ndash615

16 EnderleD BeiselC StadlerMB GerstungM AthriP andParoR (2010) Polycomb preferentially targets stalled promotersof coding and noncoding transcripts Genome Res 21 216ndash226

17 FullwoodMJ LiuMH PanYF LiuJ XuHMohamedYB OrlovYL VelkovS HoA MeiPH et al(2009) An oestrogen-receptor-a-bound human chromatininteractome Nature 462 58ndash64

18 GanQ SchonesDE Ho EunS WeiG CuiK ZhaoK andChenX (2010) Monovalent and unpoised status of most genes inundifferentiated cell-enriched Drosophila testis Genome Biol 11R42

19 GorchakovAA AlekseyenkoAA KharchenkoP ParkPJand KurodaMI (2009) Long-range spreading of dosagecompensation in Drosophila captures transcribed autosomal genesinserted on X Genes Dev 23 2266ndash2271

20 GuentherMG FramptonGM SoldnerF HockemeyerDMitalipovaM JaenischR and YoungRA (2010) Chromatinstructure and gene expression programs of human embryonic andinduced pluripotent stem cells Cell Stem Cell 7 249ndash257

21 HoffmanBG RobertsonG ZavagliaB BeachM CullumRLeeS SoukhatchevaG LiL WederellED ThiessenN et al(2010) Locus co-occupancy nucleosome positioning andH3K4me1 regulate the functionality of FOXA2- HNF4A- andPDX1-bound loci in islets and liver Genome Res 20 1037ndash1051

22 HurtadoA HolmesKA Ross-InnesCS SchmidtD andCarrollJS (2010) FOXA1 is a key determinant of estrogenreceptor function and endocrine response Nat Genet 43 27ndash33

23 IpJY SchmidtD PanQ RamaniAK FraserAGOdomDT and BlencoweBJ (2011) Global impact of RNApolymerase II elongation inhibition on alternative splicingregulation Genome Res 21 390ndash401

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 10 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

24 JohnS SaboPJ ThurmanRE SungM-H BiddieSCJohnsonTA HagerGL and StamatoyannopoulosJA (2011)Chromatin accessibility pre-determines glucocorticoid receptorbinding patterns Nat Genet 43 264ndash268

25 JolmaA KiviojaT ToivonenJ ChengL WeiG EngeMTaipaleM VaquerizasJM YanJ SillanpaaMJ et al (2010)Multiplexed massively parallel SELEX for characterization ofhuman transcription factor binding specificities Genome Res 20861ndash873

26 JungH LacombeJ MazzoniEO LiemKF Jr GrinsteinJMahonyS MukhopadhyayD GiffordDK YoungRAAndersonKV et al (2010) Global control of motor neurontopography mediated by the repressive actions of a single hoxgene Neuron 67 781ndash796

27 KageyMH NewmanJJ BilodeauS ZhanY OrlandoDAvan BerkumNL EbmeierCC GoossensJ RahlPBLevineSS et al (2010) Mediator and cohesin connect geneexpression and chromatin architecture Nature 467 430ndash435

28 KasowskiM GrubertF HeffelfingerC HariharanMAsabereA WaszakSM HabeggerL RozowskyJ ShiMUrbanAE et al (2010) Variation in transcription factor bindingamong humans Science 328 232ndash235

29 KocheRP SmithZD AdliM GuH KuM GnirkeABernsteinBE and MeissnerA (2011) Reprogramming factorexpression initiates widespread targeted chromatin remodelingCell Stem Cell 8 96ndash105

30 KramerJM KochinkeK OortveldMAW MarksHKramerD de JongEK AsztalosZ WestwoodJTStunnenbergHG SokolowskiMB et al (2011) Epigeneticregulation of learning and memory by Drosophila EHMTG9aPLoS Biol 9 e1000569

31 LeeBK BhingeAA and IyerVR (2011) Wide-rangingfunctions of E2F4 in transcriptional activation and repressionrevealed by genome-wide analysis Nucleic Acids Res 393558ndash3573

32 LiL JothiR CuiK LeeJY CohenT GorivodskyMTzchoriI ZhaoY HayesSM BresnickEH et al (2010)Nuclear adaptor Ldb1 regulates a transcriptional programessential for the maintenance of hematopoietic stem cells NatImmunol 12 129ndash136

33 MahonyS MazzoniEO McCuineS YoungRA WichterleHand GiffordDK (2011) Ligand-dependent dynamics of retinoicacid receptor binding during early neurogenesis Genome Biol12 R2

34 MarsonA LevineSS ColeMF FramptonGMBrambrinkT JohnstoneS GuentherMG JohnstonWKWernigM and NewmanJ (2008) Connecting microRNA genesto the core transcriptional regulatory circuitry of embryonic stemcells Cell 134 521ndash533

35 PaulerFM SloaneMA HuangR ReghaK KoernerMVTamirI SommerA AszodiA JenuweinT and BarlowDP(2008) H3K27me3 forms BLOCs over silent genes and intergenicregions and specifies a histone banding pattern on a mouseautosomal chromosome Genome Res 19 221ndash233

36 Rada-IglesiasA BajpaiR SwigutT BrugmannSAFlynnRA and WysockaJ (2010) A unique chromatin signatureuncovers early developmental enhancers in humans Nature 470279ndash283

37 RamagopalanSV HegerA BerlangaAJ MaugeriNJLincolnMR BurrellA HandunnetthiL HandelAEDisantoG OrtonS-M et al (2010) A ChIP-seq definedgenome-wide map of vitamin D receptor bindingassociations with disease and evolution Genome Res 201352ndash1360

38 Rugg-GunnPJ CoxBJ RalstonA and RossantJ (2010)Distinct histone modifications in stem cell lines and tissue lineagesfrom the early mouse embryo Proc Natl Acad Sci USA 10710783ndash10790

39 SchmidtD SchwaliePC Ross-InnesCS HurtadoABrownGD CarrollJS FlicekP and OdomDT (2010) ACTCF-independent role for cohesin in tissue-specific transcriptionGenome Res 20 578ndash588

40 VermeulenM EberlHC MatareseF MarksH DenissovSButterF LeeKK OlsenJV HymanAA and

StunnenbergHG (2010) Quantitative interaction proteomics andgenome-wide profiling of epigenetic histone marks and theirreaders Cell 142 967ndash980

41 VerziMP ShinH HeHH SulahianR MeyerCAMontgomeryRK FleetJC BrownM LiuXS andShivdasaniRA (2010) Differentiation-specific histonemodifications reveal dynamic chromatin interactions and partnersfor the intestinal transcription factor CDX2 Dev Cell 19713ndash726

42 WeiGH BadisG BergerMF KiviojaT PalinK EngeMBonkeM JolmaA VarjosaloM GehrkeAR et al (2010)Genome-wide analysis of ETS-family DNA-binding in vitro andin vivo EMBO J 29 2147ndash2160

43 WeiL VahediG SunH-W WatfordWT TakatoriHRamosHL TakahashiH LiangJ Gutierrez-CruzG ZangCet al (2010) Discrete roles of STAT4 and STAT6 transcriptionfactors in tuning epigenetic modifications and transcription duringT helper cell differentiation Immunity 32 840ndash851

44 YangY LuY EspejoA WuJ XuW LiangS andBedfordMT (2010) TDRD3 is an effector molecule for arginine-methylated histone marks Mol Cell 40 1016ndash1023

45 YuS CuiK JothiR ZhaoD-M JingX ZhaoK andXueH-H (2010) GABP controls a critical transcriptionregulatory module that is essential for maintenance anddifferentiation of hematopoietic stemprogenitor cells Blood 1172166ndash2178

46 HoffmanMM ErnstJ WilderSP KundajeA HarrisRSLibbrechtM GiardineB EllenbogenPM BilmesJABirneyE et al (2013) Integrative annotation of chromatinelements from ENCODE data Nucleic Acids Res 41 827ndash841

47 LimSJ TanTW and TongJC (2010) Computationalepigenetics the new scientific paradigm Bioinformation 4331ndash337

48 ThurmanRE DayN NobleWS and StamatoyannopoulosJA(2007) Identification of higher-order functional domains in thehuman ENCODE regions Genome Res 17 917ndash927

49 LangmeadB TrapnellC PopM and SalzbergSL (2009)Ultrafast and memory-efficient alignment of short DNAsequences to the human genome Genome Biol 10 R25

50 DiazA ParkK LimDA and SongJS (2012) Normalizationbias correction and peak calling for ChIP-seq Stat Appl GenetMol Biol 11 9

51 LiangK and KelesS (2012) Normalization of ChIP-seq datawith control BMC Bioinformatics 13 199

52 R Development Core Team (2012) R A Language andEnvironment for Statistical Computing R Foundation forStatistical Computing Vienna Austria

53 DaviesDL and BouldinDW (1979) A cluster separationmeasure IEEE Trans Pattern Anal Mach Intell 1 224ndash227

54 HanleyJA and McNeilBJ (1982) The meaning and use of thearea under a receiver operating characteristic (ROC) curveRadiology 143 29ndash36

55 GarnerSR (1995) Weka the waikato environment forknowledge analysis In Procceding of the New Zealand ComputerScience Research Students Conference Hamilton New Zealandpp 57ndash64

56 Shi-YunS Kai-QuanS Chong-JinO Xiao-PingL andWilder-SmithE (2008) Automatic identification and removal ofartifacts in EEG using a probabilistic multi-class SVM approachwith error correction In IEEE International Conference onSystems Man and Cybernetics Singapore Singaporepp 1134ndash1139

57 BellmanRE (2003) Dynamic Programming Dover PublicationsNew York NY

58 BickmoreWA and van SteenselB (2013) Genome architecturedomain organization of interphase chromosomes Cell 1521270ndash1284

59 OppenheimAV SchaferRW and BuckJR (1999) Discrete-Time Signal Processing Prentice-Hall Inc Upper SaddleRiver NJ

60 BlankTA and BeckerPB (1995) Electrostatic mechanism ofnucleosome spacing J Mol Biol 252 305ndash313

61 MicsinaiM ParisiF StrinoF AspP DynlachtBD andKlugerY (2012) Picking ChIP-seq peak detectors for

PAGE 11 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

analyzing chromatin modification experiments Nucleic Acids Res40 e70

62 KharchenkoPV TolstorukovMY and ParkPJ (2008) Designand analysis of ChIP-seq experiments for DNA-binding proteinsNat Biotechnol 26 1351ndash1359

63 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of theENCODE and modENCODE consortia Genome Res 221813ndash1831

64 RamachandranP PalidworGA PorterCJ and PerkinsTJ(2013) MaSC mappability-sensitive cross-correlation forestimating mean fragment length of single-end short-readsequencing data Bioinformatics 29 444ndash450

65 NarlikarL and JothiR (2012) ChIP-Seq data analysisidentification of ProteinndashDNA binding sites with SISSRs peak-finder In WangJ TanAC and TianT (eds) Next GenerationMicroarray Bioinformatics Vol 802 Humana Press New YorkUSA pp 305ndash322

66 ZhangY LiuT MeyerCA EeckhouteJ JohnsonDSBernsteinBE NussbaumC MyersRM BrownM LiWet al (2008) Model-based analysis of ChIP-Seq (MACS) GenomeBiol 9 R137

67 LittleAV LeeJ Yoon-MoJ and MaggioniM (2009) IEEESP15th Workshop on Statistical Signal Processing (SSPrsquo09) CardiffUnited Kingdom pp 85ndash88

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 12 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

Page 5: Arpeggio: harmonic compression of ChIP-seq data reveals protein ...

We named the recovered autocorrelation of the true IPsignal ZethtTHORN ZethtTHORN the Arpeggio profile (Figure 1)

In studies where multiple controls are available or nocontrols are available we matched to each ChIP-seq ex-periment the control that is closest in terms of its spectralproperties We applied the same control matching proced-ure to the rest of the samples and found that most samplesmatched controls done in the same study or cell line Thiscontrol matching procedure is a conservative approach inwhich we try to identify the parts of the Y spectra that aresignificantly distinct from the X spectra and thus increasesspecificity In the extreme case where the autocorrelationof the measured ChIP signal YethTHORN and of the technicalvariability XethTHORN have proportional spectral densities iethey are linearly correlated their ratio will be constant andtheir deconvolution is an impulse indicating that there isno true IP signal (Figure 2)

We matched controls to experiments from the pool ofcontrols in our data set by first matching the organism andDNA-shearing technique and selected the control with thehighest correlation (Pearsonrsquos ) to the ChIP experimentin the base resolution domain (frequency domain ieafter applying the Fourier transform) We also requirethat gt 085 We note that this leads to the highest speci-ficity in the context of our spectral analysis We recall thatcorrelation in the resolution domain does not imply cor-relation in the genomic co-ordinates domain The value ofthe Arpeggio profile at frac14 0 reflects differences in readcount between experiment and control From the values of frac14 0 recorded in our data set we concluded that noexperiment control pair had a perfectly matching readcount In general this did not significantly affect ourability to recover the autocorrelation of the true IP signal

However for 31 samples of 806 in our data set wecould not identify any matching control Of the remaining775 there was also a small fraction of experiments forwhich the resulting Arpeggio profile exhibited several arti-facts suggesting poorly matched controls In particularwhen the autocorrelation of the control had a differentdecay rate than that of the experiment we observed dipsaround frac14 0 or gradual rises or falls as opposed toleveling out at greater values of jj (SupplementaryFigure S6) This difference in decay rate is likely due totechnical variability dependent on read counts Howeverthere may be biological components to it as well eg if apolymerase binds and then moves along the genome it isreasonable to believe that the reads will be more spreadthan a transcription factor that binds to a single locationIt would have been desirable to extract the true IP signal

Z(t) from ZethtTHORN ZethtTHORN directly by square root in the reso-lution domain However in general FethZethtTHORNTHORN is not entirelypositive or real thus there is no unique solution for Z(t)because the phase information is lost in the autocorrel-ation operation In practice the Arpeggio profilesZethtTHORN ZethtTHORN are sufficient for the purpose of comparingChIP-seq experiments and can also be used to examine thespectral density as a function of base resolution

Recovering the length distribution of ChIP-seq fragments

The fragment length distribution is an important param-eter for algorithms that seeks to identify binding eventlocations from short read experiments such as ChIP-seq

minus2000 minus1000 0 1000 2000

000

000

0005

000

100

0015

000

20

Offset(bp)

IP s

igna

l (R

eads

ChI

P

Tot

al R

eads

ChI

P)

(R

eads

Con

trol

Tot

al R

eads

Con

trol

)

HNF4APolIIDNA Input

Figure 2 Examples of the deconvolved Arpeggio profiles for a tran-scription factor PolII and a total DNA input sample The transcriptionfactor HNF4A shows a strong spike followed by a steep decayindicating isolated binding and PolII shows an isolated pulsefollowed by gradual decay The isolated pulse for total DNA inputindicates that sequenced reads provide no additional informationbeyond technical variability ethXethtTHORNTHORN The DNA input Arpeggio was con-structed using the samplersquos best matched control As in the previousfigure the read count difference between the sample and the matchedcontrol at frac14 0 was removed and smoothing was applied to aid invisualization

minus2000 minus1000 0 1000 2000

000

000

0005

000

100

0015

000

20

Offset(bp)

IP s

igna

l (R

eads

ChI

P

Tot

al R

eads

ChI

P)

(R

eads

Con

trol

Tot

al R

eads

Con

trol

)

H3K36me3H3K27me3H3K27AcpRB

Figure 1 Examples of the deconvolved Arpeggio profiles for histonemarks H3K27me3 Histone 3 lysine 36 tri-methylation (H3K36me3)Histone 3 lysine 27 acetylation (H3K27Ac) and for retinoblastoma(pRB) The H3K27me3 and H3K36me3 profiles show periodicity of190 bp consistent with previous reports of nucleosome frequency(60) The value representing the total read count difference betweenChIP and control at frac14 0 was removed and smoothing was appliedto aid in visualization

PAGE 5 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

peak callers (61) If paired-end reads are available theycan be used to recover the fragment length distributionempirically however many experiments in currentrepositories were not done using paired-end reads Wenote that for a given number of nucleotides sequencedthe single end approach represents twice as many frag-ments and thus is more sensitive than the paired-endapproach The latter however has an advantage inmapability in repetitive regionsPrevious studies have used cross-correlation to infer the

average fragment length of singled-end short read ChIP-seq data (62ndash64) As known when considering reads fromopposing strands some of the measured distances reflectfragment lengths (62ndash66)We modeled the probability associated with sampling a

fragment starting at any given position on the positivestrand of the genome as Prfrac12W frac14 t If the sequencedread from this fragment happens to map to the positivestrand it will start at the same genomic position t givingthe same probability distribution for the readsPrfrac12R frac14 Prfrac12W If however the sequenced read maps tothe negative strand then its starting position is F= f thefragment length nucleotides downstream from thestarting position of the fragment W= t Thus given theprobability of sampling a read on the positive strandPrfrac12R there is an equivalent probability of sampling aread on the negative strand Prfrac12R+F where F is arandom variable representing the fragment lengthWe show that harmonic deconvolution can be used to

determine not only the average fragment length but alsothe full empirical fragment length distribution This isdone by deconvolving the cross-correlation (seelsquoMaterials and Methodsrsquo section) between the start ofreads aligning to opposite strands which we denoted asethTHORN from the autocorrelation of reads aligning to thesame strand for the same experiment ethTHORN We recall thatthe probability distribution of the sum of two randomvariables is equivalent to the convolution of their individ-ual probability distributions and thus

c frac14 Prfrac12R+F Prfrac12R

frac14 Prfrac12R Prfrac12R Prfrac12Feth7THORN

where c is a normalization constant such thatc DethtTHORN frac14 Prfrac12R Similarly to Equation (6) we deconvolvethe fragment length distribution

Prfrac12F frac14 F1 FethPrfrac12RPrfrac12RPrfrac12FTHORNFethPrfrac12RPrfrac12RTHORN

frac14 F1

FethcYethTHORNTHORN

FethcYethTHORNTHORN

eth8THORN

We found that the reported fragment size from the differ-ent studies included in our data set matched the fragmentsize inferred using our deconvolution approach(Supplementary Figure S7 and Supplementary Table S2)Moreover our deconvolution approach using only oneread from each read pair of a paired end experimentproduced an estimate of the fragment length distributionthat closely matched to the length distribution of thepaired-end fragments (Figure 3 see SupplementaryNote A)

Arpeggio captures the biology of proteinndashchromatininteraction

The space of Arpeggio spectral densities is lowdimensionalTo facilitate organization of large sequencing archives inparticular during search operations for complex queriesor computationally intensive machine-learning tasks it isdesirable to represent the samples using a small number offeatures Compared with the size of the genome spectraldensities of Arpeggio profiles described by only 8192elements are already relatively small We investigatedwhether the space comprising all spectral densities in ourdatabase could be further reduced using Multi-Dimensional Scaling (MDS) such as PCA We foundthat six principal components were sufficient to capture85 of the variability present in our collection ofspectral densities

Although more advanced techniques (67) may result ina better compression ie fewer dimensions we decided touse PCA which is readily available and familiar for manypractitioners We used the leading six spectral densityprincipal components to organize hundreds of ChIP-seqexperiments and aid annotation of novel samples Wetermed this representation Arpeggio MDS

Use of Arpeggio MDS coordinates for classification andclusteringClassification and clustering are affected by choice of datarepresentation and dissimilarity measures Here we usethe DaviesndashBouldin index (52) to assess discernability

0 200 400 600 800 1000

000

00

001

000

20

003

000

40

005

000

6

fragment size (bp)de

nsity

PairedminusEnd estimate (212bp)SingleminusEnd Arpeggio (214bp)Experimental protocol (200bp)Read length (75bp)

Figure 3 Comparison between paired-end fragment length distributionand Arpeggio fragment length distribution The Arpeggio fragmentlength distribution was estimated from only one read of each readpair The reported fragment length from the experimental protocoltogether with the sequenced read length is shown as dashed linesThe arpeggio fragment length distribution shows an additional spikeat the read length which has been previously observed in opposingstrand cross-correlation (63)

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 6 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

between clusters and inferred class labels for each classvariable using a k-nearest neighbor classification

First for each class variables eg antibody targetcellular mechanism or organism we considered classlabels as cluster indices and computed the DaviesndashBouldin index (see lsquoMaterials and Methodsrsquo section)Our analysis showed significantly small DaviesndashBouldinindices for all class variables (Supplementary Table S4)This indicates that the organization of the data based onArpeggio MDS coordinates has a structure amenable for avariety of machine-learning approaches The DaviesndashBouldin index might have been affected by samplingbias of proteins analyzed within each organism groupskewing the index for the organism class variable

Second for each class variable we trained a k-nearestneighbor classifier (k=1) and using a leave-one-outapproach we computed the balanced accuracy of classlabel assignments (see lsquoMaterials and Methodsrsquo section)For most class variables the performance of the trainedclassifier was above the performance of a classifier assign-ing labels at random (Supplementary Table S4)Importantly misclassification was rare but was morecommon between total DNA input and immunoglobulinG controls (Figure 4)

Similarity between Arpeggio profiles is indicative ofbiological functions

We showed that the Arpeggio profiles could be used totrain classifiers that suggest annotation for new experi-ments We sought to extend this paradigm and addresswhether given a new set of experiments Arpeggioprofiles could be used to rapidly select available ChIP-seq experiments that would enable a more complete de-scription of the biological mechanism of interestTypically this analysis involves identification of peaksfrom the ChIP-seq signals and the study of the overlapbetween events in two or more different experiments (8)For this reason we compared the proximity mappingdetermined by the Arpeggio profiles with the proximityscore determined using the Jaccard distance of peakoverlaps

For each ChIP-seq experiment peaks were identifiedusing the Qeseq program (61) assigning to each experi-ment the best matching control as described in theprevious sections We set the fragment size to 150 bp forexperiments using MNase digestion and to 250 bp for ex-periments using sonication (see Supplementary Table S2)We considered two peaks to be overlapping if they sharedat least 1 bp For any pair of experiments the totalnumber of peaks was computed as the sum of thenumber of peaks in each experiment minus the numberof overlapping peaks The pairwise peak overlap score wascomputed using the Jaccard distance namely one minusthe ratio between the number of overlapping peaks andthe total number of peaks In contrast to our Arpeggioapproach peak overlap does not directly allow cross-species comparison To ensure a fair comparisonbetween data representations we split our data set intoa set of human samples (n=541) and a set of murinesamples (n=237) and analyzed them separately

Applying PCA to the peak overlap Jaccard distancematrix revealed that for the human set 365 principalcomponents were needed to capture 85 of the variabilityin the data In contrast 85 of the variability of thepairwise distance matrix of the Fourier transforms of theArpeggio profiles was captured by the six leading principalcomponents Thus the Arpeggio MDS provides a morecompact representation of the dataNext for each class variable we studied the classification

performance using three distance measures peak overlapJaccard distance inverse correlation measure betweenspectral density of Arpeggio profiles and Euclideandistance between Arpeggio MDS coordinates Weanalyzed human and mouse samples separately For eachclass variable we quantified the performance using theArea Under the receiver operator characteristic Curve(AUC) as described in the lsquoMaterials and Methodsrsquo sectionApplication of a k-nearest neighbor classifier with k=1

to these three distance measures resulted in comparableperformances in most class variables However proximitybetween Arpeggio MDS profiles as compared with peakoverlap Jaccard distance reflected higher association forcellular mechanisms This was more evident in thehuman set where the larger number of experiments corres-ponded to smaller error bars (Figure 5)Arpeggio retains information related to mode of

binding and performs best in clustering cellular mechan-ism in contrast peak overlap contains information aboutbinding locations and best clusters more variables such asbatch effects associated with the Study ID and cell line

Fact

or

His

tone

Mod

ifica

tion

RN

A P

olym

eras

e

Gen

omic

Con

trol

Uns

peci

fic C

ontr

ol

Factor

Histone Modification

RNA Polymerase

Genomic Control

Unspecific Control

Pre

dict

ed la

bel

True label

516

107

96

103

178

253

479

82

8

106

267

97

485

66

85

16

8

39

206

515

169

28

35

304

464

Figure 4 Confusion matrix of predicting functional annotations fromthe Arpeggio profiles using a k-nearest neighbor classifier with k=1The values along each column of the matrix represent how the instancesin the actual class were assigned to the predicted class The diagonalelements indicate sensitivity Darker colors indicate higher classificationfrequencies Most misclassifications occurred between controls

PAGE 7 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

(Supplementary Table S5 Figure 5 and SupplementaryFigure S8)

Reading biological features from Arpeggio profiles

In the previous sections we have provided evidence thatArpeggio profiles and their spectral densities can be usedto rapidly compare a large number of experiments Inthis section we show that Arpeggio profiles can also beused to derive biologically and technically meaningfulinformationFor instance H3K27me3 Arpeggio profiles exhibited

distinct periodicity with high amplitudes of oscillation(Figure 1) This suggested a highly ordered array of nu-cleosomes consistent with a static chromatin structurewhere nucleosomes are precisely positioned This agreedwell with the known role of Polycomb and H3K27 tri-methylation in transcriptional repression and heterochro-matin formation We recall that the profile represents anaggregate of binding events across the whole genome theclear and distinct periodicity for H3K27me3 suggestedthat nucleosomes with the tri-methylated H3K27 markhad remarkably constant periodicities throughout thegenome (4) Further we show that this periodicity andthe width of the signal (number of clear oscillations) issimilar across species (Figure 6)Another oscillatory pattern can be observed for Histone

3 Lysine 36 tri-methylation (H3K36me3) This histonemodification mark is deposited along with activelytranscribing PolII complexes and is by far the mostreliable histone mark for actively transcribed genes Incontrast to H3K27me3 profiles where the oscillation isonly slowly dampening due to the static chromatin

structure imposed by this mark H3K36me3 profilesshow a central peak surrounded by slightly smallerpeaks with a rapidly degrading oscillation indicative ofa fluidic chromatin state where nucleosomes are not pre-cisely positioned This is in agreement with this markbeing found in actively transcribed genes as the nucleo-somes in actively transcribed chromatin are perturbed bythe passing polymerase complexes (Figure 1)

The difference in chromatin state is particularly evidentin the harmonic analysis of Arpeggio profiles Inspired byprevious work on nucleosome spacing (60) we computedthe ratio

frac14Aeth 1190THORN

Aeth 1180THORN+Aeth 1200THORN

2

eth9THORN

where Aeth1=tTHORN is the magnitude of the t bp length scale inthe spectral density H3K27me3 exhibited a stronger acompared with H3K36me3 (Figure 7) which suggestsflexibility in the spacing of nucleosomes carryingH3K36me3 mark Interestingly the ratio a was alsolarge in Histone 3 Lysine 4 tri-methylation (H3K4me3)and in the Retinoblastoma protein (pRb) Although thereasons for such a precise placement of H3K4me3 areunclear pRb is an important factor in establishingheterochromatin

The effects of open fluidic chromatin states are strongerin the Arpeggio profiles of histone acetylations Theseprofiles are characterized by a high central peak sur-rounded by unorganized oscillations These profiles areclearly indicative of an open fluid chromatin configur-ation which is consistent with actively transcribedregions (Figure 1)

However the reference ChIP-seq signal for openactively transcribed chromatin is the Arpeggio profile ofRNA Polymerase II The PolII complexes move alongactively transcribed genes thus the ChIP-seq information

Hsapiens

AU

C

00

02

04

06

08

10

Pro

tein

AB

targ

et

Inte

ract

ion

dom

ain

Fun

ctio

nal a

nnot

atio

n

Act

ivity

ann

otat

ion

Cel

lula

r M

echa

nism

She

arin

g

Cel

l Lin

e

Stu

dy ID

RandomPeak overlapArpeggioArpeggio MDS

Figure 5 Classification of biological and experimental factors in termsof peak-based and Arpeggio-based ChIP-seq signatures for humansamples The comparison was performed based on proximitycomputed using peak overlap between pairs of experiments inversecorrelation of Arpeggio spectral densities or Euclidean distances ofArpeggio MDS coordinates The AUC of a classifier assigningrandom labels is shown as a black dashed line Compared with peakoverlap proximity based on Arpeggio MDS had a higher median per-formance of predicting cellular mechanism

500 1000 1500 2000minus4e

minus04

minus2e

minus04

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Offset(bp)

IP s

igna

l (R

eads

ChI

P

Tot

al R

eads

ChI

P)

(R

eads

Con

trol

Tot

al R

eads

Con

trol

)

Fly (187bp period)Mouse (188bp period)Human (184bp period)

Figure 6 H3K27me3 shows similar periodicity across fly mouse andhuman Periodicity was calculated using the local maxima of thespectral density for the range of the Arpeggio profiles shown

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 8 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

is spread over a relatively large spatial region For thisreason PolII ChIP experiments need significantsequencing depth to obtain a good picture of PolIIactivity The Arpeggio profiles for PolII show a strongcentral signal surrounded by two shoulders each flankedby disorganized nucleosomes (Figure 2) The central peakis where PolII is located most closely to the DNA strandand thus efficient cross-linking can occur However PolIIis part of a large lsquoholoenzymersquo transcription machineryand the surrounding smaller humps are likely where thiscomplex is close enough to the DNA strand to be cross-linked and enriched in a ChIP-seq experiment The un-organized pattern surrounding these peaks is likely aresult of the actively transcribing PolII complex As itmoves through the chromatin it perturbs andor disas-sembles nucleosomes in front and re-deposits thembehind giving highly disorganized chromatin structure(Figure 2)

Other proteins such as Androgen Receptor SPDEF(SAM pointed domain-containing Ets transcriptionfactor) ERG (Ets-Related Gene) FL1 (Follicularlymphoma susceptibility to 1) display typical site-specific DNA-binding profiles of transcription factors inwhich a strong signal occurs at the binding siteaccompanied by disorganized surrounding patterns indi-cative of active fluidic chromatin (Figure 2)

DISCUSSION

The contribution of this study is the design of a compactChIP-seq data representation based on the denoised auto-correlation We show that use of this compact data repre-sentation has several advantages that facilitates efficientcomputation and data storage linking the cellular mech-anisms of protein targets in novel ChIP-seq experiments todata in current repositories low-dimensional organizationof large repositories of ChIP-seq data that facilitates ex-ploratory data analysis cross-species and cross-cell linecomparisons extraction of technical features such asfragment length distribution and biological relevantinterpretation

Large collections of ChIP-seq data have been leveragedto gain new biological insights (8) The volume of ChIP-seq data in public repositories has noticeably increased inrecent years Typically users retrieve samples based ontheir prior knowledge and expectations of the biologicalsystem Currently available data retrieval systems arebased on matching qualitative annotations such asorganism cell-type condition and specific immunopre-cipitated protein We suggest the use of our novel com-pression technique Arpeggio to enable searching forsamples similar to the query experiment in terms of quan-titative patterns present in theirs signals thus facilitatingnovel biological discoveries We note that non-linear datacompression approaches applied to the autocorrelationfunctions can organize the data and reveal new insights(see diffusion map analysis Supplementary Note B) Wepresent Arpeggio profiles and their spectral densities Thislow-dimensional harmonic data representation can beused for selecting publicly available experiments that arebiologically related to an experiment of interest TheArpeggio profiles are computed from the autocorrelationof ChIP-seq signals which have been previously exploredin the context of data quality assessment (7)In contrast to previous approaches we applied signal

processing techniques to derive a profile of the IP auto-correlation that is diminished in technical variability andrequires little pre-filtering We found that Arpeggioprofiles were remarkably organized in four maincategories corresponding to intuitive classes of structuralinteractions factors showed peaks with sharply decayingtails polymerases showed peaks as well but with slowlydecaying tails histone modifications showed damped os-cillations corresponding to trains of peaks at fixed dis-tances from one another lastly controls showed a singlepulse sharper than the peak of factors indicating that asexpected sequenced reads from total DNA inputs have norecurrent properties Typical binding patterns of a particu-lar proteinndashchromatin interaction are obscured by noiseArpeggio profiles overcome this problem by aggregatingthe recurrent patterns of proteinndashchromatin interactionThe quality of autocorrelation also improves quickerthan the read coverage density as the number of readsincreases In peak finding the average coverage isexpected to increase linearly on the order of OethN L=GTHORNwhere N is the number of sequenced reads L is the read-length and G is the size of the genome in contrast thenumber of distances between reads contributing to thecomputation of the autocorrelation scales quadraticallyin the worst case as OethN2 W=GTHORN where W is themaximum lag at which the autocorrelation is evaluatedwhere W L We note that Arpeggio profiles do notprovide the location of such binding events however weplan to further develop these characteristic spectralbinding patterns for locating peaksIn this work we used an unsupervised approach to

organize a large volume of ChIP-seq experiments Weshow that close proximity between the denoised spectraldensities of two different proteins is often associated withsimilar cellular mechanism In this work we manuallyannotated 806 of the 14 306 currently marked as ChIP-seq samples in the SRA Databases are expected to

DN

AIn

put

ER

GG

RH

2BK

12A

cH

2BK

15A

cH

2BK

20A

cH

3H

3K18

Ac

H3K

27A

cH

3K27

me3

H3K

36m

e3H

3K4m

e1H

3K4m

e2H

3K4m

e3H

3K79

me1

H3K

79m

e2H

3K79

me3

H3K

9Ac

H3K

9me3

H4K

12A

cH

4K20

me1

H4K

5Ac

IgG

P30

0P

olII

RA

RS

PD

EF

VitD

rhE

Ra

pRb

099

100

101

102

103

104

105

106

α

Figure 7 Flexibility of nucleosome spacing across different experi-ments Flexible spacing or isolated events (SupplementaryFigure S2) is represented by ratios close to one Ratios clearly aboveone indicate strict spacing of 190 bp between events (SupplementaryFigure S4) Histone modifications associated with heterochromatinexhibit higher ratios indicating less flexibility

PAGE 9 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

evolve to be more structured and enable automatic re-trieval eliminating the need for data entry tasks andallowing us to organize thousands of samples at a timeMoreover high-throughput sequencing is becoming cheapfacilitating the mass survey of novel ChIP targets forwhich function is yet to be determined Applying theproposed spectral representation to thousands of existingannotated ChIP-seq experiments will allow us to screenthese new ChIP targets reducing the resources requiredto elucidate their functions Interpretation of spectralpatterns in many fields of science and engineering (ieradiology control systems analysis imaging) is often theproduct of years of study In this study we focused onproperties that discriminate coarse categories of proteinndashchromatin interaction There is a wealth of knowledgehidden in Arpeggio representation and we anticipatethat with increasing database size and quality it willprovide information on a much finer scale

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Onlineincluding [68ndash122]

ACKNOWLEDGEMENTS

The authors thank Ronen Talmon Sherman WeissmanRonald Coifman and Paolo Barbano for insightfuldiscussions

FUNDING

National Institute of Health [T15 LM07056 to KS] and[CA-158167 to YK] Yale Cancer Center translationalresearch pilot funds (to FP and FS) American CancerSociety Award [M130572 to FS] the American-ItalianCancer Foundation [Post-Doctoral Research Fellowshipto FS] the Peter T Rowley Breast Cancer ResearchProjects funded by the New York State Department ofHealth [FAU 0812160900 YK] Funding for openaccess charge The Peter T Rowley Breast CancerResearch Projects funded by the New York StateDepartment of Health [FAU 0812160900 YK]

Conflict of interest statement None declared

REFERENCES

1 MortazaviA WilliamsBA MccueK SchaefferL and WoldB(2008) Mapping and quantifying mammalian transcriptomes byRNA-Seq Nat Methods 5 621ndash628

2 TrapnellC RobertsA GoffL PerteaG KimD KelleyDRPimentelH SalzbergSL RinnJL and PachterL (2012)Differential gene and transcript expression analysis of RNA-seqexperiments with TopHat and Cufflinks Nat Protoc 7 562ndash578

3 GreenmanC StephensP SmithR DalglieshGL HunterCBignellG DaviesH TeagueJ ButlerA StevensC et al(2007) Patterns of somatic mutation in human cancer genomesNature 446 153ndash158

4 AspP BlumR VethanthamV ParisiF MicsinaiM ChengJBowmanC KlugerY and DynlachtBD (2011) Genome-wide

remodeling of the epigenetic landscape during myogenicdifferentiation Proc Natl Acad Sci USA 108 E149ndashE158

5 BarskiA and ZhaoK (2009) Genomic location analysis byChIP-Seq J Cell Biochem 107 11ndash18

6 MikkelsenTS KuMC JaffeDB IssacB LiebermanEGiannoukosG AlvarezP BrockmanW KimTK KocheRPet al (2007) Genome-wide maps of chromatin state in pluripotentand lineage-committed cells Nature 448 553ndash560

7 ENCODE Project Consortium (2004) The ENCODE(ENCyclopedia of DNA elements) project Science 306 636ndash640

8 ENCODE Project Consortium (2012) An integrated encyclopediaof DNA elements in the human genome Nature 489 57ndash74

9 ListerR PelizzolaM DowenRH HawkinsRD HonGTonti-FilippiniJ NeryJR LeeL YeZ NgoQM et al(2009) Human DNA methylomes at base resolution showwidespread epigenomic differences Nature 462 315ndash322

10 BellO SchwaigerM OakeleyEJ LienertF BeiselCStadlerMB and SchubelerD (2010) Accessibility of theDrosophila genome discriminates PcG repression H4K16acetylation and replication timing Nat Struct Mol Biol 17894ndash900

11 BlowM McCulleyDJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2010)ChIP-Seq identification of weakly conserved heart enhancers NatGenet 42 806ndash810

12 BottomlyD KylerSL McWeeneySK and YochumGS(2010) Identification of b-catenin binding regions in colon cancercells using ChIP-Seq Nucleic Acids Res 38 5735ndash5745

13 ChicasA WangX ZhangC McCurrachM ZhaoZ MertODickinsRA NaritaM ZhangM and LoweSW (2010)Dissecting the unique role of the retinoblastoma tumor suppressorduring cellular senescence Cancer Cell 17 376ndash387

14 CreyghtonMP ChengAW WelsteadGG KooistraTCareyBW SteineEJ HannaJ LodatoMA FramptonGMSharpPA et al (2010) From the cover histone H3K27acseparates active from poised enhancers and predictsdevelopmental state Proc Natl Acad Sci USA 10721931ndash21936

15 DurantL WatfordWT RamosHL LaurenceA VahediGWeiL TakahashiH SunH-W KannoY PowrieF et al(2010) Diverse targets of the transcription factor STAT3contribute to T cell pathogenicity and homeostasis Immunity 32605ndash615

16 EnderleD BeiselC StadlerMB GerstungM AthriP andParoR (2010) Polycomb preferentially targets stalled promotersof coding and noncoding transcripts Genome Res 21 216ndash226

17 FullwoodMJ LiuMH PanYF LiuJ XuHMohamedYB OrlovYL VelkovS HoA MeiPH et al(2009) An oestrogen-receptor-a-bound human chromatininteractome Nature 462 58ndash64

18 GanQ SchonesDE Ho EunS WeiG CuiK ZhaoK andChenX (2010) Monovalent and unpoised status of most genes inundifferentiated cell-enriched Drosophila testis Genome Biol 11R42

19 GorchakovAA AlekseyenkoAA KharchenkoP ParkPJand KurodaMI (2009) Long-range spreading of dosagecompensation in Drosophila captures transcribed autosomal genesinserted on X Genes Dev 23 2266ndash2271

20 GuentherMG FramptonGM SoldnerF HockemeyerDMitalipovaM JaenischR and YoungRA (2010) Chromatinstructure and gene expression programs of human embryonic andinduced pluripotent stem cells Cell Stem Cell 7 249ndash257

21 HoffmanBG RobertsonG ZavagliaB BeachM CullumRLeeS SoukhatchevaG LiL WederellED ThiessenN et al(2010) Locus co-occupancy nucleosome positioning andH3K4me1 regulate the functionality of FOXA2- HNF4A- andPDX1-bound loci in islets and liver Genome Res 20 1037ndash1051

22 HurtadoA HolmesKA Ross-InnesCS SchmidtD andCarrollJS (2010) FOXA1 is a key determinant of estrogenreceptor function and endocrine response Nat Genet 43 27ndash33

23 IpJY SchmidtD PanQ RamaniAK FraserAGOdomDT and BlencoweBJ (2011) Global impact of RNApolymerase II elongation inhibition on alternative splicingregulation Genome Res 21 390ndash401

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 10 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

24 JohnS SaboPJ ThurmanRE SungM-H BiddieSCJohnsonTA HagerGL and StamatoyannopoulosJA (2011)Chromatin accessibility pre-determines glucocorticoid receptorbinding patterns Nat Genet 43 264ndash268

25 JolmaA KiviojaT ToivonenJ ChengL WeiG EngeMTaipaleM VaquerizasJM YanJ SillanpaaMJ et al (2010)Multiplexed massively parallel SELEX for characterization ofhuman transcription factor binding specificities Genome Res 20861ndash873

26 JungH LacombeJ MazzoniEO LiemKF Jr GrinsteinJMahonyS MukhopadhyayD GiffordDK YoungRAAndersonKV et al (2010) Global control of motor neurontopography mediated by the repressive actions of a single hoxgene Neuron 67 781ndash796

27 KageyMH NewmanJJ BilodeauS ZhanY OrlandoDAvan BerkumNL EbmeierCC GoossensJ RahlPBLevineSS et al (2010) Mediator and cohesin connect geneexpression and chromatin architecture Nature 467 430ndash435

28 KasowskiM GrubertF HeffelfingerC HariharanMAsabereA WaszakSM HabeggerL RozowskyJ ShiMUrbanAE et al (2010) Variation in transcription factor bindingamong humans Science 328 232ndash235

29 KocheRP SmithZD AdliM GuH KuM GnirkeABernsteinBE and MeissnerA (2011) Reprogramming factorexpression initiates widespread targeted chromatin remodelingCell Stem Cell 8 96ndash105

30 KramerJM KochinkeK OortveldMAW MarksHKramerD de JongEK AsztalosZ WestwoodJTStunnenbergHG SokolowskiMB et al (2011) Epigeneticregulation of learning and memory by Drosophila EHMTG9aPLoS Biol 9 e1000569

31 LeeBK BhingeAA and IyerVR (2011) Wide-rangingfunctions of E2F4 in transcriptional activation and repressionrevealed by genome-wide analysis Nucleic Acids Res 393558ndash3573

32 LiL JothiR CuiK LeeJY CohenT GorivodskyMTzchoriI ZhaoY HayesSM BresnickEH et al (2010)Nuclear adaptor Ldb1 regulates a transcriptional programessential for the maintenance of hematopoietic stem cells NatImmunol 12 129ndash136

33 MahonyS MazzoniEO McCuineS YoungRA WichterleHand GiffordDK (2011) Ligand-dependent dynamics of retinoicacid receptor binding during early neurogenesis Genome Biol12 R2

34 MarsonA LevineSS ColeMF FramptonGMBrambrinkT JohnstoneS GuentherMG JohnstonWKWernigM and NewmanJ (2008) Connecting microRNA genesto the core transcriptional regulatory circuitry of embryonic stemcells Cell 134 521ndash533

35 PaulerFM SloaneMA HuangR ReghaK KoernerMVTamirI SommerA AszodiA JenuweinT and BarlowDP(2008) H3K27me3 forms BLOCs over silent genes and intergenicregions and specifies a histone banding pattern on a mouseautosomal chromosome Genome Res 19 221ndash233

36 Rada-IglesiasA BajpaiR SwigutT BrugmannSAFlynnRA and WysockaJ (2010) A unique chromatin signatureuncovers early developmental enhancers in humans Nature 470279ndash283

37 RamagopalanSV HegerA BerlangaAJ MaugeriNJLincolnMR BurrellA HandunnetthiL HandelAEDisantoG OrtonS-M et al (2010) A ChIP-seq definedgenome-wide map of vitamin D receptor bindingassociations with disease and evolution Genome Res 201352ndash1360

38 Rugg-GunnPJ CoxBJ RalstonA and RossantJ (2010)Distinct histone modifications in stem cell lines and tissue lineagesfrom the early mouse embryo Proc Natl Acad Sci USA 10710783ndash10790

39 SchmidtD SchwaliePC Ross-InnesCS HurtadoABrownGD CarrollJS FlicekP and OdomDT (2010) ACTCF-independent role for cohesin in tissue-specific transcriptionGenome Res 20 578ndash588

40 VermeulenM EberlHC MatareseF MarksH DenissovSButterF LeeKK OlsenJV HymanAA and

StunnenbergHG (2010) Quantitative interaction proteomics andgenome-wide profiling of epigenetic histone marks and theirreaders Cell 142 967ndash980

41 VerziMP ShinH HeHH SulahianR MeyerCAMontgomeryRK FleetJC BrownM LiuXS andShivdasaniRA (2010) Differentiation-specific histonemodifications reveal dynamic chromatin interactions and partnersfor the intestinal transcription factor CDX2 Dev Cell 19713ndash726

42 WeiGH BadisG BergerMF KiviojaT PalinK EngeMBonkeM JolmaA VarjosaloM GehrkeAR et al (2010)Genome-wide analysis of ETS-family DNA-binding in vitro andin vivo EMBO J 29 2147ndash2160

43 WeiL VahediG SunH-W WatfordWT TakatoriHRamosHL TakahashiH LiangJ Gutierrez-CruzG ZangCet al (2010) Discrete roles of STAT4 and STAT6 transcriptionfactors in tuning epigenetic modifications and transcription duringT helper cell differentiation Immunity 32 840ndash851

44 YangY LuY EspejoA WuJ XuW LiangS andBedfordMT (2010) TDRD3 is an effector molecule for arginine-methylated histone marks Mol Cell 40 1016ndash1023

45 YuS CuiK JothiR ZhaoD-M JingX ZhaoK andXueH-H (2010) GABP controls a critical transcriptionregulatory module that is essential for maintenance anddifferentiation of hematopoietic stemprogenitor cells Blood 1172166ndash2178

46 HoffmanMM ErnstJ WilderSP KundajeA HarrisRSLibbrechtM GiardineB EllenbogenPM BilmesJABirneyE et al (2013) Integrative annotation of chromatinelements from ENCODE data Nucleic Acids Res 41 827ndash841

47 LimSJ TanTW and TongJC (2010) Computationalepigenetics the new scientific paradigm Bioinformation 4331ndash337

48 ThurmanRE DayN NobleWS and StamatoyannopoulosJA(2007) Identification of higher-order functional domains in thehuman ENCODE regions Genome Res 17 917ndash927

49 LangmeadB TrapnellC PopM and SalzbergSL (2009)Ultrafast and memory-efficient alignment of short DNAsequences to the human genome Genome Biol 10 R25

50 DiazA ParkK LimDA and SongJS (2012) Normalizationbias correction and peak calling for ChIP-seq Stat Appl GenetMol Biol 11 9

51 LiangK and KelesS (2012) Normalization of ChIP-seq datawith control BMC Bioinformatics 13 199

52 R Development Core Team (2012) R A Language andEnvironment for Statistical Computing R Foundation forStatistical Computing Vienna Austria

53 DaviesDL and BouldinDW (1979) A cluster separationmeasure IEEE Trans Pattern Anal Mach Intell 1 224ndash227

54 HanleyJA and McNeilBJ (1982) The meaning and use of thearea under a receiver operating characteristic (ROC) curveRadiology 143 29ndash36

55 GarnerSR (1995) Weka the waikato environment forknowledge analysis In Procceding of the New Zealand ComputerScience Research Students Conference Hamilton New Zealandpp 57ndash64

56 Shi-YunS Kai-QuanS Chong-JinO Xiao-PingL andWilder-SmithE (2008) Automatic identification and removal ofartifacts in EEG using a probabilistic multi-class SVM approachwith error correction In IEEE International Conference onSystems Man and Cybernetics Singapore Singaporepp 1134ndash1139

57 BellmanRE (2003) Dynamic Programming Dover PublicationsNew York NY

58 BickmoreWA and van SteenselB (2013) Genome architecturedomain organization of interphase chromosomes Cell 1521270ndash1284

59 OppenheimAV SchaferRW and BuckJR (1999) Discrete-Time Signal Processing Prentice-Hall Inc Upper SaddleRiver NJ

60 BlankTA and BeckerPB (1995) Electrostatic mechanism ofnucleosome spacing J Mol Biol 252 305ndash313

61 MicsinaiM ParisiF StrinoF AspP DynlachtBD andKlugerY (2012) Picking ChIP-seq peak detectors for

PAGE 11 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

analyzing chromatin modification experiments Nucleic Acids Res40 e70

62 KharchenkoPV TolstorukovMY and ParkPJ (2008) Designand analysis of ChIP-seq experiments for DNA-binding proteinsNat Biotechnol 26 1351ndash1359

63 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of theENCODE and modENCODE consortia Genome Res 221813ndash1831

64 RamachandranP PalidworGA PorterCJ and PerkinsTJ(2013) MaSC mappability-sensitive cross-correlation forestimating mean fragment length of single-end short-readsequencing data Bioinformatics 29 444ndash450

65 NarlikarL and JothiR (2012) ChIP-Seq data analysisidentification of ProteinndashDNA binding sites with SISSRs peak-finder In WangJ TanAC and TianT (eds) Next GenerationMicroarray Bioinformatics Vol 802 Humana Press New YorkUSA pp 305ndash322

66 ZhangY LiuT MeyerCA EeckhouteJ JohnsonDSBernsteinBE NussbaumC MyersRM BrownM LiWet al (2008) Model-based analysis of ChIP-Seq (MACS) GenomeBiol 9 R137

67 LittleAV LeeJ Yoon-MoJ and MaggioniM (2009) IEEESP15th Workshop on Statistical Signal Processing (SSPrsquo09) CardiffUnited Kingdom pp 85ndash88

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 12 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

Page 6: Arpeggio: harmonic compression of ChIP-seq data reveals protein ...

peak callers (61) If paired-end reads are available theycan be used to recover the fragment length distributionempirically however many experiments in currentrepositories were not done using paired-end reads Wenote that for a given number of nucleotides sequencedthe single end approach represents twice as many frag-ments and thus is more sensitive than the paired-endapproach The latter however has an advantage inmapability in repetitive regionsPrevious studies have used cross-correlation to infer the

average fragment length of singled-end short read ChIP-seq data (62ndash64) As known when considering reads fromopposing strands some of the measured distances reflectfragment lengths (62ndash66)We modeled the probability associated with sampling a

fragment starting at any given position on the positivestrand of the genome as Prfrac12W frac14 t If the sequencedread from this fragment happens to map to the positivestrand it will start at the same genomic position t givingthe same probability distribution for the readsPrfrac12R frac14 Prfrac12W If however the sequenced read maps tothe negative strand then its starting position is F= f thefragment length nucleotides downstream from thestarting position of the fragment W= t Thus given theprobability of sampling a read on the positive strandPrfrac12R there is an equivalent probability of sampling aread on the negative strand Prfrac12R+F where F is arandom variable representing the fragment lengthWe show that harmonic deconvolution can be used to

determine not only the average fragment length but alsothe full empirical fragment length distribution This isdone by deconvolving the cross-correlation (seelsquoMaterials and Methodsrsquo section) between the start ofreads aligning to opposite strands which we denoted asethTHORN from the autocorrelation of reads aligning to thesame strand for the same experiment ethTHORN We recall thatthe probability distribution of the sum of two randomvariables is equivalent to the convolution of their individ-ual probability distributions and thus

c frac14 Prfrac12R+F Prfrac12R

frac14 Prfrac12R Prfrac12R Prfrac12Feth7THORN

where c is a normalization constant such thatc DethtTHORN frac14 Prfrac12R Similarly to Equation (6) we deconvolvethe fragment length distribution

Prfrac12F frac14 F1 FethPrfrac12RPrfrac12RPrfrac12FTHORNFethPrfrac12RPrfrac12RTHORN

frac14 F1

FethcYethTHORNTHORN

FethcYethTHORNTHORN

eth8THORN

We found that the reported fragment size from the differ-ent studies included in our data set matched the fragmentsize inferred using our deconvolution approach(Supplementary Figure S7 and Supplementary Table S2)Moreover our deconvolution approach using only oneread from each read pair of a paired end experimentproduced an estimate of the fragment length distributionthat closely matched to the length distribution of thepaired-end fragments (Figure 3 see SupplementaryNote A)

Arpeggio captures the biology of proteinndashchromatininteraction

The space of Arpeggio spectral densities is lowdimensionalTo facilitate organization of large sequencing archives inparticular during search operations for complex queriesor computationally intensive machine-learning tasks it isdesirable to represent the samples using a small number offeatures Compared with the size of the genome spectraldensities of Arpeggio profiles described by only 8192elements are already relatively small We investigatedwhether the space comprising all spectral densities in ourdatabase could be further reduced using Multi-Dimensional Scaling (MDS) such as PCA We foundthat six principal components were sufficient to capture85 of the variability present in our collection ofspectral densities

Although more advanced techniques (67) may result ina better compression ie fewer dimensions we decided touse PCA which is readily available and familiar for manypractitioners We used the leading six spectral densityprincipal components to organize hundreds of ChIP-seqexperiments and aid annotation of novel samples Wetermed this representation Arpeggio MDS

Use of Arpeggio MDS coordinates for classification andclusteringClassification and clustering are affected by choice of datarepresentation and dissimilarity measures Here we usethe DaviesndashBouldin index (52) to assess discernability

0 200 400 600 800 1000

000

00

001

000

20

003

000

40

005

000

6

fragment size (bp)de

nsity

PairedminusEnd estimate (212bp)SingleminusEnd Arpeggio (214bp)Experimental protocol (200bp)Read length (75bp)

Figure 3 Comparison between paired-end fragment length distributionand Arpeggio fragment length distribution The Arpeggio fragmentlength distribution was estimated from only one read of each readpair The reported fragment length from the experimental protocoltogether with the sequenced read length is shown as dashed linesThe arpeggio fragment length distribution shows an additional spikeat the read length which has been previously observed in opposingstrand cross-correlation (63)

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 6 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

between clusters and inferred class labels for each classvariable using a k-nearest neighbor classification

First for each class variables eg antibody targetcellular mechanism or organism we considered classlabels as cluster indices and computed the DaviesndashBouldin index (see lsquoMaterials and Methodsrsquo section)Our analysis showed significantly small DaviesndashBouldinindices for all class variables (Supplementary Table S4)This indicates that the organization of the data based onArpeggio MDS coordinates has a structure amenable for avariety of machine-learning approaches The DaviesndashBouldin index might have been affected by samplingbias of proteins analyzed within each organism groupskewing the index for the organism class variable

Second for each class variable we trained a k-nearestneighbor classifier (k=1) and using a leave-one-outapproach we computed the balanced accuracy of classlabel assignments (see lsquoMaterials and Methodsrsquo section)For most class variables the performance of the trainedclassifier was above the performance of a classifier assign-ing labels at random (Supplementary Table S4)Importantly misclassification was rare but was morecommon between total DNA input and immunoglobulinG controls (Figure 4)

Similarity between Arpeggio profiles is indicative ofbiological functions

We showed that the Arpeggio profiles could be used totrain classifiers that suggest annotation for new experi-ments We sought to extend this paradigm and addresswhether given a new set of experiments Arpeggioprofiles could be used to rapidly select available ChIP-seq experiments that would enable a more complete de-scription of the biological mechanism of interestTypically this analysis involves identification of peaksfrom the ChIP-seq signals and the study of the overlapbetween events in two or more different experiments (8)For this reason we compared the proximity mappingdetermined by the Arpeggio profiles with the proximityscore determined using the Jaccard distance of peakoverlaps

For each ChIP-seq experiment peaks were identifiedusing the Qeseq program (61) assigning to each experi-ment the best matching control as described in theprevious sections We set the fragment size to 150 bp forexperiments using MNase digestion and to 250 bp for ex-periments using sonication (see Supplementary Table S2)We considered two peaks to be overlapping if they sharedat least 1 bp For any pair of experiments the totalnumber of peaks was computed as the sum of thenumber of peaks in each experiment minus the numberof overlapping peaks The pairwise peak overlap score wascomputed using the Jaccard distance namely one minusthe ratio between the number of overlapping peaks andthe total number of peaks In contrast to our Arpeggioapproach peak overlap does not directly allow cross-species comparison To ensure a fair comparisonbetween data representations we split our data set intoa set of human samples (n=541) and a set of murinesamples (n=237) and analyzed them separately

Applying PCA to the peak overlap Jaccard distancematrix revealed that for the human set 365 principalcomponents were needed to capture 85 of the variabilityin the data In contrast 85 of the variability of thepairwise distance matrix of the Fourier transforms of theArpeggio profiles was captured by the six leading principalcomponents Thus the Arpeggio MDS provides a morecompact representation of the dataNext for each class variable we studied the classification

performance using three distance measures peak overlapJaccard distance inverse correlation measure betweenspectral density of Arpeggio profiles and Euclideandistance between Arpeggio MDS coordinates Weanalyzed human and mouse samples separately For eachclass variable we quantified the performance using theArea Under the receiver operator characteristic Curve(AUC) as described in the lsquoMaterials and Methodsrsquo sectionApplication of a k-nearest neighbor classifier with k=1

to these three distance measures resulted in comparableperformances in most class variables However proximitybetween Arpeggio MDS profiles as compared with peakoverlap Jaccard distance reflected higher association forcellular mechanisms This was more evident in thehuman set where the larger number of experiments corres-ponded to smaller error bars (Figure 5)Arpeggio retains information related to mode of

binding and performs best in clustering cellular mechan-ism in contrast peak overlap contains information aboutbinding locations and best clusters more variables such asbatch effects associated with the Study ID and cell line

Fact

or

His

tone

Mod

ifica

tion

RN

A P

olym

eras

e

Gen

omic

Con

trol

Uns

peci

fic C

ontr

ol

Factor

Histone Modification

RNA Polymerase

Genomic Control

Unspecific Control

Pre

dict

ed la

bel

True label

516

107

96

103

178

253

479

82

8

106

267

97

485

66

85

16

8

39

206

515

169

28

35

304

464

Figure 4 Confusion matrix of predicting functional annotations fromthe Arpeggio profiles using a k-nearest neighbor classifier with k=1The values along each column of the matrix represent how the instancesin the actual class were assigned to the predicted class The diagonalelements indicate sensitivity Darker colors indicate higher classificationfrequencies Most misclassifications occurred between controls

PAGE 7 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

(Supplementary Table S5 Figure 5 and SupplementaryFigure S8)

Reading biological features from Arpeggio profiles

In the previous sections we have provided evidence thatArpeggio profiles and their spectral densities can be usedto rapidly compare a large number of experiments Inthis section we show that Arpeggio profiles can also beused to derive biologically and technically meaningfulinformationFor instance H3K27me3 Arpeggio profiles exhibited

distinct periodicity with high amplitudes of oscillation(Figure 1) This suggested a highly ordered array of nu-cleosomes consistent with a static chromatin structurewhere nucleosomes are precisely positioned This agreedwell with the known role of Polycomb and H3K27 tri-methylation in transcriptional repression and heterochro-matin formation We recall that the profile represents anaggregate of binding events across the whole genome theclear and distinct periodicity for H3K27me3 suggestedthat nucleosomes with the tri-methylated H3K27 markhad remarkably constant periodicities throughout thegenome (4) Further we show that this periodicity andthe width of the signal (number of clear oscillations) issimilar across species (Figure 6)Another oscillatory pattern can be observed for Histone

3 Lysine 36 tri-methylation (H3K36me3) This histonemodification mark is deposited along with activelytranscribing PolII complexes and is by far the mostreliable histone mark for actively transcribed genes Incontrast to H3K27me3 profiles where the oscillation isonly slowly dampening due to the static chromatin

structure imposed by this mark H3K36me3 profilesshow a central peak surrounded by slightly smallerpeaks with a rapidly degrading oscillation indicative ofa fluidic chromatin state where nucleosomes are not pre-cisely positioned This is in agreement with this markbeing found in actively transcribed genes as the nucleo-somes in actively transcribed chromatin are perturbed bythe passing polymerase complexes (Figure 1)

The difference in chromatin state is particularly evidentin the harmonic analysis of Arpeggio profiles Inspired byprevious work on nucleosome spacing (60) we computedthe ratio

frac14Aeth 1190THORN

Aeth 1180THORN+Aeth 1200THORN

2

eth9THORN

where Aeth1=tTHORN is the magnitude of the t bp length scale inthe spectral density H3K27me3 exhibited a stronger acompared with H3K36me3 (Figure 7) which suggestsflexibility in the spacing of nucleosomes carryingH3K36me3 mark Interestingly the ratio a was alsolarge in Histone 3 Lysine 4 tri-methylation (H3K4me3)and in the Retinoblastoma protein (pRb) Although thereasons for such a precise placement of H3K4me3 areunclear pRb is an important factor in establishingheterochromatin

The effects of open fluidic chromatin states are strongerin the Arpeggio profiles of histone acetylations Theseprofiles are characterized by a high central peak sur-rounded by unorganized oscillations These profiles areclearly indicative of an open fluid chromatin configur-ation which is consistent with actively transcribedregions (Figure 1)

However the reference ChIP-seq signal for openactively transcribed chromatin is the Arpeggio profile ofRNA Polymerase II The PolII complexes move alongactively transcribed genes thus the ChIP-seq information

Hsapiens

AU

C

00

02

04

06

08

10

Pro

tein

AB

targ

et

Inte

ract

ion

dom

ain

Fun

ctio

nal a

nnot

atio

n

Act

ivity

ann

otat

ion

Cel

lula

r M

echa

nism

She

arin

g

Cel

l Lin

e

Stu

dy ID

RandomPeak overlapArpeggioArpeggio MDS

Figure 5 Classification of biological and experimental factors in termsof peak-based and Arpeggio-based ChIP-seq signatures for humansamples The comparison was performed based on proximitycomputed using peak overlap between pairs of experiments inversecorrelation of Arpeggio spectral densities or Euclidean distances ofArpeggio MDS coordinates The AUC of a classifier assigningrandom labels is shown as a black dashed line Compared with peakoverlap proximity based on Arpeggio MDS had a higher median per-formance of predicting cellular mechanism

500 1000 1500 2000minus4e

minus04

minus2e

minus04

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Offset(bp)

IP s

igna

l (R

eads

ChI

P

Tot

al R

eads

ChI

P)

(R

eads

Con

trol

Tot

al R

eads

Con

trol

)

Fly (187bp period)Mouse (188bp period)Human (184bp period)

Figure 6 H3K27me3 shows similar periodicity across fly mouse andhuman Periodicity was calculated using the local maxima of thespectral density for the range of the Arpeggio profiles shown

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 8 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

is spread over a relatively large spatial region For thisreason PolII ChIP experiments need significantsequencing depth to obtain a good picture of PolIIactivity The Arpeggio profiles for PolII show a strongcentral signal surrounded by two shoulders each flankedby disorganized nucleosomes (Figure 2) The central peakis where PolII is located most closely to the DNA strandand thus efficient cross-linking can occur However PolIIis part of a large lsquoholoenzymersquo transcription machineryand the surrounding smaller humps are likely where thiscomplex is close enough to the DNA strand to be cross-linked and enriched in a ChIP-seq experiment The un-organized pattern surrounding these peaks is likely aresult of the actively transcribing PolII complex As itmoves through the chromatin it perturbs andor disas-sembles nucleosomes in front and re-deposits thembehind giving highly disorganized chromatin structure(Figure 2)

Other proteins such as Androgen Receptor SPDEF(SAM pointed domain-containing Ets transcriptionfactor) ERG (Ets-Related Gene) FL1 (Follicularlymphoma susceptibility to 1) display typical site-specific DNA-binding profiles of transcription factors inwhich a strong signal occurs at the binding siteaccompanied by disorganized surrounding patterns indi-cative of active fluidic chromatin (Figure 2)

DISCUSSION

The contribution of this study is the design of a compactChIP-seq data representation based on the denoised auto-correlation We show that use of this compact data repre-sentation has several advantages that facilitates efficientcomputation and data storage linking the cellular mech-anisms of protein targets in novel ChIP-seq experiments todata in current repositories low-dimensional organizationof large repositories of ChIP-seq data that facilitates ex-ploratory data analysis cross-species and cross-cell linecomparisons extraction of technical features such asfragment length distribution and biological relevantinterpretation

Large collections of ChIP-seq data have been leveragedto gain new biological insights (8) The volume of ChIP-seq data in public repositories has noticeably increased inrecent years Typically users retrieve samples based ontheir prior knowledge and expectations of the biologicalsystem Currently available data retrieval systems arebased on matching qualitative annotations such asorganism cell-type condition and specific immunopre-cipitated protein We suggest the use of our novel com-pression technique Arpeggio to enable searching forsamples similar to the query experiment in terms of quan-titative patterns present in theirs signals thus facilitatingnovel biological discoveries We note that non-linear datacompression approaches applied to the autocorrelationfunctions can organize the data and reveal new insights(see diffusion map analysis Supplementary Note B) Wepresent Arpeggio profiles and their spectral densities Thislow-dimensional harmonic data representation can beused for selecting publicly available experiments that arebiologically related to an experiment of interest TheArpeggio profiles are computed from the autocorrelationof ChIP-seq signals which have been previously exploredin the context of data quality assessment (7)In contrast to previous approaches we applied signal

processing techniques to derive a profile of the IP auto-correlation that is diminished in technical variability andrequires little pre-filtering We found that Arpeggioprofiles were remarkably organized in four maincategories corresponding to intuitive classes of structuralinteractions factors showed peaks with sharply decayingtails polymerases showed peaks as well but with slowlydecaying tails histone modifications showed damped os-cillations corresponding to trains of peaks at fixed dis-tances from one another lastly controls showed a singlepulse sharper than the peak of factors indicating that asexpected sequenced reads from total DNA inputs have norecurrent properties Typical binding patterns of a particu-lar proteinndashchromatin interaction are obscured by noiseArpeggio profiles overcome this problem by aggregatingthe recurrent patterns of proteinndashchromatin interactionThe quality of autocorrelation also improves quickerthan the read coverage density as the number of readsincreases In peak finding the average coverage isexpected to increase linearly on the order of OethN L=GTHORNwhere N is the number of sequenced reads L is the read-length and G is the size of the genome in contrast thenumber of distances between reads contributing to thecomputation of the autocorrelation scales quadraticallyin the worst case as OethN2 W=GTHORN where W is themaximum lag at which the autocorrelation is evaluatedwhere W L We note that Arpeggio profiles do notprovide the location of such binding events however weplan to further develop these characteristic spectralbinding patterns for locating peaksIn this work we used an unsupervised approach to

organize a large volume of ChIP-seq experiments Weshow that close proximity between the denoised spectraldensities of two different proteins is often associated withsimilar cellular mechanism In this work we manuallyannotated 806 of the 14 306 currently marked as ChIP-seq samples in the SRA Databases are expected to

DN

AIn

put

ER

GG

RH

2BK

12A

cH

2BK

15A

cH

2BK

20A

cH

3H

3K18

Ac

H3K

27A

cH

3K27

me3

H3K

36m

e3H

3K4m

e1H

3K4m

e2H

3K4m

e3H

3K79

me1

H3K

79m

e2H

3K79

me3

H3K

9Ac

H3K

9me3

H4K

12A

cH

4K20

me1

H4K

5Ac

IgG

P30

0P

olII

RA

RS

PD

EF

VitD

rhE

Ra

pRb

099

100

101

102

103

104

105

106

α

Figure 7 Flexibility of nucleosome spacing across different experi-ments Flexible spacing or isolated events (SupplementaryFigure S2) is represented by ratios close to one Ratios clearly aboveone indicate strict spacing of 190 bp between events (SupplementaryFigure S4) Histone modifications associated with heterochromatinexhibit higher ratios indicating less flexibility

PAGE 9 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

evolve to be more structured and enable automatic re-trieval eliminating the need for data entry tasks andallowing us to organize thousands of samples at a timeMoreover high-throughput sequencing is becoming cheapfacilitating the mass survey of novel ChIP targets forwhich function is yet to be determined Applying theproposed spectral representation to thousands of existingannotated ChIP-seq experiments will allow us to screenthese new ChIP targets reducing the resources requiredto elucidate their functions Interpretation of spectralpatterns in many fields of science and engineering (ieradiology control systems analysis imaging) is often theproduct of years of study In this study we focused onproperties that discriminate coarse categories of proteinndashchromatin interaction There is a wealth of knowledgehidden in Arpeggio representation and we anticipatethat with increasing database size and quality it willprovide information on a much finer scale

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Onlineincluding [68ndash122]

ACKNOWLEDGEMENTS

The authors thank Ronen Talmon Sherman WeissmanRonald Coifman and Paolo Barbano for insightfuldiscussions

FUNDING

National Institute of Health [T15 LM07056 to KS] and[CA-158167 to YK] Yale Cancer Center translationalresearch pilot funds (to FP and FS) American CancerSociety Award [M130572 to FS] the American-ItalianCancer Foundation [Post-Doctoral Research Fellowshipto FS] the Peter T Rowley Breast Cancer ResearchProjects funded by the New York State Department ofHealth [FAU 0812160900 YK] Funding for openaccess charge The Peter T Rowley Breast CancerResearch Projects funded by the New York StateDepartment of Health [FAU 0812160900 YK]

Conflict of interest statement None declared

REFERENCES

1 MortazaviA WilliamsBA MccueK SchaefferL and WoldB(2008) Mapping and quantifying mammalian transcriptomes byRNA-Seq Nat Methods 5 621ndash628

2 TrapnellC RobertsA GoffL PerteaG KimD KelleyDRPimentelH SalzbergSL RinnJL and PachterL (2012)Differential gene and transcript expression analysis of RNA-seqexperiments with TopHat and Cufflinks Nat Protoc 7 562ndash578

3 GreenmanC StephensP SmithR DalglieshGL HunterCBignellG DaviesH TeagueJ ButlerA StevensC et al(2007) Patterns of somatic mutation in human cancer genomesNature 446 153ndash158

4 AspP BlumR VethanthamV ParisiF MicsinaiM ChengJBowmanC KlugerY and DynlachtBD (2011) Genome-wide

remodeling of the epigenetic landscape during myogenicdifferentiation Proc Natl Acad Sci USA 108 E149ndashE158

5 BarskiA and ZhaoK (2009) Genomic location analysis byChIP-Seq J Cell Biochem 107 11ndash18

6 MikkelsenTS KuMC JaffeDB IssacB LiebermanEGiannoukosG AlvarezP BrockmanW KimTK KocheRPet al (2007) Genome-wide maps of chromatin state in pluripotentand lineage-committed cells Nature 448 553ndash560

7 ENCODE Project Consortium (2004) The ENCODE(ENCyclopedia of DNA elements) project Science 306 636ndash640

8 ENCODE Project Consortium (2012) An integrated encyclopediaof DNA elements in the human genome Nature 489 57ndash74

9 ListerR PelizzolaM DowenRH HawkinsRD HonGTonti-FilippiniJ NeryJR LeeL YeZ NgoQM et al(2009) Human DNA methylomes at base resolution showwidespread epigenomic differences Nature 462 315ndash322

10 BellO SchwaigerM OakeleyEJ LienertF BeiselCStadlerMB and SchubelerD (2010) Accessibility of theDrosophila genome discriminates PcG repression H4K16acetylation and replication timing Nat Struct Mol Biol 17894ndash900

11 BlowM McCulleyDJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2010)ChIP-Seq identification of weakly conserved heart enhancers NatGenet 42 806ndash810

12 BottomlyD KylerSL McWeeneySK and YochumGS(2010) Identification of b-catenin binding regions in colon cancercells using ChIP-Seq Nucleic Acids Res 38 5735ndash5745

13 ChicasA WangX ZhangC McCurrachM ZhaoZ MertODickinsRA NaritaM ZhangM and LoweSW (2010)Dissecting the unique role of the retinoblastoma tumor suppressorduring cellular senescence Cancer Cell 17 376ndash387

14 CreyghtonMP ChengAW WelsteadGG KooistraTCareyBW SteineEJ HannaJ LodatoMA FramptonGMSharpPA et al (2010) From the cover histone H3K27acseparates active from poised enhancers and predictsdevelopmental state Proc Natl Acad Sci USA 10721931ndash21936

15 DurantL WatfordWT RamosHL LaurenceA VahediGWeiL TakahashiH SunH-W KannoY PowrieF et al(2010) Diverse targets of the transcription factor STAT3contribute to T cell pathogenicity and homeostasis Immunity 32605ndash615

16 EnderleD BeiselC StadlerMB GerstungM AthriP andParoR (2010) Polycomb preferentially targets stalled promotersof coding and noncoding transcripts Genome Res 21 216ndash226

17 FullwoodMJ LiuMH PanYF LiuJ XuHMohamedYB OrlovYL VelkovS HoA MeiPH et al(2009) An oestrogen-receptor-a-bound human chromatininteractome Nature 462 58ndash64

18 GanQ SchonesDE Ho EunS WeiG CuiK ZhaoK andChenX (2010) Monovalent and unpoised status of most genes inundifferentiated cell-enriched Drosophila testis Genome Biol 11R42

19 GorchakovAA AlekseyenkoAA KharchenkoP ParkPJand KurodaMI (2009) Long-range spreading of dosagecompensation in Drosophila captures transcribed autosomal genesinserted on X Genes Dev 23 2266ndash2271

20 GuentherMG FramptonGM SoldnerF HockemeyerDMitalipovaM JaenischR and YoungRA (2010) Chromatinstructure and gene expression programs of human embryonic andinduced pluripotent stem cells Cell Stem Cell 7 249ndash257

21 HoffmanBG RobertsonG ZavagliaB BeachM CullumRLeeS SoukhatchevaG LiL WederellED ThiessenN et al(2010) Locus co-occupancy nucleosome positioning andH3K4me1 regulate the functionality of FOXA2- HNF4A- andPDX1-bound loci in islets and liver Genome Res 20 1037ndash1051

22 HurtadoA HolmesKA Ross-InnesCS SchmidtD andCarrollJS (2010) FOXA1 is a key determinant of estrogenreceptor function and endocrine response Nat Genet 43 27ndash33

23 IpJY SchmidtD PanQ RamaniAK FraserAGOdomDT and BlencoweBJ (2011) Global impact of RNApolymerase II elongation inhibition on alternative splicingregulation Genome Res 21 390ndash401

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 10 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

24 JohnS SaboPJ ThurmanRE SungM-H BiddieSCJohnsonTA HagerGL and StamatoyannopoulosJA (2011)Chromatin accessibility pre-determines glucocorticoid receptorbinding patterns Nat Genet 43 264ndash268

25 JolmaA KiviojaT ToivonenJ ChengL WeiG EngeMTaipaleM VaquerizasJM YanJ SillanpaaMJ et al (2010)Multiplexed massively parallel SELEX for characterization ofhuman transcription factor binding specificities Genome Res 20861ndash873

26 JungH LacombeJ MazzoniEO LiemKF Jr GrinsteinJMahonyS MukhopadhyayD GiffordDK YoungRAAndersonKV et al (2010) Global control of motor neurontopography mediated by the repressive actions of a single hoxgene Neuron 67 781ndash796

27 KageyMH NewmanJJ BilodeauS ZhanY OrlandoDAvan BerkumNL EbmeierCC GoossensJ RahlPBLevineSS et al (2010) Mediator and cohesin connect geneexpression and chromatin architecture Nature 467 430ndash435

28 KasowskiM GrubertF HeffelfingerC HariharanMAsabereA WaszakSM HabeggerL RozowskyJ ShiMUrbanAE et al (2010) Variation in transcription factor bindingamong humans Science 328 232ndash235

29 KocheRP SmithZD AdliM GuH KuM GnirkeABernsteinBE and MeissnerA (2011) Reprogramming factorexpression initiates widespread targeted chromatin remodelingCell Stem Cell 8 96ndash105

30 KramerJM KochinkeK OortveldMAW MarksHKramerD de JongEK AsztalosZ WestwoodJTStunnenbergHG SokolowskiMB et al (2011) Epigeneticregulation of learning and memory by Drosophila EHMTG9aPLoS Biol 9 e1000569

31 LeeBK BhingeAA and IyerVR (2011) Wide-rangingfunctions of E2F4 in transcriptional activation and repressionrevealed by genome-wide analysis Nucleic Acids Res 393558ndash3573

32 LiL JothiR CuiK LeeJY CohenT GorivodskyMTzchoriI ZhaoY HayesSM BresnickEH et al (2010)Nuclear adaptor Ldb1 regulates a transcriptional programessential for the maintenance of hematopoietic stem cells NatImmunol 12 129ndash136

33 MahonyS MazzoniEO McCuineS YoungRA WichterleHand GiffordDK (2011) Ligand-dependent dynamics of retinoicacid receptor binding during early neurogenesis Genome Biol12 R2

34 MarsonA LevineSS ColeMF FramptonGMBrambrinkT JohnstoneS GuentherMG JohnstonWKWernigM and NewmanJ (2008) Connecting microRNA genesto the core transcriptional regulatory circuitry of embryonic stemcells Cell 134 521ndash533

35 PaulerFM SloaneMA HuangR ReghaK KoernerMVTamirI SommerA AszodiA JenuweinT and BarlowDP(2008) H3K27me3 forms BLOCs over silent genes and intergenicregions and specifies a histone banding pattern on a mouseautosomal chromosome Genome Res 19 221ndash233

36 Rada-IglesiasA BajpaiR SwigutT BrugmannSAFlynnRA and WysockaJ (2010) A unique chromatin signatureuncovers early developmental enhancers in humans Nature 470279ndash283

37 RamagopalanSV HegerA BerlangaAJ MaugeriNJLincolnMR BurrellA HandunnetthiL HandelAEDisantoG OrtonS-M et al (2010) A ChIP-seq definedgenome-wide map of vitamin D receptor bindingassociations with disease and evolution Genome Res 201352ndash1360

38 Rugg-GunnPJ CoxBJ RalstonA and RossantJ (2010)Distinct histone modifications in stem cell lines and tissue lineagesfrom the early mouse embryo Proc Natl Acad Sci USA 10710783ndash10790

39 SchmidtD SchwaliePC Ross-InnesCS HurtadoABrownGD CarrollJS FlicekP and OdomDT (2010) ACTCF-independent role for cohesin in tissue-specific transcriptionGenome Res 20 578ndash588

40 VermeulenM EberlHC MatareseF MarksH DenissovSButterF LeeKK OlsenJV HymanAA and

StunnenbergHG (2010) Quantitative interaction proteomics andgenome-wide profiling of epigenetic histone marks and theirreaders Cell 142 967ndash980

41 VerziMP ShinH HeHH SulahianR MeyerCAMontgomeryRK FleetJC BrownM LiuXS andShivdasaniRA (2010) Differentiation-specific histonemodifications reveal dynamic chromatin interactions and partnersfor the intestinal transcription factor CDX2 Dev Cell 19713ndash726

42 WeiGH BadisG BergerMF KiviojaT PalinK EngeMBonkeM JolmaA VarjosaloM GehrkeAR et al (2010)Genome-wide analysis of ETS-family DNA-binding in vitro andin vivo EMBO J 29 2147ndash2160

43 WeiL VahediG SunH-W WatfordWT TakatoriHRamosHL TakahashiH LiangJ Gutierrez-CruzG ZangCet al (2010) Discrete roles of STAT4 and STAT6 transcriptionfactors in tuning epigenetic modifications and transcription duringT helper cell differentiation Immunity 32 840ndash851

44 YangY LuY EspejoA WuJ XuW LiangS andBedfordMT (2010) TDRD3 is an effector molecule for arginine-methylated histone marks Mol Cell 40 1016ndash1023

45 YuS CuiK JothiR ZhaoD-M JingX ZhaoK andXueH-H (2010) GABP controls a critical transcriptionregulatory module that is essential for maintenance anddifferentiation of hematopoietic stemprogenitor cells Blood 1172166ndash2178

46 HoffmanMM ErnstJ WilderSP KundajeA HarrisRSLibbrechtM GiardineB EllenbogenPM BilmesJABirneyE et al (2013) Integrative annotation of chromatinelements from ENCODE data Nucleic Acids Res 41 827ndash841

47 LimSJ TanTW and TongJC (2010) Computationalepigenetics the new scientific paradigm Bioinformation 4331ndash337

48 ThurmanRE DayN NobleWS and StamatoyannopoulosJA(2007) Identification of higher-order functional domains in thehuman ENCODE regions Genome Res 17 917ndash927

49 LangmeadB TrapnellC PopM and SalzbergSL (2009)Ultrafast and memory-efficient alignment of short DNAsequences to the human genome Genome Biol 10 R25

50 DiazA ParkK LimDA and SongJS (2012) Normalizationbias correction and peak calling for ChIP-seq Stat Appl GenetMol Biol 11 9

51 LiangK and KelesS (2012) Normalization of ChIP-seq datawith control BMC Bioinformatics 13 199

52 R Development Core Team (2012) R A Language andEnvironment for Statistical Computing R Foundation forStatistical Computing Vienna Austria

53 DaviesDL and BouldinDW (1979) A cluster separationmeasure IEEE Trans Pattern Anal Mach Intell 1 224ndash227

54 HanleyJA and McNeilBJ (1982) The meaning and use of thearea under a receiver operating characteristic (ROC) curveRadiology 143 29ndash36

55 GarnerSR (1995) Weka the waikato environment forknowledge analysis In Procceding of the New Zealand ComputerScience Research Students Conference Hamilton New Zealandpp 57ndash64

56 Shi-YunS Kai-QuanS Chong-JinO Xiao-PingL andWilder-SmithE (2008) Automatic identification and removal ofartifacts in EEG using a probabilistic multi-class SVM approachwith error correction In IEEE International Conference onSystems Man and Cybernetics Singapore Singaporepp 1134ndash1139

57 BellmanRE (2003) Dynamic Programming Dover PublicationsNew York NY

58 BickmoreWA and van SteenselB (2013) Genome architecturedomain organization of interphase chromosomes Cell 1521270ndash1284

59 OppenheimAV SchaferRW and BuckJR (1999) Discrete-Time Signal Processing Prentice-Hall Inc Upper SaddleRiver NJ

60 BlankTA and BeckerPB (1995) Electrostatic mechanism ofnucleosome spacing J Mol Biol 252 305ndash313

61 MicsinaiM ParisiF StrinoF AspP DynlachtBD andKlugerY (2012) Picking ChIP-seq peak detectors for

PAGE 11 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

analyzing chromatin modification experiments Nucleic Acids Res40 e70

62 KharchenkoPV TolstorukovMY and ParkPJ (2008) Designand analysis of ChIP-seq experiments for DNA-binding proteinsNat Biotechnol 26 1351ndash1359

63 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of theENCODE and modENCODE consortia Genome Res 221813ndash1831

64 RamachandranP PalidworGA PorterCJ and PerkinsTJ(2013) MaSC mappability-sensitive cross-correlation forestimating mean fragment length of single-end short-readsequencing data Bioinformatics 29 444ndash450

65 NarlikarL and JothiR (2012) ChIP-Seq data analysisidentification of ProteinndashDNA binding sites with SISSRs peak-finder In WangJ TanAC and TianT (eds) Next GenerationMicroarray Bioinformatics Vol 802 Humana Press New YorkUSA pp 305ndash322

66 ZhangY LiuT MeyerCA EeckhouteJ JohnsonDSBernsteinBE NussbaumC MyersRM BrownM LiWet al (2008) Model-based analysis of ChIP-Seq (MACS) GenomeBiol 9 R137

67 LittleAV LeeJ Yoon-MoJ and MaggioniM (2009) IEEESP15th Workshop on Statistical Signal Processing (SSPrsquo09) CardiffUnited Kingdom pp 85ndash88

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 12 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

Page 7: Arpeggio: harmonic compression of ChIP-seq data reveals protein ...

between clusters and inferred class labels for each classvariable using a k-nearest neighbor classification

First for each class variables eg antibody targetcellular mechanism or organism we considered classlabels as cluster indices and computed the DaviesndashBouldin index (see lsquoMaterials and Methodsrsquo section)Our analysis showed significantly small DaviesndashBouldinindices for all class variables (Supplementary Table S4)This indicates that the organization of the data based onArpeggio MDS coordinates has a structure amenable for avariety of machine-learning approaches The DaviesndashBouldin index might have been affected by samplingbias of proteins analyzed within each organism groupskewing the index for the organism class variable

Second for each class variable we trained a k-nearestneighbor classifier (k=1) and using a leave-one-outapproach we computed the balanced accuracy of classlabel assignments (see lsquoMaterials and Methodsrsquo section)For most class variables the performance of the trainedclassifier was above the performance of a classifier assign-ing labels at random (Supplementary Table S4)Importantly misclassification was rare but was morecommon between total DNA input and immunoglobulinG controls (Figure 4)

Similarity between Arpeggio profiles is indicative ofbiological functions

We showed that the Arpeggio profiles could be used totrain classifiers that suggest annotation for new experi-ments We sought to extend this paradigm and addresswhether given a new set of experiments Arpeggioprofiles could be used to rapidly select available ChIP-seq experiments that would enable a more complete de-scription of the biological mechanism of interestTypically this analysis involves identification of peaksfrom the ChIP-seq signals and the study of the overlapbetween events in two or more different experiments (8)For this reason we compared the proximity mappingdetermined by the Arpeggio profiles with the proximityscore determined using the Jaccard distance of peakoverlaps

For each ChIP-seq experiment peaks were identifiedusing the Qeseq program (61) assigning to each experi-ment the best matching control as described in theprevious sections We set the fragment size to 150 bp forexperiments using MNase digestion and to 250 bp for ex-periments using sonication (see Supplementary Table S2)We considered two peaks to be overlapping if they sharedat least 1 bp For any pair of experiments the totalnumber of peaks was computed as the sum of thenumber of peaks in each experiment minus the numberof overlapping peaks The pairwise peak overlap score wascomputed using the Jaccard distance namely one minusthe ratio between the number of overlapping peaks andthe total number of peaks In contrast to our Arpeggioapproach peak overlap does not directly allow cross-species comparison To ensure a fair comparisonbetween data representations we split our data set intoa set of human samples (n=541) and a set of murinesamples (n=237) and analyzed them separately

Applying PCA to the peak overlap Jaccard distancematrix revealed that for the human set 365 principalcomponents were needed to capture 85 of the variabilityin the data In contrast 85 of the variability of thepairwise distance matrix of the Fourier transforms of theArpeggio profiles was captured by the six leading principalcomponents Thus the Arpeggio MDS provides a morecompact representation of the dataNext for each class variable we studied the classification

performance using three distance measures peak overlapJaccard distance inverse correlation measure betweenspectral density of Arpeggio profiles and Euclideandistance between Arpeggio MDS coordinates Weanalyzed human and mouse samples separately For eachclass variable we quantified the performance using theArea Under the receiver operator characteristic Curve(AUC) as described in the lsquoMaterials and Methodsrsquo sectionApplication of a k-nearest neighbor classifier with k=1

to these three distance measures resulted in comparableperformances in most class variables However proximitybetween Arpeggio MDS profiles as compared with peakoverlap Jaccard distance reflected higher association forcellular mechanisms This was more evident in thehuman set where the larger number of experiments corres-ponded to smaller error bars (Figure 5)Arpeggio retains information related to mode of

binding and performs best in clustering cellular mechan-ism in contrast peak overlap contains information aboutbinding locations and best clusters more variables such asbatch effects associated with the Study ID and cell line

Fact

or

His

tone

Mod

ifica

tion

RN

A P

olym

eras

e

Gen

omic

Con

trol

Uns

peci

fic C

ontr

ol

Factor

Histone Modification

RNA Polymerase

Genomic Control

Unspecific Control

Pre

dict

ed la

bel

True label

516

107

96

103

178

253

479

82

8

106

267

97

485

66

85

16

8

39

206

515

169

28

35

304

464

Figure 4 Confusion matrix of predicting functional annotations fromthe Arpeggio profiles using a k-nearest neighbor classifier with k=1The values along each column of the matrix represent how the instancesin the actual class were assigned to the predicted class The diagonalelements indicate sensitivity Darker colors indicate higher classificationfrequencies Most misclassifications occurred between controls

PAGE 7 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

(Supplementary Table S5 Figure 5 and SupplementaryFigure S8)

Reading biological features from Arpeggio profiles

In the previous sections we have provided evidence thatArpeggio profiles and their spectral densities can be usedto rapidly compare a large number of experiments Inthis section we show that Arpeggio profiles can also beused to derive biologically and technically meaningfulinformationFor instance H3K27me3 Arpeggio profiles exhibited

distinct periodicity with high amplitudes of oscillation(Figure 1) This suggested a highly ordered array of nu-cleosomes consistent with a static chromatin structurewhere nucleosomes are precisely positioned This agreedwell with the known role of Polycomb and H3K27 tri-methylation in transcriptional repression and heterochro-matin formation We recall that the profile represents anaggregate of binding events across the whole genome theclear and distinct periodicity for H3K27me3 suggestedthat nucleosomes with the tri-methylated H3K27 markhad remarkably constant periodicities throughout thegenome (4) Further we show that this periodicity andthe width of the signal (number of clear oscillations) issimilar across species (Figure 6)Another oscillatory pattern can be observed for Histone

3 Lysine 36 tri-methylation (H3K36me3) This histonemodification mark is deposited along with activelytranscribing PolII complexes and is by far the mostreliable histone mark for actively transcribed genes Incontrast to H3K27me3 profiles where the oscillation isonly slowly dampening due to the static chromatin

structure imposed by this mark H3K36me3 profilesshow a central peak surrounded by slightly smallerpeaks with a rapidly degrading oscillation indicative ofa fluidic chromatin state where nucleosomes are not pre-cisely positioned This is in agreement with this markbeing found in actively transcribed genes as the nucleo-somes in actively transcribed chromatin are perturbed bythe passing polymerase complexes (Figure 1)

The difference in chromatin state is particularly evidentin the harmonic analysis of Arpeggio profiles Inspired byprevious work on nucleosome spacing (60) we computedthe ratio

frac14Aeth 1190THORN

Aeth 1180THORN+Aeth 1200THORN

2

eth9THORN

where Aeth1=tTHORN is the magnitude of the t bp length scale inthe spectral density H3K27me3 exhibited a stronger acompared with H3K36me3 (Figure 7) which suggestsflexibility in the spacing of nucleosomes carryingH3K36me3 mark Interestingly the ratio a was alsolarge in Histone 3 Lysine 4 tri-methylation (H3K4me3)and in the Retinoblastoma protein (pRb) Although thereasons for such a precise placement of H3K4me3 areunclear pRb is an important factor in establishingheterochromatin

The effects of open fluidic chromatin states are strongerin the Arpeggio profiles of histone acetylations Theseprofiles are characterized by a high central peak sur-rounded by unorganized oscillations These profiles areclearly indicative of an open fluid chromatin configur-ation which is consistent with actively transcribedregions (Figure 1)

However the reference ChIP-seq signal for openactively transcribed chromatin is the Arpeggio profile ofRNA Polymerase II The PolII complexes move alongactively transcribed genes thus the ChIP-seq information

Hsapiens

AU

C

00

02

04

06

08

10

Pro

tein

AB

targ

et

Inte

ract

ion

dom

ain

Fun

ctio

nal a

nnot

atio

n

Act

ivity

ann

otat

ion

Cel

lula

r M

echa

nism

She

arin

g

Cel

l Lin

e

Stu

dy ID

RandomPeak overlapArpeggioArpeggio MDS

Figure 5 Classification of biological and experimental factors in termsof peak-based and Arpeggio-based ChIP-seq signatures for humansamples The comparison was performed based on proximitycomputed using peak overlap between pairs of experiments inversecorrelation of Arpeggio spectral densities or Euclidean distances ofArpeggio MDS coordinates The AUC of a classifier assigningrandom labels is shown as a black dashed line Compared with peakoverlap proximity based on Arpeggio MDS had a higher median per-formance of predicting cellular mechanism

500 1000 1500 2000minus4e

minus04

minus2e

minus04

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Offset(bp)

IP s

igna

l (R

eads

ChI

P

Tot

al R

eads

ChI

P)

(R

eads

Con

trol

Tot

al R

eads

Con

trol

)

Fly (187bp period)Mouse (188bp period)Human (184bp period)

Figure 6 H3K27me3 shows similar periodicity across fly mouse andhuman Periodicity was calculated using the local maxima of thespectral density for the range of the Arpeggio profiles shown

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 8 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

is spread over a relatively large spatial region For thisreason PolII ChIP experiments need significantsequencing depth to obtain a good picture of PolIIactivity The Arpeggio profiles for PolII show a strongcentral signal surrounded by two shoulders each flankedby disorganized nucleosomes (Figure 2) The central peakis where PolII is located most closely to the DNA strandand thus efficient cross-linking can occur However PolIIis part of a large lsquoholoenzymersquo transcription machineryand the surrounding smaller humps are likely where thiscomplex is close enough to the DNA strand to be cross-linked and enriched in a ChIP-seq experiment The un-organized pattern surrounding these peaks is likely aresult of the actively transcribing PolII complex As itmoves through the chromatin it perturbs andor disas-sembles nucleosomes in front and re-deposits thembehind giving highly disorganized chromatin structure(Figure 2)

Other proteins such as Androgen Receptor SPDEF(SAM pointed domain-containing Ets transcriptionfactor) ERG (Ets-Related Gene) FL1 (Follicularlymphoma susceptibility to 1) display typical site-specific DNA-binding profiles of transcription factors inwhich a strong signal occurs at the binding siteaccompanied by disorganized surrounding patterns indi-cative of active fluidic chromatin (Figure 2)

DISCUSSION

The contribution of this study is the design of a compactChIP-seq data representation based on the denoised auto-correlation We show that use of this compact data repre-sentation has several advantages that facilitates efficientcomputation and data storage linking the cellular mech-anisms of protein targets in novel ChIP-seq experiments todata in current repositories low-dimensional organizationof large repositories of ChIP-seq data that facilitates ex-ploratory data analysis cross-species and cross-cell linecomparisons extraction of technical features such asfragment length distribution and biological relevantinterpretation

Large collections of ChIP-seq data have been leveragedto gain new biological insights (8) The volume of ChIP-seq data in public repositories has noticeably increased inrecent years Typically users retrieve samples based ontheir prior knowledge and expectations of the biologicalsystem Currently available data retrieval systems arebased on matching qualitative annotations such asorganism cell-type condition and specific immunopre-cipitated protein We suggest the use of our novel com-pression technique Arpeggio to enable searching forsamples similar to the query experiment in terms of quan-titative patterns present in theirs signals thus facilitatingnovel biological discoveries We note that non-linear datacompression approaches applied to the autocorrelationfunctions can organize the data and reveal new insights(see diffusion map analysis Supplementary Note B) Wepresent Arpeggio profiles and their spectral densities Thislow-dimensional harmonic data representation can beused for selecting publicly available experiments that arebiologically related to an experiment of interest TheArpeggio profiles are computed from the autocorrelationof ChIP-seq signals which have been previously exploredin the context of data quality assessment (7)In contrast to previous approaches we applied signal

processing techniques to derive a profile of the IP auto-correlation that is diminished in technical variability andrequires little pre-filtering We found that Arpeggioprofiles were remarkably organized in four maincategories corresponding to intuitive classes of structuralinteractions factors showed peaks with sharply decayingtails polymerases showed peaks as well but with slowlydecaying tails histone modifications showed damped os-cillations corresponding to trains of peaks at fixed dis-tances from one another lastly controls showed a singlepulse sharper than the peak of factors indicating that asexpected sequenced reads from total DNA inputs have norecurrent properties Typical binding patterns of a particu-lar proteinndashchromatin interaction are obscured by noiseArpeggio profiles overcome this problem by aggregatingthe recurrent patterns of proteinndashchromatin interactionThe quality of autocorrelation also improves quickerthan the read coverage density as the number of readsincreases In peak finding the average coverage isexpected to increase linearly on the order of OethN L=GTHORNwhere N is the number of sequenced reads L is the read-length and G is the size of the genome in contrast thenumber of distances between reads contributing to thecomputation of the autocorrelation scales quadraticallyin the worst case as OethN2 W=GTHORN where W is themaximum lag at which the autocorrelation is evaluatedwhere W L We note that Arpeggio profiles do notprovide the location of such binding events however weplan to further develop these characteristic spectralbinding patterns for locating peaksIn this work we used an unsupervised approach to

organize a large volume of ChIP-seq experiments Weshow that close proximity between the denoised spectraldensities of two different proteins is often associated withsimilar cellular mechanism In this work we manuallyannotated 806 of the 14 306 currently marked as ChIP-seq samples in the SRA Databases are expected to

DN

AIn

put

ER

GG

RH

2BK

12A

cH

2BK

15A

cH

2BK

20A

cH

3H

3K18

Ac

H3K

27A

cH

3K27

me3

H3K

36m

e3H

3K4m

e1H

3K4m

e2H

3K4m

e3H

3K79

me1

H3K

79m

e2H

3K79

me3

H3K

9Ac

H3K

9me3

H4K

12A

cH

4K20

me1

H4K

5Ac

IgG

P30

0P

olII

RA

RS

PD

EF

VitD

rhE

Ra

pRb

099

100

101

102

103

104

105

106

α

Figure 7 Flexibility of nucleosome spacing across different experi-ments Flexible spacing or isolated events (SupplementaryFigure S2) is represented by ratios close to one Ratios clearly aboveone indicate strict spacing of 190 bp between events (SupplementaryFigure S4) Histone modifications associated with heterochromatinexhibit higher ratios indicating less flexibility

PAGE 9 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

evolve to be more structured and enable automatic re-trieval eliminating the need for data entry tasks andallowing us to organize thousands of samples at a timeMoreover high-throughput sequencing is becoming cheapfacilitating the mass survey of novel ChIP targets forwhich function is yet to be determined Applying theproposed spectral representation to thousands of existingannotated ChIP-seq experiments will allow us to screenthese new ChIP targets reducing the resources requiredto elucidate their functions Interpretation of spectralpatterns in many fields of science and engineering (ieradiology control systems analysis imaging) is often theproduct of years of study In this study we focused onproperties that discriminate coarse categories of proteinndashchromatin interaction There is a wealth of knowledgehidden in Arpeggio representation and we anticipatethat with increasing database size and quality it willprovide information on a much finer scale

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Onlineincluding [68ndash122]

ACKNOWLEDGEMENTS

The authors thank Ronen Talmon Sherman WeissmanRonald Coifman and Paolo Barbano for insightfuldiscussions

FUNDING

National Institute of Health [T15 LM07056 to KS] and[CA-158167 to YK] Yale Cancer Center translationalresearch pilot funds (to FP and FS) American CancerSociety Award [M130572 to FS] the American-ItalianCancer Foundation [Post-Doctoral Research Fellowshipto FS] the Peter T Rowley Breast Cancer ResearchProjects funded by the New York State Department ofHealth [FAU 0812160900 YK] Funding for openaccess charge The Peter T Rowley Breast CancerResearch Projects funded by the New York StateDepartment of Health [FAU 0812160900 YK]

Conflict of interest statement None declared

REFERENCES

1 MortazaviA WilliamsBA MccueK SchaefferL and WoldB(2008) Mapping and quantifying mammalian transcriptomes byRNA-Seq Nat Methods 5 621ndash628

2 TrapnellC RobertsA GoffL PerteaG KimD KelleyDRPimentelH SalzbergSL RinnJL and PachterL (2012)Differential gene and transcript expression analysis of RNA-seqexperiments with TopHat and Cufflinks Nat Protoc 7 562ndash578

3 GreenmanC StephensP SmithR DalglieshGL HunterCBignellG DaviesH TeagueJ ButlerA StevensC et al(2007) Patterns of somatic mutation in human cancer genomesNature 446 153ndash158

4 AspP BlumR VethanthamV ParisiF MicsinaiM ChengJBowmanC KlugerY and DynlachtBD (2011) Genome-wide

remodeling of the epigenetic landscape during myogenicdifferentiation Proc Natl Acad Sci USA 108 E149ndashE158

5 BarskiA and ZhaoK (2009) Genomic location analysis byChIP-Seq J Cell Biochem 107 11ndash18

6 MikkelsenTS KuMC JaffeDB IssacB LiebermanEGiannoukosG AlvarezP BrockmanW KimTK KocheRPet al (2007) Genome-wide maps of chromatin state in pluripotentand lineage-committed cells Nature 448 553ndash560

7 ENCODE Project Consortium (2004) The ENCODE(ENCyclopedia of DNA elements) project Science 306 636ndash640

8 ENCODE Project Consortium (2012) An integrated encyclopediaof DNA elements in the human genome Nature 489 57ndash74

9 ListerR PelizzolaM DowenRH HawkinsRD HonGTonti-FilippiniJ NeryJR LeeL YeZ NgoQM et al(2009) Human DNA methylomes at base resolution showwidespread epigenomic differences Nature 462 315ndash322

10 BellO SchwaigerM OakeleyEJ LienertF BeiselCStadlerMB and SchubelerD (2010) Accessibility of theDrosophila genome discriminates PcG repression H4K16acetylation and replication timing Nat Struct Mol Biol 17894ndash900

11 BlowM McCulleyDJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2010)ChIP-Seq identification of weakly conserved heart enhancers NatGenet 42 806ndash810

12 BottomlyD KylerSL McWeeneySK and YochumGS(2010) Identification of b-catenin binding regions in colon cancercells using ChIP-Seq Nucleic Acids Res 38 5735ndash5745

13 ChicasA WangX ZhangC McCurrachM ZhaoZ MertODickinsRA NaritaM ZhangM and LoweSW (2010)Dissecting the unique role of the retinoblastoma tumor suppressorduring cellular senescence Cancer Cell 17 376ndash387

14 CreyghtonMP ChengAW WelsteadGG KooistraTCareyBW SteineEJ HannaJ LodatoMA FramptonGMSharpPA et al (2010) From the cover histone H3K27acseparates active from poised enhancers and predictsdevelopmental state Proc Natl Acad Sci USA 10721931ndash21936

15 DurantL WatfordWT RamosHL LaurenceA VahediGWeiL TakahashiH SunH-W KannoY PowrieF et al(2010) Diverse targets of the transcription factor STAT3contribute to T cell pathogenicity and homeostasis Immunity 32605ndash615

16 EnderleD BeiselC StadlerMB GerstungM AthriP andParoR (2010) Polycomb preferentially targets stalled promotersof coding and noncoding transcripts Genome Res 21 216ndash226

17 FullwoodMJ LiuMH PanYF LiuJ XuHMohamedYB OrlovYL VelkovS HoA MeiPH et al(2009) An oestrogen-receptor-a-bound human chromatininteractome Nature 462 58ndash64

18 GanQ SchonesDE Ho EunS WeiG CuiK ZhaoK andChenX (2010) Monovalent and unpoised status of most genes inundifferentiated cell-enriched Drosophila testis Genome Biol 11R42

19 GorchakovAA AlekseyenkoAA KharchenkoP ParkPJand KurodaMI (2009) Long-range spreading of dosagecompensation in Drosophila captures transcribed autosomal genesinserted on X Genes Dev 23 2266ndash2271

20 GuentherMG FramptonGM SoldnerF HockemeyerDMitalipovaM JaenischR and YoungRA (2010) Chromatinstructure and gene expression programs of human embryonic andinduced pluripotent stem cells Cell Stem Cell 7 249ndash257

21 HoffmanBG RobertsonG ZavagliaB BeachM CullumRLeeS SoukhatchevaG LiL WederellED ThiessenN et al(2010) Locus co-occupancy nucleosome positioning andH3K4me1 regulate the functionality of FOXA2- HNF4A- andPDX1-bound loci in islets and liver Genome Res 20 1037ndash1051

22 HurtadoA HolmesKA Ross-InnesCS SchmidtD andCarrollJS (2010) FOXA1 is a key determinant of estrogenreceptor function and endocrine response Nat Genet 43 27ndash33

23 IpJY SchmidtD PanQ RamaniAK FraserAGOdomDT and BlencoweBJ (2011) Global impact of RNApolymerase II elongation inhibition on alternative splicingregulation Genome Res 21 390ndash401

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 10 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

24 JohnS SaboPJ ThurmanRE SungM-H BiddieSCJohnsonTA HagerGL and StamatoyannopoulosJA (2011)Chromatin accessibility pre-determines glucocorticoid receptorbinding patterns Nat Genet 43 264ndash268

25 JolmaA KiviojaT ToivonenJ ChengL WeiG EngeMTaipaleM VaquerizasJM YanJ SillanpaaMJ et al (2010)Multiplexed massively parallel SELEX for characterization ofhuman transcription factor binding specificities Genome Res 20861ndash873

26 JungH LacombeJ MazzoniEO LiemKF Jr GrinsteinJMahonyS MukhopadhyayD GiffordDK YoungRAAndersonKV et al (2010) Global control of motor neurontopography mediated by the repressive actions of a single hoxgene Neuron 67 781ndash796

27 KageyMH NewmanJJ BilodeauS ZhanY OrlandoDAvan BerkumNL EbmeierCC GoossensJ RahlPBLevineSS et al (2010) Mediator and cohesin connect geneexpression and chromatin architecture Nature 467 430ndash435

28 KasowskiM GrubertF HeffelfingerC HariharanMAsabereA WaszakSM HabeggerL RozowskyJ ShiMUrbanAE et al (2010) Variation in transcription factor bindingamong humans Science 328 232ndash235

29 KocheRP SmithZD AdliM GuH KuM GnirkeABernsteinBE and MeissnerA (2011) Reprogramming factorexpression initiates widespread targeted chromatin remodelingCell Stem Cell 8 96ndash105

30 KramerJM KochinkeK OortveldMAW MarksHKramerD de JongEK AsztalosZ WestwoodJTStunnenbergHG SokolowskiMB et al (2011) Epigeneticregulation of learning and memory by Drosophila EHMTG9aPLoS Biol 9 e1000569

31 LeeBK BhingeAA and IyerVR (2011) Wide-rangingfunctions of E2F4 in transcriptional activation and repressionrevealed by genome-wide analysis Nucleic Acids Res 393558ndash3573

32 LiL JothiR CuiK LeeJY CohenT GorivodskyMTzchoriI ZhaoY HayesSM BresnickEH et al (2010)Nuclear adaptor Ldb1 regulates a transcriptional programessential for the maintenance of hematopoietic stem cells NatImmunol 12 129ndash136

33 MahonyS MazzoniEO McCuineS YoungRA WichterleHand GiffordDK (2011) Ligand-dependent dynamics of retinoicacid receptor binding during early neurogenesis Genome Biol12 R2

34 MarsonA LevineSS ColeMF FramptonGMBrambrinkT JohnstoneS GuentherMG JohnstonWKWernigM and NewmanJ (2008) Connecting microRNA genesto the core transcriptional regulatory circuitry of embryonic stemcells Cell 134 521ndash533

35 PaulerFM SloaneMA HuangR ReghaK KoernerMVTamirI SommerA AszodiA JenuweinT and BarlowDP(2008) H3K27me3 forms BLOCs over silent genes and intergenicregions and specifies a histone banding pattern on a mouseautosomal chromosome Genome Res 19 221ndash233

36 Rada-IglesiasA BajpaiR SwigutT BrugmannSAFlynnRA and WysockaJ (2010) A unique chromatin signatureuncovers early developmental enhancers in humans Nature 470279ndash283

37 RamagopalanSV HegerA BerlangaAJ MaugeriNJLincolnMR BurrellA HandunnetthiL HandelAEDisantoG OrtonS-M et al (2010) A ChIP-seq definedgenome-wide map of vitamin D receptor bindingassociations with disease and evolution Genome Res 201352ndash1360

38 Rugg-GunnPJ CoxBJ RalstonA and RossantJ (2010)Distinct histone modifications in stem cell lines and tissue lineagesfrom the early mouse embryo Proc Natl Acad Sci USA 10710783ndash10790

39 SchmidtD SchwaliePC Ross-InnesCS HurtadoABrownGD CarrollJS FlicekP and OdomDT (2010) ACTCF-independent role for cohesin in tissue-specific transcriptionGenome Res 20 578ndash588

40 VermeulenM EberlHC MatareseF MarksH DenissovSButterF LeeKK OlsenJV HymanAA and

StunnenbergHG (2010) Quantitative interaction proteomics andgenome-wide profiling of epigenetic histone marks and theirreaders Cell 142 967ndash980

41 VerziMP ShinH HeHH SulahianR MeyerCAMontgomeryRK FleetJC BrownM LiuXS andShivdasaniRA (2010) Differentiation-specific histonemodifications reveal dynamic chromatin interactions and partnersfor the intestinal transcription factor CDX2 Dev Cell 19713ndash726

42 WeiGH BadisG BergerMF KiviojaT PalinK EngeMBonkeM JolmaA VarjosaloM GehrkeAR et al (2010)Genome-wide analysis of ETS-family DNA-binding in vitro andin vivo EMBO J 29 2147ndash2160

43 WeiL VahediG SunH-W WatfordWT TakatoriHRamosHL TakahashiH LiangJ Gutierrez-CruzG ZangCet al (2010) Discrete roles of STAT4 and STAT6 transcriptionfactors in tuning epigenetic modifications and transcription duringT helper cell differentiation Immunity 32 840ndash851

44 YangY LuY EspejoA WuJ XuW LiangS andBedfordMT (2010) TDRD3 is an effector molecule for arginine-methylated histone marks Mol Cell 40 1016ndash1023

45 YuS CuiK JothiR ZhaoD-M JingX ZhaoK andXueH-H (2010) GABP controls a critical transcriptionregulatory module that is essential for maintenance anddifferentiation of hematopoietic stemprogenitor cells Blood 1172166ndash2178

46 HoffmanMM ErnstJ WilderSP KundajeA HarrisRSLibbrechtM GiardineB EllenbogenPM BilmesJABirneyE et al (2013) Integrative annotation of chromatinelements from ENCODE data Nucleic Acids Res 41 827ndash841

47 LimSJ TanTW and TongJC (2010) Computationalepigenetics the new scientific paradigm Bioinformation 4331ndash337

48 ThurmanRE DayN NobleWS and StamatoyannopoulosJA(2007) Identification of higher-order functional domains in thehuman ENCODE regions Genome Res 17 917ndash927

49 LangmeadB TrapnellC PopM and SalzbergSL (2009)Ultrafast and memory-efficient alignment of short DNAsequences to the human genome Genome Biol 10 R25

50 DiazA ParkK LimDA and SongJS (2012) Normalizationbias correction and peak calling for ChIP-seq Stat Appl GenetMol Biol 11 9

51 LiangK and KelesS (2012) Normalization of ChIP-seq datawith control BMC Bioinformatics 13 199

52 R Development Core Team (2012) R A Language andEnvironment for Statistical Computing R Foundation forStatistical Computing Vienna Austria

53 DaviesDL and BouldinDW (1979) A cluster separationmeasure IEEE Trans Pattern Anal Mach Intell 1 224ndash227

54 HanleyJA and McNeilBJ (1982) The meaning and use of thearea under a receiver operating characteristic (ROC) curveRadiology 143 29ndash36

55 GarnerSR (1995) Weka the waikato environment forknowledge analysis In Procceding of the New Zealand ComputerScience Research Students Conference Hamilton New Zealandpp 57ndash64

56 Shi-YunS Kai-QuanS Chong-JinO Xiao-PingL andWilder-SmithE (2008) Automatic identification and removal ofartifacts in EEG using a probabilistic multi-class SVM approachwith error correction In IEEE International Conference onSystems Man and Cybernetics Singapore Singaporepp 1134ndash1139

57 BellmanRE (2003) Dynamic Programming Dover PublicationsNew York NY

58 BickmoreWA and van SteenselB (2013) Genome architecturedomain organization of interphase chromosomes Cell 1521270ndash1284

59 OppenheimAV SchaferRW and BuckJR (1999) Discrete-Time Signal Processing Prentice-Hall Inc Upper SaddleRiver NJ

60 BlankTA and BeckerPB (1995) Electrostatic mechanism ofnucleosome spacing J Mol Biol 252 305ndash313

61 MicsinaiM ParisiF StrinoF AspP DynlachtBD andKlugerY (2012) Picking ChIP-seq peak detectors for

PAGE 11 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

analyzing chromatin modification experiments Nucleic Acids Res40 e70

62 KharchenkoPV TolstorukovMY and ParkPJ (2008) Designand analysis of ChIP-seq experiments for DNA-binding proteinsNat Biotechnol 26 1351ndash1359

63 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of theENCODE and modENCODE consortia Genome Res 221813ndash1831

64 RamachandranP PalidworGA PorterCJ and PerkinsTJ(2013) MaSC mappability-sensitive cross-correlation forestimating mean fragment length of single-end short-readsequencing data Bioinformatics 29 444ndash450

65 NarlikarL and JothiR (2012) ChIP-Seq data analysisidentification of ProteinndashDNA binding sites with SISSRs peak-finder In WangJ TanAC and TianT (eds) Next GenerationMicroarray Bioinformatics Vol 802 Humana Press New YorkUSA pp 305ndash322

66 ZhangY LiuT MeyerCA EeckhouteJ JohnsonDSBernsteinBE NussbaumC MyersRM BrownM LiWet al (2008) Model-based analysis of ChIP-Seq (MACS) GenomeBiol 9 R137

67 LittleAV LeeJ Yoon-MoJ and MaggioniM (2009) IEEESP15th Workshop on Statistical Signal Processing (SSPrsquo09) CardiffUnited Kingdom pp 85ndash88

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 12 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

Page 8: Arpeggio: harmonic compression of ChIP-seq data reveals protein ...

(Supplementary Table S5 Figure 5 and SupplementaryFigure S8)

Reading biological features from Arpeggio profiles

In the previous sections we have provided evidence thatArpeggio profiles and their spectral densities can be usedto rapidly compare a large number of experiments Inthis section we show that Arpeggio profiles can also beused to derive biologically and technically meaningfulinformationFor instance H3K27me3 Arpeggio profiles exhibited

distinct periodicity with high amplitudes of oscillation(Figure 1) This suggested a highly ordered array of nu-cleosomes consistent with a static chromatin structurewhere nucleosomes are precisely positioned This agreedwell with the known role of Polycomb and H3K27 tri-methylation in transcriptional repression and heterochro-matin formation We recall that the profile represents anaggregate of binding events across the whole genome theclear and distinct periodicity for H3K27me3 suggestedthat nucleosomes with the tri-methylated H3K27 markhad remarkably constant periodicities throughout thegenome (4) Further we show that this periodicity andthe width of the signal (number of clear oscillations) issimilar across species (Figure 6)Another oscillatory pattern can be observed for Histone

3 Lysine 36 tri-methylation (H3K36me3) This histonemodification mark is deposited along with activelytranscribing PolII complexes and is by far the mostreliable histone mark for actively transcribed genes Incontrast to H3K27me3 profiles where the oscillation isonly slowly dampening due to the static chromatin

structure imposed by this mark H3K36me3 profilesshow a central peak surrounded by slightly smallerpeaks with a rapidly degrading oscillation indicative ofa fluidic chromatin state where nucleosomes are not pre-cisely positioned This is in agreement with this markbeing found in actively transcribed genes as the nucleo-somes in actively transcribed chromatin are perturbed bythe passing polymerase complexes (Figure 1)

The difference in chromatin state is particularly evidentin the harmonic analysis of Arpeggio profiles Inspired byprevious work on nucleosome spacing (60) we computedthe ratio

frac14Aeth 1190THORN

Aeth 1180THORN+Aeth 1200THORN

2

eth9THORN

where Aeth1=tTHORN is the magnitude of the t bp length scale inthe spectral density H3K27me3 exhibited a stronger acompared with H3K36me3 (Figure 7) which suggestsflexibility in the spacing of nucleosomes carryingH3K36me3 mark Interestingly the ratio a was alsolarge in Histone 3 Lysine 4 tri-methylation (H3K4me3)and in the Retinoblastoma protein (pRb) Although thereasons for such a precise placement of H3K4me3 areunclear pRb is an important factor in establishingheterochromatin

The effects of open fluidic chromatin states are strongerin the Arpeggio profiles of histone acetylations Theseprofiles are characterized by a high central peak sur-rounded by unorganized oscillations These profiles areclearly indicative of an open fluid chromatin configur-ation which is consistent with actively transcribedregions (Figure 1)

However the reference ChIP-seq signal for openactively transcribed chromatin is the Arpeggio profile ofRNA Polymerase II The PolII complexes move alongactively transcribed genes thus the ChIP-seq information

Hsapiens

AU

C

00

02

04

06

08

10

Pro

tein

AB

targ

et

Inte

ract

ion

dom

ain

Fun

ctio

nal a

nnot

atio

n

Act

ivity

ann

otat

ion

Cel

lula

r M

echa

nism

She

arin

g

Cel

l Lin

e

Stu

dy ID

RandomPeak overlapArpeggioArpeggio MDS

Figure 5 Classification of biological and experimental factors in termsof peak-based and Arpeggio-based ChIP-seq signatures for humansamples The comparison was performed based on proximitycomputed using peak overlap between pairs of experiments inversecorrelation of Arpeggio spectral densities or Euclidean distances ofArpeggio MDS coordinates The AUC of a classifier assigningrandom labels is shown as a black dashed line Compared with peakoverlap proximity based on Arpeggio MDS had a higher median per-formance of predicting cellular mechanism

500 1000 1500 2000minus4e

minus04

minus2e

minus04

0e+

002e

minus04

4eminus

046e

minus04

8eminus

04

Offset(bp)

IP s

igna

l (R

eads

ChI

P

Tot

al R

eads

ChI

P)

(R

eads

Con

trol

Tot

al R

eads

Con

trol

)

Fly (187bp period)Mouse (188bp period)Human (184bp period)

Figure 6 H3K27me3 shows similar periodicity across fly mouse andhuman Periodicity was calculated using the local maxima of thespectral density for the range of the Arpeggio profiles shown

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 8 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

is spread over a relatively large spatial region For thisreason PolII ChIP experiments need significantsequencing depth to obtain a good picture of PolIIactivity The Arpeggio profiles for PolII show a strongcentral signal surrounded by two shoulders each flankedby disorganized nucleosomes (Figure 2) The central peakis where PolII is located most closely to the DNA strandand thus efficient cross-linking can occur However PolIIis part of a large lsquoholoenzymersquo transcription machineryand the surrounding smaller humps are likely where thiscomplex is close enough to the DNA strand to be cross-linked and enriched in a ChIP-seq experiment The un-organized pattern surrounding these peaks is likely aresult of the actively transcribing PolII complex As itmoves through the chromatin it perturbs andor disas-sembles nucleosomes in front and re-deposits thembehind giving highly disorganized chromatin structure(Figure 2)

Other proteins such as Androgen Receptor SPDEF(SAM pointed domain-containing Ets transcriptionfactor) ERG (Ets-Related Gene) FL1 (Follicularlymphoma susceptibility to 1) display typical site-specific DNA-binding profiles of transcription factors inwhich a strong signal occurs at the binding siteaccompanied by disorganized surrounding patterns indi-cative of active fluidic chromatin (Figure 2)

DISCUSSION

The contribution of this study is the design of a compactChIP-seq data representation based on the denoised auto-correlation We show that use of this compact data repre-sentation has several advantages that facilitates efficientcomputation and data storage linking the cellular mech-anisms of protein targets in novel ChIP-seq experiments todata in current repositories low-dimensional organizationof large repositories of ChIP-seq data that facilitates ex-ploratory data analysis cross-species and cross-cell linecomparisons extraction of technical features such asfragment length distribution and biological relevantinterpretation

Large collections of ChIP-seq data have been leveragedto gain new biological insights (8) The volume of ChIP-seq data in public repositories has noticeably increased inrecent years Typically users retrieve samples based ontheir prior knowledge and expectations of the biologicalsystem Currently available data retrieval systems arebased on matching qualitative annotations such asorganism cell-type condition and specific immunopre-cipitated protein We suggest the use of our novel com-pression technique Arpeggio to enable searching forsamples similar to the query experiment in terms of quan-titative patterns present in theirs signals thus facilitatingnovel biological discoveries We note that non-linear datacompression approaches applied to the autocorrelationfunctions can organize the data and reveal new insights(see diffusion map analysis Supplementary Note B) Wepresent Arpeggio profiles and their spectral densities Thislow-dimensional harmonic data representation can beused for selecting publicly available experiments that arebiologically related to an experiment of interest TheArpeggio profiles are computed from the autocorrelationof ChIP-seq signals which have been previously exploredin the context of data quality assessment (7)In contrast to previous approaches we applied signal

processing techniques to derive a profile of the IP auto-correlation that is diminished in technical variability andrequires little pre-filtering We found that Arpeggioprofiles were remarkably organized in four maincategories corresponding to intuitive classes of structuralinteractions factors showed peaks with sharply decayingtails polymerases showed peaks as well but with slowlydecaying tails histone modifications showed damped os-cillations corresponding to trains of peaks at fixed dis-tances from one another lastly controls showed a singlepulse sharper than the peak of factors indicating that asexpected sequenced reads from total DNA inputs have norecurrent properties Typical binding patterns of a particu-lar proteinndashchromatin interaction are obscured by noiseArpeggio profiles overcome this problem by aggregatingthe recurrent patterns of proteinndashchromatin interactionThe quality of autocorrelation also improves quickerthan the read coverage density as the number of readsincreases In peak finding the average coverage isexpected to increase linearly on the order of OethN L=GTHORNwhere N is the number of sequenced reads L is the read-length and G is the size of the genome in contrast thenumber of distances between reads contributing to thecomputation of the autocorrelation scales quadraticallyin the worst case as OethN2 W=GTHORN where W is themaximum lag at which the autocorrelation is evaluatedwhere W L We note that Arpeggio profiles do notprovide the location of such binding events however weplan to further develop these characteristic spectralbinding patterns for locating peaksIn this work we used an unsupervised approach to

organize a large volume of ChIP-seq experiments Weshow that close proximity between the denoised spectraldensities of two different proteins is often associated withsimilar cellular mechanism In this work we manuallyannotated 806 of the 14 306 currently marked as ChIP-seq samples in the SRA Databases are expected to

DN

AIn

put

ER

GG

RH

2BK

12A

cH

2BK

15A

cH

2BK

20A

cH

3H

3K18

Ac

H3K

27A

cH

3K27

me3

H3K

36m

e3H

3K4m

e1H

3K4m

e2H

3K4m

e3H

3K79

me1

H3K

79m

e2H

3K79

me3

H3K

9Ac

H3K

9me3

H4K

12A

cH

4K20

me1

H4K

5Ac

IgG

P30

0P

olII

RA

RS

PD

EF

VitD

rhE

Ra

pRb

099

100

101

102

103

104

105

106

α

Figure 7 Flexibility of nucleosome spacing across different experi-ments Flexible spacing or isolated events (SupplementaryFigure S2) is represented by ratios close to one Ratios clearly aboveone indicate strict spacing of 190 bp between events (SupplementaryFigure S4) Histone modifications associated with heterochromatinexhibit higher ratios indicating less flexibility

PAGE 9 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

evolve to be more structured and enable automatic re-trieval eliminating the need for data entry tasks andallowing us to organize thousands of samples at a timeMoreover high-throughput sequencing is becoming cheapfacilitating the mass survey of novel ChIP targets forwhich function is yet to be determined Applying theproposed spectral representation to thousands of existingannotated ChIP-seq experiments will allow us to screenthese new ChIP targets reducing the resources requiredto elucidate their functions Interpretation of spectralpatterns in many fields of science and engineering (ieradiology control systems analysis imaging) is often theproduct of years of study In this study we focused onproperties that discriminate coarse categories of proteinndashchromatin interaction There is a wealth of knowledgehidden in Arpeggio representation and we anticipatethat with increasing database size and quality it willprovide information on a much finer scale

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Onlineincluding [68ndash122]

ACKNOWLEDGEMENTS

The authors thank Ronen Talmon Sherman WeissmanRonald Coifman and Paolo Barbano for insightfuldiscussions

FUNDING

National Institute of Health [T15 LM07056 to KS] and[CA-158167 to YK] Yale Cancer Center translationalresearch pilot funds (to FP and FS) American CancerSociety Award [M130572 to FS] the American-ItalianCancer Foundation [Post-Doctoral Research Fellowshipto FS] the Peter T Rowley Breast Cancer ResearchProjects funded by the New York State Department ofHealth [FAU 0812160900 YK] Funding for openaccess charge The Peter T Rowley Breast CancerResearch Projects funded by the New York StateDepartment of Health [FAU 0812160900 YK]

Conflict of interest statement None declared

REFERENCES

1 MortazaviA WilliamsBA MccueK SchaefferL and WoldB(2008) Mapping and quantifying mammalian transcriptomes byRNA-Seq Nat Methods 5 621ndash628

2 TrapnellC RobertsA GoffL PerteaG KimD KelleyDRPimentelH SalzbergSL RinnJL and PachterL (2012)Differential gene and transcript expression analysis of RNA-seqexperiments with TopHat and Cufflinks Nat Protoc 7 562ndash578

3 GreenmanC StephensP SmithR DalglieshGL HunterCBignellG DaviesH TeagueJ ButlerA StevensC et al(2007) Patterns of somatic mutation in human cancer genomesNature 446 153ndash158

4 AspP BlumR VethanthamV ParisiF MicsinaiM ChengJBowmanC KlugerY and DynlachtBD (2011) Genome-wide

remodeling of the epigenetic landscape during myogenicdifferentiation Proc Natl Acad Sci USA 108 E149ndashE158

5 BarskiA and ZhaoK (2009) Genomic location analysis byChIP-Seq J Cell Biochem 107 11ndash18

6 MikkelsenTS KuMC JaffeDB IssacB LiebermanEGiannoukosG AlvarezP BrockmanW KimTK KocheRPet al (2007) Genome-wide maps of chromatin state in pluripotentand lineage-committed cells Nature 448 553ndash560

7 ENCODE Project Consortium (2004) The ENCODE(ENCyclopedia of DNA elements) project Science 306 636ndash640

8 ENCODE Project Consortium (2012) An integrated encyclopediaof DNA elements in the human genome Nature 489 57ndash74

9 ListerR PelizzolaM DowenRH HawkinsRD HonGTonti-FilippiniJ NeryJR LeeL YeZ NgoQM et al(2009) Human DNA methylomes at base resolution showwidespread epigenomic differences Nature 462 315ndash322

10 BellO SchwaigerM OakeleyEJ LienertF BeiselCStadlerMB and SchubelerD (2010) Accessibility of theDrosophila genome discriminates PcG repression H4K16acetylation and replication timing Nat Struct Mol Biol 17894ndash900

11 BlowM McCulleyDJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2010)ChIP-Seq identification of weakly conserved heart enhancers NatGenet 42 806ndash810

12 BottomlyD KylerSL McWeeneySK and YochumGS(2010) Identification of b-catenin binding regions in colon cancercells using ChIP-Seq Nucleic Acids Res 38 5735ndash5745

13 ChicasA WangX ZhangC McCurrachM ZhaoZ MertODickinsRA NaritaM ZhangM and LoweSW (2010)Dissecting the unique role of the retinoblastoma tumor suppressorduring cellular senescence Cancer Cell 17 376ndash387

14 CreyghtonMP ChengAW WelsteadGG KooistraTCareyBW SteineEJ HannaJ LodatoMA FramptonGMSharpPA et al (2010) From the cover histone H3K27acseparates active from poised enhancers and predictsdevelopmental state Proc Natl Acad Sci USA 10721931ndash21936

15 DurantL WatfordWT RamosHL LaurenceA VahediGWeiL TakahashiH SunH-W KannoY PowrieF et al(2010) Diverse targets of the transcription factor STAT3contribute to T cell pathogenicity and homeostasis Immunity 32605ndash615

16 EnderleD BeiselC StadlerMB GerstungM AthriP andParoR (2010) Polycomb preferentially targets stalled promotersof coding and noncoding transcripts Genome Res 21 216ndash226

17 FullwoodMJ LiuMH PanYF LiuJ XuHMohamedYB OrlovYL VelkovS HoA MeiPH et al(2009) An oestrogen-receptor-a-bound human chromatininteractome Nature 462 58ndash64

18 GanQ SchonesDE Ho EunS WeiG CuiK ZhaoK andChenX (2010) Monovalent and unpoised status of most genes inundifferentiated cell-enriched Drosophila testis Genome Biol 11R42

19 GorchakovAA AlekseyenkoAA KharchenkoP ParkPJand KurodaMI (2009) Long-range spreading of dosagecompensation in Drosophila captures transcribed autosomal genesinserted on X Genes Dev 23 2266ndash2271

20 GuentherMG FramptonGM SoldnerF HockemeyerDMitalipovaM JaenischR and YoungRA (2010) Chromatinstructure and gene expression programs of human embryonic andinduced pluripotent stem cells Cell Stem Cell 7 249ndash257

21 HoffmanBG RobertsonG ZavagliaB BeachM CullumRLeeS SoukhatchevaG LiL WederellED ThiessenN et al(2010) Locus co-occupancy nucleosome positioning andH3K4me1 regulate the functionality of FOXA2- HNF4A- andPDX1-bound loci in islets and liver Genome Res 20 1037ndash1051

22 HurtadoA HolmesKA Ross-InnesCS SchmidtD andCarrollJS (2010) FOXA1 is a key determinant of estrogenreceptor function and endocrine response Nat Genet 43 27ndash33

23 IpJY SchmidtD PanQ RamaniAK FraserAGOdomDT and BlencoweBJ (2011) Global impact of RNApolymerase II elongation inhibition on alternative splicingregulation Genome Res 21 390ndash401

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 10 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

24 JohnS SaboPJ ThurmanRE SungM-H BiddieSCJohnsonTA HagerGL and StamatoyannopoulosJA (2011)Chromatin accessibility pre-determines glucocorticoid receptorbinding patterns Nat Genet 43 264ndash268

25 JolmaA KiviojaT ToivonenJ ChengL WeiG EngeMTaipaleM VaquerizasJM YanJ SillanpaaMJ et al (2010)Multiplexed massively parallel SELEX for characterization ofhuman transcription factor binding specificities Genome Res 20861ndash873

26 JungH LacombeJ MazzoniEO LiemKF Jr GrinsteinJMahonyS MukhopadhyayD GiffordDK YoungRAAndersonKV et al (2010) Global control of motor neurontopography mediated by the repressive actions of a single hoxgene Neuron 67 781ndash796

27 KageyMH NewmanJJ BilodeauS ZhanY OrlandoDAvan BerkumNL EbmeierCC GoossensJ RahlPBLevineSS et al (2010) Mediator and cohesin connect geneexpression and chromatin architecture Nature 467 430ndash435

28 KasowskiM GrubertF HeffelfingerC HariharanMAsabereA WaszakSM HabeggerL RozowskyJ ShiMUrbanAE et al (2010) Variation in transcription factor bindingamong humans Science 328 232ndash235

29 KocheRP SmithZD AdliM GuH KuM GnirkeABernsteinBE and MeissnerA (2011) Reprogramming factorexpression initiates widespread targeted chromatin remodelingCell Stem Cell 8 96ndash105

30 KramerJM KochinkeK OortveldMAW MarksHKramerD de JongEK AsztalosZ WestwoodJTStunnenbergHG SokolowskiMB et al (2011) Epigeneticregulation of learning and memory by Drosophila EHMTG9aPLoS Biol 9 e1000569

31 LeeBK BhingeAA and IyerVR (2011) Wide-rangingfunctions of E2F4 in transcriptional activation and repressionrevealed by genome-wide analysis Nucleic Acids Res 393558ndash3573

32 LiL JothiR CuiK LeeJY CohenT GorivodskyMTzchoriI ZhaoY HayesSM BresnickEH et al (2010)Nuclear adaptor Ldb1 regulates a transcriptional programessential for the maintenance of hematopoietic stem cells NatImmunol 12 129ndash136

33 MahonyS MazzoniEO McCuineS YoungRA WichterleHand GiffordDK (2011) Ligand-dependent dynamics of retinoicacid receptor binding during early neurogenesis Genome Biol12 R2

34 MarsonA LevineSS ColeMF FramptonGMBrambrinkT JohnstoneS GuentherMG JohnstonWKWernigM and NewmanJ (2008) Connecting microRNA genesto the core transcriptional regulatory circuitry of embryonic stemcells Cell 134 521ndash533

35 PaulerFM SloaneMA HuangR ReghaK KoernerMVTamirI SommerA AszodiA JenuweinT and BarlowDP(2008) H3K27me3 forms BLOCs over silent genes and intergenicregions and specifies a histone banding pattern on a mouseautosomal chromosome Genome Res 19 221ndash233

36 Rada-IglesiasA BajpaiR SwigutT BrugmannSAFlynnRA and WysockaJ (2010) A unique chromatin signatureuncovers early developmental enhancers in humans Nature 470279ndash283

37 RamagopalanSV HegerA BerlangaAJ MaugeriNJLincolnMR BurrellA HandunnetthiL HandelAEDisantoG OrtonS-M et al (2010) A ChIP-seq definedgenome-wide map of vitamin D receptor bindingassociations with disease and evolution Genome Res 201352ndash1360

38 Rugg-GunnPJ CoxBJ RalstonA and RossantJ (2010)Distinct histone modifications in stem cell lines and tissue lineagesfrom the early mouse embryo Proc Natl Acad Sci USA 10710783ndash10790

39 SchmidtD SchwaliePC Ross-InnesCS HurtadoABrownGD CarrollJS FlicekP and OdomDT (2010) ACTCF-independent role for cohesin in tissue-specific transcriptionGenome Res 20 578ndash588

40 VermeulenM EberlHC MatareseF MarksH DenissovSButterF LeeKK OlsenJV HymanAA and

StunnenbergHG (2010) Quantitative interaction proteomics andgenome-wide profiling of epigenetic histone marks and theirreaders Cell 142 967ndash980

41 VerziMP ShinH HeHH SulahianR MeyerCAMontgomeryRK FleetJC BrownM LiuXS andShivdasaniRA (2010) Differentiation-specific histonemodifications reveal dynamic chromatin interactions and partnersfor the intestinal transcription factor CDX2 Dev Cell 19713ndash726

42 WeiGH BadisG BergerMF KiviojaT PalinK EngeMBonkeM JolmaA VarjosaloM GehrkeAR et al (2010)Genome-wide analysis of ETS-family DNA-binding in vitro andin vivo EMBO J 29 2147ndash2160

43 WeiL VahediG SunH-W WatfordWT TakatoriHRamosHL TakahashiH LiangJ Gutierrez-CruzG ZangCet al (2010) Discrete roles of STAT4 and STAT6 transcriptionfactors in tuning epigenetic modifications and transcription duringT helper cell differentiation Immunity 32 840ndash851

44 YangY LuY EspejoA WuJ XuW LiangS andBedfordMT (2010) TDRD3 is an effector molecule for arginine-methylated histone marks Mol Cell 40 1016ndash1023

45 YuS CuiK JothiR ZhaoD-M JingX ZhaoK andXueH-H (2010) GABP controls a critical transcriptionregulatory module that is essential for maintenance anddifferentiation of hematopoietic stemprogenitor cells Blood 1172166ndash2178

46 HoffmanMM ErnstJ WilderSP KundajeA HarrisRSLibbrechtM GiardineB EllenbogenPM BilmesJABirneyE et al (2013) Integrative annotation of chromatinelements from ENCODE data Nucleic Acids Res 41 827ndash841

47 LimSJ TanTW and TongJC (2010) Computationalepigenetics the new scientific paradigm Bioinformation 4331ndash337

48 ThurmanRE DayN NobleWS and StamatoyannopoulosJA(2007) Identification of higher-order functional domains in thehuman ENCODE regions Genome Res 17 917ndash927

49 LangmeadB TrapnellC PopM and SalzbergSL (2009)Ultrafast and memory-efficient alignment of short DNAsequences to the human genome Genome Biol 10 R25

50 DiazA ParkK LimDA and SongJS (2012) Normalizationbias correction and peak calling for ChIP-seq Stat Appl GenetMol Biol 11 9

51 LiangK and KelesS (2012) Normalization of ChIP-seq datawith control BMC Bioinformatics 13 199

52 R Development Core Team (2012) R A Language andEnvironment for Statistical Computing R Foundation forStatistical Computing Vienna Austria

53 DaviesDL and BouldinDW (1979) A cluster separationmeasure IEEE Trans Pattern Anal Mach Intell 1 224ndash227

54 HanleyJA and McNeilBJ (1982) The meaning and use of thearea under a receiver operating characteristic (ROC) curveRadiology 143 29ndash36

55 GarnerSR (1995) Weka the waikato environment forknowledge analysis In Procceding of the New Zealand ComputerScience Research Students Conference Hamilton New Zealandpp 57ndash64

56 Shi-YunS Kai-QuanS Chong-JinO Xiao-PingL andWilder-SmithE (2008) Automatic identification and removal ofartifacts in EEG using a probabilistic multi-class SVM approachwith error correction In IEEE International Conference onSystems Man and Cybernetics Singapore Singaporepp 1134ndash1139

57 BellmanRE (2003) Dynamic Programming Dover PublicationsNew York NY

58 BickmoreWA and van SteenselB (2013) Genome architecturedomain organization of interphase chromosomes Cell 1521270ndash1284

59 OppenheimAV SchaferRW and BuckJR (1999) Discrete-Time Signal Processing Prentice-Hall Inc Upper SaddleRiver NJ

60 BlankTA and BeckerPB (1995) Electrostatic mechanism ofnucleosome spacing J Mol Biol 252 305ndash313

61 MicsinaiM ParisiF StrinoF AspP DynlachtBD andKlugerY (2012) Picking ChIP-seq peak detectors for

PAGE 11 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

analyzing chromatin modification experiments Nucleic Acids Res40 e70

62 KharchenkoPV TolstorukovMY and ParkPJ (2008) Designand analysis of ChIP-seq experiments for DNA-binding proteinsNat Biotechnol 26 1351ndash1359

63 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of theENCODE and modENCODE consortia Genome Res 221813ndash1831

64 RamachandranP PalidworGA PorterCJ and PerkinsTJ(2013) MaSC mappability-sensitive cross-correlation forestimating mean fragment length of single-end short-readsequencing data Bioinformatics 29 444ndash450

65 NarlikarL and JothiR (2012) ChIP-Seq data analysisidentification of ProteinndashDNA binding sites with SISSRs peak-finder In WangJ TanAC and TianT (eds) Next GenerationMicroarray Bioinformatics Vol 802 Humana Press New YorkUSA pp 305ndash322

66 ZhangY LiuT MeyerCA EeckhouteJ JohnsonDSBernsteinBE NussbaumC MyersRM BrownM LiWet al (2008) Model-based analysis of ChIP-Seq (MACS) GenomeBiol 9 R137

67 LittleAV LeeJ Yoon-MoJ and MaggioniM (2009) IEEESP15th Workshop on Statistical Signal Processing (SSPrsquo09) CardiffUnited Kingdom pp 85ndash88

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 12 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

Page 9: Arpeggio: harmonic compression of ChIP-seq data reveals protein ...

is spread over a relatively large spatial region For thisreason PolII ChIP experiments need significantsequencing depth to obtain a good picture of PolIIactivity The Arpeggio profiles for PolII show a strongcentral signal surrounded by two shoulders each flankedby disorganized nucleosomes (Figure 2) The central peakis where PolII is located most closely to the DNA strandand thus efficient cross-linking can occur However PolIIis part of a large lsquoholoenzymersquo transcription machineryand the surrounding smaller humps are likely where thiscomplex is close enough to the DNA strand to be cross-linked and enriched in a ChIP-seq experiment The un-organized pattern surrounding these peaks is likely aresult of the actively transcribing PolII complex As itmoves through the chromatin it perturbs andor disas-sembles nucleosomes in front and re-deposits thembehind giving highly disorganized chromatin structure(Figure 2)

Other proteins such as Androgen Receptor SPDEF(SAM pointed domain-containing Ets transcriptionfactor) ERG (Ets-Related Gene) FL1 (Follicularlymphoma susceptibility to 1) display typical site-specific DNA-binding profiles of transcription factors inwhich a strong signal occurs at the binding siteaccompanied by disorganized surrounding patterns indi-cative of active fluidic chromatin (Figure 2)

DISCUSSION

The contribution of this study is the design of a compactChIP-seq data representation based on the denoised auto-correlation We show that use of this compact data repre-sentation has several advantages that facilitates efficientcomputation and data storage linking the cellular mech-anisms of protein targets in novel ChIP-seq experiments todata in current repositories low-dimensional organizationof large repositories of ChIP-seq data that facilitates ex-ploratory data analysis cross-species and cross-cell linecomparisons extraction of technical features such asfragment length distribution and biological relevantinterpretation

Large collections of ChIP-seq data have been leveragedto gain new biological insights (8) The volume of ChIP-seq data in public repositories has noticeably increased inrecent years Typically users retrieve samples based ontheir prior knowledge and expectations of the biologicalsystem Currently available data retrieval systems arebased on matching qualitative annotations such asorganism cell-type condition and specific immunopre-cipitated protein We suggest the use of our novel com-pression technique Arpeggio to enable searching forsamples similar to the query experiment in terms of quan-titative patterns present in theirs signals thus facilitatingnovel biological discoveries We note that non-linear datacompression approaches applied to the autocorrelationfunctions can organize the data and reveal new insights(see diffusion map analysis Supplementary Note B) Wepresent Arpeggio profiles and their spectral densities Thislow-dimensional harmonic data representation can beused for selecting publicly available experiments that arebiologically related to an experiment of interest TheArpeggio profiles are computed from the autocorrelationof ChIP-seq signals which have been previously exploredin the context of data quality assessment (7)In contrast to previous approaches we applied signal

processing techniques to derive a profile of the IP auto-correlation that is diminished in technical variability andrequires little pre-filtering We found that Arpeggioprofiles were remarkably organized in four maincategories corresponding to intuitive classes of structuralinteractions factors showed peaks with sharply decayingtails polymerases showed peaks as well but with slowlydecaying tails histone modifications showed damped os-cillations corresponding to trains of peaks at fixed dis-tances from one another lastly controls showed a singlepulse sharper than the peak of factors indicating that asexpected sequenced reads from total DNA inputs have norecurrent properties Typical binding patterns of a particu-lar proteinndashchromatin interaction are obscured by noiseArpeggio profiles overcome this problem by aggregatingthe recurrent patterns of proteinndashchromatin interactionThe quality of autocorrelation also improves quickerthan the read coverage density as the number of readsincreases In peak finding the average coverage isexpected to increase linearly on the order of OethN L=GTHORNwhere N is the number of sequenced reads L is the read-length and G is the size of the genome in contrast thenumber of distances between reads contributing to thecomputation of the autocorrelation scales quadraticallyin the worst case as OethN2 W=GTHORN where W is themaximum lag at which the autocorrelation is evaluatedwhere W L We note that Arpeggio profiles do notprovide the location of such binding events however weplan to further develop these characteristic spectralbinding patterns for locating peaksIn this work we used an unsupervised approach to

organize a large volume of ChIP-seq experiments Weshow that close proximity between the denoised spectraldensities of two different proteins is often associated withsimilar cellular mechanism In this work we manuallyannotated 806 of the 14 306 currently marked as ChIP-seq samples in the SRA Databases are expected to

DN

AIn

put

ER

GG

RH

2BK

12A

cH

2BK

15A

cH

2BK

20A

cH

3H

3K18

Ac

H3K

27A

cH

3K27

me3

H3K

36m

e3H

3K4m

e1H

3K4m

e2H

3K4m

e3H

3K79

me1

H3K

79m

e2H

3K79

me3

H3K

9Ac

H3K

9me3

H4K

12A

cH

4K20

me1

H4K

5Ac

IgG

P30

0P

olII

RA

RS

PD

EF

VitD

rhE

Ra

pRb

099

100

101

102

103

104

105

106

α

Figure 7 Flexibility of nucleosome spacing across different experi-ments Flexible spacing or isolated events (SupplementaryFigure S2) is represented by ratios close to one Ratios clearly aboveone indicate strict spacing of 190 bp between events (SupplementaryFigure S4) Histone modifications associated with heterochromatinexhibit higher ratios indicating less flexibility

PAGE 9 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

evolve to be more structured and enable automatic re-trieval eliminating the need for data entry tasks andallowing us to organize thousands of samples at a timeMoreover high-throughput sequencing is becoming cheapfacilitating the mass survey of novel ChIP targets forwhich function is yet to be determined Applying theproposed spectral representation to thousands of existingannotated ChIP-seq experiments will allow us to screenthese new ChIP targets reducing the resources requiredto elucidate their functions Interpretation of spectralpatterns in many fields of science and engineering (ieradiology control systems analysis imaging) is often theproduct of years of study In this study we focused onproperties that discriminate coarse categories of proteinndashchromatin interaction There is a wealth of knowledgehidden in Arpeggio representation and we anticipatethat with increasing database size and quality it willprovide information on a much finer scale

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Onlineincluding [68ndash122]

ACKNOWLEDGEMENTS

The authors thank Ronen Talmon Sherman WeissmanRonald Coifman and Paolo Barbano for insightfuldiscussions

FUNDING

National Institute of Health [T15 LM07056 to KS] and[CA-158167 to YK] Yale Cancer Center translationalresearch pilot funds (to FP and FS) American CancerSociety Award [M130572 to FS] the American-ItalianCancer Foundation [Post-Doctoral Research Fellowshipto FS] the Peter T Rowley Breast Cancer ResearchProjects funded by the New York State Department ofHealth [FAU 0812160900 YK] Funding for openaccess charge The Peter T Rowley Breast CancerResearch Projects funded by the New York StateDepartment of Health [FAU 0812160900 YK]

Conflict of interest statement None declared

REFERENCES

1 MortazaviA WilliamsBA MccueK SchaefferL and WoldB(2008) Mapping and quantifying mammalian transcriptomes byRNA-Seq Nat Methods 5 621ndash628

2 TrapnellC RobertsA GoffL PerteaG KimD KelleyDRPimentelH SalzbergSL RinnJL and PachterL (2012)Differential gene and transcript expression analysis of RNA-seqexperiments with TopHat and Cufflinks Nat Protoc 7 562ndash578

3 GreenmanC StephensP SmithR DalglieshGL HunterCBignellG DaviesH TeagueJ ButlerA StevensC et al(2007) Patterns of somatic mutation in human cancer genomesNature 446 153ndash158

4 AspP BlumR VethanthamV ParisiF MicsinaiM ChengJBowmanC KlugerY and DynlachtBD (2011) Genome-wide

remodeling of the epigenetic landscape during myogenicdifferentiation Proc Natl Acad Sci USA 108 E149ndashE158

5 BarskiA and ZhaoK (2009) Genomic location analysis byChIP-Seq J Cell Biochem 107 11ndash18

6 MikkelsenTS KuMC JaffeDB IssacB LiebermanEGiannoukosG AlvarezP BrockmanW KimTK KocheRPet al (2007) Genome-wide maps of chromatin state in pluripotentand lineage-committed cells Nature 448 553ndash560

7 ENCODE Project Consortium (2004) The ENCODE(ENCyclopedia of DNA elements) project Science 306 636ndash640

8 ENCODE Project Consortium (2012) An integrated encyclopediaof DNA elements in the human genome Nature 489 57ndash74

9 ListerR PelizzolaM DowenRH HawkinsRD HonGTonti-FilippiniJ NeryJR LeeL YeZ NgoQM et al(2009) Human DNA methylomes at base resolution showwidespread epigenomic differences Nature 462 315ndash322

10 BellO SchwaigerM OakeleyEJ LienertF BeiselCStadlerMB and SchubelerD (2010) Accessibility of theDrosophila genome discriminates PcG repression H4K16acetylation and replication timing Nat Struct Mol Biol 17894ndash900

11 BlowM McCulleyDJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2010)ChIP-Seq identification of weakly conserved heart enhancers NatGenet 42 806ndash810

12 BottomlyD KylerSL McWeeneySK and YochumGS(2010) Identification of b-catenin binding regions in colon cancercells using ChIP-Seq Nucleic Acids Res 38 5735ndash5745

13 ChicasA WangX ZhangC McCurrachM ZhaoZ MertODickinsRA NaritaM ZhangM and LoweSW (2010)Dissecting the unique role of the retinoblastoma tumor suppressorduring cellular senescence Cancer Cell 17 376ndash387

14 CreyghtonMP ChengAW WelsteadGG KooistraTCareyBW SteineEJ HannaJ LodatoMA FramptonGMSharpPA et al (2010) From the cover histone H3K27acseparates active from poised enhancers and predictsdevelopmental state Proc Natl Acad Sci USA 10721931ndash21936

15 DurantL WatfordWT RamosHL LaurenceA VahediGWeiL TakahashiH SunH-W KannoY PowrieF et al(2010) Diverse targets of the transcription factor STAT3contribute to T cell pathogenicity and homeostasis Immunity 32605ndash615

16 EnderleD BeiselC StadlerMB GerstungM AthriP andParoR (2010) Polycomb preferentially targets stalled promotersof coding and noncoding transcripts Genome Res 21 216ndash226

17 FullwoodMJ LiuMH PanYF LiuJ XuHMohamedYB OrlovYL VelkovS HoA MeiPH et al(2009) An oestrogen-receptor-a-bound human chromatininteractome Nature 462 58ndash64

18 GanQ SchonesDE Ho EunS WeiG CuiK ZhaoK andChenX (2010) Monovalent and unpoised status of most genes inundifferentiated cell-enriched Drosophila testis Genome Biol 11R42

19 GorchakovAA AlekseyenkoAA KharchenkoP ParkPJand KurodaMI (2009) Long-range spreading of dosagecompensation in Drosophila captures transcribed autosomal genesinserted on X Genes Dev 23 2266ndash2271

20 GuentherMG FramptonGM SoldnerF HockemeyerDMitalipovaM JaenischR and YoungRA (2010) Chromatinstructure and gene expression programs of human embryonic andinduced pluripotent stem cells Cell Stem Cell 7 249ndash257

21 HoffmanBG RobertsonG ZavagliaB BeachM CullumRLeeS SoukhatchevaG LiL WederellED ThiessenN et al(2010) Locus co-occupancy nucleosome positioning andH3K4me1 regulate the functionality of FOXA2- HNF4A- andPDX1-bound loci in islets and liver Genome Res 20 1037ndash1051

22 HurtadoA HolmesKA Ross-InnesCS SchmidtD andCarrollJS (2010) FOXA1 is a key determinant of estrogenreceptor function and endocrine response Nat Genet 43 27ndash33

23 IpJY SchmidtD PanQ RamaniAK FraserAGOdomDT and BlencoweBJ (2011) Global impact of RNApolymerase II elongation inhibition on alternative splicingregulation Genome Res 21 390ndash401

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 10 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

24 JohnS SaboPJ ThurmanRE SungM-H BiddieSCJohnsonTA HagerGL and StamatoyannopoulosJA (2011)Chromatin accessibility pre-determines glucocorticoid receptorbinding patterns Nat Genet 43 264ndash268

25 JolmaA KiviojaT ToivonenJ ChengL WeiG EngeMTaipaleM VaquerizasJM YanJ SillanpaaMJ et al (2010)Multiplexed massively parallel SELEX for characterization ofhuman transcription factor binding specificities Genome Res 20861ndash873

26 JungH LacombeJ MazzoniEO LiemKF Jr GrinsteinJMahonyS MukhopadhyayD GiffordDK YoungRAAndersonKV et al (2010) Global control of motor neurontopography mediated by the repressive actions of a single hoxgene Neuron 67 781ndash796

27 KageyMH NewmanJJ BilodeauS ZhanY OrlandoDAvan BerkumNL EbmeierCC GoossensJ RahlPBLevineSS et al (2010) Mediator and cohesin connect geneexpression and chromatin architecture Nature 467 430ndash435

28 KasowskiM GrubertF HeffelfingerC HariharanMAsabereA WaszakSM HabeggerL RozowskyJ ShiMUrbanAE et al (2010) Variation in transcription factor bindingamong humans Science 328 232ndash235

29 KocheRP SmithZD AdliM GuH KuM GnirkeABernsteinBE and MeissnerA (2011) Reprogramming factorexpression initiates widespread targeted chromatin remodelingCell Stem Cell 8 96ndash105

30 KramerJM KochinkeK OortveldMAW MarksHKramerD de JongEK AsztalosZ WestwoodJTStunnenbergHG SokolowskiMB et al (2011) Epigeneticregulation of learning and memory by Drosophila EHMTG9aPLoS Biol 9 e1000569

31 LeeBK BhingeAA and IyerVR (2011) Wide-rangingfunctions of E2F4 in transcriptional activation and repressionrevealed by genome-wide analysis Nucleic Acids Res 393558ndash3573

32 LiL JothiR CuiK LeeJY CohenT GorivodskyMTzchoriI ZhaoY HayesSM BresnickEH et al (2010)Nuclear adaptor Ldb1 regulates a transcriptional programessential for the maintenance of hematopoietic stem cells NatImmunol 12 129ndash136

33 MahonyS MazzoniEO McCuineS YoungRA WichterleHand GiffordDK (2011) Ligand-dependent dynamics of retinoicacid receptor binding during early neurogenesis Genome Biol12 R2

34 MarsonA LevineSS ColeMF FramptonGMBrambrinkT JohnstoneS GuentherMG JohnstonWKWernigM and NewmanJ (2008) Connecting microRNA genesto the core transcriptional regulatory circuitry of embryonic stemcells Cell 134 521ndash533

35 PaulerFM SloaneMA HuangR ReghaK KoernerMVTamirI SommerA AszodiA JenuweinT and BarlowDP(2008) H3K27me3 forms BLOCs over silent genes and intergenicregions and specifies a histone banding pattern on a mouseautosomal chromosome Genome Res 19 221ndash233

36 Rada-IglesiasA BajpaiR SwigutT BrugmannSAFlynnRA and WysockaJ (2010) A unique chromatin signatureuncovers early developmental enhancers in humans Nature 470279ndash283

37 RamagopalanSV HegerA BerlangaAJ MaugeriNJLincolnMR BurrellA HandunnetthiL HandelAEDisantoG OrtonS-M et al (2010) A ChIP-seq definedgenome-wide map of vitamin D receptor bindingassociations with disease and evolution Genome Res 201352ndash1360

38 Rugg-GunnPJ CoxBJ RalstonA and RossantJ (2010)Distinct histone modifications in stem cell lines and tissue lineagesfrom the early mouse embryo Proc Natl Acad Sci USA 10710783ndash10790

39 SchmidtD SchwaliePC Ross-InnesCS HurtadoABrownGD CarrollJS FlicekP and OdomDT (2010) ACTCF-independent role for cohesin in tissue-specific transcriptionGenome Res 20 578ndash588

40 VermeulenM EberlHC MatareseF MarksH DenissovSButterF LeeKK OlsenJV HymanAA and

StunnenbergHG (2010) Quantitative interaction proteomics andgenome-wide profiling of epigenetic histone marks and theirreaders Cell 142 967ndash980

41 VerziMP ShinH HeHH SulahianR MeyerCAMontgomeryRK FleetJC BrownM LiuXS andShivdasaniRA (2010) Differentiation-specific histonemodifications reveal dynamic chromatin interactions and partnersfor the intestinal transcription factor CDX2 Dev Cell 19713ndash726

42 WeiGH BadisG BergerMF KiviojaT PalinK EngeMBonkeM JolmaA VarjosaloM GehrkeAR et al (2010)Genome-wide analysis of ETS-family DNA-binding in vitro andin vivo EMBO J 29 2147ndash2160

43 WeiL VahediG SunH-W WatfordWT TakatoriHRamosHL TakahashiH LiangJ Gutierrez-CruzG ZangCet al (2010) Discrete roles of STAT4 and STAT6 transcriptionfactors in tuning epigenetic modifications and transcription duringT helper cell differentiation Immunity 32 840ndash851

44 YangY LuY EspejoA WuJ XuW LiangS andBedfordMT (2010) TDRD3 is an effector molecule for arginine-methylated histone marks Mol Cell 40 1016ndash1023

45 YuS CuiK JothiR ZhaoD-M JingX ZhaoK andXueH-H (2010) GABP controls a critical transcriptionregulatory module that is essential for maintenance anddifferentiation of hematopoietic stemprogenitor cells Blood 1172166ndash2178

46 HoffmanMM ErnstJ WilderSP KundajeA HarrisRSLibbrechtM GiardineB EllenbogenPM BilmesJABirneyE et al (2013) Integrative annotation of chromatinelements from ENCODE data Nucleic Acids Res 41 827ndash841

47 LimSJ TanTW and TongJC (2010) Computationalepigenetics the new scientific paradigm Bioinformation 4331ndash337

48 ThurmanRE DayN NobleWS and StamatoyannopoulosJA(2007) Identification of higher-order functional domains in thehuman ENCODE regions Genome Res 17 917ndash927

49 LangmeadB TrapnellC PopM and SalzbergSL (2009)Ultrafast and memory-efficient alignment of short DNAsequences to the human genome Genome Biol 10 R25

50 DiazA ParkK LimDA and SongJS (2012) Normalizationbias correction and peak calling for ChIP-seq Stat Appl GenetMol Biol 11 9

51 LiangK and KelesS (2012) Normalization of ChIP-seq datawith control BMC Bioinformatics 13 199

52 R Development Core Team (2012) R A Language andEnvironment for Statistical Computing R Foundation forStatistical Computing Vienna Austria

53 DaviesDL and BouldinDW (1979) A cluster separationmeasure IEEE Trans Pattern Anal Mach Intell 1 224ndash227

54 HanleyJA and McNeilBJ (1982) The meaning and use of thearea under a receiver operating characteristic (ROC) curveRadiology 143 29ndash36

55 GarnerSR (1995) Weka the waikato environment forknowledge analysis In Procceding of the New Zealand ComputerScience Research Students Conference Hamilton New Zealandpp 57ndash64

56 Shi-YunS Kai-QuanS Chong-JinO Xiao-PingL andWilder-SmithE (2008) Automatic identification and removal ofartifacts in EEG using a probabilistic multi-class SVM approachwith error correction In IEEE International Conference onSystems Man and Cybernetics Singapore Singaporepp 1134ndash1139

57 BellmanRE (2003) Dynamic Programming Dover PublicationsNew York NY

58 BickmoreWA and van SteenselB (2013) Genome architecturedomain organization of interphase chromosomes Cell 1521270ndash1284

59 OppenheimAV SchaferRW and BuckJR (1999) Discrete-Time Signal Processing Prentice-Hall Inc Upper SaddleRiver NJ

60 BlankTA and BeckerPB (1995) Electrostatic mechanism ofnucleosome spacing J Mol Biol 252 305ndash313

61 MicsinaiM ParisiF StrinoF AspP DynlachtBD andKlugerY (2012) Picking ChIP-seq peak detectors for

PAGE 11 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

analyzing chromatin modification experiments Nucleic Acids Res40 e70

62 KharchenkoPV TolstorukovMY and ParkPJ (2008) Designand analysis of ChIP-seq experiments for DNA-binding proteinsNat Biotechnol 26 1351ndash1359

63 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of theENCODE and modENCODE consortia Genome Res 221813ndash1831

64 RamachandranP PalidworGA PorterCJ and PerkinsTJ(2013) MaSC mappability-sensitive cross-correlation forestimating mean fragment length of single-end short-readsequencing data Bioinformatics 29 444ndash450

65 NarlikarL and JothiR (2012) ChIP-Seq data analysisidentification of ProteinndashDNA binding sites with SISSRs peak-finder In WangJ TanAC and TianT (eds) Next GenerationMicroarray Bioinformatics Vol 802 Humana Press New YorkUSA pp 305ndash322

66 ZhangY LiuT MeyerCA EeckhouteJ JohnsonDSBernsteinBE NussbaumC MyersRM BrownM LiWet al (2008) Model-based analysis of ChIP-Seq (MACS) GenomeBiol 9 R137

67 LittleAV LeeJ Yoon-MoJ and MaggioniM (2009) IEEESP15th Workshop on Statistical Signal Processing (SSPrsquo09) CardiffUnited Kingdom pp 85ndash88

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 12 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

Page 10: Arpeggio: harmonic compression of ChIP-seq data reveals protein ...

evolve to be more structured and enable automatic re-trieval eliminating the need for data entry tasks andallowing us to organize thousands of samples at a timeMoreover high-throughput sequencing is becoming cheapfacilitating the mass survey of novel ChIP targets forwhich function is yet to be determined Applying theproposed spectral representation to thousands of existingannotated ChIP-seq experiments will allow us to screenthese new ChIP targets reducing the resources requiredto elucidate their functions Interpretation of spectralpatterns in many fields of science and engineering (ieradiology control systems analysis imaging) is often theproduct of years of study In this study we focused onproperties that discriminate coarse categories of proteinndashchromatin interaction There is a wealth of knowledgehidden in Arpeggio representation and we anticipatethat with increasing database size and quality it willprovide information on a much finer scale

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Onlineincluding [68ndash122]

ACKNOWLEDGEMENTS

The authors thank Ronen Talmon Sherman WeissmanRonald Coifman and Paolo Barbano for insightfuldiscussions

FUNDING

National Institute of Health [T15 LM07056 to KS] and[CA-158167 to YK] Yale Cancer Center translationalresearch pilot funds (to FP and FS) American CancerSociety Award [M130572 to FS] the American-ItalianCancer Foundation [Post-Doctoral Research Fellowshipto FS] the Peter T Rowley Breast Cancer ResearchProjects funded by the New York State Department ofHealth [FAU 0812160900 YK] Funding for openaccess charge The Peter T Rowley Breast CancerResearch Projects funded by the New York StateDepartment of Health [FAU 0812160900 YK]

Conflict of interest statement None declared

REFERENCES

1 MortazaviA WilliamsBA MccueK SchaefferL and WoldB(2008) Mapping and quantifying mammalian transcriptomes byRNA-Seq Nat Methods 5 621ndash628

2 TrapnellC RobertsA GoffL PerteaG KimD KelleyDRPimentelH SalzbergSL RinnJL and PachterL (2012)Differential gene and transcript expression analysis of RNA-seqexperiments with TopHat and Cufflinks Nat Protoc 7 562ndash578

3 GreenmanC StephensP SmithR DalglieshGL HunterCBignellG DaviesH TeagueJ ButlerA StevensC et al(2007) Patterns of somatic mutation in human cancer genomesNature 446 153ndash158

4 AspP BlumR VethanthamV ParisiF MicsinaiM ChengJBowmanC KlugerY and DynlachtBD (2011) Genome-wide

remodeling of the epigenetic landscape during myogenicdifferentiation Proc Natl Acad Sci USA 108 E149ndashE158

5 BarskiA and ZhaoK (2009) Genomic location analysis byChIP-Seq J Cell Biochem 107 11ndash18

6 MikkelsenTS KuMC JaffeDB IssacB LiebermanEGiannoukosG AlvarezP BrockmanW KimTK KocheRPet al (2007) Genome-wide maps of chromatin state in pluripotentand lineage-committed cells Nature 448 553ndash560

7 ENCODE Project Consortium (2004) The ENCODE(ENCyclopedia of DNA elements) project Science 306 636ndash640

8 ENCODE Project Consortium (2012) An integrated encyclopediaof DNA elements in the human genome Nature 489 57ndash74

9 ListerR PelizzolaM DowenRH HawkinsRD HonGTonti-FilippiniJ NeryJR LeeL YeZ NgoQM et al(2009) Human DNA methylomes at base resolution showwidespread epigenomic differences Nature 462 315ndash322

10 BellO SchwaigerM OakeleyEJ LienertF BeiselCStadlerMB and SchubelerD (2010) Accessibility of theDrosophila genome discriminates PcG repression H4K16acetylation and replication timing Nat Struct Mol Biol 17894ndash900

11 BlowM McCulleyDJ LiZ ZhangT AkiyamaJA HoltAPlajzer-FrickI ShoukryM WrightC ChenF et al (2010)ChIP-Seq identification of weakly conserved heart enhancers NatGenet 42 806ndash810

12 BottomlyD KylerSL McWeeneySK and YochumGS(2010) Identification of b-catenin binding regions in colon cancercells using ChIP-Seq Nucleic Acids Res 38 5735ndash5745

13 ChicasA WangX ZhangC McCurrachM ZhaoZ MertODickinsRA NaritaM ZhangM and LoweSW (2010)Dissecting the unique role of the retinoblastoma tumor suppressorduring cellular senescence Cancer Cell 17 376ndash387

14 CreyghtonMP ChengAW WelsteadGG KooistraTCareyBW SteineEJ HannaJ LodatoMA FramptonGMSharpPA et al (2010) From the cover histone H3K27acseparates active from poised enhancers and predictsdevelopmental state Proc Natl Acad Sci USA 10721931ndash21936

15 DurantL WatfordWT RamosHL LaurenceA VahediGWeiL TakahashiH SunH-W KannoY PowrieF et al(2010) Diverse targets of the transcription factor STAT3contribute to T cell pathogenicity and homeostasis Immunity 32605ndash615

16 EnderleD BeiselC StadlerMB GerstungM AthriP andParoR (2010) Polycomb preferentially targets stalled promotersof coding and noncoding transcripts Genome Res 21 216ndash226

17 FullwoodMJ LiuMH PanYF LiuJ XuHMohamedYB OrlovYL VelkovS HoA MeiPH et al(2009) An oestrogen-receptor-a-bound human chromatininteractome Nature 462 58ndash64

18 GanQ SchonesDE Ho EunS WeiG CuiK ZhaoK andChenX (2010) Monovalent and unpoised status of most genes inundifferentiated cell-enriched Drosophila testis Genome Biol 11R42

19 GorchakovAA AlekseyenkoAA KharchenkoP ParkPJand KurodaMI (2009) Long-range spreading of dosagecompensation in Drosophila captures transcribed autosomal genesinserted on X Genes Dev 23 2266ndash2271

20 GuentherMG FramptonGM SoldnerF HockemeyerDMitalipovaM JaenischR and YoungRA (2010) Chromatinstructure and gene expression programs of human embryonic andinduced pluripotent stem cells Cell Stem Cell 7 249ndash257

21 HoffmanBG RobertsonG ZavagliaB BeachM CullumRLeeS SoukhatchevaG LiL WederellED ThiessenN et al(2010) Locus co-occupancy nucleosome positioning andH3K4me1 regulate the functionality of FOXA2- HNF4A- andPDX1-bound loci in islets and liver Genome Res 20 1037ndash1051

22 HurtadoA HolmesKA Ross-InnesCS SchmidtD andCarrollJS (2010) FOXA1 is a key determinant of estrogenreceptor function and endocrine response Nat Genet 43 27ndash33

23 IpJY SchmidtD PanQ RamaniAK FraserAGOdomDT and BlencoweBJ (2011) Global impact of RNApolymerase II elongation inhibition on alternative splicingregulation Genome Res 21 390ndash401

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 10 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

24 JohnS SaboPJ ThurmanRE SungM-H BiddieSCJohnsonTA HagerGL and StamatoyannopoulosJA (2011)Chromatin accessibility pre-determines glucocorticoid receptorbinding patterns Nat Genet 43 264ndash268

25 JolmaA KiviojaT ToivonenJ ChengL WeiG EngeMTaipaleM VaquerizasJM YanJ SillanpaaMJ et al (2010)Multiplexed massively parallel SELEX for characterization ofhuman transcription factor binding specificities Genome Res 20861ndash873

26 JungH LacombeJ MazzoniEO LiemKF Jr GrinsteinJMahonyS MukhopadhyayD GiffordDK YoungRAAndersonKV et al (2010) Global control of motor neurontopography mediated by the repressive actions of a single hoxgene Neuron 67 781ndash796

27 KageyMH NewmanJJ BilodeauS ZhanY OrlandoDAvan BerkumNL EbmeierCC GoossensJ RahlPBLevineSS et al (2010) Mediator and cohesin connect geneexpression and chromatin architecture Nature 467 430ndash435

28 KasowskiM GrubertF HeffelfingerC HariharanMAsabereA WaszakSM HabeggerL RozowskyJ ShiMUrbanAE et al (2010) Variation in transcription factor bindingamong humans Science 328 232ndash235

29 KocheRP SmithZD AdliM GuH KuM GnirkeABernsteinBE and MeissnerA (2011) Reprogramming factorexpression initiates widespread targeted chromatin remodelingCell Stem Cell 8 96ndash105

30 KramerJM KochinkeK OortveldMAW MarksHKramerD de JongEK AsztalosZ WestwoodJTStunnenbergHG SokolowskiMB et al (2011) Epigeneticregulation of learning and memory by Drosophila EHMTG9aPLoS Biol 9 e1000569

31 LeeBK BhingeAA and IyerVR (2011) Wide-rangingfunctions of E2F4 in transcriptional activation and repressionrevealed by genome-wide analysis Nucleic Acids Res 393558ndash3573

32 LiL JothiR CuiK LeeJY CohenT GorivodskyMTzchoriI ZhaoY HayesSM BresnickEH et al (2010)Nuclear adaptor Ldb1 regulates a transcriptional programessential for the maintenance of hematopoietic stem cells NatImmunol 12 129ndash136

33 MahonyS MazzoniEO McCuineS YoungRA WichterleHand GiffordDK (2011) Ligand-dependent dynamics of retinoicacid receptor binding during early neurogenesis Genome Biol12 R2

34 MarsonA LevineSS ColeMF FramptonGMBrambrinkT JohnstoneS GuentherMG JohnstonWKWernigM and NewmanJ (2008) Connecting microRNA genesto the core transcriptional regulatory circuitry of embryonic stemcells Cell 134 521ndash533

35 PaulerFM SloaneMA HuangR ReghaK KoernerMVTamirI SommerA AszodiA JenuweinT and BarlowDP(2008) H3K27me3 forms BLOCs over silent genes and intergenicregions and specifies a histone banding pattern on a mouseautosomal chromosome Genome Res 19 221ndash233

36 Rada-IglesiasA BajpaiR SwigutT BrugmannSAFlynnRA and WysockaJ (2010) A unique chromatin signatureuncovers early developmental enhancers in humans Nature 470279ndash283

37 RamagopalanSV HegerA BerlangaAJ MaugeriNJLincolnMR BurrellA HandunnetthiL HandelAEDisantoG OrtonS-M et al (2010) A ChIP-seq definedgenome-wide map of vitamin D receptor bindingassociations with disease and evolution Genome Res 201352ndash1360

38 Rugg-GunnPJ CoxBJ RalstonA and RossantJ (2010)Distinct histone modifications in stem cell lines and tissue lineagesfrom the early mouse embryo Proc Natl Acad Sci USA 10710783ndash10790

39 SchmidtD SchwaliePC Ross-InnesCS HurtadoABrownGD CarrollJS FlicekP and OdomDT (2010) ACTCF-independent role for cohesin in tissue-specific transcriptionGenome Res 20 578ndash588

40 VermeulenM EberlHC MatareseF MarksH DenissovSButterF LeeKK OlsenJV HymanAA and

StunnenbergHG (2010) Quantitative interaction proteomics andgenome-wide profiling of epigenetic histone marks and theirreaders Cell 142 967ndash980

41 VerziMP ShinH HeHH SulahianR MeyerCAMontgomeryRK FleetJC BrownM LiuXS andShivdasaniRA (2010) Differentiation-specific histonemodifications reveal dynamic chromatin interactions and partnersfor the intestinal transcription factor CDX2 Dev Cell 19713ndash726

42 WeiGH BadisG BergerMF KiviojaT PalinK EngeMBonkeM JolmaA VarjosaloM GehrkeAR et al (2010)Genome-wide analysis of ETS-family DNA-binding in vitro andin vivo EMBO J 29 2147ndash2160

43 WeiL VahediG SunH-W WatfordWT TakatoriHRamosHL TakahashiH LiangJ Gutierrez-CruzG ZangCet al (2010) Discrete roles of STAT4 and STAT6 transcriptionfactors in tuning epigenetic modifications and transcription duringT helper cell differentiation Immunity 32 840ndash851

44 YangY LuY EspejoA WuJ XuW LiangS andBedfordMT (2010) TDRD3 is an effector molecule for arginine-methylated histone marks Mol Cell 40 1016ndash1023

45 YuS CuiK JothiR ZhaoD-M JingX ZhaoK andXueH-H (2010) GABP controls a critical transcriptionregulatory module that is essential for maintenance anddifferentiation of hematopoietic stemprogenitor cells Blood 1172166ndash2178

46 HoffmanMM ErnstJ WilderSP KundajeA HarrisRSLibbrechtM GiardineB EllenbogenPM BilmesJABirneyE et al (2013) Integrative annotation of chromatinelements from ENCODE data Nucleic Acids Res 41 827ndash841

47 LimSJ TanTW and TongJC (2010) Computationalepigenetics the new scientific paradigm Bioinformation 4331ndash337

48 ThurmanRE DayN NobleWS and StamatoyannopoulosJA(2007) Identification of higher-order functional domains in thehuman ENCODE regions Genome Res 17 917ndash927

49 LangmeadB TrapnellC PopM and SalzbergSL (2009)Ultrafast and memory-efficient alignment of short DNAsequences to the human genome Genome Biol 10 R25

50 DiazA ParkK LimDA and SongJS (2012) Normalizationbias correction and peak calling for ChIP-seq Stat Appl GenetMol Biol 11 9

51 LiangK and KelesS (2012) Normalization of ChIP-seq datawith control BMC Bioinformatics 13 199

52 R Development Core Team (2012) R A Language andEnvironment for Statistical Computing R Foundation forStatistical Computing Vienna Austria

53 DaviesDL and BouldinDW (1979) A cluster separationmeasure IEEE Trans Pattern Anal Mach Intell 1 224ndash227

54 HanleyJA and McNeilBJ (1982) The meaning and use of thearea under a receiver operating characteristic (ROC) curveRadiology 143 29ndash36

55 GarnerSR (1995) Weka the waikato environment forknowledge analysis In Procceding of the New Zealand ComputerScience Research Students Conference Hamilton New Zealandpp 57ndash64

56 Shi-YunS Kai-QuanS Chong-JinO Xiao-PingL andWilder-SmithE (2008) Automatic identification and removal ofartifacts in EEG using a probabilistic multi-class SVM approachwith error correction In IEEE International Conference onSystems Man and Cybernetics Singapore Singaporepp 1134ndash1139

57 BellmanRE (2003) Dynamic Programming Dover PublicationsNew York NY

58 BickmoreWA and van SteenselB (2013) Genome architecturedomain organization of interphase chromosomes Cell 1521270ndash1284

59 OppenheimAV SchaferRW and BuckJR (1999) Discrete-Time Signal Processing Prentice-Hall Inc Upper SaddleRiver NJ

60 BlankTA and BeckerPB (1995) Electrostatic mechanism ofnucleosome spacing J Mol Biol 252 305ndash313

61 MicsinaiM ParisiF StrinoF AspP DynlachtBD andKlugerY (2012) Picking ChIP-seq peak detectors for

PAGE 11 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

analyzing chromatin modification experiments Nucleic Acids Res40 e70

62 KharchenkoPV TolstorukovMY and ParkPJ (2008) Designand analysis of ChIP-seq experiments for DNA-binding proteinsNat Biotechnol 26 1351ndash1359

63 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of theENCODE and modENCODE consortia Genome Res 221813ndash1831

64 RamachandranP PalidworGA PorterCJ and PerkinsTJ(2013) MaSC mappability-sensitive cross-correlation forestimating mean fragment length of single-end short-readsequencing data Bioinformatics 29 444ndash450

65 NarlikarL and JothiR (2012) ChIP-Seq data analysisidentification of ProteinndashDNA binding sites with SISSRs peak-finder In WangJ TanAC and TianT (eds) Next GenerationMicroarray Bioinformatics Vol 802 Humana Press New YorkUSA pp 305ndash322

66 ZhangY LiuT MeyerCA EeckhouteJ JohnsonDSBernsteinBE NussbaumC MyersRM BrownM LiWet al (2008) Model-based analysis of ChIP-Seq (MACS) GenomeBiol 9 R137

67 LittleAV LeeJ Yoon-MoJ and MaggioniM (2009) IEEESP15th Workshop on Statistical Signal Processing (SSPrsquo09) CardiffUnited Kingdom pp 85ndash88

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 12 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

Page 11: Arpeggio: harmonic compression of ChIP-seq data reveals protein ...

24 JohnS SaboPJ ThurmanRE SungM-H BiddieSCJohnsonTA HagerGL and StamatoyannopoulosJA (2011)Chromatin accessibility pre-determines glucocorticoid receptorbinding patterns Nat Genet 43 264ndash268

25 JolmaA KiviojaT ToivonenJ ChengL WeiG EngeMTaipaleM VaquerizasJM YanJ SillanpaaMJ et al (2010)Multiplexed massively parallel SELEX for characterization ofhuman transcription factor binding specificities Genome Res 20861ndash873

26 JungH LacombeJ MazzoniEO LiemKF Jr GrinsteinJMahonyS MukhopadhyayD GiffordDK YoungRAAndersonKV et al (2010) Global control of motor neurontopography mediated by the repressive actions of a single hoxgene Neuron 67 781ndash796

27 KageyMH NewmanJJ BilodeauS ZhanY OrlandoDAvan BerkumNL EbmeierCC GoossensJ RahlPBLevineSS et al (2010) Mediator and cohesin connect geneexpression and chromatin architecture Nature 467 430ndash435

28 KasowskiM GrubertF HeffelfingerC HariharanMAsabereA WaszakSM HabeggerL RozowskyJ ShiMUrbanAE et al (2010) Variation in transcription factor bindingamong humans Science 328 232ndash235

29 KocheRP SmithZD AdliM GuH KuM GnirkeABernsteinBE and MeissnerA (2011) Reprogramming factorexpression initiates widespread targeted chromatin remodelingCell Stem Cell 8 96ndash105

30 KramerJM KochinkeK OortveldMAW MarksHKramerD de JongEK AsztalosZ WestwoodJTStunnenbergHG SokolowskiMB et al (2011) Epigeneticregulation of learning and memory by Drosophila EHMTG9aPLoS Biol 9 e1000569

31 LeeBK BhingeAA and IyerVR (2011) Wide-rangingfunctions of E2F4 in transcriptional activation and repressionrevealed by genome-wide analysis Nucleic Acids Res 393558ndash3573

32 LiL JothiR CuiK LeeJY CohenT GorivodskyMTzchoriI ZhaoY HayesSM BresnickEH et al (2010)Nuclear adaptor Ldb1 regulates a transcriptional programessential for the maintenance of hematopoietic stem cells NatImmunol 12 129ndash136

33 MahonyS MazzoniEO McCuineS YoungRA WichterleHand GiffordDK (2011) Ligand-dependent dynamics of retinoicacid receptor binding during early neurogenesis Genome Biol12 R2

34 MarsonA LevineSS ColeMF FramptonGMBrambrinkT JohnstoneS GuentherMG JohnstonWKWernigM and NewmanJ (2008) Connecting microRNA genesto the core transcriptional regulatory circuitry of embryonic stemcells Cell 134 521ndash533

35 PaulerFM SloaneMA HuangR ReghaK KoernerMVTamirI SommerA AszodiA JenuweinT and BarlowDP(2008) H3K27me3 forms BLOCs over silent genes and intergenicregions and specifies a histone banding pattern on a mouseautosomal chromosome Genome Res 19 221ndash233

36 Rada-IglesiasA BajpaiR SwigutT BrugmannSAFlynnRA and WysockaJ (2010) A unique chromatin signatureuncovers early developmental enhancers in humans Nature 470279ndash283

37 RamagopalanSV HegerA BerlangaAJ MaugeriNJLincolnMR BurrellA HandunnetthiL HandelAEDisantoG OrtonS-M et al (2010) A ChIP-seq definedgenome-wide map of vitamin D receptor bindingassociations with disease and evolution Genome Res 201352ndash1360

38 Rugg-GunnPJ CoxBJ RalstonA and RossantJ (2010)Distinct histone modifications in stem cell lines and tissue lineagesfrom the early mouse embryo Proc Natl Acad Sci USA 10710783ndash10790

39 SchmidtD SchwaliePC Ross-InnesCS HurtadoABrownGD CarrollJS FlicekP and OdomDT (2010) ACTCF-independent role for cohesin in tissue-specific transcriptionGenome Res 20 578ndash588

40 VermeulenM EberlHC MatareseF MarksH DenissovSButterF LeeKK OlsenJV HymanAA and

StunnenbergHG (2010) Quantitative interaction proteomics andgenome-wide profiling of epigenetic histone marks and theirreaders Cell 142 967ndash980

41 VerziMP ShinH HeHH SulahianR MeyerCAMontgomeryRK FleetJC BrownM LiuXS andShivdasaniRA (2010) Differentiation-specific histonemodifications reveal dynamic chromatin interactions and partnersfor the intestinal transcription factor CDX2 Dev Cell 19713ndash726

42 WeiGH BadisG BergerMF KiviojaT PalinK EngeMBonkeM JolmaA VarjosaloM GehrkeAR et al (2010)Genome-wide analysis of ETS-family DNA-binding in vitro andin vivo EMBO J 29 2147ndash2160

43 WeiL VahediG SunH-W WatfordWT TakatoriHRamosHL TakahashiH LiangJ Gutierrez-CruzG ZangCet al (2010) Discrete roles of STAT4 and STAT6 transcriptionfactors in tuning epigenetic modifications and transcription duringT helper cell differentiation Immunity 32 840ndash851

44 YangY LuY EspejoA WuJ XuW LiangS andBedfordMT (2010) TDRD3 is an effector molecule for arginine-methylated histone marks Mol Cell 40 1016ndash1023

45 YuS CuiK JothiR ZhaoD-M JingX ZhaoK andXueH-H (2010) GABP controls a critical transcriptionregulatory module that is essential for maintenance anddifferentiation of hematopoietic stemprogenitor cells Blood 1172166ndash2178

46 HoffmanMM ErnstJ WilderSP KundajeA HarrisRSLibbrechtM GiardineB EllenbogenPM BilmesJABirneyE et al (2013) Integrative annotation of chromatinelements from ENCODE data Nucleic Acids Res 41 827ndash841

47 LimSJ TanTW and TongJC (2010) Computationalepigenetics the new scientific paradigm Bioinformation 4331ndash337

48 ThurmanRE DayN NobleWS and StamatoyannopoulosJA(2007) Identification of higher-order functional domains in thehuman ENCODE regions Genome Res 17 917ndash927

49 LangmeadB TrapnellC PopM and SalzbergSL (2009)Ultrafast and memory-efficient alignment of short DNAsequences to the human genome Genome Biol 10 R25

50 DiazA ParkK LimDA and SongJS (2012) Normalizationbias correction and peak calling for ChIP-seq Stat Appl GenetMol Biol 11 9

51 LiangK and KelesS (2012) Normalization of ChIP-seq datawith control BMC Bioinformatics 13 199

52 R Development Core Team (2012) R A Language andEnvironment for Statistical Computing R Foundation forStatistical Computing Vienna Austria

53 DaviesDL and BouldinDW (1979) A cluster separationmeasure IEEE Trans Pattern Anal Mach Intell 1 224ndash227

54 HanleyJA and McNeilBJ (1982) The meaning and use of thearea under a receiver operating characteristic (ROC) curveRadiology 143 29ndash36

55 GarnerSR (1995) Weka the waikato environment forknowledge analysis In Procceding of the New Zealand ComputerScience Research Students Conference Hamilton New Zealandpp 57ndash64

56 Shi-YunS Kai-QuanS Chong-JinO Xiao-PingL andWilder-SmithE (2008) Automatic identification and removal ofartifacts in EEG using a probabilistic multi-class SVM approachwith error correction In IEEE International Conference onSystems Man and Cybernetics Singapore Singaporepp 1134ndash1139

57 BellmanRE (2003) Dynamic Programming Dover PublicationsNew York NY

58 BickmoreWA and van SteenselB (2013) Genome architecturedomain organization of interphase chromosomes Cell 1521270ndash1284

59 OppenheimAV SchaferRW and BuckJR (1999) Discrete-Time Signal Processing Prentice-Hall Inc Upper SaddleRiver NJ

60 BlankTA and BeckerPB (1995) Electrostatic mechanism ofnucleosome spacing J Mol Biol 252 305ndash313

61 MicsinaiM ParisiF StrinoF AspP DynlachtBD andKlugerY (2012) Picking ChIP-seq peak detectors for

PAGE 11 OF 12 Nucleic Acids Research 2013 Vol 41 No 16 e161

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

analyzing chromatin modification experiments Nucleic Acids Res40 e70

62 KharchenkoPV TolstorukovMY and ParkPJ (2008) Designand analysis of ChIP-seq experiments for DNA-binding proteinsNat Biotechnol 26 1351ndash1359

63 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of theENCODE and modENCODE consortia Genome Res 221813ndash1831

64 RamachandranP PalidworGA PorterCJ and PerkinsTJ(2013) MaSC mappability-sensitive cross-correlation forestimating mean fragment length of single-end short-readsequencing data Bioinformatics 29 444ndash450

65 NarlikarL and JothiR (2012) ChIP-Seq data analysisidentification of ProteinndashDNA binding sites with SISSRs peak-finder In WangJ TanAC and TianT (eds) Next GenerationMicroarray Bioinformatics Vol 802 Humana Press New YorkUSA pp 305ndash322

66 ZhangY LiuT MeyerCA EeckhouteJ JohnsonDSBernsteinBE NussbaumC MyersRM BrownM LiWet al (2008) Model-based analysis of ChIP-Seq (MACS) GenomeBiol 9 R137

67 LittleAV LeeJ Yoon-MoJ and MaggioniM (2009) IEEESP15th Workshop on Statistical Signal Processing (SSPrsquo09) CardiffUnited Kingdom pp 85ndash88

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 12 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018

Page 12: Arpeggio: harmonic compression of ChIP-seq data reveals protein ...

analyzing chromatin modification experiments Nucleic Acids Res40 e70

62 KharchenkoPV TolstorukovMY and ParkPJ (2008) Designand analysis of ChIP-seq experiments for DNA-binding proteinsNat Biotechnol 26 1351ndash1359

63 LandtSG MarinovGK KundajeA KheradpourP PauliFBatzoglouS BernsteinBE BickelP BrownJB CaytingPet al (2012) ChIP-seq guidelines and practices of theENCODE and modENCODE consortia Genome Res 221813ndash1831

64 RamachandranP PalidworGA PorterCJ and PerkinsTJ(2013) MaSC mappability-sensitive cross-correlation forestimating mean fragment length of single-end short-readsequencing data Bioinformatics 29 444ndash450

65 NarlikarL and JothiR (2012) ChIP-Seq data analysisidentification of ProteinndashDNA binding sites with SISSRs peak-finder In WangJ TanAC and TianT (eds) Next GenerationMicroarray Bioinformatics Vol 802 Humana Press New YorkUSA pp 305ndash322

66 ZhangY LiuT MeyerCA EeckhouteJ JohnsonDSBernsteinBE NussbaumC MyersRM BrownM LiWet al (2008) Model-based analysis of ChIP-Seq (MACS) GenomeBiol 9 R137

67 LittleAV LeeJ Yoon-MoJ and MaggioniM (2009) IEEESP15th Workshop on Statistical Signal Processing (SSPrsquo09) CardiffUnited Kingdom pp 85ndash88

e161 Nucleic Acids Research 2013 Vol 41 No 16 PAGE 12 OF 12

Downloaded from httpsacademicoupcomnararticle-abstract4116e1612411628by gueston 11 April 2018