An algorithmic perspective of de novo cis-regulatory motif ... us...sequencing (ChIP-seq) technology...
Transcript of An algorithmic perspective of de novo cis-regulatory motif ... us...sequencing (ChIP-seq) technology...
An algorithmic perspective of de novo cis-regulatory
motif finding based on ChIP-seq dataBingqiang Liu Jinyu Yang Yang Li Adam McDermaid and Qin MaCorresponding author Qin Ma Department of Agronomy Horticulture and Plant Science South Dakota State University Brookings SD 57007 USATel 1-706-254-4293 E-mail qinmasdstateedu
Abstract
Transcription factors are proteins that bind to specific DNA sequences and play important roles in controlling the expres-sion levels of their target genes Hence prediction of transcription factor binding sites (TFBSs) provides a solid foundationfor inferring gene regulatory mechanisms and building regulatory networks for a genome Chromatin immunoprecipitationsequencing (ChIP-seq) technology can generate large-scale experimental data for such proteinndashDNA interactions providingan unprecedented opportunity to identify TFBSs (aka cis-regulatory motifs) The bottleneck however is the lack of robustmathematical models as well as efficient computational methods for TFBS prediction to make effective use of massiveChIP-seq data sets in the public domain The purpose of this study is to review existing motif-finding methods for ChIP-seqdata from an algorithmic perspective and provide new computational insight into this field The state-of-the-art methodswere shown through summarizing eight representative motif-finding algorithms along with corresponding challenges andintroducing some important relative functions according to specific biological demands including discriminative motiffinding and cofactor motifs analysis Finally potential directions and plans for ChIP-seq-based motif-finding tools wereshowcased in support of future algorithm development
Key words cis-regulatory elements ChIP-seq motif finding algorithm
Introduction
Cis-regulatory motifs are usually conserved short DNA se-quences which tend to be 8ndash20 bp long [1] Typically they aretranscription factor binding sites (TFBSs) and play significantroles in regulating transcription rates of nearby genes and fur-ther control their expression levels Hence de novo motif predic-tion and related analyses (eg motif scan and comparison)provide a solid foundation for the inference of gene transcrip-tional regulatory mechanisms in both prokaryotic and eukary-otic organisms [2 3] Moreover these techniques alsosubstantially contribute to some system-level studies such asregulon modeling and regulatory network construction [2 4 5]With the rapidly growing availability of sequenced genomesand advanced biotechnologies substantial computational tech-niques have been carried out to identify motifs from query DNA
sequences Nevertheless the variations among motifs and theirshort length make their discovery a challenging problem
Substantial efforts have been devoted in seeking a reliableand efficient way for motif identification over the past few dec-ades Since the 1980s identifying motifs in provided promotershas been one of the most prevalent approaches and numeroustools have been developed [6ndash13] such as AlignACEBioProspector CONSENSUS MDscan MEME CUBIC MDscanand BOBRO [10 11 13ndash24] Some of these tools have been suc-cessfully applied to various organisms for regulatory networkconstruction [2 5] The underlying mechanism is that the co-regulated genes should exhibit overrepresented common motifsin their promoter regions Although considerable efforts havebeen made one non-negligible limitation is the high false-posi-tive rates in predictions [9 25ndash27] Under the assumption that
Bingqiang Liu is an associate professor in School of Mathematics at Shandong University Jinan Shandong P R ChinaJinyu Yang is a PhD student in Department of Mathematics and Statistics at South Dakota State University Brookings SD USAYang Li is a PhD student in School of Mathematics at Shandong University Jinan Shandong P R ChinaAdam McDermaid is a PhD student in Department of Mathematics and Statistics at South Dakota State University Brookings SD USAQin Ma is the director of Bioinformatics and Mathematical Biosciences Laboratory and an assistant professor in Department of Agronomy Horticultureand Plant Science at South Dakota State University SD USA He is also an adjunct assistant professor in BioSNTRSubmitted 20 December 2016 Received (in revised form) 22 February 2017
VC The Author 2017 Published by Oxford University Press All rights reserved For Permissions please email journalspermissionsoupcom
1
Briefings in Bioinformatics 2017 1ndash13
doi 101093bibbbx026Software Review
the motifs in promoters tend to evolve at a lower rate and there-fore be more conserved than nonfunctional surrounding se-quences some phylogenetic footprinting-based algorithmshave been developed to reduce the false-positive rate such asPhyloGibbs Footprinter PhyloCon and MicroFootprinter [14 28ndash32] A phylogenetic footprinting strategy was firstly proposed in1988 [33 34] and has significantly improved the state-of-the-artperformance in this field However the majority of programsbased on phylogenetic footprinting did not make full use of thephylogenetic relationship of query promoter sequences fromvarious genomes [21] Owing to this limitation some promotersfrom highly divergent species could be included and the motifinstances are not conserved enough to carry out motif predic-tion [35ndash37] Most recently Liu et al [4 38] developed two com-putational pipelines aiming to break this bottleneckSpecifically they extracted phylogenetic relationships fromregulatory sequences using a combinatorial framework basedon 216 selected representative genomes to refine the ortholo-gous promoter set It is noteworthy that all the methods men-tioned above could be potentially improved by integratingadditional experimental data
The rapid development of high-throughput biotechnologies[39ndash48] has provided new insight and powerful support for regu-latory mechanism analyses and genome-scale regulatory net-work elucidation In particular chromatin immunoprecipitationsequencing(ie ChIP-seq followed by high-throughput DNAsequencing) provides massive proteinndashDNA interactive infor-mation and has been successfully applied to genome-wide ana-lyses of transcription factor (TF) binding histone modificationmarkers and polymerase binding [39 49] This technology usesone chosen protein (ie TF histone or DNA polymerase of inter-est) to bind the whole-genome sequences [50 51] then somephysical techniques such as ultrasonic are used to shear thecross-linked DNA sequences into segments after proteinndashDNAinteraction [39] finally these segments will be sequenced intoshort reads [52 53] These reads could be mapped onto their ref-erence genome if available using Bowtie [54] BWA [55] etcBased on the mapping results the motif-enriched genomic re-gions could be identified by peak-calling tools [56] such as SPP[57] MACS [58] CisGenome [59] FindPeaks [60] QuEST [61] andPeakRanger [62] These regions will be the potential bindingsites of the chosen protein
Recent studies suggest that ChIP-seq can be effectively inte-grated into and benefit TFBS discovery tools [57 61 63ndash73] Itprovides high-throughput motif signals and allows genome-scale discovery in a cell More accurate binding regions (peaks)can be derived from ChIP-seq experiments thus leading tomore reliable prediction performance [9] However the peaksdetected from ChIP-seq data can be up to a few hundred basepairs while the documented cis-regulatory motifs are usuallyonly as long as 8ndash20 bp [74] Therefore an ab initio motif discov-ery method is still indispensable to (i) identify the accuratebinding sites from these ChIP-seq peaks and (ii) build con-served motif profiles for further study in transcriptional regula-tion Unfortunately some widely used motif discovery toolseg MEME and WEEDER [75] cannot be directly used on ChIP-seq peaks as they are designed for co-regulated promoter se-quences with limited size Recently some efforts have beenmade to rectify this problem by modifying traditional motif-finding tools to adapt to the ChIP-seq data [68 74 76] or design-ing specific strategies for ChIP-seq-based motif finding [73 77]The computational challenges of these tools include but notlimited to (i) huge amounts of sequenced ChIP-seq reads canmake motif finding a computationally infeasible problem [9] (ii)
failure to identify the motifs associated with cofactors of theChIP-ed TF [73] or cis-regulatory modules [78] (iii) lack of insightin integration of ChIP-seq data sets from multiple TFs [79] (iv)the traditional false-positive issue in motif prediction causedby the noise in ChIP-seq technology [74] (v) lack of an efficientway to determine the correct lengths of motifs except exhaust-ively enumerating each length within an interval [18 80 81]and (vi) weak support in elucidation of the mutual interactionsamong multiple motifs from larger ChIP-ed data sets [82ndash84]which is important in disease diagnosis through gene regula-tory network construction
So far there have been a few reviews on motif findingfocused on the analysis of ChIP-seq data [77 85 86] Tran et al[77] introduced nine Web tools with numerous details in theirusages and applications For each of them the authors show-cased the requirement of the input format default parametersoutput format basic characteristics of the tool and a brief intro-duction of the algorithm Rather than focusing on individual al-gorithms or tools Lihu et al [85] were dedicated to reviewingseven ensemble methods with input and output requirementsand a brief introduction of the included algorithms Kulakovskiyet al provided a systematic history of tool development in motiffinding before the ChIP-seq technology and advantages as wellas computational challenges of using ChIP-seq in motif findingSeveral ChIP-seq-based methods and their applications were re-viewed focusing on their overall stories or workflows [86]However it is still under investigation in a detailed algorithmicaspect as the performance of existing methods is far from satis-factory based on both efficiency and performance From the al-gorithmic point of view this article presents the mainchallenges and characteristics of de novo motif finding in ChIP-seq peaks through systematically reviewing eight representa-tive tools (Table 1) Specifically the review mainly focuses onthe motif-finding techniques adopted by these methods as wellas such additional specific functions as discriminative motiffinding and cofactor motif identification Through summarizingthe existing limitations and revealing algorithmic potentialsseveral promising directions for further improvements of ChIP-seq-based motif finding have been proposed both in methodo-logical aspects and concrete applications
Motif representation and identification
A motif represents a set of DNA segments with the same lengthwhich are binding sites for the same TF Each segment of themotif is called an instance and different instances of the samemotif tend to be similar with each other on sequence level(Figure 1A) A representation model of a motif to demonstratethe similarity of its instances is expected to accurately capturethe characteristics of proteinndashDNA binding activity of its corres-ponding TF [90]
The most straightforward model to denote the binding pref-erence of a TF on each position along a motif is the lsquoconsensusrsquosequence (eg AGTCA or AGTCG for the motif in Figure 1A) whichis composed of the concatenation of the most frequent nucleo-tide on each position It can be seen as the ancestor of the bind-ing sites of the same TF with an assumption that these sitesevolved from it Although the consensus presents the character-istics of a motif in each position in a simple and clear way thevariations in this motif are absent in this model The lsquodegener-ate consensusrsquo was proposed to fill this gap using IUPAC wildcards to replace the exact nucleotides (A G C and T) For ex-ample W means both A and T in this position could be recog-nized by the TF of this motif (Figure 1B) [9]
2 | Liu et al
A more accurate and most commonly used model is thelsquomotif profilersquo A profile is built by aligning the available in-stances of a motif M and counting the frequency of each nucleo-tide at each position (fij) These frequencies give rise to a typicalmatrix representation of a motif profile (Mffrac14 fij4 l Figure 1A)called a lsquoposition weight matrixrsquo (PWM) An alternative way ofconstructing the PWM is using the probability distribution to re-place frequencies (Mpfrac14 pij4 l) Specifically these frequencieswill be divided by the number of binding sites of this motif andsuch a representation of the PWM in Figure 1A is shown inEquation (1)
Mp frac14 fpijg frac14
06 0 0 02 04
0 06 0 0 04
0 04 0 08 02
04 0 1 0 0
2666664
3777775
(1)
Taking the background frequencies of each nucleotides intoconsideration the PWM in Equation (1) can be further modifiedas Mgfrac14 gij where gijfrac14 log(fijbi) and biis the probability of theith nucleotide of (A G C and T) appearing in the background se-quences Based on this version of PWM we can calculate thelsquoinformation contentrsquo (IC) to evaluate how conserved this motifis ie Equation (2)
I Meth THORN frac14X1
jfrac141
X4
ifrac141
pij logpij
bi(2)
In addition the matrix can also be used to evaluate how agiven DNA segment s with length l is consistent with motif Mby calculating the score
score seth THORN frac14Xl
jfrac141
X4
ifrac141
pij logpij
bi dij (3)
where dijfrac14 1 if the jth nucleotide of s is the ith nucleotide of (AG C and T) dijfrac14 0 otherwise A problem of this model is that theprobability fij could be zero for a small set of binding sites giv-ing rise to negative infinity in Equations (2) and (3) A commonmethod to avoid this bias is adding a certain valueT
able
1Fe
atu
res
of
the
too
lsfo
rC
hIP
-seq
mo
tif
fin
din
g
Pro
gram
sM
ICSA
Ch
IPM
un
kD
REM
ER
SAT
pea
k-m
oti
fsFM
oti
fSI
OM
ICS
Dis
cro
ver
RPM
CM
C
Wo
rd-b
ased
Yes
Yes
Yes
Yes
Yes
Pro
file
-bas
edY
esY
esY
esT
ech
niq
ues
and
crit
eria
MEM
EthornPS
SMsc
ore
IC-l
ike
fun
ctio
ns
thornse
qu
ence
wei
ght
Wo
rdco
un
tthorn
Fish
erte
stR
SAT
oli
go-a
nal
ysis
d
yad
-an
alys
isl
oca
lwo
rdan
alys
isM
EME
and
Ch
IPM
un
k
Wo
rden
um
erat
ion
thornsu
ffix
treethorn
IC-b
ased
z-sc
ore
IC-l
ike
fun
ctio
ns
thorntr
eem
eth
od
sH
MM thornm
ult
iple
test
ing
corr
ecte
dP-
valu
e
MC
MC
Dis
crim
inat
ive
mo
tif
fin
din
gY
esY
esY
es
Co
fact
ors
Mu
ltip
lem
oti
fsM
ult
iple
div
erse
mo
tifs
Mu
ltip
led
iver
sem
oti
fsM
oti
fm
od
ule
sM
ult
iple
div
erse
mo
tifs
Mu
ltip
led
iver
sem
oti
fsM
oti
fco
mp
aris
on
RSA
Tco
mp
are-
mo
tifs
STA
MP
TO
MT
OM
Yea
ro
fre
leas
ean
dre
fere
nce
s20
10[7
9]20
10[6
9]20
11[7
3]20
12[8
7]20
14[7
4]20
14[7
8]20
14[8
8]20
15[8
9]
A B
C
Figure 1 Representation of a motif (A) an example of motif consensus degener-
ate consensus and profile (B) a full list of wild cards in the degenerate consen-
sus (C) a different motif but has same profile with motif in (A)
Motif Prediction based on ChIP-seq data | 3
(pseudocount) for each position of the motif [91] Through sys-tem simulation and analysis Nishida et al found that the opti-mal pseudocount value is correlated with the entropy of a motifprofile Specifically the less conserved motif profiles preferlarger pseudocount value and 08 is suggested in general
As shown above multiple approaches for modeling TFrsquosDNA-binding specificity have been developed A systematiccomparison of these approaches can provide substantially valu-able information for further motif-finding algorithm designDREAM5 Consortium organized a competition on motif repre-sentation models by applying 26 approaches to in vitro protein-binding microarray data [92] These approaches adopt variousstrategies including but not limited to k-mers model PWMhidden Markov models (HMMs) and dinucleotides The k-mersand PWM are the two main strategies which show similar aver-age performance on multiple data sets However they have sub-stantial differences in terms of individual performance on a fewdata sets indicating that motif finding is sensitive for model se-lection Another interesting observation is that the IC of a motifmay not fully represent its accuracy It is obviously contrarywith the basic principle of general motif-finding tools thusdeeper investigation into this area is still needed to improve themotif representation models
These motif representation models are still not perfect witha common disadvantage that they ignore the correlation amongdifferent positions in a motif For example the motif in Figure1C has the same PWM as the motif in Figure 1a but apparentlythe first two positions in this motif are correlated and depend-ent to each other Hence a high order Markov model is sug-gested to be integrated into PWM matrix [24] Meanwhile it isunsure whether a known PWM can fairly represent the wholepopulation scenario as the frequencies of nucleotides in eachposition are calculated only from the known binding sites of aTF A motif profile built on partially identified bindings sites of aTF may induce bias when it is used to interpret the global bind-ing preference especially when this profile is used to model theorthologous binding sites from various species A more funda-mental debate is Do the nucleotides with lower frequenciesimply lower binding ability At the time of this writing thereare still no clear answers to this question and deeper thoughtabout above concerns will bring potential ways to improveexisting representation models of cis-regulatory motifs
Motif signal detection techniques andperformance evaluation
The basic computational assumption of motif identification isthat they are overrepresented as conserved patterns in given se-quences The scattered instances of a motif are not perfectlyidentical but similar with each other Once identified and wellaligned they will show significance in conservation comparedwith background sequences Therefore finding and aligningthese instances are the primary issues and remain a challengein motif finding ChIP-seq data supply enriched resource formotif finding yet make this problem more challenging as thefull-size peak sequences from a ChIP-seq experiment are muchlarger than co-regulated promoters in traditional motif finding
Motif-finding methods mainly fall into two categories word-based methods (ie consensus-based) and profile-based meth-ods [9 13] Word-based methods usually enumerate and com-pare nucleotides starting from a consensus sequence with afixed length and a tolerance of mutations Theoretically thisstrategy is able to identify global optimal solutions but suffers
from high computational complexity when it is applied to large-scale input data or when to-be-identified motifs are relativelylong and with a large number of mutations [13] The profile-based methods usually start with some aligned patterns eitherrandomly chosen [93] or enumerated in a limited subset of inputdata [93] and refine them based on some criteria on the wholedata These criteria are designed to evaluate the overrepre-sented significance of aligned profiles from the input se-quences Improvements are mostly conducted in a heuristicway eg neighboring improvement (add or delete patterns tosee if the profile goes better similar to a hill-climbing method)or iterative statistical methods [Gibbs sampling or expectationmaximization (EM)] The profile-based methods usually runfaster than word-based methods and have better performancein predicting motifs with complex mutations However thiskind of methods tend to fail in detection of multiple motifs es-pecially when the data size is large as the iterative procedurethey adopted often falls into local optimizations which is diffi-cult to escape [89]
Word-based methods
The dominant model of these methods is the (l d)-motifs where lis the width of a motif and d is the maximum number of muta-tions between a motif instance and the consensus sequence [7374 76 78 88 94] Intuitively the goal is to find potential (l d)-motifs that are significantly presented in the input sequencesThese methods can be further divided into two categoriespattern-driven and sample-driven [74] The former enumerates all4l l-mers on character set (A G C and T) as consensus sequences tobuild candidate motifs The latter only uses real l-mers from inputdata to generate all possible (l d)-motifs Five algorithms specific-ally designed for motif finding in ChIP-seq peaks FMotif [74]DREME [73] RSAT peak-motifs [76] SIOMICS [78 94] and Discrover[88] are reviewed with more details as follows
DREME [73] is a sample-driven method which exhaustivelysearches for exact words and heuristically expands them towords with wild cards DREME requires a positive data set iethe peaks derived from a ChIP-seq data set and a negative dataset ie different peaks from other ChIP-seq experiments or ran-domly shuffled version of the positive data It takes all l-mers inthe positive data set and counts the number of their occurrencein the two data sets The top 100 words that are significantlyoverrepresented as determined by Fisherrsquos exact test are takenas seed motifs DREME then expands the seed motifs by allow-ing exactly one additional wild card The newly generatedwords are evaluated and only the top 100 most significantwords are kept To save time in the word generation stepDREME estimates the number of sequences matching generatedmotifs rather than searching explicitly for them in the data setHowever this program prefers short motifs (with length 4ndash10 bp) hence is more suitable for monomeric eukaryotic TFs
The RSAT package [87] provides a pipeline lsquopeak-motifrsquo todiscover motifs in ChIP-seq peaks [76] It integrates a set of try-and-test algorithms using complementary criteria to detect ex-ceptional words ie lsquooligo-analysisrsquo lsquodyad-analysisrsquo lsquoposition-analysisrsquo and lsquolocal-word-analysisrsquo [95ndash97] It is a sample-drivenmethod starting by counting all 6ndash7 bp long nucleotide occur-rences within the input regulatory sequence set and estimatingtheir statistical significances The expected word frequencieswere estimated with a 1st-order or 2nd-order Markov model bydefault A lower order can be chosen to improve the sensitivityof frequency estimation in a smaller data set Finally the sig-nificant words are assembled and converted to PWMs for
4 | Liu et al
further usage It is worth noting that this algorithm uses cali-brated tables which only takes the uneven nucleotide represen-tation in the noncoding sequences into consideration Thisfeature combined with the simplicity of its binomial-basedstatistical analysis results in the high efficiency of RSAT lsquopeak-motifsrsquo Experimental results also showcased that it is signifi-cantly faster than other tools like DREME [73] ChIPMunk [98 99]and MEME-ChIP [68] However the methodological simplicitylimits this method to the detection of short and relatively con-served motifs Longer motifs must be reconstituted by the com-bination of overlapping short nucleotides
FMotif [74] is an exhaustive pattern-driven strategy for find-ing motifs in ChIP-seq peaks A similar idea has been used inseveral traditional motif-finding tools for co-regulated data egWEEDER [75] and MITRA [100] One unique advantage ofpattern-driven methods is being able to identify planted (l d)-motifs without prior knowledge of width l FMotif basically fol-lows the strategy of WEEDER to enumerate all the possible (l d)-motifs in a depth-first manner and scan all motif occurrences ina suffix tree Although this strategy has achieved high accuracyin searching for short eukaryotic motifs [25] it still has severaldrawbacks For example it sets a constraint on mutation toler-ance dl which is not satisfied by many motifs and the runtime will greatly increase in identifying long motifs or motifswith low occurrence frequency The effect of these drawbacks isgetting worse when faced with large-scale ChIP-seq data Tosolve these problems FMotif designed a new suffix tree struc-ture for motif instance searching It keeps the mismatching in-formation generated when scanning i-mers in suffix tree anduses them in the searching of (ithorn 1)-mers The new structuredramatically decreases the time complexity of motif instancesearching and achieves higher accuracy compared withWEEDER thus it makes this enumeration strategy applicableon ChIP-seq data For example the run time of FMotif is about32 min and 475 h when identifying planted (16 5)-motif in thedata set with 10 000 and 80 000 nucleotides correspondinglyHowever WEEDER spends gt10 h when identifying the samemotif in the data set with having 1200 nucleotides [74] The effi-ciency can be even improved through a clustering strategy incombining l-mer enumeration (eg cisFinder) [101] One short-coming of FMotif is higher space complexity because morememory is required to store mismatched information
The above three methods have made substantial efforts onthe motif signal detection procedure SIOMICS [78 92] adopts arelatively simpler strategy to build motifs but pays more atten-tion to reassess these motifs based on the significance of agroup of motifs (called motif modules) Specifically SIOMICStakes all l-mers (lfrac14 8 by default) in given sequences as consen-suses and builds a motif for each of them by searching all seg-ments with at most one mismatch as its motif instances Themotifs are evaluated and ranked by the following Equation (4)
log ethnTHORNl
Xl
jfrac141
X4
i
pij log pij 1n
Xallsegments
log p0 seth THORNeth THORN
24
35 (4)
where n is the number of instances in this motif pij is the sameas that in Equation (2) and p0(s) is the probability of generatingTFBSs based on background nucleotide frequencies This scorenormalizes the effects of motif length and number of instancesthus has no bias in motif interpretation and comparison It hasbeen successfully used in MDscan another motif-finding toolfor ChIP-chip data [17] SIOMICS eliminates redundancy in pre-diction through removing the motif which has a higher-rank
similar motif The predicted motifs were reassessed as motifmodules by a tree strategy which will be discussed in the lsquoFindco-factors of TFs from ChIP-seq datarsquo section The basic as-sumption is that an individually insignificant motif should betaken as a valuable motif if it is overrepresented when con-sidered as a member of motif module together with the motifsof some other factors
The word-based methods can be roughly considered as aglobal optimization strategy as they enumerate all possible (ld)-motifs However they usually only have a good performanceon simulation data but not on real biological data because the (ld)-motif model is too strict for representing real TFBSs A hightime complexity is another bottleneck of the enumeration strat-egy Even though many techniques have been developed to ac-celerate the searching speed it is still a tough problem in ChIP-seq motif finding
Profile-based methods
Profile-based motif-finding methods are usually based on an opti-mization strategy with a motif profile score eg IC for PWMSpecifically they try to find a collection of DNA segments givingrise to a motif profile with the highest score among all the com-binations of candidates However it is impractical to identify thebest profile by exhaustive enumeration as going through all pos-sible combinations will lead to extremely high time complexityTherefore heuristic strategies are usually adopted to narrowdown the searching space First some primary profiles will bebuilt by randomly selecting one or several DNA segments or byan enumerative search in partially selected input sequences(based on some preliminary knowledge) Then they will be im-proved aiming to gain higher profile scores during the iterationprocess The improvement process is usually conducted itera-tively in a heuristic way [102] eg EM [80] and Gibbs sampling[15] It is noteworthy that high-quality motif profiles can be builtthrough combinatorial optimization strategies eg integrate lin-ear programming [103] and graph-based techniques [7 10 23]With ChIP-seq data a critical challenge to efficiently handle theexponentially increased number of sequences has been posed tothe conventional motif discovery methods
Several traditional motif-finding tools were updated to adaptto ChIP-seq data and most of them focused on how to improvethe time efficiency faced with the larger data size For exampleMEME-ChIP [68] runs MEME as one of its primary motif detectionengines however it limits the input data as no gt600 sequencesHybrid Motif Sampler (HMS) combines deterministic identifica-tion of motif instances and random sampling strategy togetherto accelerate its iteration process [72] STEME [104] which canbe considered as an updated version of MEME introduced a suf-fix tree structure in the EM process to index the sequences setwhich decreases the running time to a different magnitudeMICSA uses MEME on a subset of the highest ChIP-seq peaks toidentify motifs [79] and then MEME is applied for severalrounds on the sequences in which no motifs have been identi-fied ChIPMotifs is originally designed for ChIP-chip data andintegrates MEME and WEEDER to predict primary motif candi-dates followed by a bootstrap resampling approach to deter-mine optimal cutoff and filter against the false-positivecandidates [105]
ChIPMunk [69] is a new iterative algorithm for ChIP-seq datathat combines greedy optimization with bootstrapping Theevaluation of motif profiles in ChIPMunk is based on theKullback Discrete Information Content (KDIC) which is calcu-lated by Equation (5)
Motif Prediction based on ChIP-seq data | 5
KDIC frac14Xl
jfrac141
X4
ifrac141
log fij log N
NXl
jfrac141
X4
ifrac141
fij log bi (5)
where the variables l N fij and bi are the same as those definedin Equations (3) and (4) Like CONSENSUS ChIPMunk uses agreedy strategy to find motif profiles that have high KDIC val-ues The identified motif profiles are ranked based on PWMscores and then improved by an EM-like iterative process It isworth noting that the rebuilding of motif profiles integrated theweight scores calculated from the ChIP-seq coverage informa-tion Specifically ChIPMunk set a score to each sequence as itsnormalized maximal peak value and these weights are as-signed to each position forming the motif profiles Theexperiment on several ChIP-seq data set indicated thatChIPMunk obtained a better performance both in running speedand prediction quality compared with MEME HMS [72] andSeSiMCMC [106]
RPMCMC [89] is a profile-based method which combinesGibbs motif samplers and a reversible-jump MCMC method Asa parallel version of Gibbs RPMCMC fully uses parallel mechan-isms to accelerate motif finding This breakthrough means a lotto current motif finding studies especially in this big data eradriven by high-throughput sequencing technology eg ChIP-seq Although various motif-finding methods have been pro-posed before such as DREME HEGMA WEEDER and Gibbs motifsampler [15] they have limited power in properly controllingthe trade-off between computation time and motif detection ac-curacy Fortunately RPMCMC makes multiple interacting motifsamplers in parallel to ensure an acceptable run time comparedwith other methods Most importantly a repulsive force isapplied in RPMCMC to separate different motif samplers closeto each other making further contribution in getting rid of localoptima In this way RPMCMC has achieved extremely promis-ing performances on synthetic promoter sequence and ENCODEChIP-seq data sets Meanwhile it can also be applied to diversemotif identification ie discriminative motifs and more detailsregarding this will be discussed in the next section [89]
The profile-based methods stand on an assumption thateach nucleotide in a motif instance participates independentlyin the TF-binding activity However more studies have indi-cated that the neighboring positions have strong dependent ef-fect in some motifs [107] Taking this issue into considerationMathelier and Wasserman [64] designed a new transcriptionfactor flexible model (TFFM) and developed a prediction systembased on this model It uses a 1st-order HMM to elucidate the di-nucleotide dependencies Specifically the distribution probabil-ities of nucleotides in position i are dependent on thenucleotide found at position i-1 The performance evaluation on96 ChIP-seq data sets indicates that the models considering pos-ition dependence outperform the other models on 90 data setsof 96 ie 94 in terms of the area under the curves for the cor-responding receiver operating characteristic curves [64]Although this algorithm is not designed to build motif profilesthe TFFM model can be easily modified to integrate the correl-ation between positions in de novo motif finding
Although several motif representation models have been de-veloped and new methods have been designed for motif findingon ChIP-seq data the problem is still far from a sound solutionbecause of the complexity of gene regulation the large-scale ofChIP-seq data and the low specificity of some binding sites Inthe future new models for motif presentation may be createdto replace the consensus and profiles model and improve theaccuracy of motif finding
Assessment criteria and performance evaluation
The main difference between motif finding before and after thenext-generation sequencing technology is the dramaticallyincreased data size [9] Substantial experiments showcased thatthe most popular traditional motif-finding tools eg MEME andWEEDER cannot handle the genome-scale ChIP-seq data con-taining thousands of peaks Therefore almost all ChIP-seq-based motif-finding tools focus on improvement in efficiency inaddition to accuracy and conduct comprehensive performanceevaluations on synthetic data or real biological data sets Theperformance evaluation was mainly carried out based on thecomparison with other tools like MEME WEEDER and the most-cited ChIP-seq-based tool DREME The criteria include coverageon known binding sites (eg sensitivity or SN) similarity be-tween predicted motif profiles and the documented ones speci-ficity (SP) positive predicted value (PPV) and the running timeon data sets with different scales
Here we summarized the performance evaluation of theeight ChIP-seq-based motif-finding tools as a sketch in Table 2Overall no method can outperform others in all situationsThese eight tools showed their preferences on different datasizes different motif lengths and advantages on special contextof motif finding Generally DREME has a good performance onprimary motif finding from large-scale data set along withother two functions discriminative motif finding and cofactormotif finding which will be discussed in next section Howeverit prefers short motif identification and only outputs profilematrices without information of corresponding motif instancesIn this situation we have to rely on additional motif scanningtools to identify the concrete TF-binding sites MICSA the ear-liest among the eight tools takes MEME as the search enginethus its speed and performance are highly limited by MEMERSAT peak-motifs and FMotif handle large data sets with a goodprediction accuracy and run faster than DREME SpecificallyFMotif shows good tolerance on noise rate of input sequencesSIOMICS works well on finding motif modules and runs fasterthan DREME but slower than RSAT peak-motifs Discrover usesDREME to identify motif seeds and has a better performance ondiscriminative motif finding than DREME RPMCMC shows agreat potential on finding diverse motifs in real biological datasets More details can be found in Table 2 from which usersmay get a clue to appropriate tool selection in specific contextof motif-finding applications
Development of advanced functions withbiological insightsDiscriminative motif finding
Traditional motif-finding tools aim to identify a group of con-served motifs in query sequences which are expected to containinstances of the to-be-detected motifs Besides these motifssome other randomly conserved DNA patterns may also exist inthese sequences Distinguishing these false positives from realmotifs has always been a big challenge in motif finding Thisissue is getting more serious because the high-throughputsequencing techniques bring in more false positives One solutionto this problem is to carry out the discriminative motif findingwhich is to find motifs whose occurrence frequencies vary be-tween the query sequence set and several well-defined controlsets The simplest contrast for discriminative motif finding con-tains a positive set (query sequence set) and a negative set (se-quences without binding activity or randomly generated
6 | Liu et al
sequences) [88] The discriminative ideas have been integratedinto some traditional motif-finding tools to reduce false positivesin prediction For example CONSENSUS [102] calculates P-valuesfor motif profiles indicating the probabilities for the same motifsoccurring in a hypothetical random sequence set with a similarsize BOBRO [10] calculates such a P-value by practically simulat-ing random sequences set in the program as contrast underPoisson distribution BBR [24] takes coding genomic sequences asbackground data to reevaluate candidate motifs and an experi-ment on Escherichia coli genome showed that this strategy reducesthe number of false positives Certainly the function of discrim-inative motif finding is not limited to dealing with the false-posi-tive issue Some biology-driven applications specifically identifymotifs overrepresented in a positive set but not in a negative set(called decoy motif problem) Hence a motif without occurrencedifferences between two sets is not considerable no matter howconserved it is [108]
Application of discriminative motif finding is even more im-portant in ChIP-seq data analysis because the data size is usu-ally large and the peaks with or without binding activitiesnaturally compose the positive and negative sets respectivelyIn DREME [73] the first step is counting how many positive se-quences and negative sequences contain a query word Thenthe P-value from Fisherrsquos exact test is calculated for this wordThe words with significant P-values will be kept as seeds for
motif merging The same evaluation process will be conductediteratively in generation of motifs by combing similar motifseeds In application DREME performed discriminative motiffinding by taking Sox2- and Oct4-binding ChIP-seq data inmouse embryonic stem cell (mESC) as positive and negativedata set and found that Oct4 data set has significantly moreOct4-binding sites than the Sox2 data set The results over-turned the previous opinion that Sox2 and Oct4 bind their tar-gets exclusively as a heterodimer [109]
Maaskola and Rajewsky [88] developed an improved HMM-based tool Discrover for discriminative motif finding on largedata set The method consists of three parts lsquoseed findingrsquolsquoHMM optimizationrsquo and lsquosignificance filteringrsquo The seed findingpart is similar to DREME but uses more objective functions be-sides the Fisherrsquos exact test P-values Among these objectivefunctions mutual information of condition and motif occur-rence (MICO) and maximum mutual information estimationcan be used for discriminative motif evaluation on the contrastwith more than two conditions which extended the applicationof Discrover The HMM optimization step starts from a back-ground state trained by the BaumndashWelch algorithm Then theinitial emission probabilities of the motif chains are centered onthe seed sequences The posterior probability of all sequences isevaluated to calculate the expected number of sequences thathave motif occurrences Discrover conducts discriminative
Table 2 Performance summary of motif-finding tools on ChIP-seq data
Programs Criteria Preferences Speed Performance and special characteristics
MICSA [79] Speed motifcoverage
Prefers small datasets
Slow on large data sets Performance is limited by MEME performswell on filtering weak peaks
ChIPMunk [69] Speed motifcoverage
NA Faster than MEME slowerthan DREME quasi-linearcomplexity (power 127)
More accurate than MEME
DREME [73] Speed motifcoverage
Handles large ChIP-seq data (thou-sands of se-quences) prefersshort motifs
Much faster than WEEDERand MEME linear timecomplexity
Performs well at cofactor motif finding anddiscriminative motif finding only providesmotif profile matrix no output of motifsites
RSAT peak-motifs[87]
Speed motifcoverage
Handles large ChIP-seq data sets (upto 1 000 000 peaksof 100 bp each)
Faster than DREME MEMEand ChIPMunk
High prediction accuracy
FMotif [74] Speed motif cover-age SN and SP onmotif sites
Handles large ChIP-seq data handleslong motifs
Fast but requires morememory
High sensitivity and high specificity pre-dicted accurate motif profiles with highrank (1 or 2) for 12 mESC data tolerates upto 30 noise sequences
SIOMICS [78] Speed motif cover-age and specificityon motifs
NA Slower than RSAT peak-motifs much faster thanDREME especially onlarge data sets
Competitive performance with DREME andRSAT in motif finding and cofactor motiffinding competitive performance withDREME and better than RSAT peak-motifson false-positive control good motif mod-ule prediction
Discrover [88] Speed MCC(Matthews corre-lation coefficient)on nucleotide-level SN and PPVon binding sitelevel
NA Comparable with DREME inspeed
Better than Bioprospector and MDscan on SNand PPV of prediction better than DREMEon discriminative motif finding
RPMCMC [89] Speed SN and PPV NA Similar to DREME Better than WEEDER on SN and PPV of predic-tion great potential to mine many reliablediverse motifs
Motif Prediction based on ChIP-seq data | 7
learning by iterative gradient optimization based on the chosenobjective functions which has a linear run time complexitybased on the length of the input data Finally Discrover refinesthe final models by a threshold on a corrected MICO-based P-value regardless of if MICO or another objective function wasused in previous steps In addition Discrover makes extra ef-forts in detection of multiple motifs which will be discussed inthe next section
In the application of discriminative motif finding on ChIP-seq data the various choices of reference data can benefit dif-ferent biological analysis about gene regulation and is the mainfactor affecting the prediction performance [108 110]Specifically the reference data can be (i) randomly generatedsequences using a uniform distribution or a Markov process (ii)the sequences with no binding evidence (iii) the ChIP-seq peaksfor another TF (iv) the peaks of the same TF but under a differ-ent condition even (v) multiple-level reference data sets basedon a detailed grade framework of the signal strength etc [94]This information can reduce false positives and find other spe-cial motifs including decoy motifs variant motifs for one TFimpoverished motifs and collaborated motifs [108] and thuscan improve the state-of-the-art studies about complex regula-tion mechanism eg combinatorial regulation and behaviorchange of TFs among different conditions [110] It is worth not-ing that the binding activity of a TF could be affected by epigen-etic modifications in a complex fashion eg through localchromatin structure caused by the packaging of DNA histonesand other proteins [111] Cuellar-Partida et al [111] have usedthe signal from histone modification ChIP-assay to filter thescanned motifs The corresponding pitfall raised in discrimina-tive motif finding is that the sequences in a negative data setwith no binding evidence may contain potential binding sitesSome algorithms try to mitigate this effect by a special designwhich can tolerate a few motif occurrences in negative set [110]Another phenomenon is that the binding motifs of a TF couldbe affected by the binding of other TFs Mason et al [112] provedthat Oct4 preferentially binds to different motifs with or withoutSox2 bindings within 5000 bp indicating that the binding motifof Oct4 is context-dependent
Find cofactors of TFs from ChIP-seq data
In animals and plants TFs usually regulate gene expression withcooperation of other partner TFs (cofactors) [113] Growing evi-dence indicates that cofactors which interacted directly or indir-ectly play an important role in transcription regulation Thebinding sites of multiple TFs in a short DNA region often deter-mine the temporal spatial expression pattern of the downstreamgenes thus will help to elucidate underlying patterns of combina-torial regulation [114] The analysis of cofactors has attracted sub-stantial attention in traditional motif finding For example BBA[24] in the DMINDA Web server designed a probability model toevaluate the co-occurrences among identified motifs in a givenset of regulatory sequences which can reveal joint regulation re-lationships by multiple TFs The cofactor motif finding on ChIP-seq data is usually formulated by detecting multiple nonredun-dant motifs in a set of ChIP-seq peak regions [73] or detecting TFpatterns whose binding motifs frequently co-occurred on thesame set of regulatory regions [113] Recently some algorithmshave tried to take advantage of the large-scale ChIP-seq data todiscover the cofactor motifs
Researchers want to identify multiple nonredundant motifsin query ChIP-seq data The most straightforward way is to scanpeak sequences by documented motifs [84 115] However the
number of documented motifs is limited compared with thenumber of TFs leading to a loss of some real binding sitesSome de novo motif-finding methods try to first identify mul-tiple motifs and then build relationships for them with knownTFs One example is the aforementioned DREME [73] whichfeeds the whole sequences into the corresponding model ratherthan using only a small portion of the collected ChIP-seq data Itidentifies multiple motifs by removing the most statistically sig-nificant identified motif derived from Fisherrsquos exact testand then repeats the search for motifs DREME achieved re-markable performance in cofactor motifs discovery in mousecell ChIP-seq data and dramatically outperformed other preva-lent algorithms In particular DREME discovered gt2 times thenumber of cofactor motifs compared with Amadeus [116] andgt10 times compared with Trawler [117] In addition the algo-rithm peak-motifs in the RSAT package also returns additionalmotifs potentially bound by cofactors and its application onmouse cell ChIP-seq data successfully detected some cofactormotifs [76]
Rather than simply considering individual motifs separatelySIOMICS [78 94] models the cofactor motifs as motif modulesie a group of TFs with their TFBSs co-occurring in significantlylarge number of ChIP-seq peaks Cofactor motifs may occur onlyin a small set of peaks and thus can be underrepresented indi-vidually in all peak regions Therefore evaluation of motif mod-ules simultaneously could make the significance of cofactorsstand out from the massive background sequences Asdescribed in previous section SIOMICS identifies conservedmotifs by a word-based strategy based on Equation (4) Toidentify nonredundant motifs it removes redundant candidateswith lower scores and the remaining ones are used to identifyputative motifs Instead of considering motif candidatesindividually SIOMICS represents them as nodes in a tree inwhich the more frequent candidates are closer to the root whilethe branch of the tree represents a group of co-occurring motifsBy evaluating each branch based on a Poisson clumpingheuristic strategy SIOMICS identifies statistically significantmotif modules and takes motif candidates in them as final pu-tative motifs The performance comparison on the same ChIP-seq data set used by DREME and peak-motifs indicates thatmore documented cofactor motifs were identified by SIOMICSrather than DREME and RSAT Peak-motifs with comparabletime efficiency
The above methods identify cofactors from ChIP-seq data fora single TF and have obtained many valuable results Howeversynthetic analysis of diverse ChIP-seq data could provide moreinformation about the co-regulation mechanism of TFs FCOPs[113] is a method for identifying combinatorial occupancy pat-terns of multiple TFs from diverse ChIP-seq data FCOPs inte-grates two kinds of ChIP-seq data sets as input from TFsoccupancy as well as chromatin modification and it considerstwo kinds of uncertainty How possible a peak from ChIP-seqdata represents an actual binding activity and how confidentthe predicted enhancer sequences are FCOPs generate probabil-ities from these uncertainties and use them to evaluate the stat-istical significance of a given set of TFs co-occurring in a givenset of enhancers Obviously enumerating all possible subset ofenhancers has an exponential complexity which is inefficientThis method adopts a dynamic programming-based method tocalculate the probabilities which leads to linear complexitywith respect to the number of the enhancers both in time andspace Despite this method not providing any DNA motifs itcan be easily integrated into ChIP-seq motif analysis to identifycofactors
8 | Liu et al
Citation analysis of ChIP-seq motif-findingtools
To analyze the usage of existing ChIP-seq-based motif-findingtools we collected the citation information of eight tools inTable 1 and a Web server MEME-ChIP which implants MEMEand DREME The relative citations of the eight tools since 2011are shown in Figure 2 where we can see that DREME has anoverwhelming superiority among these tools and RSAT peak-motifs and ChIPMunk also have higher relative citation ratesAlthough the four tools that were developed within the past 3years (RPMCMC Discover SIOMICS and FMotif) introducedsome new ideas and obtained better prediction performancethey still fail to get more attention from biologists and almostall of their citations are from the papers focusing on develop-ment of new methods The main reason is the lack of a user-friendly Web server and a platform with multiple related func-tions limiting their usages by people without adequate compu-tational background
DREME as one motif-finding engine of MEME-ChIP has beenintegrated in the MEME suite which is the most popular motif-finding and analysis platform MEME-ChIP obtained more add-itional citations than DREME (Figure 3) and most of theseshould be attribute to DREME RSAT peak-motifs is part of RSATplatform where a series of modular computer programs is
integrated for regulatory signal detection in noncoding se-quences And the relatively high number of citations ofChIPMunk also benefited from its Web service The methodsthat only provide source code are difficult to obtain continuousattention Taking MICSA as an example it was published in2010 and was cited frequently within 2 years following publica-tion but its citation count decreased sharply in recent yearsThis phenomenon indicated that the user experience of a newalgorithm is as important as the quality of algorithm itself
Discussion
The ChIP-seq peaks provide a foundation for the identificationof motifs on a genome scale In return the predicted motifs canbe used to reevaluate the peaks and even find more potentialTF-binding sites that were missed by a ChIP-seq experimentFor example MICSA [79] takes advantage of de novo motif identi-fication to reevaluate the ChIP-seq peaks In MICSA only a sub-set of the highest peaks is fed into MEME for motifsidentification for which the probability of observing a motif bychance (P-value) is calculated Then the predicted motifs willbe used to scan all previously identified peaks and a score willbe assigned to each peak by combining the P-values of motifsthat occurred on it with its depth The experimental results
0
10
20
30
40
50
60
70
80
90
100
2011 2012 2013 2014 2015 2016
Rela
ve
annu
al ci
tao
n (
)
Year of citaon
RPMCMC 2015[91]
Discrover 2014[93]
SIOMICS 2014[78]
Fmof 2014[74]
RSAT peak-mofs 2012[94]
DREME 2011[73]
ChIPMunk 2010[69]
MICSA 2010[79]
Figure 2 Relative citation of the ChIP-seq motif-finding tools until October 2016 The numbers after a tool name indicate the year of its original release and the corres-
ponding reference Citation of each tool was extracted from the Bing Academy (httpcnbingcomacademic)
0
5
10
15
20
25
30
35
40
45
50
Cita
ons
pery
ears
ince
publ
ished
MICSA 2010[79]
ChIPMunk 2010[69]
DREME 2011[73]
RSAT peak-mofs 2012[94]
Fmof 2014[74]
SIOMICS 2014[78]
Discrover 2014[93]
RPMCMC 2015[91]
MEME-ChIP 2011[68]a
Figure 3 The average annual citation of eight motif-finding tools and MEME-ChIP aMEME-ChIP is a Web server that implants DREME and MEME to detect motifs
Motif Prediction based on ChIP-seq data | 9
showed that MICSA offered a distinct improvement in TF-bind-ing prediction accuracy with fewer false positives comparedwith the other 10 approaches Certainly searching potentialbinding sites using the predicted motifs is not limited to thepeak sequences It could even be applied on the whole-genomesequences as some binding activities may not occur under thecondition of ChIP experiments and these binding sites are alsovaluable for studying the full regulation picture of the interestedTFs
Another issue that should be discussed concerning ChIP-seqmotif finding is the difference of its application on various datasources For example the binding motifs in eukaryotes and pro-karyotes often have their respective characteristics and thelength of the binding regions in eukaryotes is much longer thanthose in prokaryotes Therefore there is a need of special designin motif-finding and analysis techniques
As with the traditional motif finding there are some ensem-ble methods for ChIP-seq-based motif finding [85 118] It was al-ways recommended that using several tools in motif finding isa better strategy as diverse techniques may capture differentcharacteristics of motifs This idea leads to the development ofensemble motif-finding methods Most of these methods sim-ply assemble several algorithms together and rank the outputof them For example W-ChIPMotif [119] includes MEMEWEEDER and MaMF [120] and assesses the output by comparingwith a randomized initial input CompleteMotifs [118] incorpor-ates three de novo tools (ChIPMunk WEEDER and CUDA-MEME)[121] and calculates the P-values of predicted motifs by a back-ground random model However using more algorithms intro-duces more false positives while they have high probability tocover real motifs Therefore further analyses are still requiredfor the output of multiple algorithms
Certainly the accuracy of prediction and the efficiency ofmethods are still the most important features in ChIP-seq motiffinding and the application of current methods shows that theyare still far from completely satisfactory More deep thinkingand advanced techniques are still needed to improve the intrin-sic algorithms For example the evolution information fromphylogenetic footprinting can be useful as prior knowledge inevaluation of candidate binding sites Besides additional func-tions such as cofactor analysis evaluating peaks based onmotifs and multiple types of discriminative motif finding are at-tracting more attention in applications They should be wellintegrated in motif-finding methods in future studies Finallyan integrated Web server or platform is essential for the appli-cation of new designed methods as it can provide more func-tions about upstream data collection and downstream resultanalysis
Most recently deep learning has demonstrated its unprece-dented performance in regulatory genomics especially in ChIP-seq-based motif analysis [122] This marked performance ismainly achieved by taking advantage of graphics processingunits and the capability to extract complex features of motifsfrom ChIP-seq data However it can be improved further ifmore information (eg DNA shape) can be taken into consider-ation rather than only the ChIP-seq peaks sequences
Key Points
bull The algorithmic strategies adopted by current motif-finding tools can be generally divided into two catego-ries word-based and profile-based which could be im-plemented with other models like tree graph and
clustering The combination of these two strategies isa better choice to break the current bottleneck in pre-diction accuracy and efficiency
bull The new features of current tools including but notlimited to discriminative motif identification (DREAMRSAT lsquopeak-motifsrsquo and Discrover) cofactor motifdetection (DREME RSAT lsquopeak-motifsrsquo SIOMICSRPMCMC and Discrover) etc are essential for applica-tion of ChIP-seq data analysis and are encouraged tobe integrated into newly designed algorithms
bull Existing tools still have a much room for improvementwith respect to prediction accuracy and efficiency thecontradictory relationship between the time efficiencyand space complexity as well as between predictionaccuracy and application universality is generally pre-sent in these tools
bull An integrated Web server or platform providing vari-ous analysis functions in data collection and the resultanalysis is essential for the spread of new designedmethods in their applications
bull In future studies more mathematical models eggraph model statistics and combinatorial optimiza-tion could be developed and integrated to design newmotif signal detection techniques and new evaluationcriteria In addition more information eg the de-pendency between nucleotides and the height of peaksfrom ChIP-seq data should be considered to improvethe prediction accuracy
Funding
The State of South Dakota Research Innovation Center andthe Agriculture Experiment Station of South Dakota StateUniversity the National Nature Science Foundation ofChina (grant numbers 61303084 and 61432010 to BL) andthe Young Scholars Program of Shandong University (grantnumbers YSPSDU 2015WLJH19 to BL)
References1 DrsquoHaeseleer P What are DNA sequence motifs Nat
Biotechnol 200624423ndash52 Brohee S Janky R Abdel-Sater F et al Unraveling networks
of co-regulated genes on the sole basis of genome se-quences Nucleic Acids Res 2011396340ndash58
3 Davidson E Levin M Gene regulatory networks Natl Acad SciUSA 20051024935
4 Liu B Zhou C Li G et al Bacterial regulon modeling and pre-diction based on systematic cis regulatory motif analysesSci Rep 2016623030
5 Baumbach J On the power and limits of evolutionary con-servationndashunraveling bacterial gene regulatory networksNucleic Acids Res 2010387877ndash84
6 Lawrence CE Altschul SF Boguski MS et al Detecting subtlesequence signals a Gibbs sampling strategy for multiplealignment Science 1993262208ndash14
7 Pevzner PA Sze SH Combinatorial approaches to findingsubtle signals in DNA sequences Proc Int Conf Intell Syst MolBiol 20008269ndash78
8 Nakaki R Kang J Tateno M A novel ab initio identificationsystem of transcriptional regulation motifs in genome DNAsequences based on direct comparison scheme of signalnoise distributions Nucleic Acids Res 2012408835ndash48
10 | Liu et al
9 Zambelli F Pesole G Pavesi G Motif discovery and transcrip-tion factor binding sites before and after the next-generation sequencing era Brief Bioinform 201314225ndash37
10 Li G Liu B Ma Q et al A new framework for identifying cis-regulatory motifs in prokaryotes Nucleic Acids Res201139e42
11 Chen X Guo L Fan Z et al W-AlignACE an improved Gibbssampling algorithm based on more accurate position weightmatrices learned from sequence and gene expressionChIP-chip data Bioinformatics 2008241121ndash8
12 Sinha S PhyME a software tool for finding motifs in sets oforthologous sequences Methods Mol Biol 2007395309ndash18
13 Das MK Dai HK A survey of DNA motif finding algorithmsBMC Bioinformatics 20078(Suppl 7)S21
14 Wang T Stormo GD Combining phylogenetic data with co-regulated genes to identify regulatory motifs Bioinformatics2003192369ndash80
15 Liu X Brutlag DL Liu JS BioProspector discovering con-served DNA motifs in upstream regulatory regions of co-expressed genes Pac Symp Biocomput 2001127ndash38
16 Hertz GZ Stormo GD Identifying DNA and protein patternswith statistically significant alignments of multiple se-quences Bioinformatics 199915563ndash77
17 Liu XS Brutlag DL Liu JS An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments NatBiotechnol 200220835ndash9
18 Bailey TL Boden M Buske FA et al MEME SUITE tools for motifdiscovery and searching Nucleic Acids Res 200937W202ndash8
19 Olman V Xu D Xu Y CUBIC identification of regulatorybinding sites through data clustering J Bioinform Comput Biol2003121ndash40
20 Li X Wong WH Sampling motifs on phylogenetic trees ProcNatl Acad Sci USA 20051029481ndash6
21 Blanchette M Tompa M Discovery of regulatory elementsby a computational method for phylogenetic footprintingGenome Res 200212739ndash48
22 Blanchette M Tompa M FootPrinter a program designed forphylogenetic footprinting Nucleic Acids Res 2003313840ndash2
23 Li G Liu B Xu Y Accurate recognition of cis-regulatorymotifs with the correct lengths in prokaryotic genomesNucleic Acids Res 201038e12
24 Ma Q Liu B Zhou C et al An integrated toolkit for accurateprediction and analysis of cis-regulatory motifs at a genomescale Bioinformatics 2013292261ndash8
25 Tompa M Li N Bailey TL et al Assessing computationaltools for the discovery of transcription factor binding sitesNat Biotechnol 200523137ndash44
26 McCue LA Thompson W Carmack CS et al Factors influenc-ing the identification of transcription factor binding sites bycross-species comparison Genome Res 2002121523ndash32
27 Simcha D Price ND Geman D The limits of de novo DNAmotif discovery PLoS One 20127e47836
28 Siddharthan R Siggia ED van Nimwegen E PhyloGibbs aGibbs sampling motif finder that incorporates phylogenyPLoS Comput Biol 20051e67
29 Blanchette M Schwikowski B Tompa M Algorithms forphylogenetic footprinting J Comput Biol 20029211ndash23
30 Neph S Tompa M MicroFootPrinter a tool for phylogeneticfootprinting in prokaryotic genomes Nucleic Acids Res200634W366ndash8
31 Carmack CS McCue LA Newberg LA et al PhyloScan identi-fication of transcription factor binding sites using cross-species evidence Algorithms Mol Biol 200721
32 Zhang S Xu M Li S et al Genome-wide de novo prediction ofcis-regulatory binding sites in prokaryotes Nucleic Acids Res200937e72
33 Katara P Grover A Sharma V Phylogenetic footprinting aboost for microbial regulatory genomics Protoplasma2012249901ndash7
34 Tagle DA Koop BF Goodman M et al Embryonic epsilon andgamma globin genes of a prosimian primate (Galago crassi-caudatus) Nucleotide and amino acid sequences develop-mental regulation and phylogenetic footprints J Mol Biol1988203439ndash55
35 Borneman AR Gianoulis TA Zhang ZD et al Divergence oftranscription factor binding sites across related yeast spe-cies Science 2007317815ndash9
36 Odom DT Dowell RD Jacobsen ES et al Tissue-specific tran-scriptional regulation has diverged significantly betweenhuman and mouse Nat Genet 200739730ndash2
37 Boyle AP Araya CL Brdlik C et al Comparative analysis ofregulatory information and circuits across distant speciesNature 2014512453ndash6
38 Bingqiang Liu CZ Zhang H Guojun L et al An integrativeand applicable phylogenetic footprinting framework for cis-regulatory motifs identification in prokaryotic genomesBMC Genomics 201617578
39 Park PJ ChIP-seq advantages and challenges of a maturingtechnology Nat Rev Genet 200910669ndash80
40 Song L Crawford GE DNase-seq a high-resolution tech-nique for mapping active gene regulatory elements acrossthe genome from mammalian cells Cold Spring Harb Protoc201025384
41 Wang Z Gerstein M Snyder et al RNA-Seq a revolutionarytool for transcriptomics Nat Rev Genet 20081057ndash63
42 Tsankov AM Gu H Akopian V et al Transcription factorbinding dynamics during human ES cell differentiationNature 2015518344ndash9
43 Wu F Olson BG Yao J DamID-seq genome-wide mapping ofprotein-DNA interactions by high throughput sequencing ofadenine-methylated DNA fragments J Vis Exp 2016107e53620
44 Maragkakis M Alexiou P Nakaya T et al CLIPSeqTools-anovel bioinformatics CLIP-seq analysis suite RNA2015221ndash9
45 Hafner M Landthaler M Burger L et al PAR-CliPmdasha methodto identify transcriptome-wide the binding sites of RNAbinding proteins J Vis Exp 201041e2034
46 Ingolia NT Ribosome profiling new views of translationfrom single codons to genome scale Nat Rev Genet201415205ndash13
47 Giresi PG Kim J Mcdaniell RM et al FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates activeregulatory elements from human chromatin Genome Res200717877ndash85
48 Nutiu R Friedman RC Luo S et al Direct measurement ofDNA affinity landscapes on a high-throughput sequencinginstrument Nat Biotechnol 201129659ndash64
49 Collas P Dahl JA Chop it ChIP it check it the currentstatus of chromatin immunoprecipitation Front Biosci200813929ndash43
50 Kimura H Sato Y DNA Replication and Histone ModificationJapan Springer 2016
51 Suganuma T Workman JL Histone modification as a reflec-tion of metabolism Cell Cycle 2016 15481ndash2
52 Qu H Fang X A brief review on the human encyclopedia ofDNA elements (ENCODE) project Genomics ProteomicsBioinformatics 201311135ndash41
Motif Prediction based on ChIP-seq data | 11
53 Consortium EP The ENCODE (ENCyclopedia Of DNAElements) project Science 2004306636ndash40
54 Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910R25
55 Li H Durbin R Fast and accurate long-read alignment withBurrowsndashWheeler transform Bioinformatics 201026589ndash95
56 Szalkowski AM Schmid CD Rapid innovation in ChIP-seqpeak-calling algorithms is outdistancing benchmarking ef-forts Brief Bioinform 201112626ndash33 628
57 Kharchenko PV Tolstorukov MY Park PJ Design and ana-lysis of ChIP-seq experiments for DNA-binding proteins NatBiotechnol 2008261351ndash9
58 Zhang Y Liu T Meyer CA et al Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089R137
59 Jiang H Wang F Dyer NP et al CisGenome browser a flex-ible tool for genomic data visualization Bioinformatics2010261781ndash2
60 Fejes AP Robertson G Bilenky M et al FindPeaks 31 atool for identifying areas of enrichment from massivelyparallel short-read sequencing technology Bioinformatics2008241729ndash30
61 Valouev A Johnson DS Sundquist A et al Genome-wideanalysis of transcription factor binding sites based on ChIP-Seq data Nat Methods 20085829ndash34
62 Xin F Grossman R Stein L PeakRanger a cloud-enabledpeak caller for ChIP-seq data BMC Bioinformatics201112139
63 Kuan PF Chung D Pan G et al A statistical frameworkfor the analysis of ChIP-Seq data J Am Stat Assoc2011106891ndash903
64 Mathelier A Wasserman WW The next generation of tran-scription factor binding site prediction PLoS Comput Biol20139e1003214
65 Cheng C Min R Gerstein M TIP a probabilistic method foridentifying transcription factor target genes from ChIP-seqbinding profiles Bioinformatics 2011273221ndash7
66 Wu S Wang J Zhao W et al ChIP-PaM an algorithm to iden-tify protein-DNA interaction using ChIP-Seq data Theor BiolMed Model 20107(1)18
67 van Heeringen SJ Veenstra GJC GimmeMotifs a de novomotif prediction pipeline for ChIP-sequencing experimentsBioinformatics 201127270ndash1
68 Machanick P Bailey TL MEME-ChIP motif analysis of largeDNA datasets Bioinformatics 2011271696ndash7
69 Kulakovskiy IV Boeva VA Favorov AV et al Deep and widedigging for binding motifs in ChIP-Seq data Bioinformatics2010262622ndash3
70 Jothi R Cuddapah S Barski A et al Genome-wide identifica-tion of in vivo proteinndashDNA binding sites from ChIP-Seqdata Nucleic Acids Res 2008365221ndash31
71 Mercier E Droit A Li L et al An integrated pipeline for thegenome-wide analysis of transcription factor binding sitesfrom ChIP-Seq PLoS One 20116e16432
72 Hu M Yu J Taylor JM et al On the detection and refinementof transcription factor binding sites using ChIP-Seq dataNucleic Acids Res 2010382154ndash67
73 Bailey TL DREME motif discovery in transcription factorChIP-seq data Bioinformatics 2011271653ndash9
74 Jia C Carson MB Wang Y et al A new exhaustive methodand strategy for finding motifs in ChIP-enriched regionsPLoS One 20149e86044
75 Pavesi G Mereghetti P Mauri G et al Weeder Web discov-ery of transcription factor binding sites in a set of
sequences from co-regulated genes Nucleic Acids Res200432W199ndash203
76 Thomas-Chollier M Herrmann C Defrance M et al RSATpeak-motifs motif analysis in full-size ChIP-seq datasetsNucleic Acids Res 201240e31
77 Tran NT Huang CH A survey of motif finding web tools fordetecting binding site motifs in ChIP-Seq data Biol Direct201494
78 Jun Ding HH Xiaoman L SIOMICS a novel approach for sys-tematic identification of motifs in ChIP-seq data NucleicAcids Res 2014421645ndash35
79 Boeva V Surdez D Guillon N et al De novo motif identifica-tion improves the accuracy of predicting transcription factorbinding sites in ChIP-Seq data analysis Nucleic Acids Res201038e126
80 Bailey TL Williams N Misleh C et al MEME discovering andanalyzing DNA and protein sequence motifs Nucleic AcidsRes 200634369ndash73
81 Hartmann H Guthhrlein EW Siebert M et al P-value-basedregulatory motif discovery using positional weight matricesGenome Res 201323181ndash94
82 Niu M De novo prediction of cis-regulatory modules in eu-karyotic organisms Dissertations amp ThesesmdashGradworksThe university of North Carolina at Charlotte 2014
83 Bolouri H Ruzzo WL Integration of 198 ChIP-seq datasetsreveals human cis-regulatory regions J Comput Biol201219989ndash97
84 Sun H Guns T Fierro AC et al Unveiling combinatorial regu-lation through the combination of ChIP information and insilico cis-regulatory module detection Nucleic Acids Res201240e90
85 Lihu A Holban S A review of ensemble methods for denovo motif discovery in ChIP-Seq data Brief Bioinform201617731
86 Kulakovskiy IV Makeev VJ Motif discovery and motif find-ing in ChIP-Seq data Genome Analysis Current Proceduresand Applications Caister Academic Press Norfolk UK2014 83
87 Medina-Rivera A Defrance M Sand O et al RSAT 2015Regulatory Sequence Analysis Tools Nucleic Acids Res201543W50ndash6
88 Maaskola J Rajewsky N Binding site discovery from nucleicacid sequences by discriminative learning of hidden Markovmodels Nucleic Acids Res 20144212995ndash3011
89 Ikebata H Yoshida R Repulsive parallel MCMC algorithm fordiscovering diverse motifs from large sequence setsBioinformatics 2015311561ndash8
90 Stormo GD DNA binding sites representation and discov-ery Bioinformatics 20001616ndash23
91 Nishida K Frith MC Nakai K Pseudocounts for transcriptionfactor binding sites Nucleic Acids Res 200937939ndash44
92 Weirauch MT Cote A Norel R et al Evaluation of methodsfor modeling transcription factor sequence specificity NatBiotechnol 201331126ndash34
93 Bailey TL Elkan C Fitting a mixture model by expectationmaximization to discover motifs in biopolymers Proc IntConf Intell Syst Mol Biol 1994228ndash36
94 Ding J Dhillon V Li X et al Systematic discovery of cofactormotifs from ChIP-seq data by SIOMICS Methods 201579ndash8047ndash51
95 van Helden J Andre B Collado-Vides J Extracting regulatorysites from the upstream region of yeast genes by computa-tional analysis of oligonucleotide frequencies J Mol Biol1998281827ndash42
12 | Liu et al
96 van Helden J Rios AF Collado-Vides J Discovering regula-tory elements in non-coding sequences by analysis ofspaced dyads Nucleic Acids Res 2000281808ndash18
97 van Helden J del Olmo M Perez-Ortin JE Statistical analysisof yeast genomic downstream sequences reveals putativepolyadenylation signals Nucleic Acids Res 2000281000ndash10
98 Kulakovskiy I Levitsky V Oshchepkov D et al From bindingmotifs in ChIP-Seq data to improved models of transcriptionfactor binding sites J Bioinform Comput Biol 2013111340004
99 Levitsky VG Kulakovskiy IV Ershov NI et al Application ofexperimentally verified transcription factor binding sitesmodels for computational analysis of ChIP-Seq data BMCGenomics 20141580
100 Eskin E Pevzner PA Finding composite regulatory patternsin DNA sequences Bioinformatics 200218 Suppl 1S354ndash63
101 Sharov AA Ko MS Exhaustive search for over-representedDNA sequence motifs with CisFinder DNA Res 200916261ndash73
102 Hertz GZ Hartzell GW III Stormo GD Identification of con-sensus patterns in unaligned DNA sequences known to befunctionally related Comput Appl Biosci 1990681ndash92
103 Kingsford C Zaslavsky E Singh M a compact mathematicalprogramming formulation for DNA motif finding Lect NotesComput Sci 2006400913
104 Reid JE Wernisch L STEME efficient EM to find motifs inlarge data sets Nucleic Acids Res 201139e126
105 Jin VX OrsquoGeen H Iyengar S et al Identification of an OCT4 andSRY regulatory module using integrated computational and ex-perimental genomics approaches Genome Res 200717807ndash17
106 Favorov AV Gelfand MS Gerasimova AV et al A Gibbs sam-pler for identification of symmetrically structured spacedDNA motifs with improved estimation of the signal lengthBioinformatics 2005212240ndash5
107 Zhao Y Ruan S Pandey M et al Improved models for tran-scription factor binding site identification using noninde-pendent interactions Genetics 2012191781ndash90
108 Redhead E Bailey TL Discriminative motif discovery in DNAand protein sequences using the DEME algorithm BMCBioinformatics 20078385
109 Chen X Xu H Yuan P et al Integration of external signalingpathways with the core transcriptional network in embry-onic stem cells Cell 20081331106ndash17
110 Huggins P Zhong S Shiff I et al DECOD Fast and AccurateDiscriminative DNA Motif Finding Bioinformatics2011272361ndash7
111 Cuellar-Partida G Buske FA McLeay RC et al Epigenetic pri-ors for identifying active transcription factor binding sitesBioinformatics 20122856ndash62
112 Mason MJ Plath K Zhou Q Identification of context-dependent motifs by contrasting ChIP binding dataBioinformatics 2010262826ndash32
113 Teng L He B Gao P et al Discover context-specificcombinatorial transcription factor interactions byintegrating diverse ChIP-Seq data sets Nucleic Acids Res201442e24
114 Vaquerizas JM Kummerfeld SK Teichmann SA et al A cen-sus of human transcription factors function expressionand evolution Nat Rev Genet 200910252ndash63
115 Ding J Cai X Wang Y et al ChIPModule systematic discov-ery of transcription factors and their cofactors from ChIP-seq data Pac Symp Biocomput 2013320ndash31
116 Ettwiller L Paten B Ramialison M et al Trawler de novoregulatory motif discovery pipeline for chromatin immuno-precipitation Nat Methods 20074563ndash5
117 Linhart C Halperin Y Shamir R Transcription factor andmicroRNA motif discovery the Amadeus platform and acompendium of metazoan target sets Genome Res2008181180ndash9
118 Kuttippurathu L Hsing M Liu Y et al CompleteMOTIFsDNA motif discovery platform for transcription factor bind-ing experiments Bioinformatics 201127715ndash7
119 Ho JW Bishop E Karchenko PV et al ChIP-chip versus ChIP-seq lessons for experimental design and data analysis BMCGenomics 201112134
120 Che D Li G Mao F et al Detecting uber-operons in prokary-otic genomes Nucleic Acids Res 2006342418ndash27
121 Liu Y Schmidt B Liu W et al CUDAndashMEME acceleratingmotif discovery in biological sequences using CUDA-enabled graphics processing units Pattern Recognit Lett2010318
122 Alipanahi B Delong A Weirauch MT et al Predicting the se-quence specificities of DNA- and RNA-binding proteins bydeep learning Nat Biotechnol 201533831ndash8
Motif Prediction based on ChIP-seq data | 13
the motifs in promoters tend to evolve at a lower rate and there-fore be more conserved than nonfunctional surrounding se-quences some phylogenetic footprinting-based algorithmshave been developed to reduce the false-positive rate such asPhyloGibbs Footprinter PhyloCon and MicroFootprinter [14 28ndash32] A phylogenetic footprinting strategy was firstly proposed in1988 [33 34] and has significantly improved the state-of-the-artperformance in this field However the majority of programsbased on phylogenetic footprinting did not make full use of thephylogenetic relationship of query promoter sequences fromvarious genomes [21] Owing to this limitation some promotersfrom highly divergent species could be included and the motifinstances are not conserved enough to carry out motif predic-tion [35ndash37] Most recently Liu et al [4 38] developed two com-putational pipelines aiming to break this bottleneckSpecifically they extracted phylogenetic relationships fromregulatory sequences using a combinatorial framework basedon 216 selected representative genomes to refine the ortholo-gous promoter set It is noteworthy that all the methods men-tioned above could be potentially improved by integratingadditional experimental data
The rapid development of high-throughput biotechnologies[39ndash48] has provided new insight and powerful support for regu-latory mechanism analyses and genome-scale regulatory net-work elucidation In particular chromatin immunoprecipitationsequencing(ie ChIP-seq followed by high-throughput DNAsequencing) provides massive proteinndashDNA interactive infor-mation and has been successfully applied to genome-wide ana-lyses of transcription factor (TF) binding histone modificationmarkers and polymerase binding [39 49] This technology usesone chosen protein (ie TF histone or DNA polymerase of inter-est) to bind the whole-genome sequences [50 51] then somephysical techniques such as ultrasonic are used to shear thecross-linked DNA sequences into segments after proteinndashDNAinteraction [39] finally these segments will be sequenced intoshort reads [52 53] These reads could be mapped onto their ref-erence genome if available using Bowtie [54] BWA [55] etcBased on the mapping results the motif-enriched genomic re-gions could be identified by peak-calling tools [56] such as SPP[57] MACS [58] CisGenome [59] FindPeaks [60] QuEST [61] andPeakRanger [62] These regions will be the potential bindingsites of the chosen protein
Recent studies suggest that ChIP-seq can be effectively inte-grated into and benefit TFBS discovery tools [57 61 63ndash73] Itprovides high-throughput motif signals and allows genome-scale discovery in a cell More accurate binding regions (peaks)can be derived from ChIP-seq experiments thus leading tomore reliable prediction performance [9] However the peaksdetected from ChIP-seq data can be up to a few hundred basepairs while the documented cis-regulatory motifs are usuallyonly as long as 8ndash20 bp [74] Therefore an ab initio motif discov-ery method is still indispensable to (i) identify the accuratebinding sites from these ChIP-seq peaks and (ii) build con-served motif profiles for further study in transcriptional regula-tion Unfortunately some widely used motif discovery toolseg MEME and WEEDER [75] cannot be directly used on ChIP-seq peaks as they are designed for co-regulated promoter se-quences with limited size Recently some efforts have beenmade to rectify this problem by modifying traditional motif-finding tools to adapt to the ChIP-seq data [68 74 76] or design-ing specific strategies for ChIP-seq-based motif finding [73 77]The computational challenges of these tools include but notlimited to (i) huge amounts of sequenced ChIP-seq reads canmake motif finding a computationally infeasible problem [9] (ii)
failure to identify the motifs associated with cofactors of theChIP-ed TF [73] or cis-regulatory modules [78] (iii) lack of insightin integration of ChIP-seq data sets from multiple TFs [79] (iv)the traditional false-positive issue in motif prediction causedby the noise in ChIP-seq technology [74] (v) lack of an efficientway to determine the correct lengths of motifs except exhaust-ively enumerating each length within an interval [18 80 81]and (vi) weak support in elucidation of the mutual interactionsamong multiple motifs from larger ChIP-ed data sets [82ndash84]which is important in disease diagnosis through gene regula-tory network construction
So far there have been a few reviews on motif findingfocused on the analysis of ChIP-seq data [77 85 86] Tran et al[77] introduced nine Web tools with numerous details in theirusages and applications For each of them the authors show-cased the requirement of the input format default parametersoutput format basic characteristics of the tool and a brief intro-duction of the algorithm Rather than focusing on individual al-gorithms or tools Lihu et al [85] were dedicated to reviewingseven ensemble methods with input and output requirementsand a brief introduction of the included algorithms Kulakovskiyet al provided a systematic history of tool development in motiffinding before the ChIP-seq technology and advantages as wellas computational challenges of using ChIP-seq in motif findingSeveral ChIP-seq-based methods and their applications were re-viewed focusing on their overall stories or workflows [86]However it is still under investigation in a detailed algorithmicaspect as the performance of existing methods is far from satis-factory based on both efficiency and performance From the al-gorithmic point of view this article presents the mainchallenges and characteristics of de novo motif finding in ChIP-seq peaks through systematically reviewing eight representa-tive tools (Table 1) Specifically the review mainly focuses onthe motif-finding techniques adopted by these methods as wellas such additional specific functions as discriminative motiffinding and cofactor motif identification Through summarizingthe existing limitations and revealing algorithmic potentialsseveral promising directions for further improvements of ChIP-seq-based motif finding have been proposed both in methodo-logical aspects and concrete applications
Motif representation and identification
A motif represents a set of DNA segments with the same lengthwhich are binding sites for the same TF Each segment of themotif is called an instance and different instances of the samemotif tend to be similar with each other on sequence level(Figure 1A) A representation model of a motif to demonstratethe similarity of its instances is expected to accurately capturethe characteristics of proteinndashDNA binding activity of its corres-ponding TF [90]
The most straightforward model to denote the binding pref-erence of a TF on each position along a motif is the lsquoconsensusrsquosequence (eg AGTCA or AGTCG for the motif in Figure 1A) whichis composed of the concatenation of the most frequent nucleo-tide on each position It can be seen as the ancestor of the bind-ing sites of the same TF with an assumption that these sitesevolved from it Although the consensus presents the character-istics of a motif in each position in a simple and clear way thevariations in this motif are absent in this model The lsquodegener-ate consensusrsquo was proposed to fill this gap using IUPAC wildcards to replace the exact nucleotides (A G C and T) For ex-ample W means both A and T in this position could be recog-nized by the TF of this motif (Figure 1B) [9]
2 | Liu et al
A more accurate and most commonly used model is thelsquomotif profilersquo A profile is built by aligning the available in-stances of a motif M and counting the frequency of each nucleo-tide at each position (fij) These frequencies give rise to a typicalmatrix representation of a motif profile (Mffrac14 fij4 l Figure 1A)called a lsquoposition weight matrixrsquo (PWM) An alternative way ofconstructing the PWM is using the probability distribution to re-place frequencies (Mpfrac14 pij4 l) Specifically these frequencieswill be divided by the number of binding sites of this motif andsuch a representation of the PWM in Figure 1A is shown inEquation (1)
Mp frac14 fpijg frac14
06 0 0 02 04
0 06 0 0 04
0 04 0 08 02
04 0 1 0 0
2666664
3777775
(1)
Taking the background frequencies of each nucleotides intoconsideration the PWM in Equation (1) can be further modifiedas Mgfrac14 gij where gijfrac14 log(fijbi) and biis the probability of theith nucleotide of (A G C and T) appearing in the background se-quences Based on this version of PWM we can calculate thelsquoinformation contentrsquo (IC) to evaluate how conserved this motifis ie Equation (2)
I Meth THORN frac14X1
jfrac141
X4
ifrac141
pij logpij
bi(2)
In addition the matrix can also be used to evaluate how agiven DNA segment s with length l is consistent with motif Mby calculating the score
score seth THORN frac14Xl
jfrac141
X4
ifrac141
pij logpij
bi dij (3)
where dijfrac14 1 if the jth nucleotide of s is the ith nucleotide of (AG C and T) dijfrac14 0 otherwise A problem of this model is that theprobability fij could be zero for a small set of binding sites giv-ing rise to negative infinity in Equations (2) and (3) A commonmethod to avoid this bias is adding a certain valueT
able
1Fe
atu
res
of
the
too
lsfo
rC
hIP
-seq
mo
tif
fin
din
g
Pro
gram
sM
ICSA
Ch
IPM
un
kD
REM
ER
SAT
pea
k-m
oti
fsFM
oti
fSI
OM
ICS
Dis
cro
ver
RPM
CM
C
Wo
rd-b
ased
Yes
Yes
Yes
Yes
Yes
Pro
file
-bas
edY
esY
esY
esT
ech
niq
ues
and
crit
eria
MEM
EthornPS
SMsc
ore
IC-l
ike
fun
ctio
ns
thornse
qu
ence
wei
ght
Wo
rdco
un
tthorn
Fish
erte
stR
SAT
oli
go-a
nal
ysis
d
yad
-an
alys
isl
oca
lwo
rdan
alys
isM
EME
and
Ch
IPM
un
k
Wo
rden
um
erat
ion
thornsu
ffix
treethorn
IC-b
ased
z-sc
ore
IC-l
ike
fun
ctio
ns
thorntr
eem
eth
od
sH
MM thornm
ult
iple
test
ing
corr
ecte
dP-
valu
e
MC
MC
Dis
crim
inat
ive
mo
tif
fin
din
gY
esY
esY
es
Co
fact
ors
Mu
ltip
lem
oti
fsM
ult
iple
div
erse
mo
tifs
Mu
ltip
led
iver
sem
oti
fsM
oti
fm
od
ule
sM
ult
iple
div
erse
mo
tifs
Mu
ltip
led
iver
sem
oti
fsM
oti
fco
mp
aris
on
RSA
Tco
mp
are-
mo
tifs
STA
MP
TO
MT
OM
Yea
ro
fre
leas
ean
dre
fere
nce
s20
10[7
9]20
10[6
9]20
11[7
3]20
12[8
7]20
14[7
4]20
14[7
8]20
14[8
8]20
15[8
9]
A B
C
Figure 1 Representation of a motif (A) an example of motif consensus degener-
ate consensus and profile (B) a full list of wild cards in the degenerate consen-
sus (C) a different motif but has same profile with motif in (A)
Motif Prediction based on ChIP-seq data | 3
(pseudocount) for each position of the motif [91] Through sys-tem simulation and analysis Nishida et al found that the opti-mal pseudocount value is correlated with the entropy of a motifprofile Specifically the less conserved motif profiles preferlarger pseudocount value and 08 is suggested in general
As shown above multiple approaches for modeling TFrsquosDNA-binding specificity have been developed A systematiccomparison of these approaches can provide substantially valu-able information for further motif-finding algorithm designDREAM5 Consortium organized a competition on motif repre-sentation models by applying 26 approaches to in vitro protein-binding microarray data [92] These approaches adopt variousstrategies including but not limited to k-mers model PWMhidden Markov models (HMMs) and dinucleotides The k-mersand PWM are the two main strategies which show similar aver-age performance on multiple data sets However they have sub-stantial differences in terms of individual performance on a fewdata sets indicating that motif finding is sensitive for model se-lection Another interesting observation is that the IC of a motifmay not fully represent its accuracy It is obviously contrarywith the basic principle of general motif-finding tools thusdeeper investigation into this area is still needed to improve themotif representation models
These motif representation models are still not perfect witha common disadvantage that they ignore the correlation amongdifferent positions in a motif For example the motif in Figure1C has the same PWM as the motif in Figure 1a but apparentlythe first two positions in this motif are correlated and depend-ent to each other Hence a high order Markov model is sug-gested to be integrated into PWM matrix [24] Meanwhile it isunsure whether a known PWM can fairly represent the wholepopulation scenario as the frequencies of nucleotides in eachposition are calculated only from the known binding sites of aTF A motif profile built on partially identified bindings sites of aTF may induce bias when it is used to interpret the global bind-ing preference especially when this profile is used to model theorthologous binding sites from various species A more funda-mental debate is Do the nucleotides with lower frequenciesimply lower binding ability At the time of this writing thereare still no clear answers to this question and deeper thoughtabout above concerns will bring potential ways to improveexisting representation models of cis-regulatory motifs
Motif signal detection techniques andperformance evaluation
The basic computational assumption of motif identification isthat they are overrepresented as conserved patterns in given se-quences The scattered instances of a motif are not perfectlyidentical but similar with each other Once identified and wellaligned they will show significance in conservation comparedwith background sequences Therefore finding and aligningthese instances are the primary issues and remain a challengein motif finding ChIP-seq data supply enriched resource formotif finding yet make this problem more challenging as thefull-size peak sequences from a ChIP-seq experiment are muchlarger than co-regulated promoters in traditional motif finding
Motif-finding methods mainly fall into two categories word-based methods (ie consensus-based) and profile-based meth-ods [9 13] Word-based methods usually enumerate and com-pare nucleotides starting from a consensus sequence with afixed length and a tolerance of mutations Theoretically thisstrategy is able to identify global optimal solutions but suffers
from high computational complexity when it is applied to large-scale input data or when to-be-identified motifs are relativelylong and with a large number of mutations [13] The profile-based methods usually start with some aligned patterns eitherrandomly chosen [93] or enumerated in a limited subset of inputdata [93] and refine them based on some criteria on the wholedata These criteria are designed to evaluate the overrepre-sented significance of aligned profiles from the input se-quences Improvements are mostly conducted in a heuristicway eg neighboring improvement (add or delete patterns tosee if the profile goes better similar to a hill-climbing method)or iterative statistical methods [Gibbs sampling or expectationmaximization (EM)] The profile-based methods usually runfaster than word-based methods and have better performancein predicting motifs with complex mutations However thiskind of methods tend to fail in detection of multiple motifs es-pecially when the data size is large as the iterative procedurethey adopted often falls into local optimizations which is diffi-cult to escape [89]
Word-based methods
The dominant model of these methods is the (l d)-motifs where lis the width of a motif and d is the maximum number of muta-tions between a motif instance and the consensus sequence [7374 76 78 88 94] Intuitively the goal is to find potential (l d)-motifs that are significantly presented in the input sequencesThese methods can be further divided into two categoriespattern-driven and sample-driven [74] The former enumerates all4l l-mers on character set (A G C and T) as consensus sequences tobuild candidate motifs The latter only uses real l-mers from inputdata to generate all possible (l d)-motifs Five algorithms specific-ally designed for motif finding in ChIP-seq peaks FMotif [74]DREME [73] RSAT peak-motifs [76] SIOMICS [78 94] and Discrover[88] are reviewed with more details as follows
DREME [73] is a sample-driven method which exhaustivelysearches for exact words and heuristically expands them towords with wild cards DREME requires a positive data set iethe peaks derived from a ChIP-seq data set and a negative dataset ie different peaks from other ChIP-seq experiments or ran-domly shuffled version of the positive data It takes all l-mers inthe positive data set and counts the number of their occurrencein the two data sets The top 100 words that are significantlyoverrepresented as determined by Fisherrsquos exact test are takenas seed motifs DREME then expands the seed motifs by allow-ing exactly one additional wild card The newly generatedwords are evaluated and only the top 100 most significantwords are kept To save time in the word generation stepDREME estimates the number of sequences matching generatedmotifs rather than searching explicitly for them in the data setHowever this program prefers short motifs (with length 4ndash10 bp) hence is more suitable for monomeric eukaryotic TFs
The RSAT package [87] provides a pipeline lsquopeak-motifrsquo todiscover motifs in ChIP-seq peaks [76] It integrates a set of try-and-test algorithms using complementary criteria to detect ex-ceptional words ie lsquooligo-analysisrsquo lsquodyad-analysisrsquo lsquoposition-analysisrsquo and lsquolocal-word-analysisrsquo [95ndash97] It is a sample-drivenmethod starting by counting all 6ndash7 bp long nucleotide occur-rences within the input regulatory sequence set and estimatingtheir statistical significances The expected word frequencieswere estimated with a 1st-order or 2nd-order Markov model bydefault A lower order can be chosen to improve the sensitivityof frequency estimation in a smaller data set Finally the sig-nificant words are assembled and converted to PWMs for
4 | Liu et al
further usage It is worth noting that this algorithm uses cali-brated tables which only takes the uneven nucleotide represen-tation in the noncoding sequences into consideration Thisfeature combined with the simplicity of its binomial-basedstatistical analysis results in the high efficiency of RSAT lsquopeak-motifsrsquo Experimental results also showcased that it is signifi-cantly faster than other tools like DREME [73] ChIPMunk [98 99]and MEME-ChIP [68] However the methodological simplicitylimits this method to the detection of short and relatively con-served motifs Longer motifs must be reconstituted by the com-bination of overlapping short nucleotides
FMotif [74] is an exhaustive pattern-driven strategy for find-ing motifs in ChIP-seq peaks A similar idea has been used inseveral traditional motif-finding tools for co-regulated data egWEEDER [75] and MITRA [100] One unique advantage ofpattern-driven methods is being able to identify planted (l d)-motifs without prior knowledge of width l FMotif basically fol-lows the strategy of WEEDER to enumerate all the possible (l d)-motifs in a depth-first manner and scan all motif occurrences ina suffix tree Although this strategy has achieved high accuracyin searching for short eukaryotic motifs [25] it still has severaldrawbacks For example it sets a constraint on mutation toler-ance dl which is not satisfied by many motifs and the runtime will greatly increase in identifying long motifs or motifswith low occurrence frequency The effect of these drawbacks isgetting worse when faced with large-scale ChIP-seq data Tosolve these problems FMotif designed a new suffix tree struc-ture for motif instance searching It keeps the mismatching in-formation generated when scanning i-mers in suffix tree anduses them in the searching of (ithorn 1)-mers The new structuredramatically decreases the time complexity of motif instancesearching and achieves higher accuracy compared withWEEDER thus it makes this enumeration strategy applicableon ChIP-seq data For example the run time of FMotif is about32 min and 475 h when identifying planted (16 5)-motif in thedata set with 10 000 and 80 000 nucleotides correspondinglyHowever WEEDER spends gt10 h when identifying the samemotif in the data set with having 1200 nucleotides [74] The effi-ciency can be even improved through a clustering strategy incombining l-mer enumeration (eg cisFinder) [101] One short-coming of FMotif is higher space complexity because morememory is required to store mismatched information
The above three methods have made substantial efforts onthe motif signal detection procedure SIOMICS [78 92] adopts arelatively simpler strategy to build motifs but pays more atten-tion to reassess these motifs based on the significance of agroup of motifs (called motif modules) Specifically SIOMICStakes all l-mers (lfrac14 8 by default) in given sequences as consen-suses and builds a motif for each of them by searching all seg-ments with at most one mismatch as its motif instances Themotifs are evaluated and ranked by the following Equation (4)
log ethnTHORNl
Xl
jfrac141
X4
i
pij log pij 1n
Xallsegments
log p0 seth THORNeth THORN
24
35 (4)
where n is the number of instances in this motif pij is the sameas that in Equation (2) and p0(s) is the probability of generatingTFBSs based on background nucleotide frequencies This scorenormalizes the effects of motif length and number of instancesthus has no bias in motif interpretation and comparison It hasbeen successfully used in MDscan another motif-finding toolfor ChIP-chip data [17] SIOMICS eliminates redundancy in pre-diction through removing the motif which has a higher-rank
similar motif The predicted motifs were reassessed as motifmodules by a tree strategy which will be discussed in the lsquoFindco-factors of TFs from ChIP-seq datarsquo section The basic as-sumption is that an individually insignificant motif should betaken as a valuable motif if it is overrepresented when con-sidered as a member of motif module together with the motifsof some other factors
The word-based methods can be roughly considered as aglobal optimization strategy as they enumerate all possible (ld)-motifs However they usually only have a good performanceon simulation data but not on real biological data because the (ld)-motif model is too strict for representing real TFBSs A hightime complexity is another bottleneck of the enumeration strat-egy Even though many techniques have been developed to ac-celerate the searching speed it is still a tough problem in ChIP-seq motif finding
Profile-based methods
Profile-based motif-finding methods are usually based on an opti-mization strategy with a motif profile score eg IC for PWMSpecifically they try to find a collection of DNA segments givingrise to a motif profile with the highest score among all the com-binations of candidates However it is impractical to identify thebest profile by exhaustive enumeration as going through all pos-sible combinations will lead to extremely high time complexityTherefore heuristic strategies are usually adopted to narrowdown the searching space First some primary profiles will bebuilt by randomly selecting one or several DNA segments or byan enumerative search in partially selected input sequences(based on some preliminary knowledge) Then they will be im-proved aiming to gain higher profile scores during the iterationprocess The improvement process is usually conducted itera-tively in a heuristic way [102] eg EM [80] and Gibbs sampling[15] It is noteworthy that high-quality motif profiles can be builtthrough combinatorial optimization strategies eg integrate lin-ear programming [103] and graph-based techniques [7 10 23]With ChIP-seq data a critical challenge to efficiently handle theexponentially increased number of sequences has been posed tothe conventional motif discovery methods
Several traditional motif-finding tools were updated to adaptto ChIP-seq data and most of them focused on how to improvethe time efficiency faced with the larger data size For exampleMEME-ChIP [68] runs MEME as one of its primary motif detectionengines however it limits the input data as no gt600 sequencesHybrid Motif Sampler (HMS) combines deterministic identifica-tion of motif instances and random sampling strategy togetherto accelerate its iteration process [72] STEME [104] which canbe considered as an updated version of MEME introduced a suf-fix tree structure in the EM process to index the sequences setwhich decreases the running time to a different magnitudeMICSA uses MEME on a subset of the highest ChIP-seq peaks toidentify motifs [79] and then MEME is applied for severalrounds on the sequences in which no motifs have been identi-fied ChIPMotifs is originally designed for ChIP-chip data andintegrates MEME and WEEDER to predict primary motif candi-dates followed by a bootstrap resampling approach to deter-mine optimal cutoff and filter against the false-positivecandidates [105]
ChIPMunk [69] is a new iterative algorithm for ChIP-seq datathat combines greedy optimization with bootstrapping Theevaluation of motif profiles in ChIPMunk is based on theKullback Discrete Information Content (KDIC) which is calcu-lated by Equation (5)
Motif Prediction based on ChIP-seq data | 5
KDIC frac14Xl
jfrac141
X4
ifrac141
log fij log N
NXl
jfrac141
X4
ifrac141
fij log bi (5)
where the variables l N fij and bi are the same as those definedin Equations (3) and (4) Like CONSENSUS ChIPMunk uses agreedy strategy to find motif profiles that have high KDIC val-ues The identified motif profiles are ranked based on PWMscores and then improved by an EM-like iterative process It isworth noting that the rebuilding of motif profiles integrated theweight scores calculated from the ChIP-seq coverage informa-tion Specifically ChIPMunk set a score to each sequence as itsnormalized maximal peak value and these weights are as-signed to each position forming the motif profiles Theexperiment on several ChIP-seq data set indicated thatChIPMunk obtained a better performance both in running speedand prediction quality compared with MEME HMS [72] andSeSiMCMC [106]
RPMCMC [89] is a profile-based method which combinesGibbs motif samplers and a reversible-jump MCMC method Asa parallel version of Gibbs RPMCMC fully uses parallel mechan-isms to accelerate motif finding This breakthrough means a lotto current motif finding studies especially in this big data eradriven by high-throughput sequencing technology eg ChIP-seq Although various motif-finding methods have been pro-posed before such as DREME HEGMA WEEDER and Gibbs motifsampler [15] they have limited power in properly controllingthe trade-off between computation time and motif detection ac-curacy Fortunately RPMCMC makes multiple interacting motifsamplers in parallel to ensure an acceptable run time comparedwith other methods Most importantly a repulsive force isapplied in RPMCMC to separate different motif samplers closeto each other making further contribution in getting rid of localoptima In this way RPMCMC has achieved extremely promis-ing performances on synthetic promoter sequence and ENCODEChIP-seq data sets Meanwhile it can also be applied to diversemotif identification ie discriminative motifs and more detailsregarding this will be discussed in the next section [89]
The profile-based methods stand on an assumption thateach nucleotide in a motif instance participates independentlyin the TF-binding activity However more studies have indi-cated that the neighboring positions have strong dependent ef-fect in some motifs [107] Taking this issue into considerationMathelier and Wasserman [64] designed a new transcriptionfactor flexible model (TFFM) and developed a prediction systembased on this model It uses a 1st-order HMM to elucidate the di-nucleotide dependencies Specifically the distribution probabil-ities of nucleotides in position i are dependent on thenucleotide found at position i-1 The performance evaluation on96 ChIP-seq data sets indicates that the models considering pos-ition dependence outperform the other models on 90 data setsof 96 ie 94 in terms of the area under the curves for the cor-responding receiver operating characteristic curves [64]Although this algorithm is not designed to build motif profilesthe TFFM model can be easily modified to integrate the correl-ation between positions in de novo motif finding
Although several motif representation models have been de-veloped and new methods have been designed for motif findingon ChIP-seq data the problem is still far from a sound solutionbecause of the complexity of gene regulation the large-scale ofChIP-seq data and the low specificity of some binding sites Inthe future new models for motif presentation may be createdto replace the consensus and profiles model and improve theaccuracy of motif finding
Assessment criteria and performance evaluation
The main difference between motif finding before and after thenext-generation sequencing technology is the dramaticallyincreased data size [9] Substantial experiments showcased thatthe most popular traditional motif-finding tools eg MEME andWEEDER cannot handle the genome-scale ChIP-seq data con-taining thousands of peaks Therefore almost all ChIP-seq-based motif-finding tools focus on improvement in efficiency inaddition to accuracy and conduct comprehensive performanceevaluations on synthetic data or real biological data sets Theperformance evaluation was mainly carried out based on thecomparison with other tools like MEME WEEDER and the most-cited ChIP-seq-based tool DREME The criteria include coverageon known binding sites (eg sensitivity or SN) similarity be-tween predicted motif profiles and the documented ones speci-ficity (SP) positive predicted value (PPV) and the running timeon data sets with different scales
Here we summarized the performance evaluation of theeight ChIP-seq-based motif-finding tools as a sketch in Table 2Overall no method can outperform others in all situationsThese eight tools showed their preferences on different datasizes different motif lengths and advantages on special contextof motif finding Generally DREME has a good performance onprimary motif finding from large-scale data set along withother two functions discriminative motif finding and cofactormotif finding which will be discussed in next section Howeverit prefers short motif identification and only outputs profilematrices without information of corresponding motif instancesIn this situation we have to rely on additional motif scanningtools to identify the concrete TF-binding sites MICSA the ear-liest among the eight tools takes MEME as the search enginethus its speed and performance are highly limited by MEMERSAT peak-motifs and FMotif handle large data sets with a goodprediction accuracy and run faster than DREME SpecificallyFMotif shows good tolerance on noise rate of input sequencesSIOMICS works well on finding motif modules and runs fasterthan DREME but slower than RSAT peak-motifs Discrover usesDREME to identify motif seeds and has a better performance ondiscriminative motif finding than DREME RPMCMC shows agreat potential on finding diverse motifs in real biological datasets More details can be found in Table 2 from which usersmay get a clue to appropriate tool selection in specific contextof motif-finding applications
Development of advanced functions withbiological insightsDiscriminative motif finding
Traditional motif-finding tools aim to identify a group of con-served motifs in query sequences which are expected to containinstances of the to-be-detected motifs Besides these motifssome other randomly conserved DNA patterns may also exist inthese sequences Distinguishing these false positives from realmotifs has always been a big challenge in motif finding Thisissue is getting more serious because the high-throughputsequencing techniques bring in more false positives One solutionto this problem is to carry out the discriminative motif findingwhich is to find motifs whose occurrence frequencies vary be-tween the query sequence set and several well-defined controlsets The simplest contrast for discriminative motif finding con-tains a positive set (query sequence set) and a negative set (se-quences without binding activity or randomly generated
6 | Liu et al
sequences) [88] The discriminative ideas have been integratedinto some traditional motif-finding tools to reduce false positivesin prediction For example CONSENSUS [102] calculates P-valuesfor motif profiles indicating the probabilities for the same motifsoccurring in a hypothetical random sequence set with a similarsize BOBRO [10] calculates such a P-value by practically simulat-ing random sequences set in the program as contrast underPoisson distribution BBR [24] takes coding genomic sequences asbackground data to reevaluate candidate motifs and an experi-ment on Escherichia coli genome showed that this strategy reducesthe number of false positives Certainly the function of discrim-inative motif finding is not limited to dealing with the false-posi-tive issue Some biology-driven applications specifically identifymotifs overrepresented in a positive set but not in a negative set(called decoy motif problem) Hence a motif without occurrencedifferences between two sets is not considerable no matter howconserved it is [108]
Application of discriminative motif finding is even more im-portant in ChIP-seq data analysis because the data size is usu-ally large and the peaks with or without binding activitiesnaturally compose the positive and negative sets respectivelyIn DREME [73] the first step is counting how many positive se-quences and negative sequences contain a query word Thenthe P-value from Fisherrsquos exact test is calculated for this wordThe words with significant P-values will be kept as seeds for
motif merging The same evaluation process will be conductediteratively in generation of motifs by combing similar motifseeds In application DREME performed discriminative motiffinding by taking Sox2- and Oct4-binding ChIP-seq data inmouse embryonic stem cell (mESC) as positive and negativedata set and found that Oct4 data set has significantly moreOct4-binding sites than the Sox2 data set The results over-turned the previous opinion that Sox2 and Oct4 bind their tar-gets exclusively as a heterodimer [109]
Maaskola and Rajewsky [88] developed an improved HMM-based tool Discrover for discriminative motif finding on largedata set The method consists of three parts lsquoseed findingrsquolsquoHMM optimizationrsquo and lsquosignificance filteringrsquo The seed findingpart is similar to DREME but uses more objective functions be-sides the Fisherrsquos exact test P-values Among these objectivefunctions mutual information of condition and motif occur-rence (MICO) and maximum mutual information estimationcan be used for discriminative motif evaluation on the contrastwith more than two conditions which extended the applicationof Discrover The HMM optimization step starts from a back-ground state trained by the BaumndashWelch algorithm Then theinitial emission probabilities of the motif chains are centered onthe seed sequences The posterior probability of all sequences isevaluated to calculate the expected number of sequences thathave motif occurrences Discrover conducts discriminative
Table 2 Performance summary of motif-finding tools on ChIP-seq data
Programs Criteria Preferences Speed Performance and special characteristics
MICSA [79] Speed motifcoverage
Prefers small datasets
Slow on large data sets Performance is limited by MEME performswell on filtering weak peaks
ChIPMunk [69] Speed motifcoverage
NA Faster than MEME slowerthan DREME quasi-linearcomplexity (power 127)
More accurate than MEME
DREME [73] Speed motifcoverage
Handles large ChIP-seq data (thou-sands of se-quences) prefersshort motifs
Much faster than WEEDERand MEME linear timecomplexity
Performs well at cofactor motif finding anddiscriminative motif finding only providesmotif profile matrix no output of motifsites
RSAT peak-motifs[87]
Speed motifcoverage
Handles large ChIP-seq data sets (upto 1 000 000 peaksof 100 bp each)
Faster than DREME MEMEand ChIPMunk
High prediction accuracy
FMotif [74] Speed motif cover-age SN and SP onmotif sites
Handles large ChIP-seq data handleslong motifs
Fast but requires morememory
High sensitivity and high specificity pre-dicted accurate motif profiles with highrank (1 or 2) for 12 mESC data tolerates upto 30 noise sequences
SIOMICS [78] Speed motif cover-age and specificityon motifs
NA Slower than RSAT peak-motifs much faster thanDREME especially onlarge data sets
Competitive performance with DREME andRSAT in motif finding and cofactor motiffinding competitive performance withDREME and better than RSAT peak-motifson false-positive control good motif mod-ule prediction
Discrover [88] Speed MCC(Matthews corre-lation coefficient)on nucleotide-level SN and PPVon binding sitelevel
NA Comparable with DREME inspeed
Better than Bioprospector and MDscan on SNand PPV of prediction better than DREMEon discriminative motif finding
RPMCMC [89] Speed SN and PPV NA Similar to DREME Better than WEEDER on SN and PPV of predic-tion great potential to mine many reliablediverse motifs
Motif Prediction based on ChIP-seq data | 7
learning by iterative gradient optimization based on the chosenobjective functions which has a linear run time complexitybased on the length of the input data Finally Discrover refinesthe final models by a threshold on a corrected MICO-based P-value regardless of if MICO or another objective function wasused in previous steps In addition Discrover makes extra ef-forts in detection of multiple motifs which will be discussed inthe next section
In the application of discriminative motif finding on ChIP-seq data the various choices of reference data can benefit dif-ferent biological analysis about gene regulation and is the mainfactor affecting the prediction performance [108 110]Specifically the reference data can be (i) randomly generatedsequences using a uniform distribution or a Markov process (ii)the sequences with no binding evidence (iii) the ChIP-seq peaksfor another TF (iv) the peaks of the same TF but under a differ-ent condition even (v) multiple-level reference data sets basedon a detailed grade framework of the signal strength etc [94]This information can reduce false positives and find other spe-cial motifs including decoy motifs variant motifs for one TFimpoverished motifs and collaborated motifs [108] and thuscan improve the state-of-the-art studies about complex regula-tion mechanism eg combinatorial regulation and behaviorchange of TFs among different conditions [110] It is worth not-ing that the binding activity of a TF could be affected by epigen-etic modifications in a complex fashion eg through localchromatin structure caused by the packaging of DNA histonesand other proteins [111] Cuellar-Partida et al [111] have usedthe signal from histone modification ChIP-assay to filter thescanned motifs The corresponding pitfall raised in discrimina-tive motif finding is that the sequences in a negative data setwith no binding evidence may contain potential binding sitesSome algorithms try to mitigate this effect by a special designwhich can tolerate a few motif occurrences in negative set [110]Another phenomenon is that the binding motifs of a TF couldbe affected by the binding of other TFs Mason et al [112] provedthat Oct4 preferentially binds to different motifs with or withoutSox2 bindings within 5000 bp indicating that the binding motifof Oct4 is context-dependent
Find cofactors of TFs from ChIP-seq data
In animals and plants TFs usually regulate gene expression withcooperation of other partner TFs (cofactors) [113] Growing evi-dence indicates that cofactors which interacted directly or indir-ectly play an important role in transcription regulation Thebinding sites of multiple TFs in a short DNA region often deter-mine the temporal spatial expression pattern of the downstreamgenes thus will help to elucidate underlying patterns of combina-torial regulation [114] The analysis of cofactors has attracted sub-stantial attention in traditional motif finding For example BBA[24] in the DMINDA Web server designed a probability model toevaluate the co-occurrences among identified motifs in a givenset of regulatory sequences which can reveal joint regulation re-lationships by multiple TFs The cofactor motif finding on ChIP-seq data is usually formulated by detecting multiple nonredun-dant motifs in a set of ChIP-seq peak regions [73] or detecting TFpatterns whose binding motifs frequently co-occurred on thesame set of regulatory regions [113] Recently some algorithmshave tried to take advantage of the large-scale ChIP-seq data todiscover the cofactor motifs
Researchers want to identify multiple nonredundant motifsin query ChIP-seq data The most straightforward way is to scanpeak sequences by documented motifs [84 115] However the
number of documented motifs is limited compared with thenumber of TFs leading to a loss of some real binding sitesSome de novo motif-finding methods try to first identify mul-tiple motifs and then build relationships for them with knownTFs One example is the aforementioned DREME [73] whichfeeds the whole sequences into the corresponding model ratherthan using only a small portion of the collected ChIP-seq data Itidentifies multiple motifs by removing the most statistically sig-nificant identified motif derived from Fisherrsquos exact testand then repeats the search for motifs DREME achieved re-markable performance in cofactor motifs discovery in mousecell ChIP-seq data and dramatically outperformed other preva-lent algorithms In particular DREME discovered gt2 times thenumber of cofactor motifs compared with Amadeus [116] andgt10 times compared with Trawler [117] In addition the algo-rithm peak-motifs in the RSAT package also returns additionalmotifs potentially bound by cofactors and its application onmouse cell ChIP-seq data successfully detected some cofactormotifs [76]
Rather than simply considering individual motifs separatelySIOMICS [78 94] models the cofactor motifs as motif modulesie a group of TFs with their TFBSs co-occurring in significantlylarge number of ChIP-seq peaks Cofactor motifs may occur onlyin a small set of peaks and thus can be underrepresented indi-vidually in all peak regions Therefore evaluation of motif mod-ules simultaneously could make the significance of cofactorsstand out from the massive background sequences Asdescribed in previous section SIOMICS identifies conservedmotifs by a word-based strategy based on Equation (4) Toidentify nonredundant motifs it removes redundant candidateswith lower scores and the remaining ones are used to identifyputative motifs Instead of considering motif candidatesindividually SIOMICS represents them as nodes in a tree inwhich the more frequent candidates are closer to the root whilethe branch of the tree represents a group of co-occurring motifsBy evaluating each branch based on a Poisson clumpingheuristic strategy SIOMICS identifies statistically significantmotif modules and takes motif candidates in them as final pu-tative motifs The performance comparison on the same ChIP-seq data set used by DREME and peak-motifs indicates thatmore documented cofactor motifs were identified by SIOMICSrather than DREME and RSAT Peak-motifs with comparabletime efficiency
The above methods identify cofactors from ChIP-seq data fora single TF and have obtained many valuable results Howeversynthetic analysis of diverse ChIP-seq data could provide moreinformation about the co-regulation mechanism of TFs FCOPs[113] is a method for identifying combinatorial occupancy pat-terns of multiple TFs from diverse ChIP-seq data FCOPs inte-grates two kinds of ChIP-seq data sets as input from TFsoccupancy as well as chromatin modification and it considerstwo kinds of uncertainty How possible a peak from ChIP-seqdata represents an actual binding activity and how confidentthe predicted enhancer sequences are FCOPs generate probabil-ities from these uncertainties and use them to evaluate the stat-istical significance of a given set of TFs co-occurring in a givenset of enhancers Obviously enumerating all possible subset ofenhancers has an exponential complexity which is inefficientThis method adopts a dynamic programming-based method tocalculate the probabilities which leads to linear complexitywith respect to the number of the enhancers both in time andspace Despite this method not providing any DNA motifs itcan be easily integrated into ChIP-seq motif analysis to identifycofactors
8 | Liu et al
Citation analysis of ChIP-seq motif-findingtools
To analyze the usage of existing ChIP-seq-based motif-findingtools we collected the citation information of eight tools inTable 1 and a Web server MEME-ChIP which implants MEMEand DREME The relative citations of the eight tools since 2011are shown in Figure 2 where we can see that DREME has anoverwhelming superiority among these tools and RSAT peak-motifs and ChIPMunk also have higher relative citation ratesAlthough the four tools that were developed within the past 3years (RPMCMC Discover SIOMICS and FMotif) introducedsome new ideas and obtained better prediction performancethey still fail to get more attention from biologists and almostall of their citations are from the papers focusing on develop-ment of new methods The main reason is the lack of a user-friendly Web server and a platform with multiple related func-tions limiting their usages by people without adequate compu-tational background
DREME as one motif-finding engine of MEME-ChIP has beenintegrated in the MEME suite which is the most popular motif-finding and analysis platform MEME-ChIP obtained more add-itional citations than DREME (Figure 3) and most of theseshould be attribute to DREME RSAT peak-motifs is part of RSATplatform where a series of modular computer programs is
integrated for regulatory signal detection in noncoding se-quences And the relatively high number of citations ofChIPMunk also benefited from its Web service The methodsthat only provide source code are difficult to obtain continuousattention Taking MICSA as an example it was published in2010 and was cited frequently within 2 years following publica-tion but its citation count decreased sharply in recent yearsThis phenomenon indicated that the user experience of a newalgorithm is as important as the quality of algorithm itself
Discussion
The ChIP-seq peaks provide a foundation for the identificationof motifs on a genome scale In return the predicted motifs canbe used to reevaluate the peaks and even find more potentialTF-binding sites that were missed by a ChIP-seq experimentFor example MICSA [79] takes advantage of de novo motif identi-fication to reevaluate the ChIP-seq peaks In MICSA only a sub-set of the highest peaks is fed into MEME for motifsidentification for which the probability of observing a motif bychance (P-value) is calculated Then the predicted motifs willbe used to scan all previously identified peaks and a score willbe assigned to each peak by combining the P-values of motifsthat occurred on it with its depth The experimental results
0
10
20
30
40
50
60
70
80
90
100
2011 2012 2013 2014 2015 2016
Rela
ve
annu
al ci
tao
n (
)
Year of citaon
RPMCMC 2015[91]
Discrover 2014[93]
SIOMICS 2014[78]
Fmof 2014[74]
RSAT peak-mofs 2012[94]
DREME 2011[73]
ChIPMunk 2010[69]
MICSA 2010[79]
Figure 2 Relative citation of the ChIP-seq motif-finding tools until October 2016 The numbers after a tool name indicate the year of its original release and the corres-
ponding reference Citation of each tool was extracted from the Bing Academy (httpcnbingcomacademic)
0
5
10
15
20
25
30
35
40
45
50
Cita
ons
pery
ears
ince
publ
ished
MICSA 2010[79]
ChIPMunk 2010[69]
DREME 2011[73]
RSAT peak-mofs 2012[94]
Fmof 2014[74]
SIOMICS 2014[78]
Discrover 2014[93]
RPMCMC 2015[91]
MEME-ChIP 2011[68]a
Figure 3 The average annual citation of eight motif-finding tools and MEME-ChIP aMEME-ChIP is a Web server that implants DREME and MEME to detect motifs
Motif Prediction based on ChIP-seq data | 9
showed that MICSA offered a distinct improvement in TF-bind-ing prediction accuracy with fewer false positives comparedwith the other 10 approaches Certainly searching potentialbinding sites using the predicted motifs is not limited to thepeak sequences It could even be applied on the whole-genomesequences as some binding activities may not occur under thecondition of ChIP experiments and these binding sites are alsovaluable for studying the full regulation picture of the interestedTFs
Another issue that should be discussed concerning ChIP-seqmotif finding is the difference of its application on various datasources For example the binding motifs in eukaryotes and pro-karyotes often have their respective characteristics and thelength of the binding regions in eukaryotes is much longer thanthose in prokaryotes Therefore there is a need of special designin motif-finding and analysis techniques
As with the traditional motif finding there are some ensem-ble methods for ChIP-seq-based motif finding [85 118] It was al-ways recommended that using several tools in motif finding isa better strategy as diverse techniques may capture differentcharacteristics of motifs This idea leads to the development ofensemble motif-finding methods Most of these methods sim-ply assemble several algorithms together and rank the outputof them For example W-ChIPMotif [119] includes MEMEWEEDER and MaMF [120] and assesses the output by comparingwith a randomized initial input CompleteMotifs [118] incorpor-ates three de novo tools (ChIPMunk WEEDER and CUDA-MEME)[121] and calculates the P-values of predicted motifs by a back-ground random model However using more algorithms intro-duces more false positives while they have high probability tocover real motifs Therefore further analyses are still requiredfor the output of multiple algorithms
Certainly the accuracy of prediction and the efficiency ofmethods are still the most important features in ChIP-seq motiffinding and the application of current methods shows that theyare still far from completely satisfactory More deep thinkingand advanced techniques are still needed to improve the intrin-sic algorithms For example the evolution information fromphylogenetic footprinting can be useful as prior knowledge inevaluation of candidate binding sites Besides additional func-tions such as cofactor analysis evaluating peaks based onmotifs and multiple types of discriminative motif finding are at-tracting more attention in applications They should be wellintegrated in motif-finding methods in future studies Finallyan integrated Web server or platform is essential for the appli-cation of new designed methods as it can provide more func-tions about upstream data collection and downstream resultanalysis
Most recently deep learning has demonstrated its unprece-dented performance in regulatory genomics especially in ChIP-seq-based motif analysis [122] This marked performance ismainly achieved by taking advantage of graphics processingunits and the capability to extract complex features of motifsfrom ChIP-seq data However it can be improved further ifmore information (eg DNA shape) can be taken into consider-ation rather than only the ChIP-seq peaks sequences
Key Points
bull The algorithmic strategies adopted by current motif-finding tools can be generally divided into two catego-ries word-based and profile-based which could be im-plemented with other models like tree graph and
clustering The combination of these two strategies isa better choice to break the current bottleneck in pre-diction accuracy and efficiency
bull The new features of current tools including but notlimited to discriminative motif identification (DREAMRSAT lsquopeak-motifsrsquo and Discrover) cofactor motifdetection (DREME RSAT lsquopeak-motifsrsquo SIOMICSRPMCMC and Discrover) etc are essential for applica-tion of ChIP-seq data analysis and are encouraged tobe integrated into newly designed algorithms
bull Existing tools still have a much room for improvementwith respect to prediction accuracy and efficiency thecontradictory relationship between the time efficiencyand space complexity as well as between predictionaccuracy and application universality is generally pre-sent in these tools
bull An integrated Web server or platform providing vari-ous analysis functions in data collection and the resultanalysis is essential for the spread of new designedmethods in their applications
bull In future studies more mathematical models eggraph model statistics and combinatorial optimiza-tion could be developed and integrated to design newmotif signal detection techniques and new evaluationcriteria In addition more information eg the de-pendency between nucleotides and the height of peaksfrom ChIP-seq data should be considered to improvethe prediction accuracy
Funding
The State of South Dakota Research Innovation Center andthe Agriculture Experiment Station of South Dakota StateUniversity the National Nature Science Foundation ofChina (grant numbers 61303084 and 61432010 to BL) andthe Young Scholars Program of Shandong University (grantnumbers YSPSDU 2015WLJH19 to BL)
References1 DrsquoHaeseleer P What are DNA sequence motifs Nat
Biotechnol 200624423ndash52 Brohee S Janky R Abdel-Sater F et al Unraveling networks
of co-regulated genes on the sole basis of genome se-quences Nucleic Acids Res 2011396340ndash58
3 Davidson E Levin M Gene regulatory networks Natl Acad SciUSA 20051024935
4 Liu B Zhou C Li G et al Bacterial regulon modeling and pre-diction based on systematic cis regulatory motif analysesSci Rep 2016623030
5 Baumbach J On the power and limits of evolutionary con-servationndashunraveling bacterial gene regulatory networksNucleic Acids Res 2010387877ndash84
6 Lawrence CE Altschul SF Boguski MS et al Detecting subtlesequence signals a Gibbs sampling strategy for multiplealignment Science 1993262208ndash14
7 Pevzner PA Sze SH Combinatorial approaches to findingsubtle signals in DNA sequences Proc Int Conf Intell Syst MolBiol 20008269ndash78
8 Nakaki R Kang J Tateno M A novel ab initio identificationsystem of transcriptional regulation motifs in genome DNAsequences based on direct comparison scheme of signalnoise distributions Nucleic Acids Res 2012408835ndash48
10 | Liu et al
9 Zambelli F Pesole G Pavesi G Motif discovery and transcrip-tion factor binding sites before and after the next-generation sequencing era Brief Bioinform 201314225ndash37
10 Li G Liu B Ma Q et al A new framework for identifying cis-regulatory motifs in prokaryotes Nucleic Acids Res201139e42
11 Chen X Guo L Fan Z et al W-AlignACE an improved Gibbssampling algorithm based on more accurate position weightmatrices learned from sequence and gene expressionChIP-chip data Bioinformatics 2008241121ndash8
12 Sinha S PhyME a software tool for finding motifs in sets oforthologous sequences Methods Mol Biol 2007395309ndash18
13 Das MK Dai HK A survey of DNA motif finding algorithmsBMC Bioinformatics 20078(Suppl 7)S21
14 Wang T Stormo GD Combining phylogenetic data with co-regulated genes to identify regulatory motifs Bioinformatics2003192369ndash80
15 Liu X Brutlag DL Liu JS BioProspector discovering con-served DNA motifs in upstream regulatory regions of co-expressed genes Pac Symp Biocomput 2001127ndash38
16 Hertz GZ Stormo GD Identifying DNA and protein patternswith statistically significant alignments of multiple se-quences Bioinformatics 199915563ndash77
17 Liu XS Brutlag DL Liu JS An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments NatBiotechnol 200220835ndash9
18 Bailey TL Boden M Buske FA et al MEME SUITE tools for motifdiscovery and searching Nucleic Acids Res 200937W202ndash8
19 Olman V Xu D Xu Y CUBIC identification of regulatorybinding sites through data clustering J Bioinform Comput Biol2003121ndash40
20 Li X Wong WH Sampling motifs on phylogenetic trees ProcNatl Acad Sci USA 20051029481ndash6
21 Blanchette M Tompa M Discovery of regulatory elementsby a computational method for phylogenetic footprintingGenome Res 200212739ndash48
22 Blanchette M Tompa M FootPrinter a program designed forphylogenetic footprinting Nucleic Acids Res 2003313840ndash2
23 Li G Liu B Xu Y Accurate recognition of cis-regulatorymotifs with the correct lengths in prokaryotic genomesNucleic Acids Res 201038e12
24 Ma Q Liu B Zhou C et al An integrated toolkit for accurateprediction and analysis of cis-regulatory motifs at a genomescale Bioinformatics 2013292261ndash8
25 Tompa M Li N Bailey TL et al Assessing computationaltools for the discovery of transcription factor binding sitesNat Biotechnol 200523137ndash44
26 McCue LA Thompson W Carmack CS et al Factors influenc-ing the identification of transcription factor binding sites bycross-species comparison Genome Res 2002121523ndash32
27 Simcha D Price ND Geman D The limits of de novo DNAmotif discovery PLoS One 20127e47836
28 Siddharthan R Siggia ED van Nimwegen E PhyloGibbs aGibbs sampling motif finder that incorporates phylogenyPLoS Comput Biol 20051e67
29 Blanchette M Schwikowski B Tompa M Algorithms forphylogenetic footprinting J Comput Biol 20029211ndash23
30 Neph S Tompa M MicroFootPrinter a tool for phylogeneticfootprinting in prokaryotic genomes Nucleic Acids Res200634W366ndash8
31 Carmack CS McCue LA Newberg LA et al PhyloScan identi-fication of transcription factor binding sites using cross-species evidence Algorithms Mol Biol 200721
32 Zhang S Xu M Li S et al Genome-wide de novo prediction ofcis-regulatory binding sites in prokaryotes Nucleic Acids Res200937e72
33 Katara P Grover A Sharma V Phylogenetic footprinting aboost for microbial regulatory genomics Protoplasma2012249901ndash7
34 Tagle DA Koop BF Goodman M et al Embryonic epsilon andgamma globin genes of a prosimian primate (Galago crassi-caudatus) Nucleotide and amino acid sequences develop-mental regulation and phylogenetic footprints J Mol Biol1988203439ndash55
35 Borneman AR Gianoulis TA Zhang ZD et al Divergence oftranscription factor binding sites across related yeast spe-cies Science 2007317815ndash9
36 Odom DT Dowell RD Jacobsen ES et al Tissue-specific tran-scriptional regulation has diverged significantly betweenhuman and mouse Nat Genet 200739730ndash2
37 Boyle AP Araya CL Brdlik C et al Comparative analysis ofregulatory information and circuits across distant speciesNature 2014512453ndash6
38 Bingqiang Liu CZ Zhang H Guojun L et al An integrativeand applicable phylogenetic footprinting framework for cis-regulatory motifs identification in prokaryotic genomesBMC Genomics 201617578
39 Park PJ ChIP-seq advantages and challenges of a maturingtechnology Nat Rev Genet 200910669ndash80
40 Song L Crawford GE DNase-seq a high-resolution tech-nique for mapping active gene regulatory elements acrossthe genome from mammalian cells Cold Spring Harb Protoc201025384
41 Wang Z Gerstein M Snyder et al RNA-Seq a revolutionarytool for transcriptomics Nat Rev Genet 20081057ndash63
42 Tsankov AM Gu H Akopian V et al Transcription factorbinding dynamics during human ES cell differentiationNature 2015518344ndash9
43 Wu F Olson BG Yao J DamID-seq genome-wide mapping ofprotein-DNA interactions by high throughput sequencing ofadenine-methylated DNA fragments J Vis Exp 2016107e53620
44 Maragkakis M Alexiou P Nakaya T et al CLIPSeqTools-anovel bioinformatics CLIP-seq analysis suite RNA2015221ndash9
45 Hafner M Landthaler M Burger L et al PAR-CliPmdasha methodto identify transcriptome-wide the binding sites of RNAbinding proteins J Vis Exp 201041e2034
46 Ingolia NT Ribosome profiling new views of translationfrom single codons to genome scale Nat Rev Genet201415205ndash13
47 Giresi PG Kim J Mcdaniell RM et al FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates activeregulatory elements from human chromatin Genome Res200717877ndash85
48 Nutiu R Friedman RC Luo S et al Direct measurement ofDNA affinity landscapes on a high-throughput sequencinginstrument Nat Biotechnol 201129659ndash64
49 Collas P Dahl JA Chop it ChIP it check it the currentstatus of chromatin immunoprecipitation Front Biosci200813929ndash43
50 Kimura H Sato Y DNA Replication and Histone ModificationJapan Springer 2016
51 Suganuma T Workman JL Histone modification as a reflec-tion of metabolism Cell Cycle 2016 15481ndash2
52 Qu H Fang X A brief review on the human encyclopedia ofDNA elements (ENCODE) project Genomics ProteomicsBioinformatics 201311135ndash41
Motif Prediction based on ChIP-seq data | 11
53 Consortium EP The ENCODE (ENCyclopedia Of DNAElements) project Science 2004306636ndash40
54 Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910R25
55 Li H Durbin R Fast and accurate long-read alignment withBurrowsndashWheeler transform Bioinformatics 201026589ndash95
56 Szalkowski AM Schmid CD Rapid innovation in ChIP-seqpeak-calling algorithms is outdistancing benchmarking ef-forts Brief Bioinform 201112626ndash33 628
57 Kharchenko PV Tolstorukov MY Park PJ Design and ana-lysis of ChIP-seq experiments for DNA-binding proteins NatBiotechnol 2008261351ndash9
58 Zhang Y Liu T Meyer CA et al Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089R137
59 Jiang H Wang F Dyer NP et al CisGenome browser a flex-ible tool for genomic data visualization Bioinformatics2010261781ndash2
60 Fejes AP Robertson G Bilenky M et al FindPeaks 31 atool for identifying areas of enrichment from massivelyparallel short-read sequencing technology Bioinformatics2008241729ndash30
61 Valouev A Johnson DS Sundquist A et al Genome-wideanalysis of transcription factor binding sites based on ChIP-Seq data Nat Methods 20085829ndash34
62 Xin F Grossman R Stein L PeakRanger a cloud-enabledpeak caller for ChIP-seq data BMC Bioinformatics201112139
63 Kuan PF Chung D Pan G et al A statistical frameworkfor the analysis of ChIP-Seq data J Am Stat Assoc2011106891ndash903
64 Mathelier A Wasserman WW The next generation of tran-scription factor binding site prediction PLoS Comput Biol20139e1003214
65 Cheng C Min R Gerstein M TIP a probabilistic method foridentifying transcription factor target genes from ChIP-seqbinding profiles Bioinformatics 2011273221ndash7
66 Wu S Wang J Zhao W et al ChIP-PaM an algorithm to iden-tify protein-DNA interaction using ChIP-Seq data Theor BiolMed Model 20107(1)18
67 van Heeringen SJ Veenstra GJC GimmeMotifs a de novomotif prediction pipeline for ChIP-sequencing experimentsBioinformatics 201127270ndash1
68 Machanick P Bailey TL MEME-ChIP motif analysis of largeDNA datasets Bioinformatics 2011271696ndash7
69 Kulakovskiy IV Boeva VA Favorov AV et al Deep and widedigging for binding motifs in ChIP-Seq data Bioinformatics2010262622ndash3
70 Jothi R Cuddapah S Barski A et al Genome-wide identifica-tion of in vivo proteinndashDNA binding sites from ChIP-Seqdata Nucleic Acids Res 2008365221ndash31
71 Mercier E Droit A Li L et al An integrated pipeline for thegenome-wide analysis of transcription factor binding sitesfrom ChIP-Seq PLoS One 20116e16432
72 Hu M Yu J Taylor JM et al On the detection and refinementof transcription factor binding sites using ChIP-Seq dataNucleic Acids Res 2010382154ndash67
73 Bailey TL DREME motif discovery in transcription factorChIP-seq data Bioinformatics 2011271653ndash9
74 Jia C Carson MB Wang Y et al A new exhaustive methodand strategy for finding motifs in ChIP-enriched regionsPLoS One 20149e86044
75 Pavesi G Mereghetti P Mauri G et al Weeder Web discov-ery of transcription factor binding sites in a set of
sequences from co-regulated genes Nucleic Acids Res200432W199ndash203
76 Thomas-Chollier M Herrmann C Defrance M et al RSATpeak-motifs motif analysis in full-size ChIP-seq datasetsNucleic Acids Res 201240e31
77 Tran NT Huang CH A survey of motif finding web tools fordetecting binding site motifs in ChIP-Seq data Biol Direct201494
78 Jun Ding HH Xiaoman L SIOMICS a novel approach for sys-tematic identification of motifs in ChIP-seq data NucleicAcids Res 2014421645ndash35
79 Boeva V Surdez D Guillon N et al De novo motif identifica-tion improves the accuracy of predicting transcription factorbinding sites in ChIP-Seq data analysis Nucleic Acids Res201038e126
80 Bailey TL Williams N Misleh C et al MEME discovering andanalyzing DNA and protein sequence motifs Nucleic AcidsRes 200634369ndash73
81 Hartmann H Guthhrlein EW Siebert M et al P-value-basedregulatory motif discovery using positional weight matricesGenome Res 201323181ndash94
82 Niu M De novo prediction of cis-regulatory modules in eu-karyotic organisms Dissertations amp ThesesmdashGradworksThe university of North Carolina at Charlotte 2014
83 Bolouri H Ruzzo WL Integration of 198 ChIP-seq datasetsreveals human cis-regulatory regions J Comput Biol201219989ndash97
84 Sun H Guns T Fierro AC et al Unveiling combinatorial regu-lation through the combination of ChIP information and insilico cis-regulatory module detection Nucleic Acids Res201240e90
85 Lihu A Holban S A review of ensemble methods for denovo motif discovery in ChIP-Seq data Brief Bioinform201617731
86 Kulakovskiy IV Makeev VJ Motif discovery and motif find-ing in ChIP-Seq data Genome Analysis Current Proceduresand Applications Caister Academic Press Norfolk UK2014 83
87 Medina-Rivera A Defrance M Sand O et al RSAT 2015Regulatory Sequence Analysis Tools Nucleic Acids Res201543W50ndash6
88 Maaskola J Rajewsky N Binding site discovery from nucleicacid sequences by discriminative learning of hidden Markovmodels Nucleic Acids Res 20144212995ndash3011
89 Ikebata H Yoshida R Repulsive parallel MCMC algorithm fordiscovering diverse motifs from large sequence setsBioinformatics 2015311561ndash8
90 Stormo GD DNA binding sites representation and discov-ery Bioinformatics 20001616ndash23
91 Nishida K Frith MC Nakai K Pseudocounts for transcriptionfactor binding sites Nucleic Acids Res 200937939ndash44
92 Weirauch MT Cote A Norel R et al Evaluation of methodsfor modeling transcription factor sequence specificity NatBiotechnol 201331126ndash34
93 Bailey TL Elkan C Fitting a mixture model by expectationmaximization to discover motifs in biopolymers Proc IntConf Intell Syst Mol Biol 1994228ndash36
94 Ding J Dhillon V Li X et al Systematic discovery of cofactormotifs from ChIP-seq data by SIOMICS Methods 201579ndash8047ndash51
95 van Helden J Andre B Collado-Vides J Extracting regulatorysites from the upstream region of yeast genes by computa-tional analysis of oligonucleotide frequencies J Mol Biol1998281827ndash42
12 | Liu et al
96 van Helden J Rios AF Collado-Vides J Discovering regula-tory elements in non-coding sequences by analysis ofspaced dyads Nucleic Acids Res 2000281808ndash18
97 van Helden J del Olmo M Perez-Ortin JE Statistical analysisof yeast genomic downstream sequences reveals putativepolyadenylation signals Nucleic Acids Res 2000281000ndash10
98 Kulakovskiy I Levitsky V Oshchepkov D et al From bindingmotifs in ChIP-Seq data to improved models of transcriptionfactor binding sites J Bioinform Comput Biol 2013111340004
99 Levitsky VG Kulakovskiy IV Ershov NI et al Application ofexperimentally verified transcription factor binding sitesmodels for computational analysis of ChIP-Seq data BMCGenomics 20141580
100 Eskin E Pevzner PA Finding composite regulatory patternsin DNA sequences Bioinformatics 200218 Suppl 1S354ndash63
101 Sharov AA Ko MS Exhaustive search for over-representedDNA sequence motifs with CisFinder DNA Res 200916261ndash73
102 Hertz GZ Hartzell GW III Stormo GD Identification of con-sensus patterns in unaligned DNA sequences known to befunctionally related Comput Appl Biosci 1990681ndash92
103 Kingsford C Zaslavsky E Singh M a compact mathematicalprogramming formulation for DNA motif finding Lect NotesComput Sci 2006400913
104 Reid JE Wernisch L STEME efficient EM to find motifs inlarge data sets Nucleic Acids Res 201139e126
105 Jin VX OrsquoGeen H Iyengar S et al Identification of an OCT4 andSRY regulatory module using integrated computational and ex-perimental genomics approaches Genome Res 200717807ndash17
106 Favorov AV Gelfand MS Gerasimova AV et al A Gibbs sam-pler for identification of symmetrically structured spacedDNA motifs with improved estimation of the signal lengthBioinformatics 2005212240ndash5
107 Zhao Y Ruan S Pandey M et al Improved models for tran-scription factor binding site identification using noninde-pendent interactions Genetics 2012191781ndash90
108 Redhead E Bailey TL Discriminative motif discovery in DNAand protein sequences using the DEME algorithm BMCBioinformatics 20078385
109 Chen X Xu H Yuan P et al Integration of external signalingpathways with the core transcriptional network in embry-onic stem cells Cell 20081331106ndash17
110 Huggins P Zhong S Shiff I et al DECOD Fast and AccurateDiscriminative DNA Motif Finding Bioinformatics2011272361ndash7
111 Cuellar-Partida G Buske FA McLeay RC et al Epigenetic pri-ors for identifying active transcription factor binding sitesBioinformatics 20122856ndash62
112 Mason MJ Plath K Zhou Q Identification of context-dependent motifs by contrasting ChIP binding dataBioinformatics 2010262826ndash32
113 Teng L He B Gao P et al Discover context-specificcombinatorial transcription factor interactions byintegrating diverse ChIP-Seq data sets Nucleic Acids Res201442e24
114 Vaquerizas JM Kummerfeld SK Teichmann SA et al A cen-sus of human transcription factors function expressionand evolution Nat Rev Genet 200910252ndash63
115 Ding J Cai X Wang Y et al ChIPModule systematic discov-ery of transcription factors and their cofactors from ChIP-seq data Pac Symp Biocomput 2013320ndash31
116 Ettwiller L Paten B Ramialison M et al Trawler de novoregulatory motif discovery pipeline for chromatin immuno-precipitation Nat Methods 20074563ndash5
117 Linhart C Halperin Y Shamir R Transcription factor andmicroRNA motif discovery the Amadeus platform and acompendium of metazoan target sets Genome Res2008181180ndash9
118 Kuttippurathu L Hsing M Liu Y et al CompleteMOTIFsDNA motif discovery platform for transcription factor bind-ing experiments Bioinformatics 201127715ndash7
119 Ho JW Bishop E Karchenko PV et al ChIP-chip versus ChIP-seq lessons for experimental design and data analysis BMCGenomics 201112134
120 Che D Li G Mao F et al Detecting uber-operons in prokary-otic genomes Nucleic Acids Res 2006342418ndash27
121 Liu Y Schmidt B Liu W et al CUDAndashMEME acceleratingmotif discovery in biological sequences using CUDA-enabled graphics processing units Pattern Recognit Lett2010318
122 Alipanahi B Delong A Weirauch MT et al Predicting the se-quence specificities of DNA- and RNA-binding proteins bydeep learning Nat Biotechnol 201533831ndash8
Motif Prediction based on ChIP-seq data | 13
A more accurate and most commonly used model is thelsquomotif profilersquo A profile is built by aligning the available in-stances of a motif M and counting the frequency of each nucleo-tide at each position (fij) These frequencies give rise to a typicalmatrix representation of a motif profile (Mffrac14 fij4 l Figure 1A)called a lsquoposition weight matrixrsquo (PWM) An alternative way ofconstructing the PWM is using the probability distribution to re-place frequencies (Mpfrac14 pij4 l) Specifically these frequencieswill be divided by the number of binding sites of this motif andsuch a representation of the PWM in Figure 1A is shown inEquation (1)
Mp frac14 fpijg frac14
06 0 0 02 04
0 06 0 0 04
0 04 0 08 02
04 0 1 0 0
2666664
3777775
(1)
Taking the background frequencies of each nucleotides intoconsideration the PWM in Equation (1) can be further modifiedas Mgfrac14 gij where gijfrac14 log(fijbi) and biis the probability of theith nucleotide of (A G C and T) appearing in the background se-quences Based on this version of PWM we can calculate thelsquoinformation contentrsquo (IC) to evaluate how conserved this motifis ie Equation (2)
I Meth THORN frac14X1
jfrac141
X4
ifrac141
pij logpij
bi(2)
In addition the matrix can also be used to evaluate how agiven DNA segment s with length l is consistent with motif Mby calculating the score
score seth THORN frac14Xl
jfrac141
X4
ifrac141
pij logpij
bi dij (3)
where dijfrac14 1 if the jth nucleotide of s is the ith nucleotide of (AG C and T) dijfrac14 0 otherwise A problem of this model is that theprobability fij could be zero for a small set of binding sites giv-ing rise to negative infinity in Equations (2) and (3) A commonmethod to avoid this bias is adding a certain valueT
able
1Fe
atu
res
of
the
too
lsfo
rC
hIP
-seq
mo
tif
fin
din
g
Pro
gram
sM
ICSA
Ch
IPM
un
kD
REM
ER
SAT
pea
k-m
oti
fsFM
oti
fSI
OM
ICS
Dis
cro
ver
RPM
CM
C
Wo
rd-b
ased
Yes
Yes
Yes
Yes
Yes
Pro
file
-bas
edY
esY
esY
esT
ech
niq
ues
and
crit
eria
MEM
EthornPS
SMsc
ore
IC-l
ike
fun
ctio
ns
thornse
qu
ence
wei
ght
Wo
rdco
un
tthorn
Fish
erte
stR
SAT
oli
go-a
nal
ysis
d
yad
-an
alys
isl
oca
lwo
rdan
alys
isM
EME
and
Ch
IPM
un
k
Wo
rden
um
erat
ion
thornsu
ffix
treethorn
IC-b
ased
z-sc
ore
IC-l
ike
fun
ctio
ns
thorntr
eem
eth
od
sH
MM thornm
ult
iple
test
ing
corr
ecte
dP-
valu
e
MC
MC
Dis
crim
inat
ive
mo
tif
fin
din
gY
esY
esY
es
Co
fact
ors
Mu
ltip
lem
oti
fsM
ult
iple
div
erse
mo
tifs
Mu
ltip
led
iver
sem
oti
fsM
oti
fm
od
ule
sM
ult
iple
div
erse
mo
tifs
Mu
ltip
led
iver
sem
oti
fsM
oti
fco
mp
aris
on
RSA
Tco
mp
are-
mo
tifs
STA
MP
TO
MT
OM
Yea
ro
fre
leas
ean
dre
fere
nce
s20
10[7
9]20
10[6
9]20
11[7
3]20
12[8
7]20
14[7
4]20
14[7
8]20
14[8
8]20
15[8
9]
A B
C
Figure 1 Representation of a motif (A) an example of motif consensus degener-
ate consensus and profile (B) a full list of wild cards in the degenerate consen-
sus (C) a different motif but has same profile with motif in (A)
Motif Prediction based on ChIP-seq data | 3
(pseudocount) for each position of the motif [91] Through sys-tem simulation and analysis Nishida et al found that the opti-mal pseudocount value is correlated with the entropy of a motifprofile Specifically the less conserved motif profiles preferlarger pseudocount value and 08 is suggested in general
As shown above multiple approaches for modeling TFrsquosDNA-binding specificity have been developed A systematiccomparison of these approaches can provide substantially valu-able information for further motif-finding algorithm designDREAM5 Consortium organized a competition on motif repre-sentation models by applying 26 approaches to in vitro protein-binding microarray data [92] These approaches adopt variousstrategies including but not limited to k-mers model PWMhidden Markov models (HMMs) and dinucleotides The k-mersand PWM are the two main strategies which show similar aver-age performance on multiple data sets However they have sub-stantial differences in terms of individual performance on a fewdata sets indicating that motif finding is sensitive for model se-lection Another interesting observation is that the IC of a motifmay not fully represent its accuracy It is obviously contrarywith the basic principle of general motif-finding tools thusdeeper investigation into this area is still needed to improve themotif representation models
These motif representation models are still not perfect witha common disadvantage that they ignore the correlation amongdifferent positions in a motif For example the motif in Figure1C has the same PWM as the motif in Figure 1a but apparentlythe first two positions in this motif are correlated and depend-ent to each other Hence a high order Markov model is sug-gested to be integrated into PWM matrix [24] Meanwhile it isunsure whether a known PWM can fairly represent the wholepopulation scenario as the frequencies of nucleotides in eachposition are calculated only from the known binding sites of aTF A motif profile built on partially identified bindings sites of aTF may induce bias when it is used to interpret the global bind-ing preference especially when this profile is used to model theorthologous binding sites from various species A more funda-mental debate is Do the nucleotides with lower frequenciesimply lower binding ability At the time of this writing thereare still no clear answers to this question and deeper thoughtabout above concerns will bring potential ways to improveexisting representation models of cis-regulatory motifs
Motif signal detection techniques andperformance evaluation
The basic computational assumption of motif identification isthat they are overrepresented as conserved patterns in given se-quences The scattered instances of a motif are not perfectlyidentical but similar with each other Once identified and wellaligned they will show significance in conservation comparedwith background sequences Therefore finding and aligningthese instances are the primary issues and remain a challengein motif finding ChIP-seq data supply enriched resource formotif finding yet make this problem more challenging as thefull-size peak sequences from a ChIP-seq experiment are muchlarger than co-regulated promoters in traditional motif finding
Motif-finding methods mainly fall into two categories word-based methods (ie consensus-based) and profile-based meth-ods [9 13] Word-based methods usually enumerate and com-pare nucleotides starting from a consensus sequence with afixed length and a tolerance of mutations Theoretically thisstrategy is able to identify global optimal solutions but suffers
from high computational complexity when it is applied to large-scale input data or when to-be-identified motifs are relativelylong and with a large number of mutations [13] The profile-based methods usually start with some aligned patterns eitherrandomly chosen [93] or enumerated in a limited subset of inputdata [93] and refine them based on some criteria on the wholedata These criteria are designed to evaluate the overrepre-sented significance of aligned profiles from the input se-quences Improvements are mostly conducted in a heuristicway eg neighboring improvement (add or delete patterns tosee if the profile goes better similar to a hill-climbing method)or iterative statistical methods [Gibbs sampling or expectationmaximization (EM)] The profile-based methods usually runfaster than word-based methods and have better performancein predicting motifs with complex mutations However thiskind of methods tend to fail in detection of multiple motifs es-pecially when the data size is large as the iterative procedurethey adopted often falls into local optimizations which is diffi-cult to escape [89]
Word-based methods
The dominant model of these methods is the (l d)-motifs where lis the width of a motif and d is the maximum number of muta-tions between a motif instance and the consensus sequence [7374 76 78 88 94] Intuitively the goal is to find potential (l d)-motifs that are significantly presented in the input sequencesThese methods can be further divided into two categoriespattern-driven and sample-driven [74] The former enumerates all4l l-mers on character set (A G C and T) as consensus sequences tobuild candidate motifs The latter only uses real l-mers from inputdata to generate all possible (l d)-motifs Five algorithms specific-ally designed for motif finding in ChIP-seq peaks FMotif [74]DREME [73] RSAT peak-motifs [76] SIOMICS [78 94] and Discrover[88] are reviewed with more details as follows
DREME [73] is a sample-driven method which exhaustivelysearches for exact words and heuristically expands them towords with wild cards DREME requires a positive data set iethe peaks derived from a ChIP-seq data set and a negative dataset ie different peaks from other ChIP-seq experiments or ran-domly shuffled version of the positive data It takes all l-mers inthe positive data set and counts the number of their occurrencein the two data sets The top 100 words that are significantlyoverrepresented as determined by Fisherrsquos exact test are takenas seed motifs DREME then expands the seed motifs by allow-ing exactly one additional wild card The newly generatedwords are evaluated and only the top 100 most significantwords are kept To save time in the word generation stepDREME estimates the number of sequences matching generatedmotifs rather than searching explicitly for them in the data setHowever this program prefers short motifs (with length 4ndash10 bp) hence is more suitable for monomeric eukaryotic TFs
The RSAT package [87] provides a pipeline lsquopeak-motifrsquo todiscover motifs in ChIP-seq peaks [76] It integrates a set of try-and-test algorithms using complementary criteria to detect ex-ceptional words ie lsquooligo-analysisrsquo lsquodyad-analysisrsquo lsquoposition-analysisrsquo and lsquolocal-word-analysisrsquo [95ndash97] It is a sample-drivenmethod starting by counting all 6ndash7 bp long nucleotide occur-rences within the input regulatory sequence set and estimatingtheir statistical significances The expected word frequencieswere estimated with a 1st-order or 2nd-order Markov model bydefault A lower order can be chosen to improve the sensitivityof frequency estimation in a smaller data set Finally the sig-nificant words are assembled and converted to PWMs for
4 | Liu et al
further usage It is worth noting that this algorithm uses cali-brated tables which only takes the uneven nucleotide represen-tation in the noncoding sequences into consideration Thisfeature combined with the simplicity of its binomial-basedstatistical analysis results in the high efficiency of RSAT lsquopeak-motifsrsquo Experimental results also showcased that it is signifi-cantly faster than other tools like DREME [73] ChIPMunk [98 99]and MEME-ChIP [68] However the methodological simplicitylimits this method to the detection of short and relatively con-served motifs Longer motifs must be reconstituted by the com-bination of overlapping short nucleotides
FMotif [74] is an exhaustive pattern-driven strategy for find-ing motifs in ChIP-seq peaks A similar idea has been used inseveral traditional motif-finding tools for co-regulated data egWEEDER [75] and MITRA [100] One unique advantage ofpattern-driven methods is being able to identify planted (l d)-motifs without prior knowledge of width l FMotif basically fol-lows the strategy of WEEDER to enumerate all the possible (l d)-motifs in a depth-first manner and scan all motif occurrences ina suffix tree Although this strategy has achieved high accuracyin searching for short eukaryotic motifs [25] it still has severaldrawbacks For example it sets a constraint on mutation toler-ance dl which is not satisfied by many motifs and the runtime will greatly increase in identifying long motifs or motifswith low occurrence frequency The effect of these drawbacks isgetting worse when faced with large-scale ChIP-seq data Tosolve these problems FMotif designed a new suffix tree struc-ture for motif instance searching It keeps the mismatching in-formation generated when scanning i-mers in suffix tree anduses them in the searching of (ithorn 1)-mers The new structuredramatically decreases the time complexity of motif instancesearching and achieves higher accuracy compared withWEEDER thus it makes this enumeration strategy applicableon ChIP-seq data For example the run time of FMotif is about32 min and 475 h when identifying planted (16 5)-motif in thedata set with 10 000 and 80 000 nucleotides correspondinglyHowever WEEDER spends gt10 h when identifying the samemotif in the data set with having 1200 nucleotides [74] The effi-ciency can be even improved through a clustering strategy incombining l-mer enumeration (eg cisFinder) [101] One short-coming of FMotif is higher space complexity because morememory is required to store mismatched information
The above three methods have made substantial efforts onthe motif signal detection procedure SIOMICS [78 92] adopts arelatively simpler strategy to build motifs but pays more atten-tion to reassess these motifs based on the significance of agroup of motifs (called motif modules) Specifically SIOMICStakes all l-mers (lfrac14 8 by default) in given sequences as consen-suses and builds a motif for each of them by searching all seg-ments with at most one mismatch as its motif instances Themotifs are evaluated and ranked by the following Equation (4)
log ethnTHORNl
Xl
jfrac141
X4
i
pij log pij 1n
Xallsegments
log p0 seth THORNeth THORN
24
35 (4)
where n is the number of instances in this motif pij is the sameas that in Equation (2) and p0(s) is the probability of generatingTFBSs based on background nucleotide frequencies This scorenormalizes the effects of motif length and number of instancesthus has no bias in motif interpretation and comparison It hasbeen successfully used in MDscan another motif-finding toolfor ChIP-chip data [17] SIOMICS eliminates redundancy in pre-diction through removing the motif which has a higher-rank
similar motif The predicted motifs were reassessed as motifmodules by a tree strategy which will be discussed in the lsquoFindco-factors of TFs from ChIP-seq datarsquo section The basic as-sumption is that an individually insignificant motif should betaken as a valuable motif if it is overrepresented when con-sidered as a member of motif module together with the motifsof some other factors
The word-based methods can be roughly considered as aglobal optimization strategy as they enumerate all possible (ld)-motifs However they usually only have a good performanceon simulation data but not on real biological data because the (ld)-motif model is too strict for representing real TFBSs A hightime complexity is another bottleneck of the enumeration strat-egy Even though many techniques have been developed to ac-celerate the searching speed it is still a tough problem in ChIP-seq motif finding
Profile-based methods
Profile-based motif-finding methods are usually based on an opti-mization strategy with a motif profile score eg IC for PWMSpecifically they try to find a collection of DNA segments givingrise to a motif profile with the highest score among all the com-binations of candidates However it is impractical to identify thebest profile by exhaustive enumeration as going through all pos-sible combinations will lead to extremely high time complexityTherefore heuristic strategies are usually adopted to narrowdown the searching space First some primary profiles will bebuilt by randomly selecting one or several DNA segments or byan enumerative search in partially selected input sequences(based on some preliminary knowledge) Then they will be im-proved aiming to gain higher profile scores during the iterationprocess The improvement process is usually conducted itera-tively in a heuristic way [102] eg EM [80] and Gibbs sampling[15] It is noteworthy that high-quality motif profiles can be builtthrough combinatorial optimization strategies eg integrate lin-ear programming [103] and graph-based techniques [7 10 23]With ChIP-seq data a critical challenge to efficiently handle theexponentially increased number of sequences has been posed tothe conventional motif discovery methods
Several traditional motif-finding tools were updated to adaptto ChIP-seq data and most of them focused on how to improvethe time efficiency faced with the larger data size For exampleMEME-ChIP [68] runs MEME as one of its primary motif detectionengines however it limits the input data as no gt600 sequencesHybrid Motif Sampler (HMS) combines deterministic identifica-tion of motif instances and random sampling strategy togetherto accelerate its iteration process [72] STEME [104] which canbe considered as an updated version of MEME introduced a suf-fix tree structure in the EM process to index the sequences setwhich decreases the running time to a different magnitudeMICSA uses MEME on a subset of the highest ChIP-seq peaks toidentify motifs [79] and then MEME is applied for severalrounds on the sequences in which no motifs have been identi-fied ChIPMotifs is originally designed for ChIP-chip data andintegrates MEME and WEEDER to predict primary motif candi-dates followed by a bootstrap resampling approach to deter-mine optimal cutoff and filter against the false-positivecandidates [105]
ChIPMunk [69] is a new iterative algorithm for ChIP-seq datathat combines greedy optimization with bootstrapping Theevaluation of motif profiles in ChIPMunk is based on theKullback Discrete Information Content (KDIC) which is calcu-lated by Equation (5)
Motif Prediction based on ChIP-seq data | 5
KDIC frac14Xl
jfrac141
X4
ifrac141
log fij log N
NXl
jfrac141
X4
ifrac141
fij log bi (5)
where the variables l N fij and bi are the same as those definedin Equations (3) and (4) Like CONSENSUS ChIPMunk uses agreedy strategy to find motif profiles that have high KDIC val-ues The identified motif profiles are ranked based on PWMscores and then improved by an EM-like iterative process It isworth noting that the rebuilding of motif profiles integrated theweight scores calculated from the ChIP-seq coverage informa-tion Specifically ChIPMunk set a score to each sequence as itsnormalized maximal peak value and these weights are as-signed to each position forming the motif profiles Theexperiment on several ChIP-seq data set indicated thatChIPMunk obtained a better performance both in running speedand prediction quality compared with MEME HMS [72] andSeSiMCMC [106]
RPMCMC [89] is a profile-based method which combinesGibbs motif samplers and a reversible-jump MCMC method Asa parallel version of Gibbs RPMCMC fully uses parallel mechan-isms to accelerate motif finding This breakthrough means a lotto current motif finding studies especially in this big data eradriven by high-throughput sequencing technology eg ChIP-seq Although various motif-finding methods have been pro-posed before such as DREME HEGMA WEEDER and Gibbs motifsampler [15] they have limited power in properly controllingthe trade-off between computation time and motif detection ac-curacy Fortunately RPMCMC makes multiple interacting motifsamplers in parallel to ensure an acceptable run time comparedwith other methods Most importantly a repulsive force isapplied in RPMCMC to separate different motif samplers closeto each other making further contribution in getting rid of localoptima In this way RPMCMC has achieved extremely promis-ing performances on synthetic promoter sequence and ENCODEChIP-seq data sets Meanwhile it can also be applied to diversemotif identification ie discriminative motifs and more detailsregarding this will be discussed in the next section [89]
The profile-based methods stand on an assumption thateach nucleotide in a motif instance participates independentlyin the TF-binding activity However more studies have indi-cated that the neighboring positions have strong dependent ef-fect in some motifs [107] Taking this issue into considerationMathelier and Wasserman [64] designed a new transcriptionfactor flexible model (TFFM) and developed a prediction systembased on this model It uses a 1st-order HMM to elucidate the di-nucleotide dependencies Specifically the distribution probabil-ities of nucleotides in position i are dependent on thenucleotide found at position i-1 The performance evaluation on96 ChIP-seq data sets indicates that the models considering pos-ition dependence outperform the other models on 90 data setsof 96 ie 94 in terms of the area under the curves for the cor-responding receiver operating characteristic curves [64]Although this algorithm is not designed to build motif profilesthe TFFM model can be easily modified to integrate the correl-ation between positions in de novo motif finding
Although several motif representation models have been de-veloped and new methods have been designed for motif findingon ChIP-seq data the problem is still far from a sound solutionbecause of the complexity of gene regulation the large-scale ofChIP-seq data and the low specificity of some binding sites Inthe future new models for motif presentation may be createdto replace the consensus and profiles model and improve theaccuracy of motif finding
Assessment criteria and performance evaluation
The main difference between motif finding before and after thenext-generation sequencing technology is the dramaticallyincreased data size [9] Substantial experiments showcased thatthe most popular traditional motif-finding tools eg MEME andWEEDER cannot handle the genome-scale ChIP-seq data con-taining thousands of peaks Therefore almost all ChIP-seq-based motif-finding tools focus on improvement in efficiency inaddition to accuracy and conduct comprehensive performanceevaluations on synthetic data or real biological data sets Theperformance evaluation was mainly carried out based on thecomparison with other tools like MEME WEEDER and the most-cited ChIP-seq-based tool DREME The criteria include coverageon known binding sites (eg sensitivity or SN) similarity be-tween predicted motif profiles and the documented ones speci-ficity (SP) positive predicted value (PPV) and the running timeon data sets with different scales
Here we summarized the performance evaluation of theeight ChIP-seq-based motif-finding tools as a sketch in Table 2Overall no method can outperform others in all situationsThese eight tools showed their preferences on different datasizes different motif lengths and advantages on special contextof motif finding Generally DREME has a good performance onprimary motif finding from large-scale data set along withother two functions discriminative motif finding and cofactormotif finding which will be discussed in next section Howeverit prefers short motif identification and only outputs profilematrices without information of corresponding motif instancesIn this situation we have to rely on additional motif scanningtools to identify the concrete TF-binding sites MICSA the ear-liest among the eight tools takes MEME as the search enginethus its speed and performance are highly limited by MEMERSAT peak-motifs and FMotif handle large data sets with a goodprediction accuracy and run faster than DREME SpecificallyFMotif shows good tolerance on noise rate of input sequencesSIOMICS works well on finding motif modules and runs fasterthan DREME but slower than RSAT peak-motifs Discrover usesDREME to identify motif seeds and has a better performance ondiscriminative motif finding than DREME RPMCMC shows agreat potential on finding diverse motifs in real biological datasets More details can be found in Table 2 from which usersmay get a clue to appropriate tool selection in specific contextof motif-finding applications
Development of advanced functions withbiological insightsDiscriminative motif finding
Traditional motif-finding tools aim to identify a group of con-served motifs in query sequences which are expected to containinstances of the to-be-detected motifs Besides these motifssome other randomly conserved DNA patterns may also exist inthese sequences Distinguishing these false positives from realmotifs has always been a big challenge in motif finding Thisissue is getting more serious because the high-throughputsequencing techniques bring in more false positives One solutionto this problem is to carry out the discriminative motif findingwhich is to find motifs whose occurrence frequencies vary be-tween the query sequence set and several well-defined controlsets The simplest contrast for discriminative motif finding con-tains a positive set (query sequence set) and a negative set (se-quences without binding activity or randomly generated
6 | Liu et al
sequences) [88] The discriminative ideas have been integratedinto some traditional motif-finding tools to reduce false positivesin prediction For example CONSENSUS [102] calculates P-valuesfor motif profiles indicating the probabilities for the same motifsoccurring in a hypothetical random sequence set with a similarsize BOBRO [10] calculates such a P-value by practically simulat-ing random sequences set in the program as contrast underPoisson distribution BBR [24] takes coding genomic sequences asbackground data to reevaluate candidate motifs and an experi-ment on Escherichia coli genome showed that this strategy reducesthe number of false positives Certainly the function of discrim-inative motif finding is not limited to dealing with the false-posi-tive issue Some biology-driven applications specifically identifymotifs overrepresented in a positive set but not in a negative set(called decoy motif problem) Hence a motif without occurrencedifferences between two sets is not considerable no matter howconserved it is [108]
Application of discriminative motif finding is even more im-portant in ChIP-seq data analysis because the data size is usu-ally large and the peaks with or without binding activitiesnaturally compose the positive and negative sets respectivelyIn DREME [73] the first step is counting how many positive se-quences and negative sequences contain a query word Thenthe P-value from Fisherrsquos exact test is calculated for this wordThe words with significant P-values will be kept as seeds for
motif merging The same evaluation process will be conductediteratively in generation of motifs by combing similar motifseeds In application DREME performed discriminative motiffinding by taking Sox2- and Oct4-binding ChIP-seq data inmouse embryonic stem cell (mESC) as positive and negativedata set and found that Oct4 data set has significantly moreOct4-binding sites than the Sox2 data set The results over-turned the previous opinion that Sox2 and Oct4 bind their tar-gets exclusively as a heterodimer [109]
Maaskola and Rajewsky [88] developed an improved HMM-based tool Discrover for discriminative motif finding on largedata set The method consists of three parts lsquoseed findingrsquolsquoHMM optimizationrsquo and lsquosignificance filteringrsquo The seed findingpart is similar to DREME but uses more objective functions be-sides the Fisherrsquos exact test P-values Among these objectivefunctions mutual information of condition and motif occur-rence (MICO) and maximum mutual information estimationcan be used for discriminative motif evaluation on the contrastwith more than two conditions which extended the applicationof Discrover The HMM optimization step starts from a back-ground state trained by the BaumndashWelch algorithm Then theinitial emission probabilities of the motif chains are centered onthe seed sequences The posterior probability of all sequences isevaluated to calculate the expected number of sequences thathave motif occurrences Discrover conducts discriminative
Table 2 Performance summary of motif-finding tools on ChIP-seq data
Programs Criteria Preferences Speed Performance and special characteristics
MICSA [79] Speed motifcoverage
Prefers small datasets
Slow on large data sets Performance is limited by MEME performswell on filtering weak peaks
ChIPMunk [69] Speed motifcoverage
NA Faster than MEME slowerthan DREME quasi-linearcomplexity (power 127)
More accurate than MEME
DREME [73] Speed motifcoverage
Handles large ChIP-seq data (thou-sands of se-quences) prefersshort motifs
Much faster than WEEDERand MEME linear timecomplexity
Performs well at cofactor motif finding anddiscriminative motif finding only providesmotif profile matrix no output of motifsites
RSAT peak-motifs[87]
Speed motifcoverage
Handles large ChIP-seq data sets (upto 1 000 000 peaksof 100 bp each)
Faster than DREME MEMEand ChIPMunk
High prediction accuracy
FMotif [74] Speed motif cover-age SN and SP onmotif sites
Handles large ChIP-seq data handleslong motifs
Fast but requires morememory
High sensitivity and high specificity pre-dicted accurate motif profiles with highrank (1 or 2) for 12 mESC data tolerates upto 30 noise sequences
SIOMICS [78] Speed motif cover-age and specificityon motifs
NA Slower than RSAT peak-motifs much faster thanDREME especially onlarge data sets
Competitive performance with DREME andRSAT in motif finding and cofactor motiffinding competitive performance withDREME and better than RSAT peak-motifson false-positive control good motif mod-ule prediction
Discrover [88] Speed MCC(Matthews corre-lation coefficient)on nucleotide-level SN and PPVon binding sitelevel
NA Comparable with DREME inspeed
Better than Bioprospector and MDscan on SNand PPV of prediction better than DREMEon discriminative motif finding
RPMCMC [89] Speed SN and PPV NA Similar to DREME Better than WEEDER on SN and PPV of predic-tion great potential to mine many reliablediverse motifs
Motif Prediction based on ChIP-seq data | 7
learning by iterative gradient optimization based on the chosenobjective functions which has a linear run time complexitybased on the length of the input data Finally Discrover refinesthe final models by a threshold on a corrected MICO-based P-value regardless of if MICO or another objective function wasused in previous steps In addition Discrover makes extra ef-forts in detection of multiple motifs which will be discussed inthe next section
In the application of discriminative motif finding on ChIP-seq data the various choices of reference data can benefit dif-ferent biological analysis about gene regulation and is the mainfactor affecting the prediction performance [108 110]Specifically the reference data can be (i) randomly generatedsequences using a uniform distribution or a Markov process (ii)the sequences with no binding evidence (iii) the ChIP-seq peaksfor another TF (iv) the peaks of the same TF but under a differ-ent condition even (v) multiple-level reference data sets basedon a detailed grade framework of the signal strength etc [94]This information can reduce false positives and find other spe-cial motifs including decoy motifs variant motifs for one TFimpoverished motifs and collaborated motifs [108] and thuscan improve the state-of-the-art studies about complex regula-tion mechanism eg combinatorial regulation and behaviorchange of TFs among different conditions [110] It is worth not-ing that the binding activity of a TF could be affected by epigen-etic modifications in a complex fashion eg through localchromatin structure caused by the packaging of DNA histonesand other proteins [111] Cuellar-Partida et al [111] have usedthe signal from histone modification ChIP-assay to filter thescanned motifs The corresponding pitfall raised in discrimina-tive motif finding is that the sequences in a negative data setwith no binding evidence may contain potential binding sitesSome algorithms try to mitigate this effect by a special designwhich can tolerate a few motif occurrences in negative set [110]Another phenomenon is that the binding motifs of a TF couldbe affected by the binding of other TFs Mason et al [112] provedthat Oct4 preferentially binds to different motifs with or withoutSox2 bindings within 5000 bp indicating that the binding motifof Oct4 is context-dependent
Find cofactors of TFs from ChIP-seq data
In animals and plants TFs usually regulate gene expression withcooperation of other partner TFs (cofactors) [113] Growing evi-dence indicates that cofactors which interacted directly or indir-ectly play an important role in transcription regulation Thebinding sites of multiple TFs in a short DNA region often deter-mine the temporal spatial expression pattern of the downstreamgenes thus will help to elucidate underlying patterns of combina-torial regulation [114] The analysis of cofactors has attracted sub-stantial attention in traditional motif finding For example BBA[24] in the DMINDA Web server designed a probability model toevaluate the co-occurrences among identified motifs in a givenset of regulatory sequences which can reveal joint regulation re-lationships by multiple TFs The cofactor motif finding on ChIP-seq data is usually formulated by detecting multiple nonredun-dant motifs in a set of ChIP-seq peak regions [73] or detecting TFpatterns whose binding motifs frequently co-occurred on thesame set of regulatory regions [113] Recently some algorithmshave tried to take advantage of the large-scale ChIP-seq data todiscover the cofactor motifs
Researchers want to identify multiple nonredundant motifsin query ChIP-seq data The most straightforward way is to scanpeak sequences by documented motifs [84 115] However the
number of documented motifs is limited compared with thenumber of TFs leading to a loss of some real binding sitesSome de novo motif-finding methods try to first identify mul-tiple motifs and then build relationships for them with knownTFs One example is the aforementioned DREME [73] whichfeeds the whole sequences into the corresponding model ratherthan using only a small portion of the collected ChIP-seq data Itidentifies multiple motifs by removing the most statistically sig-nificant identified motif derived from Fisherrsquos exact testand then repeats the search for motifs DREME achieved re-markable performance in cofactor motifs discovery in mousecell ChIP-seq data and dramatically outperformed other preva-lent algorithms In particular DREME discovered gt2 times thenumber of cofactor motifs compared with Amadeus [116] andgt10 times compared with Trawler [117] In addition the algo-rithm peak-motifs in the RSAT package also returns additionalmotifs potentially bound by cofactors and its application onmouse cell ChIP-seq data successfully detected some cofactormotifs [76]
Rather than simply considering individual motifs separatelySIOMICS [78 94] models the cofactor motifs as motif modulesie a group of TFs with their TFBSs co-occurring in significantlylarge number of ChIP-seq peaks Cofactor motifs may occur onlyin a small set of peaks and thus can be underrepresented indi-vidually in all peak regions Therefore evaluation of motif mod-ules simultaneously could make the significance of cofactorsstand out from the massive background sequences Asdescribed in previous section SIOMICS identifies conservedmotifs by a word-based strategy based on Equation (4) Toidentify nonredundant motifs it removes redundant candidateswith lower scores and the remaining ones are used to identifyputative motifs Instead of considering motif candidatesindividually SIOMICS represents them as nodes in a tree inwhich the more frequent candidates are closer to the root whilethe branch of the tree represents a group of co-occurring motifsBy evaluating each branch based on a Poisson clumpingheuristic strategy SIOMICS identifies statistically significantmotif modules and takes motif candidates in them as final pu-tative motifs The performance comparison on the same ChIP-seq data set used by DREME and peak-motifs indicates thatmore documented cofactor motifs were identified by SIOMICSrather than DREME and RSAT Peak-motifs with comparabletime efficiency
The above methods identify cofactors from ChIP-seq data fora single TF and have obtained many valuable results Howeversynthetic analysis of diverse ChIP-seq data could provide moreinformation about the co-regulation mechanism of TFs FCOPs[113] is a method for identifying combinatorial occupancy pat-terns of multiple TFs from diverse ChIP-seq data FCOPs inte-grates two kinds of ChIP-seq data sets as input from TFsoccupancy as well as chromatin modification and it considerstwo kinds of uncertainty How possible a peak from ChIP-seqdata represents an actual binding activity and how confidentthe predicted enhancer sequences are FCOPs generate probabil-ities from these uncertainties and use them to evaluate the stat-istical significance of a given set of TFs co-occurring in a givenset of enhancers Obviously enumerating all possible subset ofenhancers has an exponential complexity which is inefficientThis method adopts a dynamic programming-based method tocalculate the probabilities which leads to linear complexitywith respect to the number of the enhancers both in time andspace Despite this method not providing any DNA motifs itcan be easily integrated into ChIP-seq motif analysis to identifycofactors
8 | Liu et al
Citation analysis of ChIP-seq motif-findingtools
To analyze the usage of existing ChIP-seq-based motif-findingtools we collected the citation information of eight tools inTable 1 and a Web server MEME-ChIP which implants MEMEand DREME The relative citations of the eight tools since 2011are shown in Figure 2 where we can see that DREME has anoverwhelming superiority among these tools and RSAT peak-motifs and ChIPMunk also have higher relative citation ratesAlthough the four tools that were developed within the past 3years (RPMCMC Discover SIOMICS and FMotif) introducedsome new ideas and obtained better prediction performancethey still fail to get more attention from biologists and almostall of their citations are from the papers focusing on develop-ment of new methods The main reason is the lack of a user-friendly Web server and a platform with multiple related func-tions limiting their usages by people without adequate compu-tational background
DREME as one motif-finding engine of MEME-ChIP has beenintegrated in the MEME suite which is the most popular motif-finding and analysis platform MEME-ChIP obtained more add-itional citations than DREME (Figure 3) and most of theseshould be attribute to DREME RSAT peak-motifs is part of RSATplatform where a series of modular computer programs is
integrated for regulatory signal detection in noncoding se-quences And the relatively high number of citations ofChIPMunk also benefited from its Web service The methodsthat only provide source code are difficult to obtain continuousattention Taking MICSA as an example it was published in2010 and was cited frequently within 2 years following publica-tion but its citation count decreased sharply in recent yearsThis phenomenon indicated that the user experience of a newalgorithm is as important as the quality of algorithm itself
Discussion
The ChIP-seq peaks provide a foundation for the identificationof motifs on a genome scale In return the predicted motifs canbe used to reevaluate the peaks and even find more potentialTF-binding sites that were missed by a ChIP-seq experimentFor example MICSA [79] takes advantage of de novo motif identi-fication to reevaluate the ChIP-seq peaks In MICSA only a sub-set of the highest peaks is fed into MEME for motifsidentification for which the probability of observing a motif bychance (P-value) is calculated Then the predicted motifs willbe used to scan all previously identified peaks and a score willbe assigned to each peak by combining the P-values of motifsthat occurred on it with its depth The experimental results
0
10
20
30
40
50
60
70
80
90
100
2011 2012 2013 2014 2015 2016
Rela
ve
annu
al ci
tao
n (
)
Year of citaon
RPMCMC 2015[91]
Discrover 2014[93]
SIOMICS 2014[78]
Fmof 2014[74]
RSAT peak-mofs 2012[94]
DREME 2011[73]
ChIPMunk 2010[69]
MICSA 2010[79]
Figure 2 Relative citation of the ChIP-seq motif-finding tools until October 2016 The numbers after a tool name indicate the year of its original release and the corres-
ponding reference Citation of each tool was extracted from the Bing Academy (httpcnbingcomacademic)
0
5
10
15
20
25
30
35
40
45
50
Cita
ons
pery
ears
ince
publ
ished
MICSA 2010[79]
ChIPMunk 2010[69]
DREME 2011[73]
RSAT peak-mofs 2012[94]
Fmof 2014[74]
SIOMICS 2014[78]
Discrover 2014[93]
RPMCMC 2015[91]
MEME-ChIP 2011[68]a
Figure 3 The average annual citation of eight motif-finding tools and MEME-ChIP aMEME-ChIP is a Web server that implants DREME and MEME to detect motifs
Motif Prediction based on ChIP-seq data | 9
showed that MICSA offered a distinct improvement in TF-bind-ing prediction accuracy with fewer false positives comparedwith the other 10 approaches Certainly searching potentialbinding sites using the predicted motifs is not limited to thepeak sequences It could even be applied on the whole-genomesequences as some binding activities may not occur under thecondition of ChIP experiments and these binding sites are alsovaluable for studying the full regulation picture of the interestedTFs
Another issue that should be discussed concerning ChIP-seqmotif finding is the difference of its application on various datasources For example the binding motifs in eukaryotes and pro-karyotes often have their respective characteristics and thelength of the binding regions in eukaryotes is much longer thanthose in prokaryotes Therefore there is a need of special designin motif-finding and analysis techniques
As with the traditional motif finding there are some ensem-ble methods for ChIP-seq-based motif finding [85 118] It was al-ways recommended that using several tools in motif finding isa better strategy as diverse techniques may capture differentcharacteristics of motifs This idea leads to the development ofensemble motif-finding methods Most of these methods sim-ply assemble several algorithms together and rank the outputof them For example W-ChIPMotif [119] includes MEMEWEEDER and MaMF [120] and assesses the output by comparingwith a randomized initial input CompleteMotifs [118] incorpor-ates three de novo tools (ChIPMunk WEEDER and CUDA-MEME)[121] and calculates the P-values of predicted motifs by a back-ground random model However using more algorithms intro-duces more false positives while they have high probability tocover real motifs Therefore further analyses are still requiredfor the output of multiple algorithms
Certainly the accuracy of prediction and the efficiency ofmethods are still the most important features in ChIP-seq motiffinding and the application of current methods shows that theyare still far from completely satisfactory More deep thinkingand advanced techniques are still needed to improve the intrin-sic algorithms For example the evolution information fromphylogenetic footprinting can be useful as prior knowledge inevaluation of candidate binding sites Besides additional func-tions such as cofactor analysis evaluating peaks based onmotifs and multiple types of discriminative motif finding are at-tracting more attention in applications They should be wellintegrated in motif-finding methods in future studies Finallyan integrated Web server or platform is essential for the appli-cation of new designed methods as it can provide more func-tions about upstream data collection and downstream resultanalysis
Most recently deep learning has demonstrated its unprece-dented performance in regulatory genomics especially in ChIP-seq-based motif analysis [122] This marked performance ismainly achieved by taking advantage of graphics processingunits and the capability to extract complex features of motifsfrom ChIP-seq data However it can be improved further ifmore information (eg DNA shape) can be taken into consider-ation rather than only the ChIP-seq peaks sequences
Key Points
bull The algorithmic strategies adopted by current motif-finding tools can be generally divided into two catego-ries word-based and profile-based which could be im-plemented with other models like tree graph and
clustering The combination of these two strategies isa better choice to break the current bottleneck in pre-diction accuracy and efficiency
bull The new features of current tools including but notlimited to discriminative motif identification (DREAMRSAT lsquopeak-motifsrsquo and Discrover) cofactor motifdetection (DREME RSAT lsquopeak-motifsrsquo SIOMICSRPMCMC and Discrover) etc are essential for applica-tion of ChIP-seq data analysis and are encouraged tobe integrated into newly designed algorithms
bull Existing tools still have a much room for improvementwith respect to prediction accuracy and efficiency thecontradictory relationship between the time efficiencyand space complexity as well as between predictionaccuracy and application universality is generally pre-sent in these tools
bull An integrated Web server or platform providing vari-ous analysis functions in data collection and the resultanalysis is essential for the spread of new designedmethods in their applications
bull In future studies more mathematical models eggraph model statistics and combinatorial optimiza-tion could be developed and integrated to design newmotif signal detection techniques and new evaluationcriteria In addition more information eg the de-pendency between nucleotides and the height of peaksfrom ChIP-seq data should be considered to improvethe prediction accuracy
Funding
The State of South Dakota Research Innovation Center andthe Agriculture Experiment Station of South Dakota StateUniversity the National Nature Science Foundation ofChina (grant numbers 61303084 and 61432010 to BL) andthe Young Scholars Program of Shandong University (grantnumbers YSPSDU 2015WLJH19 to BL)
References1 DrsquoHaeseleer P What are DNA sequence motifs Nat
Biotechnol 200624423ndash52 Brohee S Janky R Abdel-Sater F et al Unraveling networks
of co-regulated genes on the sole basis of genome se-quences Nucleic Acids Res 2011396340ndash58
3 Davidson E Levin M Gene regulatory networks Natl Acad SciUSA 20051024935
4 Liu B Zhou C Li G et al Bacterial regulon modeling and pre-diction based on systematic cis regulatory motif analysesSci Rep 2016623030
5 Baumbach J On the power and limits of evolutionary con-servationndashunraveling bacterial gene regulatory networksNucleic Acids Res 2010387877ndash84
6 Lawrence CE Altschul SF Boguski MS et al Detecting subtlesequence signals a Gibbs sampling strategy for multiplealignment Science 1993262208ndash14
7 Pevzner PA Sze SH Combinatorial approaches to findingsubtle signals in DNA sequences Proc Int Conf Intell Syst MolBiol 20008269ndash78
8 Nakaki R Kang J Tateno M A novel ab initio identificationsystem of transcriptional regulation motifs in genome DNAsequences based on direct comparison scheme of signalnoise distributions Nucleic Acids Res 2012408835ndash48
10 | Liu et al
9 Zambelli F Pesole G Pavesi G Motif discovery and transcrip-tion factor binding sites before and after the next-generation sequencing era Brief Bioinform 201314225ndash37
10 Li G Liu B Ma Q et al A new framework for identifying cis-regulatory motifs in prokaryotes Nucleic Acids Res201139e42
11 Chen X Guo L Fan Z et al W-AlignACE an improved Gibbssampling algorithm based on more accurate position weightmatrices learned from sequence and gene expressionChIP-chip data Bioinformatics 2008241121ndash8
12 Sinha S PhyME a software tool for finding motifs in sets oforthologous sequences Methods Mol Biol 2007395309ndash18
13 Das MK Dai HK A survey of DNA motif finding algorithmsBMC Bioinformatics 20078(Suppl 7)S21
14 Wang T Stormo GD Combining phylogenetic data with co-regulated genes to identify regulatory motifs Bioinformatics2003192369ndash80
15 Liu X Brutlag DL Liu JS BioProspector discovering con-served DNA motifs in upstream regulatory regions of co-expressed genes Pac Symp Biocomput 2001127ndash38
16 Hertz GZ Stormo GD Identifying DNA and protein patternswith statistically significant alignments of multiple se-quences Bioinformatics 199915563ndash77
17 Liu XS Brutlag DL Liu JS An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments NatBiotechnol 200220835ndash9
18 Bailey TL Boden M Buske FA et al MEME SUITE tools for motifdiscovery and searching Nucleic Acids Res 200937W202ndash8
19 Olman V Xu D Xu Y CUBIC identification of regulatorybinding sites through data clustering J Bioinform Comput Biol2003121ndash40
20 Li X Wong WH Sampling motifs on phylogenetic trees ProcNatl Acad Sci USA 20051029481ndash6
21 Blanchette M Tompa M Discovery of regulatory elementsby a computational method for phylogenetic footprintingGenome Res 200212739ndash48
22 Blanchette M Tompa M FootPrinter a program designed forphylogenetic footprinting Nucleic Acids Res 2003313840ndash2
23 Li G Liu B Xu Y Accurate recognition of cis-regulatorymotifs with the correct lengths in prokaryotic genomesNucleic Acids Res 201038e12
24 Ma Q Liu B Zhou C et al An integrated toolkit for accurateprediction and analysis of cis-regulatory motifs at a genomescale Bioinformatics 2013292261ndash8
25 Tompa M Li N Bailey TL et al Assessing computationaltools for the discovery of transcription factor binding sitesNat Biotechnol 200523137ndash44
26 McCue LA Thompson W Carmack CS et al Factors influenc-ing the identification of transcription factor binding sites bycross-species comparison Genome Res 2002121523ndash32
27 Simcha D Price ND Geman D The limits of de novo DNAmotif discovery PLoS One 20127e47836
28 Siddharthan R Siggia ED van Nimwegen E PhyloGibbs aGibbs sampling motif finder that incorporates phylogenyPLoS Comput Biol 20051e67
29 Blanchette M Schwikowski B Tompa M Algorithms forphylogenetic footprinting J Comput Biol 20029211ndash23
30 Neph S Tompa M MicroFootPrinter a tool for phylogeneticfootprinting in prokaryotic genomes Nucleic Acids Res200634W366ndash8
31 Carmack CS McCue LA Newberg LA et al PhyloScan identi-fication of transcription factor binding sites using cross-species evidence Algorithms Mol Biol 200721
32 Zhang S Xu M Li S et al Genome-wide de novo prediction ofcis-regulatory binding sites in prokaryotes Nucleic Acids Res200937e72
33 Katara P Grover A Sharma V Phylogenetic footprinting aboost for microbial regulatory genomics Protoplasma2012249901ndash7
34 Tagle DA Koop BF Goodman M et al Embryonic epsilon andgamma globin genes of a prosimian primate (Galago crassi-caudatus) Nucleotide and amino acid sequences develop-mental regulation and phylogenetic footprints J Mol Biol1988203439ndash55
35 Borneman AR Gianoulis TA Zhang ZD et al Divergence oftranscription factor binding sites across related yeast spe-cies Science 2007317815ndash9
36 Odom DT Dowell RD Jacobsen ES et al Tissue-specific tran-scriptional regulation has diverged significantly betweenhuman and mouse Nat Genet 200739730ndash2
37 Boyle AP Araya CL Brdlik C et al Comparative analysis ofregulatory information and circuits across distant speciesNature 2014512453ndash6
38 Bingqiang Liu CZ Zhang H Guojun L et al An integrativeand applicable phylogenetic footprinting framework for cis-regulatory motifs identification in prokaryotic genomesBMC Genomics 201617578
39 Park PJ ChIP-seq advantages and challenges of a maturingtechnology Nat Rev Genet 200910669ndash80
40 Song L Crawford GE DNase-seq a high-resolution tech-nique for mapping active gene regulatory elements acrossthe genome from mammalian cells Cold Spring Harb Protoc201025384
41 Wang Z Gerstein M Snyder et al RNA-Seq a revolutionarytool for transcriptomics Nat Rev Genet 20081057ndash63
42 Tsankov AM Gu H Akopian V et al Transcription factorbinding dynamics during human ES cell differentiationNature 2015518344ndash9
43 Wu F Olson BG Yao J DamID-seq genome-wide mapping ofprotein-DNA interactions by high throughput sequencing ofadenine-methylated DNA fragments J Vis Exp 2016107e53620
44 Maragkakis M Alexiou P Nakaya T et al CLIPSeqTools-anovel bioinformatics CLIP-seq analysis suite RNA2015221ndash9
45 Hafner M Landthaler M Burger L et al PAR-CliPmdasha methodto identify transcriptome-wide the binding sites of RNAbinding proteins J Vis Exp 201041e2034
46 Ingolia NT Ribosome profiling new views of translationfrom single codons to genome scale Nat Rev Genet201415205ndash13
47 Giresi PG Kim J Mcdaniell RM et al FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates activeregulatory elements from human chromatin Genome Res200717877ndash85
48 Nutiu R Friedman RC Luo S et al Direct measurement ofDNA affinity landscapes on a high-throughput sequencinginstrument Nat Biotechnol 201129659ndash64
49 Collas P Dahl JA Chop it ChIP it check it the currentstatus of chromatin immunoprecipitation Front Biosci200813929ndash43
50 Kimura H Sato Y DNA Replication and Histone ModificationJapan Springer 2016
51 Suganuma T Workman JL Histone modification as a reflec-tion of metabolism Cell Cycle 2016 15481ndash2
52 Qu H Fang X A brief review on the human encyclopedia ofDNA elements (ENCODE) project Genomics ProteomicsBioinformatics 201311135ndash41
Motif Prediction based on ChIP-seq data | 11
53 Consortium EP The ENCODE (ENCyclopedia Of DNAElements) project Science 2004306636ndash40
54 Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910R25
55 Li H Durbin R Fast and accurate long-read alignment withBurrowsndashWheeler transform Bioinformatics 201026589ndash95
56 Szalkowski AM Schmid CD Rapid innovation in ChIP-seqpeak-calling algorithms is outdistancing benchmarking ef-forts Brief Bioinform 201112626ndash33 628
57 Kharchenko PV Tolstorukov MY Park PJ Design and ana-lysis of ChIP-seq experiments for DNA-binding proteins NatBiotechnol 2008261351ndash9
58 Zhang Y Liu T Meyer CA et al Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089R137
59 Jiang H Wang F Dyer NP et al CisGenome browser a flex-ible tool for genomic data visualization Bioinformatics2010261781ndash2
60 Fejes AP Robertson G Bilenky M et al FindPeaks 31 atool for identifying areas of enrichment from massivelyparallel short-read sequencing technology Bioinformatics2008241729ndash30
61 Valouev A Johnson DS Sundquist A et al Genome-wideanalysis of transcription factor binding sites based on ChIP-Seq data Nat Methods 20085829ndash34
62 Xin F Grossman R Stein L PeakRanger a cloud-enabledpeak caller for ChIP-seq data BMC Bioinformatics201112139
63 Kuan PF Chung D Pan G et al A statistical frameworkfor the analysis of ChIP-Seq data J Am Stat Assoc2011106891ndash903
64 Mathelier A Wasserman WW The next generation of tran-scription factor binding site prediction PLoS Comput Biol20139e1003214
65 Cheng C Min R Gerstein M TIP a probabilistic method foridentifying transcription factor target genes from ChIP-seqbinding profiles Bioinformatics 2011273221ndash7
66 Wu S Wang J Zhao W et al ChIP-PaM an algorithm to iden-tify protein-DNA interaction using ChIP-Seq data Theor BiolMed Model 20107(1)18
67 van Heeringen SJ Veenstra GJC GimmeMotifs a de novomotif prediction pipeline for ChIP-sequencing experimentsBioinformatics 201127270ndash1
68 Machanick P Bailey TL MEME-ChIP motif analysis of largeDNA datasets Bioinformatics 2011271696ndash7
69 Kulakovskiy IV Boeva VA Favorov AV et al Deep and widedigging for binding motifs in ChIP-Seq data Bioinformatics2010262622ndash3
70 Jothi R Cuddapah S Barski A et al Genome-wide identifica-tion of in vivo proteinndashDNA binding sites from ChIP-Seqdata Nucleic Acids Res 2008365221ndash31
71 Mercier E Droit A Li L et al An integrated pipeline for thegenome-wide analysis of transcription factor binding sitesfrom ChIP-Seq PLoS One 20116e16432
72 Hu M Yu J Taylor JM et al On the detection and refinementof transcription factor binding sites using ChIP-Seq dataNucleic Acids Res 2010382154ndash67
73 Bailey TL DREME motif discovery in transcription factorChIP-seq data Bioinformatics 2011271653ndash9
74 Jia C Carson MB Wang Y et al A new exhaustive methodand strategy for finding motifs in ChIP-enriched regionsPLoS One 20149e86044
75 Pavesi G Mereghetti P Mauri G et al Weeder Web discov-ery of transcription factor binding sites in a set of
sequences from co-regulated genes Nucleic Acids Res200432W199ndash203
76 Thomas-Chollier M Herrmann C Defrance M et al RSATpeak-motifs motif analysis in full-size ChIP-seq datasetsNucleic Acids Res 201240e31
77 Tran NT Huang CH A survey of motif finding web tools fordetecting binding site motifs in ChIP-Seq data Biol Direct201494
78 Jun Ding HH Xiaoman L SIOMICS a novel approach for sys-tematic identification of motifs in ChIP-seq data NucleicAcids Res 2014421645ndash35
79 Boeva V Surdez D Guillon N et al De novo motif identifica-tion improves the accuracy of predicting transcription factorbinding sites in ChIP-Seq data analysis Nucleic Acids Res201038e126
80 Bailey TL Williams N Misleh C et al MEME discovering andanalyzing DNA and protein sequence motifs Nucleic AcidsRes 200634369ndash73
81 Hartmann H Guthhrlein EW Siebert M et al P-value-basedregulatory motif discovery using positional weight matricesGenome Res 201323181ndash94
82 Niu M De novo prediction of cis-regulatory modules in eu-karyotic organisms Dissertations amp ThesesmdashGradworksThe university of North Carolina at Charlotte 2014
83 Bolouri H Ruzzo WL Integration of 198 ChIP-seq datasetsreveals human cis-regulatory regions J Comput Biol201219989ndash97
84 Sun H Guns T Fierro AC et al Unveiling combinatorial regu-lation through the combination of ChIP information and insilico cis-regulatory module detection Nucleic Acids Res201240e90
85 Lihu A Holban S A review of ensemble methods for denovo motif discovery in ChIP-Seq data Brief Bioinform201617731
86 Kulakovskiy IV Makeev VJ Motif discovery and motif find-ing in ChIP-Seq data Genome Analysis Current Proceduresand Applications Caister Academic Press Norfolk UK2014 83
87 Medina-Rivera A Defrance M Sand O et al RSAT 2015Regulatory Sequence Analysis Tools Nucleic Acids Res201543W50ndash6
88 Maaskola J Rajewsky N Binding site discovery from nucleicacid sequences by discriminative learning of hidden Markovmodels Nucleic Acids Res 20144212995ndash3011
89 Ikebata H Yoshida R Repulsive parallel MCMC algorithm fordiscovering diverse motifs from large sequence setsBioinformatics 2015311561ndash8
90 Stormo GD DNA binding sites representation and discov-ery Bioinformatics 20001616ndash23
91 Nishida K Frith MC Nakai K Pseudocounts for transcriptionfactor binding sites Nucleic Acids Res 200937939ndash44
92 Weirauch MT Cote A Norel R et al Evaluation of methodsfor modeling transcription factor sequence specificity NatBiotechnol 201331126ndash34
93 Bailey TL Elkan C Fitting a mixture model by expectationmaximization to discover motifs in biopolymers Proc IntConf Intell Syst Mol Biol 1994228ndash36
94 Ding J Dhillon V Li X et al Systematic discovery of cofactormotifs from ChIP-seq data by SIOMICS Methods 201579ndash8047ndash51
95 van Helden J Andre B Collado-Vides J Extracting regulatorysites from the upstream region of yeast genes by computa-tional analysis of oligonucleotide frequencies J Mol Biol1998281827ndash42
12 | Liu et al
96 van Helden J Rios AF Collado-Vides J Discovering regula-tory elements in non-coding sequences by analysis ofspaced dyads Nucleic Acids Res 2000281808ndash18
97 van Helden J del Olmo M Perez-Ortin JE Statistical analysisof yeast genomic downstream sequences reveals putativepolyadenylation signals Nucleic Acids Res 2000281000ndash10
98 Kulakovskiy I Levitsky V Oshchepkov D et al From bindingmotifs in ChIP-Seq data to improved models of transcriptionfactor binding sites J Bioinform Comput Biol 2013111340004
99 Levitsky VG Kulakovskiy IV Ershov NI et al Application ofexperimentally verified transcription factor binding sitesmodels for computational analysis of ChIP-Seq data BMCGenomics 20141580
100 Eskin E Pevzner PA Finding composite regulatory patternsin DNA sequences Bioinformatics 200218 Suppl 1S354ndash63
101 Sharov AA Ko MS Exhaustive search for over-representedDNA sequence motifs with CisFinder DNA Res 200916261ndash73
102 Hertz GZ Hartzell GW III Stormo GD Identification of con-sensus patterns in unaligned DNA sequences known to befunctionally related Comput Appl Biosci 1990681ndash92
103 Kingsford C Zaslavsky E Singh M a compact mathematicalprogramming formulation for DNA motif finding Lect NotesComput Sci 2006400913
104 Reid JE Wernisch L STEME efficient EM to find motifs inlarge data sets Nucleic Acids Res 201139e126
105 Jin VX OrsquoGeen H Iyengar S et al Identification of an OCT4 andSRY regulatory module using integrated computational and ex-perimental genomics approaches Genome Res 200717807ndash17
106 Favorov AV Gelfand MS Gerasimova AV et al A Gibbs sam-pler for identification of symmetrically structured spacedDNA motifs with improved estimation of the signal lengthBioinformatics 2005212240ndash5
107 Zhao Y Ruan S Pandey M et al Improved models for tran-scription factor binding site identification using noninde-pendent interactions Genetics 2012191781ndash90
108 Redhead E Bailey TL Discriminative motif discovery in DNAand protein sequences using the DEME algorithm BMCBioinformatics 20078385
109 Chen X Xu H Yuan P et al Integration of external signalingpathways with the core transcriptional network in embry-onic stem cells Cell 20081331106ndash17
110 Huggins P Zhong S Shiff I et al DECOD Fast and AccurateDiscriminative DNA Motif Finding Bioinformatics2011272361ndash7
111 Cuellar-Partida G Buske FA McLeay RC et al Epigenetic pri-ors for identifying active transcription factor binding sitesBioinformatics 20122856ndash62
112 Mason MJ Plath K Zhou Q Identification of context-dependent motifs by contrasting ChIP binding dataBioinformatics 2010262826ndash32
113 Teng L He B Gao P et al Discover context-specificcombinatorial transcription factor interactions byintegrating diverse ChIP-Seq data sets Nucleic Acids Res201442e24
114 Vaquerizas JM Kummerfeld SK Teichmann SA et al A cen-sus of human transcription factors function expressionand evolution Nat Rev Genet 200910252ndash63
115 Ding J Cai X Wang Y et al ChIPModule systematic discov-ery of transcription factors and their cofactors from ChIP-seq data Pac Symp Biocomput 2013320ndash31
116 Ettwiller L Paten B Ramialison M et al Trawler de novoregulatory motif discovery pipeline for chromatin immuno-precipitation Nat Methods 20074563ndash5
117 Linhart C Halperin Y Shamir R Transcription factor andmicroRNA motif discovery the Amadeus platform and acompendium of metazoan target sets Genome Res2008181180ndash9
118 Kuttippurathu L Hsing M Liu Y et al CompleteMOTIFsDNA motif discovery platform for transcription factor bind-ing experiments Bioinformatics 201127715ndash7
119 Ho JW Bishop E Karchenko PV et al ChIP-chip versus ChIP-seq lessons for experimental design and data analysis BMCGenomics 201112134
120 Che D Li G Mao F et al Detecting uber-operons in prokary-otic genomes Nucleic Acids Res 2006342418ndash27
121 Liu Y Schmidt B Liu W et al CUDAndashMEME acceleratingmotif discovery in biological sequences using CUDA-enabled graphics processing units Pattern Recognit Lett2010318
122 Alipanahi B Delong A Weirauch MT et al Predicting the se-quence specificities of DNA- and RNA-binding proteins bydeep learning Nat Biotechnol 201533831ndash8
Motif Prediction based on ChIP-seq data | 13
(pseudocount) for each position of the motif [91] Through sys-tem simulation and analysis Nishida et al found that the opti-mal pseudocount value is correlated with the entropy of a motifprofile Specifically the less conserved motif profiles preferlarger pseudocount value and 08 is suggested in general
As shown above multiple approaches for modeling TFrsquosDNA-binding specificity have been developed A systematiccomparison of these approaches can provide substantially valu-able information for further motif-finding algorithm designDREAM5 Consortium organized a competition on motif repre-sentation models by applying 26 approaches to in vitro protein-binding microarray data [92] These approaches adopt variousstrategies including but not limited to k-mers model PWMhidden Markov models (HMMs) and dinucleotides The k-mersand PWM are the two main strategies which show similar aver-age performance on multiple data sets However they have sub-stantial differences in terms of individual performance on a fewdata sets indicating that motif finding is sensitive for model se-lection Another interesting observation is that the IC of a motifmay not fully represent its accuracy It is obviously contrarywith the basic principle of general motif-finding tools thusdeeper investigation into this area is still needed to improve themotif representation models
These motif representation models are still not perfect witha common disadvantage that they ignore the correlation amongdifferent positions in a motif For example the motif in Figure1C has the same PWM as the motif in Figure 1a but apparentlythe first two positions in this motif are correlated and depend-ent to each other Hence a high order Markov model is sug-gested to be integrated into PWM matrix [24] Meanwhile it isunsure whether a known PWM can fairly represent the wholepopulation scenario as the frequencies of nucleotides in eachposition are calculated only from the known binding sites of aTF A motif profile built on partially identified bindings sites of aTF may induce bias when it is used to interpret the global bind-ing preference especially when this profile is used to model theorthologous binding sites from various species A more funda-mental debate is Do the nucleotides with lower frequenciesimply lower binding ability At the time of this writing thereare still no clear answers to this question and deeper thoughtabout above concerns will bring potential ways to improveexisting representation models of cis-regulatory motifs
Motif signal detection techniques andperformance evaluation
The basic computational assumption of motif identification isthat they are overrepresented as conserved patterns in given se-quences The scattered instances of a motif are not perfectlyidentical but similar with each other Once identified and wellaligned they will show significance in conservation comparedwith background sequences Therefore finding and aligningthese instances are the primary issues and remain a challengein motif finding ChIP-seq data supply enriched resource formotif finding yet make this problem more challenging as thefull-size peak sequences from a ChIP-seq experiment are muchlarger than co-regulated promoters in traditional motif finding
Motif-finding methods mainly fall into two categories word-based methods (ie consensus-based) and profile-based meth-ods [9 13] Word-based methods usually enumerate and com-pare nucleotides starting from a consensus sequence with afixed length and a tolerance of mutations Theoretically thisstrategy is able to identify global optimal solutions but suffers
from high computational complexity when it is applied to large-scale input data or when to-be-identified motifs are relativelylong and with a large number of mutations [13] The profile-based methods usually start with some aligned patterns eitherrandomly chosen [93] or enumerated in a limited subset of inputdata [93] and refine them based on some criteria on the wholedata These criteria are designed to evaluate the overrepre-sented significance of aligned profiles from the input se-quences Improvements are mostly conducted in a heuristicway eg neighboring improvement (add or delete patterns tosee if the profile goes better similar to a hill-climbing method)or iterative statistical methods [Gibbs sampling or expectationmaximization (EM)] The profile-based methods usually runfaster than word-based methods and have better performancein predicting motifs with complex mutations However thiskind of methods tend to fail in detection of multiple motifs es-pecially when the data size is large as the iterative procedurethey adopted often falls into local optimizations which is diffi-cult to escape [89]
Word-based methods
The dominant model of these methods is the (l d)-motifs where lis the width of a motif and d is the maximum number of muta-tions between a motif instance and the consensus sequence [7374 76 78 88 94] Intuitively the goal is to find potential (l d)-motifs that are significantly presented in the input sequencesThese methods can be further divided into two categoriespattern-driven and sample-driven [74] The former enumerates all4l l-mers on character set (A G C and T) as consensus sequences tobuild candidate motifs The latter only uses real l-mers from inputdata to generate all possible (l d)-motifs Five algorithms specific-ally designed for motif finding in ChIP-seq peaks FMotif [74]DREME [73] RSAT peak-motifs [76] SIOMICS [78 94] and Discrover[88] are reviewed with more details as follows
DREME [73] is a sample-driven method which exhaustivelysearches for exact words and heuristically expands them towords with wild cards DREME requires a positive data set iethe peaks derived from a ChIP-seq data set and a negative dataset ie different peaks from other ChIP-seq experiments or ran-domly shuffled version of the positive data It takes all l-mers inthe positive data set and counts the number of their occurrencein the two data sets The top 100 words that are significantlyoverrepresented as determined by Fisherrsquos exact test are takenas seed motifs DREME then expands the seed motifs by allow-ing exactly one additional wild card The newly generatedwords are evaluated and only the top 100 most significantwords are kept To save time in the word generation stepDREME estimates the number of sequences matching generatedmotifs rather than searching explicitly for them in the data setHowever this program prefers short motifs (with length 4ndash10 bp) hence is more suitable for monomeric eukaryotic TFs
The RSAT package [87] provides a pipeline lsquopeak-motifrsquo todiscover motifs in ChIP-seq peaks [76] It integrates a set of try-and-test algorithms using complementary criteria to detect ex-ceptional words ie lsquooligo-analysisrsquo lsquodyad-analysisrsquo lsquoposition-analysisrsquo and lsquolocal-word-analysisrsquo [95ndash97] It is a sample-drivenmethod starting by counting all 6ndash7 bp long nucleotide occur-rences within the input regulatory sequence set and estimatingtheir statistical significances The expected word frequencieswere estimated with a 1st-order or 2nd-order Markov model bydefault A lower order can be chosen to improve the sensitivityof frequency estimation in a smaller data set Finally the sig-nificant words are assembled and converted to PWMs for
4 | Liu et al
further usage It is worth noting that this algorithm uses cali-brated tables which only takes the uneven nucleotide represen-tation in the noncoding sequences into consideration Thisfeature combined with the simplicity of its binomial-basedstatistical analysis results in the high efficiency of RSAT lsquopeak-motifsrsquo Experimental results also showcased that it is signifi-cantly faster than other tools like DREME [73] ChIPMunk [98 99]and MEME-ChIP [68] However the methodological simplicitylimits this method to the detection of short and relatively con-served motifs Longer motifs must be reconstituted by the com-bination of overlapping short nucleotides
FMotif [74] is an exhaustive pattern-driven strategy for find-ing motifs in ChIP-seq peaks A similar idea has been used inseveral traditional motif-finding tools for co-regulated data egWEEDER [75] and MITRA [100] One unique advantage ofpattern-driven methods is being able to identify planted (l d)-motifs without prior knowledge of width l FMotif basically fol-lows the strategy of WEEDER to enumerate all the possible (l d)-motifs in a depth-first manner and scan all motif occurrences ina suffix tree Although this strategy has achieved high accuracyin searching for short eukaryotic motifs [25] it still has severaldrawbacks For example it sets a constraint on mutation toler-ance dl which is not satisfied by many motifs and the runtime will greatly increase in identifying long motifs or motifswith low occurrence frequency The effect of these drawbacks isgetting worse when faced with large-scale ChIP-seq data Tosolve these problems FMotif designed a new suffix tree struc-ture for motif instance searching It keeps the mismatching in-formation generated when scanning i-mers in suffix tree anduses them in the searching of (ithorn 1)-mers The new structuredramatically decreases the time complexity of motif instancesearching and achieves higher accuracy compared withWEEDER thus it makes this enumeration strategy applicableon ChIP-seq data For example the run time of FMotif is about32 min and 475 h when identifying planted (16 5)-motif in thedata set with 10 000 and 80 000 nucleotides correspondinglyHowever WEEDER spends gt10 h when identifying the samemotif in the data set with having 1200 nucleotides [74] The effi-ciency can be even improved through a clustering strategy incombining l-mer enumeration (eg cisFinder) [101] One short-coming of FMotif is higher space complexity because morememory is required to store mismatched information
The above three methods have made substantial efforts onthe motif signal detection procedure SIOMICS [78 92] adopts arelatively simpler strategy to build motifs but pays more atten-tion to reassess these motifs based on the significance of agroup of motifs (called motif modules) Specifically SIOMICStakes all l-mers (lfrac14 8 by default) in given sequences as consen-suses and builds a motif for each of them by searching all seg-ments with at most one mismatch as its motif instances Themotifs are evaluated and ranked by the following Equation (4)
log ethnTHORNl
Xl
jfrac141
X4
i
pij log pij 1n
Xallsegments
log p0 seth THORNeth THORN
24
35 (4)
where n is the number of instances in this motif pij is the sameas that in Equation (2) and p0(s) is the probability of generatingTFBSs based on background nucleotide frequencies This scorenormalizes the effects of motif length and number of instancesthus has no bias in motif interpretation and comparison It hasbeen successfully used in MDscan another motif-finding toolfor ChIP-chip data [17] SIOMICS eliminates redundancy in pre-diction through removing the motif which has a higher-rank
similar motif The predicted motifs were reassessed as motifmodules by a tree strategy which will be discussed in the lsquoFindco-factors of TFs from ChIP-seq datarsquo section The basic as-sumption is that an individually insignificant motif should betaken as a valuable motif if it is overrepresented when con-sidered as a member of motif module together with the motifsof some other factors
The word-based methods can be roughly considered as aglobal optimization strategy as they enumerate all possible (ld)-motifs However they usually only have a good performanceon simulation data but not on real biological data because the (ld)-motif model is too strict for representing real TFBSs A hightime complexity is another bottleneck of the enumeration strat-egy Even though many techniques have been developed to ac-celerate the searching speed it is still a tough problem in ChIP-seq motif finding
Profile-based methods
Profile-based motif-finding methods are usually based on an opti-mization strategy with a motif profile score eg IC for PWMSpecifically they try to find a collection of DNA segments givingrise to a motif profile with the highest score among all the com-binations of candidates However it is impractical to identify thebest profile by exhaustive enumeration as going through all pos-sible combinations will lead to extremely high time complexityTherefore heuristic strategies are usually adopted to narrowdown the searching space First some primary profiles will bebuilt by randomly selecting one or several DNA segments or byan enumerative search in partially selected input sequences(based on some preliminary knowledge) Then they will be im-proved aiming to gain higher profile scores during the iterationprocess The improvement process is usually conducted itera-tively in a heuristic way [102] eg EM [80] and Gibbs sampling[15] It is noteworthy that high-quality motif profiles can be builtthrough combinatorial optimization strategies eg integrate lin-ear programming [103] and graph-based techniques [7 10 23]With ChIP-seq data a critical challenge to efficiently handle theexponentially increased number of sequences has been posed tothe conventional motif discovery methods
Several traditional motif-finding tools were updated to adaptto ChIP-seq data and most of them focused on how to improvethe time efficiency faced with the larger data size For exampleMEME-ChIP [68] runs MEME as one of its primary motif detectionengines however it limits the input data as no gt600 sequencesHybrid Motif Sampler (HMS) combines deterministic identifica-tion of motif instances and random sampling strategy togetherto accelerate its iteration process [72] STEME [104] which canbe considered as an updated version of MEME introduced a suf-fix tree structure in the EM process to index the sequences setwhich decreases the running time to a different magnitudeMICSA uses MEME on a subset of the highest ChIP-seq peaks toidentify motifs [79] and then MEME is applied for severalrounds on the sequences in which no motifs have been identi-fied ChIPMotifs is originally designed for ChIP-chip data andintegrates MEME and WEEDER to predict primary motif candi-dates followed by a bootstrap resampling approach to deter-mine optimal cutoff and filter against the false-positivecandidates [105]
ChIPMunk [69] is a new iterative algorithm for ChIP-seq datathat combines greedy optimization with bootstrapping Theevaluation of motif profiles in ChIPMunk is based on theKullback Discrete Information Content (KDIC) which is calcu-lated by Equation (5)
Motif Prediction based on ChIP-seq data | 5
KDIC frac14Xl
jfrac141
X4
ifrac141
log fij log N
NXl
jfrac141
X4
ifrac141
fij log bi (5)
where the variables l N fij and bi are the same as those definedin Equations (3) and (4) Like CONSENSUS ChIPMunk uses agreedy strategy to find motif profiles that have high KDIC val-ues The identified motif profiles are ranked based on PWMscores and then improved by an EM-like iterative process It isworth noting that the rebuilding of motif profiles integrated theweight scores calculated from the ChIP-seq coverage informa-tion Specifically ChIPMunk set a score to each sequence as itsnormalized maximal peak value and these weights are as-signed to each position forming the motif profiles Theexperiment on several ChIP-seq data set indicated thatChIPMunk obtained a better performance both in running speedand prediction quality compared with MEME HMS [72] andSeSiMCMC [106]
RPMCMC [89] is a profile-based method which combinesGibbs motif samplers and a reversible-jump MCMC method Asa parallel version of Gibbs RPMCMC fully uses parallel mechan-isms to accelerate motif finding This breakthrough means a lotto current motif finding studies especially in this big data eradriven by high-throughput sequencing technology eg ChIP-seq Although various motif-finding methods have been pro-posed before such as DREME HEGMA WEEDER and Gibbs motifsampler [15] they have limited power in properly controllingthe trade-off between computation time and motif detection ac-curacy Fortunately RPMCMC makes multiple interacting motifsamplers in parallel to ensure an acceptable run time comparedwith other methods Most importantly a repulsive force isapplied in RPMCMC to separate different motif samplers closeto each other making further contribution in getting rid of localoptima In this way RPMCMC has achieved extremely promis-ing performances on synthetic promoter sequence and ENCODEChIP-seq data sets Meanwhile it can also be applied to diversemotif identification ie discriminative motifs and more detailsregarding this will be discussed in the next section [89]
The profile-based methods stand on an assumption thateach nucleotide in a motif instance participates independentlyin the TF-binding activity However more studies have indi-cated that the neighboring positions have strong dependent ef-fect in some motifs [107] Taking this issue into considerationMathelier and Wasserman [64] designed a new transcriptionfactor flexible model (TFFM) and developed a prediction systembased on this model It uses a 1st-order HMM to elucidate the di-nucleotide dependencies Specifically the distribution probabil-ities of nucleotides in position i are dependent on thenucleotide found at position i-1 The performance evaluation on96 ChIP-seq data sets indicates that the models considering pos-ition dependence outperform the other models on 90 data setsof 96 ie 94 in terms of the area under the curves for the cor-responding receiver operating characteristic curves [64]Although this algorithm is not designed to build motif profilesthe TFFM model can be easily modified to integrate the correl-ation between positions in de novo motif finding
Although several motif representation models have been de-veloped and new methods have been designed for motif findingon ChIP-seq data the problem is still far from a sound solutionbecause of the complexity of gene regulation the large-scale ofChIP-seq data and the low specificity of some binding sites Inthe future new models for motif presentation may be createdto replace the consensus and profiles model and improve theaccuracy of motif finding
Assessment criteria and performance evaluation
The main difference between motif finding before and after thenext-generation sequencing technology is the dramaticallyincreased data size [9] Substantial experiments showcased thatthe most popular traditional motif-finding tools eg MEME andWEEDER cannot handle the genome-scale ChIP-seq data con-taining thousands of peaks Therefore almost all ChIP-seq-based motif-finding tools focus on improvement in efficiency inaddition to accuracy and conduct comprehensive performanceevaluations on synthetic data or real biological data sets Theperformance evaluation was mainly carried out based on thecomparison with other tools like MEME WEEDER and the most-cited ChIP-seq-based tool DREME The criteria include coverageon known binding sites (eg sensitivity or SN) similarity be-tween predicted motif profiles and the documented ones speci-ficity (SP) positive predicted value (PPV) and the running timeon data sets with different scales
Here we summarized the performance evaluation of theeight ChIP-seq-based motif-finding tools as a sketch in Table 2Overall no method can outperform others in all situationsThese eight tools showed their preferences on different datasizes different motif lengths and advantages on special contextof motif finding Generally DREME has a good performance onprimary motif finding from large-scale data set along withother two functions discriminative motif finding and cofactormotif finding which will be discussed in next section Howeverit prefers short motif identification and only outputs profilematrices without information of corresponding motif instancesIn this situation we have to rely on additional motif scanningtools to identify the concrete TF-binding sites MICSA the ear-liest among the eight tools takes MEME as the search enginethus its speed and performance are highly limited by MEMERSAT peak-motifs and FMotif handle large data sets with a goodprediction accuracy and run faster than DREME SpecificallyFMotif shows good tolerance on noise rate of input sequencesSIOMICS works well on finding motif modules and runs fasterthan DREME but slower than RSAT peak-motifs Discrover usesDREME to identify motif seeds and has a better performance ondiscriminative motif finding than DREME RPMCMC shows agreat potential on finding diverse motifs in real biological datasets More details can be found in Table 2 from which usersmay get a clue to appropriate tool selection in specific contextof motif-finding applications
Development of advanced functions withbiological insightsDiscriminative motif finding
Traditional motif-finding tools aim to identify a group of con-served motifs in query sequences which are expected to containinstances of the to-be-detected motifs Besides these motifssome other randomly conserved DNA patterns may also exist inthese sequences Distinguishing these false positives from realmotifs has always been a big challenge in motif finding Thisissue is getting more serious because the high-throughputsequencing techniques bring in more false positives One solutionto this problem is to carry out the discriminative motif findingwhich is to find motifs whose occurrence frequencies vary be-tween the query sequence set and several well-defined controlsets The simplest contrast for discriminative motif finding con-tains a positive set (query sequence set) and a negative set (se-quences without binding activity or randomly generated
6 | Liu et al
sequences) [88] The discriminative ideas have been integratedinto some traditional motif-finding tools to reduce false positivesin prediction For example CONSENSUS [102] calculates P-valuesfor motif profiles indicating the probabilities for the same motifsoccurring in a hypothetical random sequence set with a similarsize BOBRO [10] calculates such a P-value by practically simulat-ing random sequences set in the program as contrast underPoisson distribution BBR [24] takes coding genomic sequences asbackground data to reevaluate candidate motifs and an experi-ment on Escherichia coli genome showed that this strategy reducesthe number of false positives Certainly the function of discrim-inative motif finding is not limited to dealing with the false-posi-tive issue Some biology-driven applications specifically identifymotifs overrepresented in a positive set but not in a negative set(called decoy motif problem) Hence a motif without occurrencedifferences between two sets is not considerable no matter howconserved it is [108]
Application of discriminative motif finding is even more im-portant in ChIP-seq data analysis because the data size is usu-ally large and the peaks with or without binding activitiesnaturally compose the positive and negative sets respectivelyIn DREME [73] the first step is counting how many positive se-quences and negative sequences contain a query word Thenthe P-value from Fisherrsquos exact test is calculated for this wordThe words with significant P-values will be kept as seeds for
motif merging The same evaluation process will be conductediteratively in generation of motifs by combing similar motifseeds In application DREME performed discriminative motiffinding by taking Sox2- and Oct4-binding ChIP-seq data inmouse embryonic stem cell (mESC) as positive and negativedata set and found that Oct4 data set has significantly moreOct4-binding sites than the Sox2 data set The results over-turned the previous opinion that Sox2 and Oct4 bind their tar-gets exclusively as a heterodimer [109]
Maaskola and Rajewsky [88] developed an improved HMM-based tool Discrover for discriminative motif finding on largedata set The method consists of three parts lsquoseed findingrsquolsquoHMM optimizationrsquo and lsquosignificance filteringrsquo The seed findingpart is similar to DREME but uses more objective functions be-sides the Fisherrsquos exact test P-values Among these objectivefunctions mutual information of condition and motif occur-rence (MICO) and maximum mutual information estimationcan be used for discriminative motif evaluation on the contrastwith more than two conditions which extended the applicationof Discrover The HMM optimization step starts from a back-ground state trained by the BaumndashWelch algorithm Then theinitial emission probabilities of the motif chains are centered onthe seed sequences The posterior probability of all sequences isevaluated to calculate the expected number of sequences thathave motif occurrences Discrover conducts discriminative
Table 2 Performance summary of motif-finding tools on ChIP-seq data
Programs Criteria Preferences Speed Performance and special characteristics
MICSA [79] Speed motifcoverage
Prefers small datasets
Slow on large data sets Performance is limited by MEME performswell on filtering weak peaks
ChIPMunk [69] Speed motifcoverage
NA Faster than MEME slowerthan DREME quasi-linearcomplexity (power 127)
More accurate than MEME
DREME [73] Speed motifcoverage
Handles large ChIP-seq data (thou-sands of se-quences) prefersshort motifs
Much faster than WEEDERand MEME linear timecomplexity
Performs well at cofactor motif finding anddiscriminative motif finding only providesmotif profile matrix no output of motifsites
RSAT peak-motifs[87]
Speed motifcoverage
Handles large ChIP-seq data sets (upto 1 000 000 peaksof 100 bp each)
Faster than DREME MEMEand ChIPMunk
High prediction accuracy
FMotif [74] Speed motif cover-age SN and SP onmotif sites
Handles large ChIP-seq data handleslong motifs
Fast but requires morememory
High sensitivity and high specificity pre-dicted accurate motif profiles with highrank (1 or 2) for 12 mESC data tolerates upto 30 noise sequences
SIOMICS [78] Speed motif cover-age and specificityon motifs
NA Slower than RSAT peak-motifs much faster thanDREME especially onlarge data sets
Competitive performance with DREME andRSAT in motif finding and cofactor motiffinding competitive performance withDREME and better than RSAT peak-motifson false-positive control good motif mod-ule prediction
Discrover [88] Speed MCC(Matthews corre-lation coefficient)on nucleotide-level SN and PPVon binding sitelevel
NA Comparable with DREME inspeed
Better than Bioprospector and MDscan on SNand PPV of prediction better than DREMEon discriminative motif finding
RPMCMC [89] Speed SN and PPV NA Similar to DREME Better than WEEDER on SN and PPV of predic-tion great potential to mine many reliablediverse motifs
Motif Prediction based on ChIP-seq data | 7
learning by iterative gradient optimization based on the chosenobjective functions which has a linear run time complexitybased on the length of the input data Finally Discrover refinesthe final models by a threshold on a corrected MICO-based P-value regardless of if MICO or another objective function wasused in previous steps In addition Discrover makes extra ef-forts in detection of multiple motifs which will be discussed inthe next section
In the application of discriminative motif finding on ChIP-seq data the various choices of reference data can benefit dif-ferent biological analysis about gene regulation and is the mainfactor affecting the prediction performance [108 110]Specifically the reference data can be (i) randomly generatedsequences using a uniform distribution or a Markov process (ii)the sequences with no binding evidence (iii) the ChIP-seq peaksfor another TF (iv) the peaks of the same TF but under a differ-ent condition even (v) multiple-level reference data sets basedon a detailed grade framework of the signal strength etc [94]This information can reduce false positives and find other spe-cial motifs including decoy motifs variant motifs for one TFimpoverished motifs and collaborated motifs [108] and thuscan improve the state-of-the-art studies about complex regula-tion mechanism eg combinatorial regulation and behaviorchange of TFs among different conditions [110] It is worth not-ing that the binding activity of a TF could be affected by epigen-etic modifications in a complex fashion eg through localchromatin structure caused by the packaging of DNA histonesand other proteins [111] Cuellar-Partida et al [111] have usedthe signal from histone modification ChIP-assay to filter thescanned motifs The corresponding pitfall raised in discrimina-tive motif finding is that the sequences in a negative data setwith no binding evidence may contain potential binding sitesSome algorithms try to mitigate this effect by a special designwhich can tolerate a few motif occurrences in negative set [110]Another phenomenon is that the binding motifs of a TF couldbe affected by the binding of other TFs Mason et al [112] provedthat Oct4 preferentially binds to different motifs with or withoutSox2 bindings within 5000 bp indicating that the binding motifof Oct4 is context-dependent
Find cofactors of TFs from ChIP-seq data
In animals and plants TFs usually regulate gene expression withcooperation of other partner TFs (cofactors) [113] Growing evi-dence indicates that cofactors which interacted directly or indir-ectly play an important role in transcription regulation Thebinding sites of multiple TFs in a short DNA region often deter-mine the temporal spatial expression pattern of the downstreamgenes thus will help to elucidate underlying patterns of combina-torial regulation [114] The analysis of cofactors has attracted sub-stantial attention in traditional motif finding For example BBA[24] in the DMINDA Web server designed a probability model toevaluate the co-occurrences among identified motifs in a givenset of regulatory sequences which can reveal joint regulation re-lationships by multiple TFs The cofactor motif finding on ChIP-seq data is usually formulated by detecting multiple nonredun-dant motifs in a set of ChIP-seq peak regions [73] or detecting TFpatterns whose binding motifs frequently co-occurred on thesame set of regulatory regions [113] Recently some algorithmshave tried to take advantage of the large-scale ChIP-seq data todiscover the cofactor motifs
Researchers want to identify multiple nonredundant motifsin query ChIP-seq data The most straightforward way is to scanpeak sequences by documented motifs [84 115] However the
number of documented motifs is limited compared with thenumber of TFs leading to a loss of some real binding sitesSome de novo motif-finding methods try to first identify mul-tiple motifs and then build relationships for them with knownTFs One example is the aforementioned DREME [73] whichfeeds the whole sequences into the corresponding model ratherthan using only a small portion of the collected ChIP-seq data Itidentifies multiple motifs by removing the most statistically sig-nificant identified motif derived from Fisherrsquos exact testand then repeats the search for motifs DREME achieved re-markable performance in cofactor motifs discovery in mousecell ChIP-seq data and dramatically outperformed other preva-lent algorithms In particular DREME discovered gt2 times thenumber of cofactor motifs compared with Amadeus [116] andgt10 times compared with Trawler [117] In addition the algo-rithm peak-motifs in the RSAT package also returns additionalmotifs potentially bound by cofactors and its application onmouse cell ChIP-seq data successfully detected some cofactormotifs [76]
Rather than simply considering individual motifs separatelySIOMICS [78 94] models the cofactor motifs as motif modulesie a group of TFs with their TFBSs co-occurring in significantlylarge number of ChIP-seq peaks Cofactor motifs may occur onlyin a small set of peaks and thus can be underrepresented indi-vidually in all peak regions Therefore evaluation of motif mod-ules simultaneously could make the significance of cofactorsstand out from the massive background sequences Asdescribed in previous section SIOMICS identifies conservedmotifs by a word-based strategy based on Equation (4) Toidentify nonredundant motifs it removes redundant candidateswith lower scores and the remaining ones are used to identifyputative motifs Instead of considering motif candidatesindividually SIOMICS represents them as nodes in a tree inwhich the more frequent candidates are closer to the root whilethe branch of the tree represents a group of co-occurring motifsBy evaluating each branch based on a Poisson clumpingheuristic strategy SIOMICS identifies statistically significantmotif modules and takes motif candidates in them as final pu-tative motifs The performance comparison on the same ChIP-seq data set used by DREME and peak-motifs indicates thatmore documented cofactor motifs were identified by SIOMICSrather than DREME and RSAT Peak-motifs with comparabletime efficiency
The above methods identify cofactors from ChIP-seq data fora single TF and have obtained many valuable results Howeversynthetic analysis of diverse ChIP-seq data could provide moreinformation about the co-regulation mechanism of TFs FCOPs[113] is a method for identifying combinatorial occupancy pat-terns of multiple TFs from diverse ChIP-seq data FCOPs inte-grates two kinds of ChIP-seq data sets as input from TFsoccupancy as well as chromatin modification and it considerstwo kinds of uncertainty How possible a peak from ChIP-seqdata represents an actual binding activity and how confidentthe predicted enhancer sequences are FCOPs generate probabil-ities from these uncertainties and use them to evaluate the stat-istical significance of a given set of TFs co-occurring in a givenset of enhancers Obviously enumerating all possible subset ofenhancers has an exponential complexity which is inefficientThis method adopts a dynamic programming-based method tocalculate the probabilities which leads to linear complexitywith respect to the number of the enhancers both in time andspace Despite this method not providing any DNA motifs itcan be easily integrated into ChIP-seq motif analysis to identifycofactors
8 | Liu et al
Citation analysis of ChIP-seq motif-findingtools
To analyze the usage of existing ChIP-seq-based motif-findingtools we collected the citation information of eight tools inTable 1 and a Web server MEME-ChIP which implants MEMEand DREME The relative citations of the eight tools since 2011are shown in Figure 2 where we can see that DREME has anoverwhelming superiority among these tools and RSAT peak-motifs and ChIPMunk also have higher relative citation ratesAlthough the four tools that were developed within the past 3years (RPMCMC Discover SIOMICS and FMotif) introducedsome new ideas and obtained better prediction performancethey still fail to get more attention from biologists and almostall of their citations are from the papers focusing on develop-ment of new methods The main reason is the lack of a user-friendly Web server and a platform with multiple related func-tions limiting their usages by people without adequate compu-tational background
DREME as one motif-finding engine of MEME-ChIP has beenintegrated in the MEME suite which is the most popular motif-finding and analysis platform MEME-ChIP obtained more add-itional citations than DREME (Figure 3) and most of theseshould be attribute to DREME RSAT peak-motifs is part of RSATplatform where a series of modular computer programs is
integrated for regulatory signal detection in noncoding se-quences And the relatively high number of citations ofChIPMunk also benefited from its Web service The methodsthat only provide source code are difficult to obtain continuousattention Taking MICSA as an example it was published in2010 and was cited frequently within 2 years following publica-tion but its citation count decreased sharply in recent yearsThis phenomenon indicated that the user experience of a newalgorithm is as important as the quality of algorithm itself
Discussion
The ChIP-seq peaks provide a foundation for the identificationof motifs on a genome scale In return the predicted motifs canbe used to reevaluate the peaks and even find more potentialTF-binding sites that were missed by a ChIP-seq experimentFor example MICSA [79] takes advantage of de novo motif identi-fication to reevaluate the ChIP-seq peaks In MICSA only a sub-set of the highest peaks is fed into MEME for motifsidentification for which the probability of observing a motif bychance (P-value) is calculated Then the predicted motifs willbe used to scan all previously identified peaks and a score willbe assigned to each peak by combining the P-values of motifsthat occurred on it with its depth The experimental results
0
10
20
30
40
50
60
70
80
90
100
2011 2012 2013 2014 2015 2016
Rela
ve
annu
al ci
tao
n (
)
Year of citaon
RPMCMC 2015[91]
Discrover 2014[93]
SIOMICS 2014[78]
Fmof 2014[74]
RSAT peak-mofs 2012[94]
DREME 2011[73]
ChIPMunk 2010[69]
MICSA 2010[79]
Figure 2 Relative citation of the ChIP-seq motif-finding tools until October 2016 The numbers after a tool name indicate the year of its original release and the corres-
ponding reference Citation of each tool was extracted from the Bing Academy (httpcnbingcomacademic)
0
5
10
15
20
25
30
35
40
45
50
Cita
ons
pery
ears
ince
publ
ished
MICSA 2010[79]
ChIPMunk 2010[69]
DREME 2011[73]
RSAT peak-mofs 2012[94]
Fmof 2014[74]
SIOMICS 2014[78]
Discrover 2014[93]
RPMCMC 2015[91]
MEME-ChIP 2011[68]a
Figure 3 The average annual citation of eight motif-finding tools and MEME-ChIP aMEME-ChIP is a Web server that implants DREME and MEME to detect motifs
Motif Prediction based on ChIP-seq data | 9
showed that MICSA offered a distinct improvement in TF-bind-ing prediction accuracy with fewer false positives comparedwith the other 10 approaches Certainly searching potentialbinding sites using the predicted motifs is not limited to thepeak sequences It could even be applied on the whole-genomesequences as some binding activities may not occur under thecondition of ChIP experiments and these binding sites are alsovaluable for studying the full regulation picture of the interestedTFs
Another issue that should be discussed concerning ChIP-seqmotif finding is the difference of its application on various datasources For example the binding motifs in eukaryotes and pro-karyotes often have their respective characteristics and thelength of the binding regions in eukaryotes is much longer thanthose in prokaryotes Therefore there is a need of special designin motif-finding and analysis techniques
As with the traditional motif finding there are some ensem-ble methods for ChIP-seq-based motif finding [85 118] It was al-ways recommended that using several tools in motif finding isa better strategy as diverse techniques may capture differentcharacteristics of motifs This idea leads to the development ofensemble motif-finding methods Most of these methods sim-ply assemble several algorithms together and rank the outputof them For example W-ChIPMotif [119] includes MEMEWEEDER and MaMF [120] and assesses the output by comparingwith a randomized initial input CompleteMotifs [118] incorpor-ates three de novo tools (ChIPMunk WEEDER and CUDA-MEME)[121] and calculates the P-values of predicted motifs by a back-ground random model However using more algorithms intro-duces more false positives while they have high probability tocover real motifs Therefore further analyses are still requiredfor the output of multiple algorithms
Certainly the accuracy of prediction and the efficiency ofmethods are still the most important features in ChIP-seq motiffinding and the application of current methods shows that theyare still far from completely satisfactory More deep thinkingand advanced techniques are still needed to improve the intrin-sic algorithms For example the evolution information fromphylogenetic footprinting can be useful as prior knowledge inevaluation of candidate binding sites Besides additional func-tions such as cofactor analysis evaluating peaks based onmotifs and multiple types of discriminative motif finding are at-tracting more attention in applications They should be wellintegrated in motif-finding methods in future studies Finallyan integrated Web server or platform is essential for the appli-cation of new designed methods as it can provide more func-tions about upstream data collection and downstream resultanalysis
Most recently deep learning has demonstrated its unprece-dented performance in regulatory genomics especially in ChIP-seq-based motif analysis [122] This marked performance ismainly achieved by taking advantage of graphics processingunits and the capability to extract complex features of motifsfrom ChIP-seq data However it can be improved further ifmore information (eg DNA shape) can be taken into consider-ation rather than only the ChIP-seq peaks sequences
Key Points
bull The algorithmic strategies adopted by current motif-finding tools can be generally divided into two catego-ries word-based and profile-based which could be im-plemented with other models like tree graph and
clustering The combination of these two strategies isa better choice to break the current bottleneck in pre-diction accuracy and efficiency
bull The new features of current tools including but notlimited to discriminative motif identification (DREAMRSAT lsquopeak-motifsrsquo and Discrover) cofactor motifdetection (DREME RSAT lsquopeak-motifsrsquo SIOMICSRPMCMC and Discrover) etc are essential for applica-tion of ChIP-seq data analysis and are encouraged tobe integrated into newly designed algorithms
bull Existing tools still have a much room for improvementwith respect to prediction accuracy and efficiency thecontradictory relationship between the time efficiencyand space complexity as well as between predictionaccuracy and application universality is generally pre-sent in these tools
bull An integrated Web server or platform providing vari-ous analysis functions in data collection and the resultanalysis is essential for the spread of new designedmethods in their applications
bull In future studies more mathematical models eggraph model statistics and combinatorial optimiza-tion could be developed and integrated to design newmotif signal detection techniques and new evaluationcriteria In addition more information eg the de-pendency between nucleotides and the height of peaksfrom ChIP-seq data should be considered to improvethe prediction accuracy
Funding
The State of South Dakota Research Innovation Center andthe Agriculture Experiment Station of South Dakota StateUniversity the National Nature Science Foundation ofChina (grant numbers 61303084 and 61432010 to BL) andthe Young Scholars Program of Shandong University (grantnumbers YSPSDU 2015WLJH19 to BL)
References1 DrsquoHaeseleer P What are DNA sequence motifs Nat
Biotechnol 200624423ndash52 Brohee S Janky R Abdel-Sater F et al Unraveling networks
of co-regulated genes on the sole basis of genome se-quences Nucleic Acids Res 2011396340ndash58
3 Davidson E Levin M Gene regulatory networks Natl Acad SciUSA 20051024935
4 Liu B Zhou C Li G et al Bacterial regulon modeling and pre-diction based on systematic cis regulatory motif analysesSci Rep 2016623030
5 Baumbach J On the power and limits of evolutionary con-servationndashunraveling bacterial gene regulatory networksNucleic Acids Res 2010387877ndash84
6 Lawrence CE Altschul SF Boguski MS et al Detecting subtlesequence signals a Gibbs sampling strategy for multiplealignment Science 1993262208ndash14
7 Pevzner PA Sze SH Combinatorial approaches to findingsubtle signals in DNA sequences Proc Int Conf Intell Syst MolBiol 20008269ndash78
8 Nakaki R Kang J Tateno M A novel ab initio identificationsystem of transcriptional regulation motifs in genome DNAsequences based on direct comparison scheme of signalnoise distributions Nucleic Acids Res 2012408835ndash48
10 | Liu et al
9 Zambelli F Pesole G Pavesi G Motif discovery and transcrip-tion factor binding sites before and after the next-generation sequencing era Brief Bioinform 201314225ndash37
10 Li G Liu B Ma Q et al A new framework for identifying cis-regulatory motifs in prokaryotes Nucleic Acids Res201139e42
11 Chen X Guo L Fan Z et al W-AlignACE an improved Gibbssampling algorithm based on more accurate position weightmatrices learned from sequence and gene expressionChIP-chip data Bioinformatics 2008241121ndash8
12 Sinha S PhyME a software tool for finding motifs in sets oforthologous sequences Methods Mol Biol 2007395309ndash18
13 Das MK Dai HK A survey of DNA motif finding algorithmsBMC Bioinformatics 20078(Suppl 7)S21
14 Wang T Stormo GD Combining phylogenetic data with co-regulated genes to identify regulatory motifs Bioinformatics2003192369ndash80
15 Liu X Brutlag DL Liu JS BioProspector discovering con-served DNA motifs in upstream regulatory regions of co-expressed genes Pac Symp Biocomput 2001127ndash38
16 Hertz GZ Stormo GD Identifying DNA and protein patternswith statistically significant alignments of multiple se-quences Bioinformatics 199915563ndash77
17 Liu XS Brutlag DL Liu JS An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments NatBiotechnol 200220835ndash9
18 Bailey TL Boden M Buske FA et al MEME SUITE tools for motifdiscovery and searching Nucleic Acids Res 200937W202ndash8
19 Olman V Xu D Xu Y CUBIC identification of regulatorybinding sites through data clustering J Bioinform Comput Biol2003121ndash40
20 Li X Wong WH Sampling motifs on phylogenetic trees ProcNatl Acad Sci USA 20051029481ndash6
21 Blanchette M Tompa M Discovery of regulatory elementsby a computational method for phylogenetic footprintingGenome Res 200212739ndash48
22 Blanchette M Tompa M FootPrinter a program designed forphylogenetic footprinting Nucleic Acids Res 2003313840ndash2
23 Li G Liu B Xu Y Accurate recognition of cis-regulatorymotifs with the correct lengths in prokaryotic genomesNucleic Acids Res 201038e12
24 Ma Q Liu B Zhou C et al An integrated toolkit for accurateprediction and analysis of cis-regulatory motifs at a genomescale Bioinformatics 2013292261ndash8
25 Tompa M Li N Bailey TL et al Assessing computationaltools for the discovery of transcription factor binding sitesNat Biotechnol 200523137ndash44
26 McCue LA Thompson W Carmack CS et al Factors influenc-ing the identification of transcription factor binding sites bycross-species comparison Genome Res 2002121523ndash32
27 Simcha D Price ND Geman D The limits of de novo DNAmotif discovery PLoS One 20127e47836
28 Siddharthan R Siggia ED van Nimwegen E PhyloGibbs aGibbs sampling motif finder that incorporates phylogenyPLoS Comput Biol 20051e67
29 Blanchette M Schwikowski B Tompa M Algorithms forphylogenetic footprinting J Comput Biol 20029211ndash23
30 Neph S Tompa M MicroFootPrinter a tool for phylogeneticfootprinting in prokaryotic genomes Nucleic Acids Res200634W366ndash8
31 Carmack CS McCue LA Newberg LA et al PhyloScan identi-fication of transcription factor binding sites using cross-species evidence Algorithms Mol Biol 200721
32 Zhang S Xu M Li S et al Genome-wide de novo prediction ofcis-regulatory binding sites in prokaryotes Nucleic Acids Res200937e72
33 Katara P Grover A Sharma V Phylogenetic footprinting aboost for microbial regulatory genomics Protoplasma2012249901ndash7
34 Tagle DA Koop BF Goodman M et al Embryonic epsilon andgamma globin genes of a prosimian primate (Galago crassi-caudatus) Nucleotide and amino acid sequences develop-mental regulation and phylogenetic footprints J Mol Biol1988203439ndash55
35 Borneman AR Gianoulis TA Zhang ZD et al Divergence oftranscription factor binding sites across related yeast spe-cies Science 2007317815ndash9
36 Odom DT Dowell RD Jacobsen ES et al Tissue-specific tran-scriptional regulation has diverged significantly betweenhuman and mouse Nat Genet 200739730ndash2
37 Boyle AP Araya CL Brdlik C et al Comparative analysis ofregulatory information and circuits across distant speciesNature 2014512453ndash6
38 Bingqiang Liu CZ Zhang H Guojun L et al An integrativeand applicable phylogenetic footprinting framework for cis-regulatory motifs identification in prokaryotic genomesBMC Genomics 201617578
39 Park PJ ChIP-seq advantages and challenges of a maturingtechnology Nat Rev Genet 200910669ndash80
40 Song L Crawford GE DNase-seq a high-resolution tech-nique for mapping active gene regulatory elements acrossthe genome from mammalian cells Cold Spring Harb Protoc201025384
41 Wang Z Gerstein M Snyder et al RNA-Seq a revolutionarytool for transcriptomics Nat Rev Genet 20081057ndash63
42 Tsankov AM Gu H Akopian V et al Transcription factorbinding dynamics during human ES cell differentiationNature 2015518344ndash9
43 Wu F Olson BG Yao J DamID-seq genome-wide mapping ofprotein-DNA interactions by high throughput sequencing ofadenine-methylated DNA fragments J Vis Exp 2016107e53620
44 Maragkakis M Alexiou P Nakaya T et al CLIPSeqTools-anovel bioinformatics CLIP-seq analysis suite RNA2015221ndash9
45 Hafner M Landthaler M Burger L et al PAR-CliPmdasha methodto identify transcriptome-wide the binding sites of RNAbinding proteins J Vis Exp 201041e2034
46 Ingolia NT Ribosome profiling new views of translationfrom single codons to genome scale Nat Rev Genet201415205ndash13
47 Giresi PG Kim J Mcdaniell RM et al FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates activeregulatory elements from human chromatin Genome Res200717877ndash85
48 Nutiu R Friedman RC Luo S et al Direct measurement ofDNA affinity landscapes on a high-throughput sequencinginstrument Nat Biotechnol 201129659ndash64
49 Collas P Dahl JA Chop it ChIP it check it the currentstatus of chromatin immunoprecipitation Front Biosci200813929ndash43
50 Kimura H Sato Y DNA Replication and Histone ModificationJapan Springer 2016
51 Suganuma T Workman JL Histone modification as a reflec-tion of metabolism Cell Cycle 2016 15481ndash2
52 Qu H Fang X A brief review on the human encyclopedia ofDNA elements (ENCODE) project Genomics ProteomicsBioinformatics 201311135ndash41
Motif Prediction based on ChIP-seq data | 11
53 Consortium EP The ENCODE (ENCyclopedia Of DNAElements) project Science 2004306636ndash40
54 Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910R25
55 Li H Durbin R Fast and accurate long-read alignment withBurrowsndashWheeler transform Bioinformatics 201026589ndash95
56 Szalkowski AM Schmid CD Rapid innovation in ChIP-seqpeak-calling algorithms is outdistancing benchmarking ef-forts Brief Bioinform 201112626ndash33 628
57 Kharchenko PV Tolstorukov MY Park PJ Design and ana-lysis of ChIP-seq experiments for DNA-binding proteins NatBiotechnol 2008261351ndash9
58 Zhang Y Liu T Meyer CA et al Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089R137
59 Jiang H Wang F Dyer NP et al CisGenome browser a flex-ible tool for genomic data visualization Bioinformatics2010261781ndash2
60 Fejes AP Robertson G Bilenky M et al FindPeaks 31 atool for identifying areas of enrichment from massivelyparallel short-read sequencing technology Bioinformatics2008241729ndash30
61 Valouev A Johnson DS Sundquist A et al Genome-wideanalysis of transcription factor binding sites based on ChIP-Seq data Nat Methods 20085829ndash34
62 Xin F Grossman R Stein L PeakRanger a cloud-enabledpeak caller for ChIP-seq data BMC Bioinformatics201112139
63 Kuan PF Chung D Pan G et al A statistical frameworkfor the analysis of ChIP-Seq data J Am Stat Assoc2011106891ndash903
64 Mathelier A Wasserman WW The next generation of tran-scription factor binding site prediction PLoS Comput Biol20139e1003214
65 Cheng C Min R Gerstein M TIP a probabilistic method foridentifying transcription factor target genes from ChIP-seqbinding profiles Bioinformatics 2011273221ndash7
66 Wu S Wang J Zhao W et al ChIP-PaM an algorithm to iden-tify protein-DNA interaction using ChIP-Seq data Theor BiolMed Model 20107(1)18
67 van Heeringen SJ Veenstra GJC GimmeMotifs a de novomotif prediction pipeline for ChIP-sequencing experimentsBioinformatics 201127270ndash1
68 Machanick P Bailey TL MEME-ChIP motif analysis of largeDNA datasets Bioinformatics 2011271696ndash7
69 Kulakovskiy IV Boeva VA Favorov AV et al Deep and widedigging for binding motifs in ChIP-Seq data Bioinformatics2010262622ndash3
70 Jothi R Cuddapah S Barski A et al Genome-wide identifica-tion of in vivo proteinndashDNA binding sites from ChIP-Seqdata Nucleic Acids Res 2008365221ndash31
71 Mercier E Droit A Li L et al An integrated pipeline for thegenome-wide analysis of transcription factor binding sitesfrom ChIP-Seq PLoS One 20116e16432
72 Hu M Yu J Taylor JM et al On the detection and refinementof transcription factor binding sites using ChIP-Seq dataNucleic Acids Res 2010382154ndash67
73 Bailey TL DREME motif discovery in transcription factorChIP-seq data Bioinformatics 2011271653ndash9
74 Jia C Carson MB Wang Y et al A new exhaustive methodand strategy for finding motifs in ChIP-enriched regionsPLoS One 20149e86044
75 Pavesi G Mereghetti P Mauri G et al Weeder Web discov-ery of transcription factor binding sites in a set of
sequences from co-regulated genes Nucleic Acids Res200432W199ndash203
76 Thomas-Chollier M Herrmann C Defrance M et al RSATpeak-motifs motif analysis in full-size ChIP-seq datasetsNucleic Acids Res 201240e31
77 Tran NT Huang CH A survey of motif finding web tools fordetecting binding site motifs in ChIP-Seq data Biol Direct201494
78 Jun Ding HH Xiaoman L SIOMICS a novel approach for sys-tematic identification of motifs in ChIP-seq data NucleicAcids Res 2014421645ndash35
79 Boeva V Surdez D Guillon N et al De novo motif identifica-tion improves the accuracy of predicting transcription factorbinding sites in ChIP-Seq data analysis Nucleic Acids Res201038e126
80 Bailey TL Williams N Misleh C et al MEME discovering andanalyzing DNA and protein sequence motifs Nucleic AcidsRes 200634369ndash73
81 Hartmann H Guthhrlein EW Siebert M et al P-value-basedregulatory motif discovery using positional weight matricesGenome Res 201323181ndash94
82 Niu M De novo prediction of cis-regulatory modules in eu-karyotic organisms Dissertations amp ThesesmdashGradworksThe university of North Carolina at Charlotte 2014
83 Bolouri H Ruzzo WL Integration of 198 ChIP-seq datasetsreveals human cis-regulatory regions J Comput Biol201219989ndash97
84 Sun H Guns T Fierro AC et al Unveiling combinatorial regu-lation through the combination of ChIP information and insilico cis-regulatory module detection Nucleic Acids Res201240e90
85 Lihu A Holban S A review of ensemble methods for denovo motif discovery in ChIP-Seq data Brief Bioinform201617731
86 Kulakovskiy IV Makeev VJ Motif discovery and motif find-ing in ChIP-Seq data Genome Analysis Current Proceduresand Applications Caister Academic Press Norfolk UK2014 83
87 Medina-Rivera A Defrance M Sand O et al RSAT 2015Regulatory Sequence Analysis Tools Nucleic Acids Res201543W50ndash6
88 Maaskola J Rajewsky N Binding site discovery from nucleicacid sequences by discriminative learning of hidden Markovmodels Nucleic Acids Res 20144212995ndash3011
89 Ikebata H Yoshida R Repulsive parallel MCMC algorithm fordiscovering diverse motifs from large sequence setsBioinformatics 2015311561ndash8
90 Stormo GD DNA binding sites representation and discov-ery Bioinformatics 20001616ndash23
91 Nishida K Frith MC Nakai K Pseudocounts for transcriptionfactor binding sites Nucleic Acids Res 200937939ndash44
92 Weirauch MT Cote A Norel R et al Evaluation of methodsfor modeling transcription factor sequence specificity NatBiotechnol 201331126ndash34
93 Bailey TL Elkan C Fitting a mixture model by expectationmaximization to discover motifs in biopolymers Proc IntConf Intell Syst Mol Biol 1994228ndash36
94 Ding J Dhillon V Li X et al Systematic discovery of cofactormotifs from ChIP-seq data by SIOMICS Methods 201579ndash8047ndash51
95 van Helden J Andre B Collado-Vides J Extracting regulatorysites from the upstream region of yeast genes by computa-tional analysis of oligonucleotide frequencies J Mol Biol1998281827ndash42
12 | Liu et al
96 van Helden J Rios AF Collado-Vides J Discovering regula-tory elements in non-coding sequences by analysis ofspaced dyads Nucleic Acids Res 2000281808ndash18
97 van Helden J del Olmo M Perez-Ortin JE Statistical analysisof yeast genomic downstream sequences reveals putativepolyadenylation signals Nucleic Acids Res 2000281000ndash10
98 Kulakovskiy I Levitsky V Oshchepkov D et al From bindingmotifs in ChIP-Seq data to improved models of transcriptionfactor binding sites J Bioinform Comput Biol 2013111340004
99 Levitsky VG Kulakovskiy IV Ershov NI et al Application ofexperimentally verified transcription factor binding sitesmodels for computational analysis of ChIP-Seq data BMCGenomics 20141580
100 Eskin E Pevzner PA Finding composite regulatory patternsin DNA sequences Bioinformatics 200218 Suppl 1S354ndash63
101 Sharov AA Ko MS Exhaustive search for over-representedDNA sequence motifs with CisFinder DNA Res 200916261ndash73
102 Hertz GZ Hartzell GW III Stormo GD Identification of con-sensus patterns in unaligned DNA sequences known to befunctionally related Comput Appl Biosci 1990681ndash92
103 Kingsford C Zaslavsky E Singh M a compact mathematicalprogramming formulation for DNA motif finding Lect NotesComput Sci 2006400913
104 Reid JE Wernisch L STEME efficient EM to find motifs inlarge data sets Nucleic Acids Res 201139e126
105 Jin VX OrsquoGeen H Iyengar S et al Identification of an OCT4 andSRY regulatory module using integrated computational and ex-perimental genomics approaches Genome Res 200717807ndash17
106 Favorov AV Gelfand MS Gerasimova AV et al A Gibbs sam-pler for identification of symmetrically structured spacedDNA motifs with improved estimation of the signal lengthBioinformatics 2005212240ndash5
107 Zhao Y Ruan S Pandey M et al Improved models for tran-scription factor binding site identification using noninde-pendent interactions Genetics 2012191781ndash90
108 Redhead E Bailey TL Discriminative motif discovery in DNAand protein sequences using the DEME algorithm BMCBioinformatics 20078385
109 Chen X Xu H Yuan P et al Integration of external signalingpathways with the core transcriptional network in embry-onic stem cells Cell 20081331106ndash17
110 Huggins P Zhong S Shiff I et al DECOD Fast and AccurateDiscriminative DNA Motif Finding Bioinformatics2011272361ndash7
111 Cuellar-Partida G Buske FA McLeay RC et al Epigenetic pri-ors for identifying active transcription factor binding sitesBioinformatics 20122856ndash62
112 Mason MJ Plath K Zhou Q Identification of context-dependent motifs by contrasting ChIP binding dataBioinformatics 2010262826ndash32
113 Teng L He B Gao P et al Discover context-specificcombinatorial transcription factor interactions byintegrating diverse ChIP-Seq data sets Nucleic Acids Res201442e24
114 Vaquerizas JM Kummerfeld SK Teichmann SA et al A cen-sus of human transcription factors function expressionand evolution Nat Rev Genet 200910252ndash63
115 Ding J Cai X Wang Y et al ChIPModule systematic discov-ery of transcription factors and their cofactors from ChIP-seq data Pac Symp Biocomput 2013320ndash31
116 Ettwiller L Paten B Ramialison M et al Trawler de novoregulatory motif discovery pipeline for chromatin immuno-precipitation Nat Methods 20074563ndash5
117 Linhart C Halperin Y Shamir R Transcription factor andmicroRNA motif discovery the Amadeus platform and acompendium of metazoan target sets Genome Res2008181180ndash9
118 Kuttippurathu L Hsing M Liu Y et al CompleteMOTIFsDNA motif discovery platform for transcription factor bind-ing experiments Bioinformatics 201127715ndash7
119 Ho JW Bishop E Karchenko PV et al ChIP-chip versus ChIP-seq lessons for experimental design and data analysis BMCGenomics 201112134
120 Che D Li G Mao F et al Detecting uber-operons in prokary-otic genomes Nucleic Acids Res 2006342418ndash27
121 Liu Y Schmidt B Liu W et al CUDAndashMEME acceleratingmotif discovery in biological sequences using CUDA-enabled graphics processing units Pattern Recognit Lett2010318
122 Alipanahi B Delong A Weirauch MT et al Predicting the se-quence specificities of DNA- and RNA-binding proteins bydeep learning Nat Biotechnol 201533831ndash8
Motif Prediction based on ChIP-seq data | 13
further usage It is worth noting that this algorithm uses cali-brated tables which only takes the uneven nucleotide represen-tation in the noncoding sequences into consideration Thisfeature combined with the simplicity of its binomial-basedstatistical analysis results in the high efficiency of RSAT lsquopeak-motifsrsquo Experimental results also showcased that it is signifi-cantly faster than other tools like DREME [73] ChIPMunk [98 99]and MEME-ChIP [68] However the methodological simplicitylimits this method to the detection of short and relatively con-served motifs Longer motifs must be reconstituted by the com-bination of overlapping short nucleotides
FMotif [74] is an exhaustive pattern-driven strategy for find-ing motifs in ChIP-seq peaks A similar idea has been used inseveral traditional motif-finding tools for co-regulated data egWEEDER [75] and MITRA [100] One unique advantage ofpattern-driven methods is being able to identify planted (l d)-motifs without prior knowledge of width l FMotif basically fol-lows the strategy of WEEDER to enumerate all the possible (l d)-motifs in a depth-first manner and scan all motif occurrences ina suffix tree Although this strategy has achieved high accuracyin searching for short eukaryotic motifs [25] it still has severaldrawbacks For example it sets a constraint on mutation toler-ance dl which is not satisfied by many motifs and the runtime will greatly increase in identifying long motifs or motifswith low occurrence frequency The effect of these drawbacks isgetting worse when faced with large-scale ChIP-seq data Tosolve these problems FMotif designed a new suffix tree struc-ture for motif instance searching It keeps the mismatching in-formation generated when scanning i-mers in suffix tree anduses them in the searching of (ithorn 1)-mers The new structuredramatically decreases the time complexity of motif instancesearching and achieves higher accuracy compared withWEEDER thus it makes this enumeration strategy applicableon ChIP-seq data For example the run time of FMotif is about32 min and 475 h when identifying planted (16 5)-motif in thedata set with 10 000 and 80 000 nucleotides correspondinglyHowever WEEDER spends gt10 h when identifying the samemotif in the data set with having 1200 nucleotides [74] The effi-ciency can be even improved through a clustering strategy incombining l-mer enumeration (eg cisFinder) [101] One short-coming of FMotif is higher space complexity because morememory is required to store mismatched information
The above three methods have made substantial efforts onthe motif signal detection procedure SIOMICS [78 92] adopts arelatively simpler strategy to build motifs but pays more atten-tion to reassess these motifs based on the significance of agroup of motifs (called motif modules) Specifically SIOMICStakes all l-mers (lfrac14 8 by default) in given sequences as consen-suses and builds a motif for each of them by searching all seg-ments with at most one mismatch as its motif instances Themotifs are evaluated and ranked by the following Equation (4)
log ethnTHORNl
Xl
jfrac141
X4
i
pij log pij 1n
Xallsegments
log p0 seth THORNeth THORN
24
35 (4)
where n is the number of instances in this motif pij is the sameas that in Equation (2) and p0(s) is the probability of generatingTFBSs based on background nucleotide frequencies This scorenormalizes the effects of motif length and number of instancesthus has no bias in motif interpretation and comparison It hasbeen successfully used in MDscan another motif-finding toolfor ChIP-chip data [17] SIOMICS eliminates redundancy in pre-diction through removing the motif which has a higher-rank
similar motif The predicted motifs were reassessed as motifmodules by a tree strategy which will be discussed in the lsquoFindco-factors of TFs from ChIP-seq datarsquo section The basic as-sumption is that an individually insignificant motif should betaken as a valuable motif if it is overrepresented when con-sidered as a member of motif module together with the motifsof some other factors
The word-based methods can be roughly considered as aglobal optimization strategy as they enumerate all possible (ld)-motifs However they usually only have a good performanceon simulation data but not on real biological data because the (ld)-motif model is too strict for representing real TFBSs A hightime complexity is another bottleneck of the enumeration strat-egy Even though many techniques have been developed to ac-celerate the searching speed it is still a tough problem in ChIP-seq motif finding
Profile-based methods
Profile-based motif-finding methods are usually based on an opti-mization strategy with a motif profile score eg IC for PWMSpecifically they try to find a collection of DNA segments givingrise to a motif profile with the highest score among all the com-binations of candidates However it is impractical to identify thebest profile by exhaustive enumeration as going through all pos-sible combinations will lead to extremely high time complexityTherefore heuristic strategies are usually adopted to narrowdown the searching space First some primary profiles will bebuilt by randomly selecting one or several DNA segments or byan enumerative search in partially selected input sequences(based on some preliminary knowledge) Then they will be im-proved aiming to gain higher profile scores during the iterationprocess The improvement process is usually conducted itera-tively in a heuristic way [102] eg EM [80] and Gibbs sampling[15] It is noteworthy that high-quality motif profiles can be builtthrough combinatorial optimization strategies eg integrate lin-ear programming [103] and graph-based techniques [7 10 23]With ChIP-seq data a critical challenge to efficiently handle theexponentially increased number of sequences has been posed tothe conventional motif discovery methods
Several traditional motif-finding tools were updated to adaptto ChIP-seq data and most of them focused on how to improvethe time efficiency faced with the larger data size For exampleMEME-ChIP [68] runs MEME as one of its primary motif detectionengines however it limits the input data as no gt600 sequencesHybrid Motif Sampler (HMS) combines deterministic identifica-tion of motif instances and random sampling strategy togetherto accelerate its iteration process [72] STEME [104] which canbe considered as an updated version of MEME introduced a suf-fix tree structure in the EM process to index the sequences setwhich decreases the running time to a different magnitudeMICSA uses MEME on a subset of the highest ChIP-seq peaks toidentify motifs [79] and then MEME is applied for severalrounds on the sequences in which no motifs have been identi-fied ChIPMotifs is originally designed for ChIP-chip data andintegrates MEME and WEEDER to predict primary motif candi-dates followed by a bootstrap resampling approach to deter-mine optimal cutoff and filter against the false-positivecandidates [105]
ChIPMunk [69] is a new iterative algorithm for ChIP-seq datathat combines greedy optimization with bootstrapping Theevaluation of motif profiles in ChIPMunk is based on theKullback Discrete Information Content (KDIC) which is calcu-lated by Equation (5)
Motif Prediction based on ChIP-seq data | 5
KDIC frac14Xl
jfrac141
X4
ifrac141
log fij log N
NXl
jfrac141
X4
ifrac141
fij log bi (5)
where the variables l N fij and bi are the same as those definedin Equations (3) and (4) Like CONSENSUS ChIPMunk uses agreedy strategy to find motif profiles that have high KDIC val-ues The identified motif profiles are ranked based on PWMscores and then improved by an EM-like iterative process It isworth noting that the rebuilding of motif profiles integrated theweight scores calculated from the ChIP-seq coverage informa-tion Specifically ChIPMunk set a score to each sequence as itsnormalized maximal peak value and these weights are as-signed to each position forming the motif profiles Theexperiment on several ChIP-seq data set indicated thatChIPMunk obtained a better performance both in running speedand prediction quality compared with MEME HMS [72] andSeSiMCMC [106]
RPMCMC [89] is a profile-based method which combinesGibbs motif samplers and a reversible-jump MCMC method Asa parallel version of Gibbs RPMCMC fully uses parallel mechan-isms to accelerate motif finding This breakthrough means a lotto current motif finding studies especially in this big data eradriven by high-throughput sequencing technology eg ChIP-seq Although various motif-finding methods have been pro-posed before such as DREME HEGMA WEEDER and Gibbs motifsampler [15] they have limited power in properly controllingthe trade-off between computation time and motif detection ac-curacy Fortunately RPMCMC makes multiple interacting motifsamplers in parallel to ensure an acceptable run time comparedwith other methods Most importantly a repulsive force isapplied in RPMCMC to separate different motif samplers closeto each other making further contribution in getting rid of localoptima In this way RPMCMC has achieved extremely promis-ing performances on synthetic promoter sequence and ENCODEChIP-seq data sets Meanwhile it can also be applied to diversemotif identification ie discriminative motifs and more detailsregarding this will be discussed in the next section [89]
The profile-based methods stand on an assumption thateach nucleotide in a motif instance participates independentlyin the TF-binding activity However more studies have indi-cated that the neighboring positions have strong dependent ef-fect in some motifs [107] Taking this issue into considerationMathelier and Wasserman [64] designed a new transcriptionfactor flexible model (TFFM) and developed a prediction systembased on this model It uses a 1st-order HMM to elucidate the di-nucleotide dependencies Specifically the distribution probabil-ities of nucleotides in position i are dependent on thenucleotide found at position i-1 The performance evaluation on96 ChIP-seq data sets indicates that the models considering pos-ition dependence outperform the other models on 90 data setsof 96 ie 94 in terms of the area under the curves for the cor-responding receiver operating characteristic curves [64]Although this algorithm is not designed to build motif profilesthe TFFM model can be easily modified to integrate the correl-ation between positions in de novo motif finding
Although several motif representation models have been de-veloped and new methods have been designed for motif findingon ChIP-seq data the problem is still far from a sound solutionbecause of the complexity of gene regulation the large-scale ofChIP-seq data and the low specificity of some binding sites Inthe future new models for motif presentation may be createdto replace the consensus and profiles model and improve theaccuracy of motif finding
Assessment criteria and performance evaluation
The main difference between motif finding before and after thenext-generation sequencing technology is the dramaticallyincreased data size [9] Substantial experiments showcased thatthe most popular traditional motif-finding tools eg MEME andWEEDER cannot handle the genome-scale ChIP-seq data con-taining thousands of peaks Therefore almost all ChIP-seq-based motif-finding tools focus on improvement in efficiency inaddition to accuracy and conduct comprehensive performanceevaluations on synthetic data or real biological data sets Theperformance evaluation was mainly carried out based on thecomparison with other tools like MEME WEEDER and the most-cited ChIP-seq-based tool DREME The criteria include coverageon known binding sites (eg sensitivity or SN) similarity be-tween predicted motif profiles and the documented ones speci-ficity (SP) positive predicted value (PPV) and the running timeon data sets with different scales
Here we summarized the performance evaluation of theeight ChIP-seq-based motif-finding tools as a sketch in Table 2Overall no method can outperform others in all situationsThese eight tools showed their preferences on different datasizes different motif lengths and advantages on special contextof motif finding Generally DREME has a good performance onprimary motif finding from large-scale data set along withother two functions discriminative motif finding and cofactormotif finding which will be discussed in next section Howeverit prefers short motif identification and only outputs profilematrices without information of corresponding motif instancesIn this situation we have to rely on additional motif scanningtools to identify the concrete TF-binding sites MICSA the ear-liest among the eight tools takes MEME as the search enginethus its speed and performance are highly limited by MEMERSAT peak-motifs and FMotif handle large data sets with a goodprediction accuracy and run faster than DREME SpecificallyFMotif shows good tolerance on noise rate of input sequencesSIOMICS works well on finding motif modules and runs fasterthan DREME but slower than RSAT peak-motifs Discrover usesDREME to identify motif seeds and has a better performance ondiscriminative motif finding than DREME RPMCMC shows agreat potential on finding diverse motifs in real biological datasets More details can be found in Table 2 from which usersmay get a clue to appropriate tool selection in specific contextof motif-finding applications
Development of advanced functions withbiological insightsDiscriminative motif finding
Traditional motif-finding tools aim to identify a group of con-served motifs in query sequences which are expected to containinstances of the to-be-detected motifs Besides these motifssome other randomly conserved DNA patterns may also exist inthese sequences Distinguishing these false positives from realmotifs has always been a big challenge in motif finding Thisissue is getting more serious because the high-throughputsequencing techniques bring in more false positives One solutionto this problem is to carry out the discriminative motif findingwhich is to find motifs whose occurrence frequencies vary be-tween the query sequence set and several well-defined controlsets The simplest contrast for discriminative motif finding con-tains a positive set (query sequence set) and a negative set (se-quences without binding activity or randomly generated
6 | Liu et al
sequences) [88] The discriminative ideas have been integratedinto some traditional motif-finding tools to reduce false positivesin prediction For example CONSENSUS [102] calculates P-valuesfor motif profiles indicating the probabilities for the same motifsoccurring in a hypothetical random sequence set with a similarsize BOBRO [10] calculates such a P-value by practically simulat-ing random sequences set in the program as contrast underPoisson distribution BBR [24] takes coding genomic sequences asbackground data to reevaluate candidate motifs and an experi-ment on Escherichia coli genome showed that this strategy reducesthe number of false positives Certainly the function of discrim-inative motif finding is not limited to dealing with the false-posi-tive issue Some biology-driven applications specifically identifymotifs overrepresented in a positive set but not in a negative set(called decoy motif problem) Hence a motif without occurrencedifferences between two sets is not considerable no matter howconserved it is [108]
Application of discriminative motif finding is even more im-portant in ChIP-seq data analysis because the data size is usu-ally large and the peaks with or without binding activitiesnaturally compose the positive and negative sets respectivelyIn DREME [73] the first step is counting how many positive se-quences and negative sequences contain a query word Thenthe P-value from Fisherrsquos exact test is calculated for this wordThe words with significant P-values will be kept as seeds for
motif merging The same evaluation process will be conductediteratively in generation of motifs by combing similar motifseeds In application DREME performed discriminative motiffinding by taking Sox2- and Oct4-binding ChIP-seq data inmouse embryonic stem cell (mESC) as positive and negativedata set and found that Oct4 data set has significantly moreOct4-binding sites than the Sox2 data set The results over-turned the previous opinion that Sox2 and Oct4 bind their tar-gets exclusively as a heterodimer [109]
Maaskola and Rajewsky [88] developed an improved HMM-based tool Discrover for discriminative motif finding on largedata set The method consists of three parts lsquoseed findingrsquolsquoHMM optimizationrsquo and lsquosignificance filteringrsquo The seed findingpart is similar to DREME but uses more objective functions be-sides the Fisherrsquos exact test P-values Among these objectivefunctions mutual information of condition and motif occur-rence (MICO) and maximum mutual information estimationcan be used for discriminative motif evaluation on the contrastwith more than two conditions which extended the applicationof Discrover The HMM optimization step starts from a back-ground state trained by the BaumndashWelch algorithm Then theinitial emission probabilities of the motif chains are centered onthe seed sequences The posterior probability of all sequences isevaluated to calculate the expected number of sequences thathave motif occurrences Discrover conducts discriminative
Table 2 Performance summary of motif-finding tools on ChIP-seq data
Programs Criteria Preferences Speed Performance and special characteristics
MICSA [79] Speed motifcoverage
Prefers small datasets
Slow on large data sets Performance is limited by MEME performswell on filtering weak peaks
ChIPMunk [69] Speed motifcoverage
NA Faster than MEME slowerthan DREME quasi-linearcomplexity (power 127)
More accurate than MEME
DREME [73] Speed motifcoverage
Handles large ChIP-seq data (thou-sands of se-quences) prefersshort motifs
Much faster than WEEDERand MEME linear timecomplexity
Performs well at cofactor motif finding anddiscriminative motif finding only providesmotif profile matrix no output of motifsites
RSAT peak-motifs[87]
Speed motifcoverage
Handles large ChIP-seq data sets (upto 1 000 000 peaksof 100 bp each)
Faster than DREME MEMEand ChIPMunk
High prediction accuracy
FMotif [74] Speed motif cover-age SN and SP onmotif sites
Handles large ChIP-seq data handleslong motifs
Fast but requires morememory
High sensitivity and high specificity pre-dicted accurate motif profiles with highrank (1 or 2) for 12 mESC data tolerates upto 30 noise sequences
SIOMICS [78] Speed motif cover-age and specificityon motifs
NA Slower than RSAT peak-motifs much faster thanDREME especially onlarge data sets
Competitive performance with DREME andRSAT in motif finding and cofactor motiffinding competitive performance withDREME and better than RSAT peak-motifson false-positive control good motif mod-ule prediction
Discrover [88] Speed MCC(Matthews corre-lation coefficient)on nucleotide-level SN and PPVon binding sitelevel
NA Comparable with DREME inspeed
Better than Bioprospector and MDscan on SNand PPV of prediction better than DREMEon discriminative motif finding
RPMCMC [89] Speed SN and PPV NA Similar to DREME Better than WEEDER on SN and PPV of predic-tion great potential to mine many reliablediverse motifs
Motif Prediction based on ChIP-seq data | 7
learning by iterative gradient optimization based on the chosenobjective functions which has a linear run time complexitybased on the length of the input data Finally Discrover refinesthe final models by a threshold on a corrected MICO-based P-value regardless of if MICO or another objective function wasused in previous steps In addition Discrover makes extra ef-forts in detection of multiple motifs which will be discussed inthe next section
In the application of discriminative motif finding on ChIP-seq data the various choices of reference data can benefit dif-ferent biological analysis about gene regulation and is the mainfactor affecting the prediction performance [108 110]Specifically the reference data can be (i) randomly generatedsequences using a uniform distribution or a Markov process (ii)the sequences with no binding evidence (iii) the ChIP-seq peaksfor another TF (iv) the peaks of the same TF but under a differ-ent condition even (v) multiple-level reference data sets basedon a detailed grade framework of the signal strength etc [94]This information can reduce false positives and find other spe-cial motifs including decoy motifs variant motifs for one TFimpoverished motifs and collaborated motifs [108] and thuscan improve the state-of-the-art studies about complex regula-tion mechanism eg combinatorial regulation and behaviorchange of TFs among different conditions [110] It is worth not-ing that the binding activity of a TF could be affected by epigen-etic modifications in a complex fashion eg through localchromatin structure caused by the packaging of DNA histonesand other proteins [111] Cuellar-Partida et al [111] have usedthe signal from histone modification ChIP-assay to filter thescanned motifs The corresponding pitfall raised in discrimina-tive motif finding is that the sequences in a negative data setwith no binding evidence may contain potential binding sitesSome algorithms try to mitigate this effect by a special designwhich can tolerate a few motif occurrences in negative set [110]Another phenomenon is that the binding motifs of a TF couldbe affected by the binding of other TFs Mason et al [112] provedthat Oct4 preferentially binds to different motifs with or withoutSox2 bindings within 5000 bp indicating that the binding motifof Oct4 is context-dependent
Find cofactors of TFs from ChIP-seq data
In animals and plants TFs usually regulate gene expression withcooperation of other partner TFs (cofactors) [113] Growing evi-dence indicates that cofactors which interacted directly or indir-ectly play an important role in transcription regulation Thebinding sites of multiple TFs in a short DNA region often deter-mine the temporal spatial expression pattern of the downstreamgenes thus will help to elucidate underlying patterns of combina-torial regulation [114] The analysis of cofactors has attracted sub-stantial attention in traditional motif finding For example BBA[24] in the DMINDA Web server designed a probability model toevaluate the co-occurrences among identified motifs in a givenset of regulatory sequences which can reveal joint regulation re-lationships by multiple TFs The cofactor motif finding on ChIP-seq data is usually formulated by detecting multiple nonredun-dant motifs in a set of ChIP-seq peak regions [73] or detecting TFpatterns whose binding motifs frequently co-occurred on thesame set of regulatory regions [113] Recently some algorithmshave tried to take advantage of the large-scale ChIP-seq data todiscover the cofactor motifs
Researchers want to identify multiple nonredundant motifsin query ChIP-seq data The most straightforward way is to scanpeak sequences by documented motifs [84 115] However the
number of documented motifs is limited compared with thenumber of TFs leading to a loss of some real binding sitesSome de novo motif-finding methods try to first identify mul-tiple motifs and then build relationships for them with knownTFs One example is the aforementioned DREME [73] whichfeeds the whole sequences into the corresponding model ratherthan using only a small portion of the collected ChIP-seq data Itidentifies multiple motifs by removing the most statistically sig-nificant identified motif derived from Fisherrsquos exact testand then repeats the search for motifs DREME achieved re-markable performance in cofactor motifs discovery in mousecell ChIP-seq data and dramatically outperformed other preva-lent algorithms In particular DREME discovered gt2 times thenumber of cofactor motifs compared with Amadeus [116] andgt10 times compared with Trawler [117] In addition the algo-rithm peak-motifs in the RSAT package also returns additionalmotifs potentially bound by cofactors and its application onmouse cell ChIP-seq data successfully detected some cofactormotifs [76]
Rather than simply considering individual motifs separatelySIOMICS [78 94] models the cofactor motifs as motif modulesie a group of TFs with their TFBSs co-occurring in significantlylarge number of ChIP-seq peaks Cofactor motifs may occur onlyin a small set of peaks and thus can be underrepresented indi-vidually in all peak regions Therefore evaluation of motif mod-ules simultaneously could make the significance of cofactorsstand out from the massive background sequences Asdescribed in previous section SIOMICS identifies conservedmotifs by a word-based strategy based on Equation (4) Toidentify nonredundant motifs it removes redundant candidateswith lower scores and the remaining ones are used to identifyputative motifs Instead of considering motif candidatesindividually SIOMICS represents them as nodes in a tree inwhich the more frequent candidates are closer to the root whilethe branch of the tree represents a group of co-occurring motifsBy evaluating each branch based on a Poisson clumpingheuristic strategy SIOMICS identifies statistically significantmotif modules and takes motif candidates in them as final pu-tative motifs The performance comparison on the same ChIP-seq data set used by DREME and peak-motifs indicates thatmore documented cofactor motifs were identified by SIOMICSrather than DREME and RSAT Peak-motifs with comparabletime efficiency
The above methods identify cofactors from ChIP-seq data fora single TF and have obtained many valuable results Howeversynthetic analysis of diverse ChIP-seq data could provide moreinformation about the co-regulation mechanism of TFs FCOPs[113] is a method for identifying combinatorial occupancy pat-terns of multiple TFs from diverse ChIP-seq data FCOPs inte-grates two kinds of ChIP-seq data sets as input from TFsoccupancy as well as chromatin modification and it considerstwo kinds of uncertainty How possible a peak from ChIP-seqdata represents an actual binding activity and how confidentthe predicted enhancer sequences are FCOPs generate probabil-ities from these uncertainties and use them to evaluate the stat-istical significance of a given set of TFs co-occurring in a givenset of enhancers Obviously enumerating all possible subset ofenhancers has an exponential complexity which is inefficientThis method adopts a dynamic programming-based method tocalculate the probabilities which leads to linear complexitywith respect to the number of the enhancers both in time andspace Despite this method not providing any DNA motifs itcan be easily integrated into ChIP-seq motif analysis to identifycofactors
8 | Liu et al
Citation analysis of ChIP-seq motif-findingtools
To analyze the usage of existing ChIP-seq-based motif-findingtools we collected the citation information of eight tools inTable 1 and a Web server MEME-ChIP which implants MEMEand DREME The relative citations of the eight tools since 2011are shown in Figure 2 where we can see that DREME has anoverwhelming superiority among these tools and RSAT peak-motifs and ChIPMunk also have higher relative citation ratesAlthough the four tools that were developed within the past 3years (RPMCMC Discover SIOMICS and FMotif) introducedsome new ideas and obtained better prediction performancethey still fail to get more attention from biologists and almostall of their citations are from the papers focusing on develop-ment of new methods The main reason is the lack of a user-friendly Web server and a platform with multiple related func-tions limiting their usages by people without adequate compu-tational background
DREME as one motif-finding engine of MEME-ChIP has beenintegrated in the MEME suite which is the most popular motif-finding and analysis platform MEME-ChIP obtained more add-itional citations than DREME (Figure 3) and most of theseshould be attribute to DREME RSAT peak-motifs is part of RSATplatform where a series of modular computer programs is
integrated for regulatory signal detection in noncoding se-quences And the relatively high number of citations ofChIPMunk also benefited from its Web service The methodsthat only provide source code are difficult to obtain continuousattention Taking MICSA as an example it was published in2010 and was cited frequently within 2 years following publica-tion but its citation count decreased sharply in recent yearsThis phenomenon indicated that the user experience of a newalgorithm is as important as the quality of algorithm itself
Discussion
The ChIP-seq peaks provide a foundation for the identificationof motifs on a genome scale In return the predicted motifs canbe used to reevaluate the peaks and even find more potentialTF-binding sites that were missed by a ChIP-seq experimentFor example MICSA [79] takes advantage of de novo motif identi-fication to reevaluate the ChIP-seq peaks In MICSA only a sub-set of the highest peaks is fed into MEME for motifsidentification for which the probability of observing a motif bychance (P-value) is calculated Then the predicted motifs willbe used to scan all previously identified peaks and a score willbe assigned to each peak by combining the P-values of motifsthat occurred on it with its depth The experimental results
0
10
20
30
40
50
60
70
80
90
100
2011 2012 2013 2014 2015 2016
Rela
ve
annu
al ci
tao
n (
)
Year of citaon
RPMCMC 2015[91]
Discrover 2014[93]
SIOMICS 2014[78]
Fmof 2014[74]
RSAT peak-mofs 2012[94]
DREME 2011[73]
ChIPMunk 2010[69]
MICSA 2010[79]
Figure 2 Relative citation of the ChIP-seq motif-finding tools until October 2016 The numbers after a tool name indicate the year of its original release and the corres-
ponding reference Citation of each tool was extracted from the Bing Academy (httpcnbingcomacademic)
0
5
10
15
20
25
30
35
40
45
50
Cita
ons
pery
ears
ince
publ
ished
MICSA 2010[79]
ChIPMunk 2010[69]
DREME 2011[73]
RSAT peak-mofs 2012[94]
Fmof 2014[74]
SIOMICS 2014[78]
Discrover 2014[93]
RPMCMC 2015[91]
MEME-ChIP 2011[68]a
Figure 3 The average annual citation of eight motif-finding tools and MEME-ChIP aMEME-ChIP is a Web server that implants DREME and MEME to detect motifs
Motif Prediction based on ChIP-seq data | 9
showed that MICSA offered a distinct improvement in TF-bind-ing prediction accuracy with fewer false positives comparedwith the other 10 approaches Certainly searching potentialbinding sites using the predicted motifs is not limited to thepeak sequences It could even be applied on the whole-genomesequences as some binding activities may not occur under thecondition of ChIP experiments and these binding sites are alsovaluable for studying the full regulation picture of the interestedTFs
Another issue that should be discussed concerning ChIP-seqmotif finding is the difference of its application on various datasources For example the binding motifs in eukaryotes and pro-karyotes often have their respective characteristics and thelength of the binding regions in eukaryotes is much longer thanthose in prokaryotes Therefore there is a need of special designin motif-finding and analysis techniques
As with the traditional motif finding there are some ensem-ble methods for ChIP-seq-based motif finding [85 118] It was al-ways recommended that using several tools in motif finding isa better strategy as diverse techniques may capture differentcharacteristics of motifs This idea leads to the development ofensemble motif-finding methods Most of these methods sim-ply assemble several algorithms together and rank the outputof them For example W-ChIPMotif [119] includes MEMEWEEDER and MaMF [120] and assesses the output by comparingwith a randomized initial input CompleteMotifs [118] incorpor-ates three de novo tools (ChIPMunk WEEDER and CUDA-MEME)[121] and calculates the P-values of predicted motifs by a back-ground random model However using more algorithms intro-duces more false positives while they have high probability tocover real motifs Therefore further analyses are still requiredfor the output of multiple algorithms
Certainly the accuracy of prediction and the efficiency ofmethods are still the most important features in ChIP-seq motiffinding and the application of current methods shows that theyare still far from completely satisfactory More deep thinkingand advanced techniques are still needed to improve the intrin-sic algorithms For example the evolution information fromphylogenetic footprinting can be useful as prior knowledge inevaluation of candidate binding sites Besides additional func-tions such as cofactor analysis evaluating peaks based onmotifs and multiple types of discriminative motif finding are at-tracting more attention in applications They should be wellintegrated in motif-finding methods in future studies Finallyan integrated Web server or platform is essential for the appli-cation of new designed methods as it can provide more func-tions about upstream data collection and downstream resultanalysis
Most recently deep learning has demonstrated its unprece-dented performance in regulatory genomics especially in ChIP-seq-based motif analysis [122] This marked performance ismainly achieved by taking advantage of graphics processingunits and the capability to extract complex features of motifsfrom ChIP-seq data However it can be improved further ifmore information (eg DNA shape) can be taken into consider-ation rather than only the ChIP-seq peaks sequences
Key Points
bull The algorithmic strategies adopted by current motif-finding tools can be generally divided into two catego-ries word-based and profile-based which could be im-plemented with other models like tree graph and
clustering The combination of these two strategies isa better choice to break the current bottleneck in pre-diction accuracy and efficiency
bull The new features of current tools including but notlimited to discriminative motif identification (DREAMRSAT lsquopeak-motifsrsquo and Discrover) cofactor motifdetection (DREME RSAT lsquopeak-motifsrsquo SIOMICSRPMCMC and Discrover) etc are essential for applica-tion of ChIP-seq data analysis and are encouraged tobe integrated into newly designed algorithms
bull Existing tools still have a much room for improvementwith respect to prediction accuracy and efficiency thecontradictory relationship between the time efficiencyand space complexity as well as between predictionaccuracy and application universality is generally pre-sent in these tools
bull An integrated Web server or platform providing vari-ous analysis functions in data collection and the resultanalysis is essential for the spread of new designedmethods in their applications
bull In future studies more mathematical models eggraph model statistics and combinatorial optimiza-tion could be developed and integrated to design newmotif signal detection techniques and new evaluationcriteria In addition more information eg the de-pendency between nucleotides and the height of peaksfrom ChIP-seq data should be considered to improvethe prediction accuracy
Funding
The State of South Dakota Research Innovation Center andthe Agriculture Experiment Station of South Dakota StateUniversity the National Nature Science Foundation ofChina (grant numbers 61303084 and 61432010 to BL) andthe Young Scholars Program of Shandong University (grantnumbers YSPSDU 2015WLJH19 to BL)
References1 DrsquoHaeseleer P What are DNA sequence motifs Nat
Biotechnol 200624423ndash52 Brohee S Janky R Abdel-Sater F et al Unraveling networks
of co-regulated genes on the sole basis of genome se-quences Nucleic Acids Res 2011396340ndash58
3 Davidson E Levin M Gene regulatory networks Natl Acad SciUSA 20051024935
4 Liu B Zhou C Li G et al Bacterial regulon modeling and pre-diction based on systematic cis regulatory motif analysesSci Rep 2016623030
5 Baumbach J On the power and limits of evolutionary con-servationndashunraveling bacterial gene regulatory networksNucleic Acids Res 2010387877ndash84
6 Lawrence CE Altschul SF Boguski MS et al Detecting subtlesequence signals a Gibbs sampling strategy for multiplealignment Science 1993262208ndash14
7 Pevzner PA Sze SH Combinatorial approaches to findingsubtle signals in DNA sequences Proc Int Conf Intell Syst MolBiol 20008269ndash78
8 Nakaki R Kang J Tateno M A novel ab initio identificationsystem of transcriptional regulation motifs in genome DNAsequences based on direct comparison scheme of signalnoise distributions Nucleic Acids Res 2012408835ndash48
10 | Liu et al
9 Zambelli F Pesole G Pavesi G Motif discovery and transcrip-tion factor binding sites before and after the next-generation sequencing era Brief Bioinform 201314225ndash37
10 Li G Liu B Ma Q et al A new framework for identifying cis-regulatory motifs in prokaryotes Nucleic Acids Res201139e42
11 Chen X Guo L Fan Z et al W-AlignACE an improved Gibbssampling algorithm based on more accurate position weightmatrices learned from sequence and gene expressionChIP-chip data Bioinformatics 2008241121ndash8
12 Sinha S PhyME a software tool for finding motifs in sets oforthologous sequences Methods Mol Biol 2007395309ndash18
13 Das MK Dai HK A survey of DNA motif finding algorithmsBMC Bioinformatics 20078(Suppl 7)S21
14 Wang T Stormo GD Combining phylogenetic data with co-regulated genes to identify regulatory motifs Bioinformatics2003192369ndash80
15 Liu X Brutlag DL Liu JS BioProspector discovering con-served DNA motifs in upstream regulatory regions of co-expressed genes Pac Symp Biocomput 2001127ndash38
16 Hertz GZ Stormo GD Identifying DNA and protein patternswith statistically significant alignments of multiple se-quences Bioinformatics 199915563ndash77
17 Liu XS Brutlag DL Liu JS An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments NatBiotechnol 200220835ndash9
18 Bailey TL Boden M Buske FA et al MEME SUITE tools for motifdiscovery and searching Nucleic Acids Res 200937W202ndash8
19 Olman V Xu D Xu Y CUBIC identification of regulatorybinding sites through data clustering J Bioinform Comput Biol2003121ndash40
20 Li X Wong WH Sampling motifs on phylogenetic trees ProcNatl Acad Sci USA 20051029481ndash6
21 Blanchette M Tompa M Discovery of regulatory elementsby a computational method for phylogenetic footprintingGenome Res 200212739ndash48
22 Blanchette M Tompa M FootPrinter a program designed forphylogenetic footprinting Nucleic Acids Res 2003313840ndash2
23 Li G Liu B Xu Y Accurate recognition of cis-regulatorymotifs with the correct lengths in prokaryotic genomesNucleic Acids Res 201038e12
24 Ma Q Liu B Zhou C et al An integrated toolkit for accurateprediction and analysis of cis-regulatory motifs at a genomescale Bioinformatics 2013292261ndash8
25 Tompa M Li N Bailey TL et al Assessing computationaltools for the discovery of transcription factor binding sitesNat Biotechnol 200523137ndash44
26 McCue LA Thompson W Carmack CS et al Factors influenc-ing the identification of transcription factor binding sites bycross-species comparison Genome Res 2002121523ndash32
27 Simcha D Price ND Geman D The limits of de novo DNAmotif discovery PLoS One 20127e47836
28 Siddharthan R Siggia ED van Nimwegen E PhyloGibbs aGibbs sampling motif finder that incorporates phylogenyPLoS Comput Biol 20051e67
29 Blanchette M Schwikowski B Tompa M Algorithms forphylogenetic footprinting J Comput Biol 20029211ndash23
30 Neph S Tompa M MicroFootPrinter a tool for phylogeneticfootprinting in prokaryotic genomes Nucleic Acids Res200634W366ndash8
31 Carmack CS McCue LA Newberg LA et al PhyloScan identi-fication of transcription factor binding sites using cross-species evidence Algorithms Mol Biol 200721
32 Zhang S Xu M Li S et al Genome-wide de novo prediction ofcis-regulatory binding sites in prokaryotes Nucleic Acids Res200937e72
33 Katara P Grover A Sharma V Phylogenetic footprinting aboost for microbial regulatory genomics Protoplasma2012249901ndash7
34 Tagle DA Koop BF Goodman M et al Embryonic epsilon andgamma globin genes of a prosimian primate (Galago crassi-caudatus) Nucleotide and amino acid sequences develop-mental regulation and phylogenetic footprints J Mol Biol1988203439ndash55
35 Borneman AR Gianoulis TA Zhang ZD et al Divergence oftranscription factor binding sites across related yeast spe-cies Science 2007317815ndash9
36 Odom DT Dowell RD Jacobsen ES et al Tissue-specific tran-scriptional regulation has diverged significantly betweenhuman and mouse Nat Genet 200739730ndash2
37 Boyle AP Araya CL Brdlik C et al Comparative analysis ofregulatory information and circuits across distant speciesNature 2014512453ndash6
38 Bingqiang Liu CZ Zhang H Guojun L et al An integrativeand applicable phylogenetic footprinting framework for cis-regulatory motifs identification in prokaryotic genomesBMC Genomics 201617578
39 Park PJ ChIP-seq advantages and challenges of a maturingtechnology Nat Rev Genet 200910669ndash80
40 Song L Crawford GE DNase-seq a high-resolution tech-nique for mapping active gene regulatory elements acrossthe genome from mammalian cells Cold Spring Harb Protoc201025384
41 Wang Z Gerstein M Snyder et al RNA-Seq a revolutionarytool for transcriptomics Nat Rev Genet 20081057ndash63
42 Tsankov AM Gu H Akopian V et al Transcription factorbinding dynamics during human ES cell differentiationNature 2015518344ndash9
43 Wu F Olson BG Yao J DamID-seq genome-wide mapping ofprotein-DNA interactions by high throughput sequencing ofadenine-methylated DNA fragments J Vis Exp 2016107e53620
44 Maragkakis M Alexiou P Nakaya T et al CLIPSeqTools-anovel bioinformatics CLIP-seq analysis suite RNA2015221ndash9
45 Hafner M Landthaler M Burger L et al PAR-CliPmdasha methodto identify transcriptome-wide the binding sites of RNAbinding proteins J Vis Exp 201041e2034
46 Ingolia NT Ribosome profiling new views of translationfrom single codons to genome scale Nat Rev Genet201415205ndash13
47 Giresi PG Kim J Mcdaniell RM et al FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates activeregulatory elements from human chromatin Genome Res200717877ndash85
48 Nutiu R Friedman RC Luo S et al Direct measurement ofDNA affinity landscapes on a high-throughput sequencinginstrument Nat Biotechnol 201129659ndash64
49 Collas P Dahl JA Chop it ChIP it check it the currentstatus of chromatin immunoprecipitation Front Biosci200813929ndash43
50 Kimura H Sato Y DNA Replication and Histone ModificationJapan Springer 2016
51 Suganuma T Workman JL Histone modification as a reflec-tion of metabolism Cell Cycle 2016 15481ndash2
52 Qu H Fang X A brief review on the human encyclopedia ofDNA elements (ENCODE) project Genomics ProteomicsBioinformatics 201311135ndash41
Motif Prediction based on ChIP-seq data | 11
53 Consortium EP The ENCODE (ENCyclopedia Of DNAElements) project Science 2004306636ndash40
54 Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910R25
55 Li H Durbin R Fast and accurate long-read alignment withBurrowsndashWheeler transform Bioinformatics 201026589ndash95
56 Szalkowski AM Schmid CD Rapid innovation in ChIP-seqpeak-calling algorithms is outdistancing benchmarking ef-forts Brief Bioinform 201112626ndash33 628
57 Kharchenko PV Tolstorukov MY Park PJ Design and ana-lysis of ChIP-seq experiments for DNA-binding proteins NatBiotechnol 2008261351ndash9
58 Zhang Y Liu T Meyer CA et al Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089R137
59 Jiang H Wang F Dyer NP et al CisGenome browser a flex-ible tool for genomic data visualization Bioinformatics2010261781ndash2
60 Fejes AP Robertson G Bilenky M et al FindPeaks 31 atool for identifying areas of enrichment from massivelyparallel short-read sequencing technology Bioinformatics2008241729ndash30
61 Valouev A Johnson DS Sundquist A et al Genome-wideanalysis of transcription factor binding sites based on ChIP-Seq data Nat Methods 20085829ndash34
62 Xin F Grossman R Stein L PeakRanger a cloud-enabledpeak caller for ChIP-seq data BMC Bioinformatics201112139
63 Kuan PF Chung D Pan G et al A statistical frameworkfor the analysis of ChIP-Seq data J Am Stat Assoc2011106891ndash903
64 Mathelier A Wasserman WW The next generation of tran-scription factor binding site prediction PLoS Comput Biol20139e1003214
65 Cheng C Min R Gerstein M TIP a probabilistic method foridentifying transcription factor target genes from ChIP-seqbinding profiles Bioinformatics 2011273221ndash7
66 Wu S Wang J Zhao W et al ChIP-PaM an algorithm to iden-tify protein-DNA interaction using ChIP-Seq data Theor BiolMed Model 20107(1)18
67 van Heeringen SJ Veenstra GJC GimmeMotifs a de novomotif prediction pipeline for ChIP-sequencing experimentsBioinformatics 201127270ndash1
68 Machanick P Bailey TL MEME-ChIP motif analysis of largeDNA datasets Bioinformatics 2011271696ndash7
69 Kulakovskiy IV Boeva VA Favorov AV et al Deep and widedigging for binding motifs in ChIP-Seq data Bioinformatics2010262622ndash3
70 Jothi R Cuddapah S Barski A et al Genome-wide identifica-tion of in vivo proteinndashDNA binding sites from ChIP-Seqdata Nucleic Acids Res 2008365221ndash31
71 Mercier E Droit A Li L et al An integrated pipeline for thegenome-wide analysis of transcription factor binding sitesfrom ChIP-Seq PLoS One 20116e16432
72 Hu M Yu J Taylor JM et al On the detection and refinementof transcription factor binding sites using ChIP-Seq dataNucleic Acids Res 2010382154ndash67
73 Bailey TL DREME motif discovery in transcription factorChIP-seq data Bioinformatics 2011271653ndash9
74 Jia C Carson MB Wang Y et al A new exhaustive methodand strategy for finding motifs in ChIP-enriched regionsPLoS One 20149e86044
75 Pavesi G Mereghetti P Mauri G et al Weeder Web discov-ery of transcription factor binding sites in a set of
sequences from co-regulated genes Nucleic Acids Res200432W199ndash203
76 Thomas-Chollier M Herrmann C Defrance M et al RSATpeak-motifs motif analysis in full-size ChIP-seq datasetsNucleic Acids Res 201240e31
77 Tran NT Huang CH A survey of motif finding web tools fordetecting binding site motifs in ChIP-Seq data Biol Direct201494
78 Jun Ding HH Xiaoman L SIOMICS a novel approach for sys-tematic identification of motifs in ChIP-seq data NucleicAcids Res 2014421645ndash35
79 Boeva V Surdez D Guillon N et al De novo motif identifica-tion improves the accuracy of predicting transcription factorbinding sites in ChIP-Seq data analysis Nucleic Acids Res201038e126
80 Bailey TL Williams N Misleh C et al MEME discovering andanalyzing DNA and protein sequence motifs Nucleic AcidsRes 200634369ndash73
81 Hartmann H Guthhrlein EW Siebert M et al P-value-basedregulatory motif discovery using positional weight matricesGenome Res 201323181ndash94
82 Niu M De novo prediction of cis-regulatory modules in eu-karyotic organisms Dissertations amp ThesesmdashGradworksThe university of North Carolina at Charlotte 2014
83 Bolouri H Ruzzo WL Integration of 198 ChIP-seq datasetsreveals human cis-regulatory regions J Comput Biol201219989ndash97
84 Sun H Guns T Fierro AC et al Unveiling combinatorial regu-lation through the combination of ChIP information and insilico cis-regulatory module detection Nucleic Acids Res201240e90
85 Lihu A Holban S A review of ensemble methods for denovo motif discovery in ChIP-Seq data Brief Bioinform201617731
86 Kulakovskiy IV Makeev VJ Motif discovery and motif find-ing in ChIP-Seq data Genome Analysis Current Proceduresand Applications Caister Academic Press Norfolk UK2014 83
87 Medina-Rivera A Defrance M Sand O et al RSAT 2015Regulatory Sequence Analysis Tools Nucleic Acids Res201543W50ndash6
88 Maaskola J Rajewsky N Binding site discovery from nucleicacid sequences by discriminative learning of hidden Markovmodels Nucleic Acids Res 20144212995ndash3011
89 Ikebata H Yoshida R Repulsive parallel MCMC algorithm fordiscovering diverse motifs from large sequence setsBioinformatics 2015311561ndash8
90 Stormo GD DNA binding sites representation and discov-ery Bioinformatics 20001616ndash23
91 Nishida K Frith MC Nakai K Pseudocounts for transcriptionfactor binding sites Nucleic Acids Res 200937939ndash44
92 Weirauch MT Cote A Norel R et al Evaluation of methodsfor modeling transcription factor sequence specificity NatBiotechnol 201331126ndash34
93 Bailey TL Elkan C Fitting a mixture model by expectationmaximization to discover motifs in biopolymers Proc IntConf Intell Syst Mol Biol 1994228ndash36
94 Ding J Dhillon V Li X et al Systematic discovery of cofactormotifs from ChIP-seq data by SIOMICS Methods 201579ndash8047ndash51
95 van Helden J Andre B Collado-Vides J Extracting regulatorysites from the upstream region of yeast genes by computa-tional analysis of oligonucleotide frequencies J Mol Biol1998281827ndash42
12 | Liu et al
96 van Helden J Rios AF Collado-Vides J Discovering regula-tory elements in non-coding sequences by analysis ofspaced dyads Nucleic Acids Res 2000281808ndash18
97 van Helden J del Olmo M Perez-Ortin JE Statistical analysisof yeast genomic downstream sequences reveals putativepolyadenylation signals Nucleic Acids Res 2000281000ndash10
98 Kulakovskiy I Levitsky V Oshchepkov D et al From bindingmotifs in ChIP-Seq data to improved models of transcriptionfactor binding sites J Bioinform Comput Biol 2013111340004
99 Levitsky VG Kulakovskiy IV Ershov NI et al Application ofexperimentally verified transcription factor binding sitesmodels for computational analysis of ChIP-Seq data BMCGenomics 20141580
100 Eskin E Pevzner PA Finding composite regulatory patternsin DNA sequences Bioinformatics 200218 Suppl 1S354ndash63
101 Sharov AA Ko MS Exhaustive search for over-representedDNA sequence motifs with CisFinder DNA Res 200916261ndash73
102 Hertz GZ Hartzell GW III Stormo GD Identification of con-sensus patterns in unaligned DNA sequences known to befunctionally related Comput Appl Biosci 1990681ndash92
103 Kingsford C Zaslavsky E Singh M a compact mathematicalprogramming formulation for DNA motif finding Lect NotesComput Sci 2006400913
104 Reid JE Wernisch L STEME efficient EM to find motifs inlarge data sets Nucleic Acids Res 201139e126
105 Jin VX OrsquoGeen H Iyengar S et al Identification of an OCT4 andSRY regulatory module using integrated computational and ex-perimental genomics approaches Genome Res 200717807ndash17
106 Favorov AV Gelfand MS Gerasimova AV et al A Gibbs sam-pler for identification of symmetrically structured spacedDNA motifs with improved estimation of the signal lengthBioinformatics 2005212240ndash5
107 Zhao Y Ruan S Pandey M et al Improved models for tran-scription factor binding site identification using noninde-pendent interactions Genetics 2012191781ndash90
108 Redhead E Bailey TL Discriminative motif discovery in DNAand protein sequences using the DEME algorithm BMCBioinformatics 20078385
109 Chen X Xu H Yuan P et al Integration of external signalingpathways with the core transcriptional network in embry-onic stem cells Cell 20081331106ndash17
110 Huggins P Zhong S Shiff I et al DECOD Fast and AccurateDiscriminative DNA Motif Finding Bioinformatics2011272361ndash7
111 Cuellar-Partida G Buske FA McLeay RC et al Epigenetic pri-ors for identifying active transcription factor binding sitesBioinformatics 20122856ndash62
112 Mason MJ Plath K Zhou Q Identification of context-dependent motifs by contrasting ChIP binding dataBioinformatics 2010262826ndash32
113 Teng L He B Gao P et al Discover context-specificcombinatorial transcription factor interactions byintegrating diverse ChIP-Seq data sets Nucleic Acids Res201442e24
114 Vaquerizas JM Kummerfeld SK Teichmann SA et al A cen-sus of human transcription factors function expressionand evolution Nat Rev Genet 200910252ndash63
115 Ding J Cai X Wang Y et al ChIPModule systematic discov-ery of transcription factors and their cofactors from ChIP-seq data Pac Symp Biocomput 2013320ndash31
116 Ettwiller L Paten B Ramialison M et al Trawler de novoregulatory motif discovery pipeline for chromatin immuno-precipitation Nat Methods 20074563ndash5
117 Linhart C Halperin Y Shamir R Transcription factor andmicroRNA motif discovery the Amadeus platform and acompendium of metazoan target sets Genome Res2008181180ndash9
118 Kuttippurathu L Hsing M Liu Y et al CompleteMOTIFsDNA motif discovery platform for transcription factor bind-ing experiments Bioinformatics 201127715ndash7
119 Ho JW Bishop E Karchenko PV et al ChIP-chip versus ChIP-seq lessons for experimental design and data analysis BMCGenomics 201112134
120 Che D Li G Mao F et al Detecting uber-operons in prokary-otic genomes Nucleic Acids Res 2006342418ndash27
121 Liu Y Schmidt B Liu W et al CUDAndashMEME acceleratingmotif discovery in biological sequences using CUDA-enabled graphics processing units Pattern Recognit Lett2010318
122 Alipanahi B Delong A Weirauch MT et al Predicting the se-quence specificities of DNA- and RNA-binding proteins bydeep learning Nat Biotechnol 201533831ndash8
Motif Prediction based on ChIP-seq data | 13
KDIC frac14Xl
jfrac141
X4
ifrac141
log fij log N
NXl
jfrac141
X4
ifrac141
fij log bi (5)
where the variables l N fij and bi are the same as those definedin Equations (3) and (4) Like CONSENSUS ChIPMunk uses agreedy strategy to find motif profiles that have high KDIC val-ues The identified motif profiles are ranked based on PWMscores and then improved by an EM-like iterative process It isworth noting that the rebuilding of motif profiles integrated theweight scores calculated from the ChIP-seq coverage informa-tion Specifically ChIPMunk set a score to each sequence as itsnormalized maximal peak value and these weights are as-signed to each position forming the motif profiles Theexperiment on several ChIP-seq data set indicated thatChIPMunk obtained a better performance both in running speedand prediction quality compared with MEME HMS [72] andSeSiMCMC [106]
RPMCMC [89] is a profile-based method which combinesGibbs motif samplers and a reversible-jump MCMC method Asa parallel version of Gibbs RPMCMC fully uses parallel mechan-isms to accelerate motif finding This breakthrough means a lotto current motif finding studies especially in this big data eradriven by high-throughput sequencing technology eg ChIP-seq Although various motif-finding methods have been pro-posed before such as DREME HEGMA WEEDER and Gibbs motifsampler [15] they have limited power in properly controllingthe trade-off between computation time and motif detection ac-curacy Fortunately RPMCMC makes multiple interacting motifsamplers in parallel to ensure an acceptable run time comparedwith other methods Most importantly a repulsive force isapplied in RPMCMC to separate different motif samplers closeto each other making further contribution in getting rid of localoptima In this way RPMCMC has achieved extremely promis-ing performances on synthetic promoter sequence and ENCODEChIP-seq data sets Meanwhile it can also be applied to diversemotif identification ie discriminative motifs and more detailsregarding this will be discussed in the next section [89]
The profile-based methods stand on an assumption thateach nucleotide in a motif instance participates independentlyin the TF-binding activity However more studies have indi-cated that the neighboring positions have strong dependent ef-fect in some motifs [107] Taking this issue into considerationMathelier and Wasserman [64] designed a new transcriptionfactor flexible model (TFFM) and developed a prediction systembased on this model It uses a 1st-order HMM to elucidate the di-nucleotide dependencies Specifically the distribution probabil-ities of nucleotides in position i are dependent on thenucleotide found at position i-1 The performance evaluation on96 ChIP-seq data sets indicates that the models considering pos-ition dependence outperform the other models on 90 data setsof 96 ie 94 in terms of the area under the curves for the cor-responding receiver operating characteristic curves [64]Although this algorithm is not designed to build motif profilesthe TFFM model can be easily modified to integrate the correl-ation between positions in de novo motif finding
Although several motif representation models have been de-veloped and new methods have been designed for motif findingon ChIP-seq data the problem is still far from a sound solutionbecause of the complexity of gene regulation the large-scale ofChIP-seq data and the low specificity of some binding sites Inthe future new models for motif presentation may be createdto replace the consensus and profiles model and improve theaccuracy of motif finding
Assessment criteria and performance evaluation
The main difference between motif finding before and after thenext-generation sequencing technology is the dramaticallyincreased data size [9] Substantial experiments showcased thatthe most popular traditional motif-finding tools eg MEME andWEEDER cannot handle the genome-scale ChIP-seq data con-taining thousands of peaks Therefore almost all ChIP-seq-based motif-finding tools focus on improvement in efficiency inaddition to accuracy and conduct comprehensive performanceevaluations on synthetic data or real biological data sets Theperformance evaluation was mainly carried out based on thecomparison with other tools like MEME WEEDER and the most-cited ChIP-seq-based tool DREME The criteria include coverageon known binding sites (eg sensitivity or SN) similarity be-tween predicted motif profiles and the documented ones speci-ficity (SP) positive predicted value (PPV) and the running timeon data sets with different scales
Here we summarized the performance evaluation of theeight ChIP-seq-based motif-finding tools as a sketch in Table 2Overall no method can outperform others in all situationsThese eight tools showed their preferences on different datasizes different motif lengths and advantages on special contextof motif finding Generally DREME has a good performance onprimary motif finding from large-scale data set along withother two functions discriminative motif finding and cofactormotif finding which will be discussed in next section Howeverit prefers short motif identification and only outputs profilematrices without information of corresponding motif instancesIn this situation we have to rely on additional motif scanningtools to identify the concrete TF-binding sites MICSA the ear-liest among the eight tools takes MEME as the search enginethus its speed and performance are highly limited by MEMERSAT peak-motifs and FMotif handle large data sets with a goodprediction accuracy and run faster than DREME SpecificallyFMotif shows good tolerance on noise rate of input sequencesSIOMICS works well on finding motif modules and runs fasterthan DREME but slower than RSAT peak-motifs Discrover usesDREME to identify motif seeds and has a better performance ondiscriminative motif finding than DREME RPMCMC shows agreat potential on finding diverse motifs in real biological datasets More details can be found in Table 2 from which usersmay get a clue to appropriate tool selection in specific contextof motif-finding applications
Development of advanced functions withbiological insightsDiscriminative motif finding
Traditional motif-finding tools aim to identify a group of con-served motifs in query sequences which are expected to containinstances of the to-be-detected motifs Besides these motifssome other randomly conserved DNA patterns may also exist inthese sequences Distinguishing these false positives from realmotifs has always been a big challenge in motif finding Thisissue is getting more serious because the high-throughputsequencing techniques bring in more false positives One solutionto this problem is to carry out the discriminative motif findingwhich is to find motifs whose occurrence frequencies vary be-tween the query sequence set and several well-defined controlsets The simplest contrast for discriminative motif finding con-tains a positive set (query sequence set) and a negative set (se-quences without binding activity or randomly generated
6 | Liu et al
sequences) [88] The discriminative ideas have been integratedinto some traditional motif-finding tools to reduce false positivesin prediction For example CONSENSUS [102] calculates P-valuesfor motif profiles indicating the probabilities for the same motifsoccurring in a hypothetical random sequence set with a similarsize BOBRO [10] calculates such a P-value by practically simulat-ing random sequences set in the program as contrast underPoisson distribution BBR [24] takes coding genomic sequences asbackground data to reevaluate candidate motifs and an experi-ment on Escherichia coli genome showed that this strategy reducesthe number of false positives Certainly the function of discrim-inative motif finding is not limited to dealing with the false-posi-tive issue Some biology-driven applications specifically identifymotifs overrepresented in a positive set but not in a negative set(called decoy motif problem) Hence a motif without occurrencedifferences between two sets is not considerable no matter howconserved it is [108]
Application of discriminative motif finding is even more im-portant in ChIP-seq data analysis because the data size is usu-ally large and the peaks with or without binding activitiesnaturally compose the positive and negative sets respectivelyIn DREME [73] the first step is counting how many positive se-quences and negative sequences contain a query word Thenthe P-value from Fisherrsquos exact test is calculated for this wordThe words with significant P-values will be kept as seeds for
motif merging The same evaluation process will be conductediteratively in generation of motifs by combing similar motifseeds In application DREME performed discriminative motiffinding by taking Sox2- and Oct4-binding ChIP-seq data inmouse embryonic stem cell (mESC) as positive and negativedata set and found that Oct4 data set has significantly moreOct4-binding sites than the Sox2 data set The results over-turned the previous opinion that Sox2 and Oct4 bind their tar-gets exclusively as a heterodimer [109]
Maaskola and Rajewsky [88] developed an improved HMM-based tool Discrover for discriminative motif finding on largedata set The method consists of three parts lsquoseed findingrsquolsquoHMM optimizationrsquo and lsquosignificance filteringrsquo The seed findingpart is similar to DREME but uses more objective functions be-sides the Fisherrsquos exact test P-values Among these objectivefunctions mutual information of condition and motif occur-rence (MICO) and maximum mutual information estimationcan be used for discriminative motif evaluation on the contrastwith more than two conditions which extended the applicationof Discrover The HMM optimization step starts from a back-ground state trained by the BaumndashWelch algorithm Then theinitial emission probabilities of the motif chains are centered onthe seed sequences The posterior probability of all sequences isevaluated to calculate the expected number of sequences thathave motif occurrences Discrover conducts discriminative
Table 2 Performance summary of motif-finding tools on ChIP-seq data
Programs Criteria Preferences Speed Performance and special characteristics
MICSA [79] Speed motifcoverage
Prefers small datasets
Slow on large data sets Performance is limited by MEME performswell on filtering weak peaks
ChIPMunk [69] Speed motifcoverage
NA Faster than MEME slowerthan DREME quasi-linearcomplexity (power 127)
More accurate than MEME
DREME [73] Speed motifcoverage
Handles large ChIP-seq data (thou-sands of se-quences) prefersshort motifs
Much faster than WEEDERand MEME linear timecomplexity
Performs well at cofactor motif finding anddiscriminative motif finding only providesmotif profile matrix no output of motifsites
RSAT peak-motifs[87]
Speed motifcoverage
Handles large ChIP-seq data sets (upto 1 000 000 peaksof 100 bp each)
Faster than DREME MEMEand ChIPMunk
High prediction accuracy
FMotif [74] Speed motif cover-age SN and SP onmotif sites
Handles large ChIP-seq data handleslong motifs
Fast but requires morememory
High sensitivity and high specificity pre-dicted accurate motif profiles with highrank (1 or 2) for 12 mESC data tolerates upto 30 noise sequences
SIOMICS [78] Speed motif cover-age and specificityon motifs
NA Slower than RSAT peak-motifs much faster thanDREME especially onlarge data sets
Competitive performance with DREME andRSAT in motif finding and cofactor motiffinding competitive performance withDREME and better than RSAT peak-motifson false-positive control good motif mod-ule prediction
Discrover [88] Speed MCC(Matthews corre-lation coefficient)on nucleotide-level SN and PPVon binding sitelevel
NA Comparable with DREME inspeed
Better than Bioprospector and MDscan on SNand PPV of prediction better than DREMEon discriminative motif finding
RPMCMC [89] Speed SN and PPV NA Similar to DREME Better than WEEDER on SN and PPV of predic-tion great potential to mine many reliablediverse motifs
Motif Prediction based on ChIP-seq data | 7
learning by iterative gradient optimization based on the chosenobjective functions which has a linear run time complexitybased on the length of the input data Finally Discrover refinesthe final models by a threshold on a corrected MICO-based P-value regardless of if MICO or another objective function wasused in previous steps In addition Discrover makes extra ef-forts in detection of multiple motifs which will be discussed inthe next section
In the application of discriminative motif finding on ChIP-seq data the various choices of reference data can benefit dif-ferent biological analysis about gene regulation and is the mainfactor affecting the prediction performance [108 110]Specifically the reference data can be (i) randomly generatedsequences using a uniform distribution or a Markov process (ii)the sequences with no binding evidence (iii) the ChIP-seq peaksfor another TF (iv) the peaks of the same TF but under a differ-ent condition even (v) multiple-level reference data sets basedon a detailed grade framework of the signal strength etc [94]This information can reduce false positives and find other spe-cial motifs including decoy motifs variant motifs for one TFimpoverished motifs and collaborated motifs [108] and thuscan improve the state-of-the-art studies about complex regula-tion mechanism eg combinatorial regulation and behaviorchange of TFs among different conditions [110] It is worth not-ing that the binding activity of a TF could be affected by epigen-etic modifications in a complex fashion eg through localchromatin structure caused by the packaging of DNA histonesand other proteins [111] Cuellar-Partida et al [111] have usedthe signal from histone modification ChIP-assay to filter thescanned motifs The corresponding pitfall raised in discrimina-tive motif finding is that the sequences in a negative data setwith no binding evidence may contain potential binding sitesSome algorithms try to mitigate this effect by a special designwhich can tolerate a few motif occurrences in negative set [110]Another phenomenon is that the binding motifs of a TF couldbe affected by the binding of other TFs Mason et al [112] provedthat Oct4 preferentially binds to different motifs with or withoutSox2 bindings within 5000 bp indicating that the binding motifof Oct4 is context-dependent
Find cofactors of TFs from ChIP-seq data
In animals and plants TFs usually regulate gene expression withcooperation of other partner TFs (cofactors) [113] Growing evi-dence indicates that cofactors which interacted directly or indir-ectly play an important role in transcription regulation Thebinding sites of multiple TFs in a short DNA region often deter-mine the temporal spatial expression pattern of the downstreamgenes thus will help to elucidate underlying patterns of combina-torial regulation [114] The analysis of cofactors has attracted sub-stantial attention in traditional motif finding For example BBA[24] in the DMINDA Web server designed a probability model toevaluate the co-occurrences among identified motifs in a givenset of regulatory sequences which can reveal joint regulation re-lationships by multiple TFs The cofactor motif finding on ChIP-seq data is usually formulated by detecting multiple nonredun-dant motifs in a set of ChIP-seq peak regions [73] or detecting TFpatterns whose binding motifs frequently co-occurred on thesame set of regulatory regions [113] Recently some algorithmshave tried to take advantage of the large-scale ChIP-seq data todiscover the cofactor motifs
Researchers want to identify multiple nonredundant motifsin query ChIP-seq data The most straightforward way is to scanpeak sequences by documented motifs [84 115] However the
number of documented motifs is limited compared with thenumber of TFs leading to a loss of some real binding sitesSome de novo motif-finding methods try to first identify mul-tiple motifs and then build relationships for them with knownTFs One example is the aforementioned DREME [73] whichfeeds the whole sequences into the corresponding model ratherthan using only a small portion of the collected ChIP-seq data Itidentifies multiple motifs by removing the most statistically sig-nificant identified motif derived from Fisherrsquos exact testand then repeats the search for motifs DREME achieved re-markable performance in cofactor motifs discovery in mousecell ChIP-seq data and dramatically outperformed other preva-lent algorithms In particular DREME discovered gt2 times thenumber of cofactor motifs compared with Amadeus [116] andgt10 times compared with Trawler [117] In addition the algo-rithm peak-motifs in the RSAT package also returns additionalmotifs potentially bound by cofactors and its application onmouse cell ChIP-seq data successfully detected some cofactormotifs [76]
Rather than simply considering individual motifs separatelySIOMICS [78 94] models the cofactor motifs as motif modulesie a group of TFs with their TFBSs co-occurring in significantlylarge number of ChIP-seq peaks Cofactor motifs may occur onlyin a small set of peaks and thus can be underrepresented indi-vidually in all peak regions Therefore evaluation of motif mod-ules simultaneously could make the significance of cofactorsstand out from the massive background sequences Asdescribed in previous section SIOMICS identifies conservedmotifs by a word-based strategy based on Equation (4) Toidentify nonredundant motifs it removes redundant candidateswith lower scores and the remaining ones are used to identifyputative motifs Instead of considering motif candidatesindividually SIOMICS represents them as nodes in a tree inwhich the more frequent candidates are closer to the root whilethe branch of the tree represents a group of co-occurring motifsBy evaluating each branch based on a Poisson clumpingheuristic strategy SIOMICS identifies statistically significantmotif modules and takes motif candidates in them as final pu-tative motifs The performance comparison on the same ChIP-seq data set used by DREME and peak-motifs indicates thatmore documented cofactor motifs were identified by SIOMICSrather than DREME and RSAT Peak-motifs with comparabletime efficiency
The above methods identify cofactors from ChIP-seq data fora single TF and have obtained many valuable results Howeversynthetic analysis of diverse ChIP-seq data could provide moreinformation about the co-regulation mechanism of TFs FCOPs[113] is a method for identifying combinatorial occupancy pat-terns of multiple TFs from diverse ChIP-seq data FCOPs inte-grates two kinds of ChIP-seq data sets as input from TFsoccupancy as well as chromatin modification and it considerstwo kinds of uncertainty How possible a peak from ChIP-seqdata represents an actual binding activity and how confidentthe predicted enhancer sequences are FCOPs generate probabil-ities from these uncertainties and use them to evaluate the stat-istical significance of a given set of TFs co-occurring in a givenset of enhancers Obviously enumerating all possible subset ofenhancers has an exponential complexity which is inefficientThis method adopts a dynamic programming-based method tocalculate the probabilities which leads to linear complexitywith respect to the number of the enhancers both in time andspace Despite this method not providing any DNA motifs itcan be easily integrated into ChIP-seq motif analysis to identifycofactors
8 | Liu et al
Citation analysis of ChIP-seq motif-findingtools
To analyze the usage of existing ChIP-seq-based motif-findingtools we collected the citation information of eight tools inTable 1 and a Web server MEME-ChIP which implants MEMEand DREME The relative citations of the eight tools since 2011are shown in Figure 2 where we can see that DREME has anoverwhelming superiority among these tools and RSAT peak-motifs and ChIPMunk also have higher relative citation ratesAlthough the four tools that were developed within the past 3years (RPMCMC Discover SIOMICS and FMotif) introducedsome new ideas and obtained better prediction performancethey still fail to get more attention from biologists and almostall of their citations are from the papers focusing on develop-ment of new methods The main reason is the lack of a user-friendly Web server and a platform with multiple related func-tions limiting their usages by people without adequate compu-tational background
DREME as one motif-finding engine of MEME-ChIP has beenintegrated in the MEME suite which is the most popular motif-finding and analysis platform MEME-ChIP obtained more add-itional citations than DREME (Figure 3) and most of theseshould be attribute to DREME RSAT peak-motifs is part of RSATplatform where a series of modular computer programs is
integrated for regulatory signal detection in noncoding se-quences And the relatively high number of citations ofChIPMunk also benefited from its Web service The methodsthat only provide source code are difficult to obtain continuousattention Taking MICSA as an example it was published in2010 and was cited frequently within 2 years following publica-tion but its citation count decreased sharply in recent yearsThis phenomenon indicated that the user experience of a newalgorithm is as important as the quality of algorithm itself
Discussion
The ChIP-seq peaks provide a foundation for the identificationof motifs on a genome scale In return the predicted motifs canbe used to reevaluate the peaks and even find more potentialTF-binding sites that were missed by a ChIP-seq experimentFor example MICSA [79] takes advantage of de novo motif identi-fication to reevaluate the ChIP-seq peaks In MICSA only a sub-set of the highest peaks is fed into MEME for motifsidentification for which the probability of observing a motif bychance (P-value) is calculated Then the predicted motifs willbe used to scan all previously identified peaks and a score willbe assigned to each peak by combining the P-values of motifsthat occurred on it with its depth The experimental results
0
10
20
30
40
50
60
70
80
90
100
2011 2012 2013 2014 2015 2016
Rela
ve
annu
al ci
tao
n (
)
Year of citaon
RPMCMC 2015[91]
Discrover 2014[93]
SIOMICS 2014[78]
Fmof 2014[74]
RSAT peak-mofs 2012[94]
DREME 2011[73]
ChIPMunk 2010[69]
MICSA 2010[79]
Figure 2 Relative citation of the ChIP-seq motif-finding tools until October 2016 The numbers after a tool name indicate the year of its original release and the corres-
ponding reference Citation of each tool was extracted from the Bing Academy (httpcnbingcomacademic)
0
5
10
15
20
25
30
35
40
45
50
Cita
ons
pery
ears
ince
publ
ished
MICSA 2010[79]
ChIPMunk 2010[69]
DREME 2011[73]
RSAT peak-mofs 2012[94]
Fmof 2014[74]
SIOMICS 2014[78]
Discrover 2014[93]
RPMCMC 2015[91]
MEME-ChIP 2011[68]a
Figure 3 The average annual citation of eight motif-finding tools and MEME-ChIP aMEME-ChIP is a Web server that implants DREME and MEME to detect motifs
Motif Prediction based on ChIP-seq data | 9
showed that MICSA offered a distinct improvement in TF-bind-ing prediction accuracy with fewer false positives comparedwith the other 10 approaches Certainly searching potentialbinding sites using the predicted motifs is not limited to thepeak sequences It could even be applied on the whole-genomesequences as some binding activities may not occur under thecondition of ChIP experiments and these binding sites are alsovaluable for studying the full regulation picture of the interestedTFs
Another issue that should be discussed concerning ChIP-seqmotif finding is the difference of its application on various datasources For example the binding motifs in eukaryotes and pro-karyotes often have their respective characteristics and thelength of the binding regions in eukaryotes is much longer thanthose in prokaryotes Therefore there is a need of special designin motif-finding and analysis techniques
As with the traditional motif finding there are some ensem-ble methods for ChIP-seq-based motif finding [85 118] It was al-ways recommended that using several tools in motif finding isa better strategy as diverse techniques may capture differentcharacteristics of motifs This idea leads to the development ofensemble motif-finding methods Most of these methods sim-ply assemble several algorithms together and rank the outputof them For example W-ChIPMotif [119] includes MEMEWEEDER and MaMF [120] and assesses the output by comparingwith a randomized initial input CompleteMotifs [118] incorpor-ates three de novo tools (ChIPMunk WEEDER and CUDA-MEME)[121] and calculates the P-values of predicted motifs by a back-ground random model However using more algorithms intro-duces more false positives while they have high probability tocover real motifs Therefore further analyses are still requiredfor the output of multiple algorithms
Certainly the accuracy of prediction and the efficiency ofmethods are still the most important features in ChIP-seq motiffinding and the application of current methods shows that theyare still far from completely satisfactory More deep thinkingand advanced techniques are still needed to improve the intrin-sic algorithms For example the evolution information fromphylogenetic footprinting can be useful as prior knowledge inevaluation of candidate binding sites Besides additional func-tions such as cofactor analysis evaluating peaks based onmotifs and multiple types of discriminative motif finding are at-tracting more attention in applications They should be wellintegrated in motif-finding methods in future studies Finallyan integrated Web server or platform is essential for the appli-cation of new designed methods as it can provide more func-tions about upstream data collection and downstream resultanalysis
Most recently deep learning has demonstrated its unprece-dented performance in regulatory genomics especially in ChIP-seq-based motif analysis [122] This marked performance ismainly achieved by taking advantage of graphics processingunits and the capability to extract complex features of motifsfrom ChIP-seq data However it can be improved further ifmore information (eg DNA shape) can be taken into consider-ation rather than only the ChIP-seq peaks sequences
Key Points
bull The algorithmic strategies adopted by current motif-finding tools can be generally divided into two catego-ries word-based and profile-based which could be im-plemented with other models like tree graph and
clustering The combination of these two strategies isa better choice to break the current bottleneck in pre-diction accuracy and efficiency
bull The new features of current tools including but notlimited to discriminative motif identification (DREAMRSAT lsquopeak-motifsrsquo and Discrover) cofactor motifdetection (DREME RSAT lsquopeak-motifsrsquo SIOMICSRPMCMC and Discrover) etc are essential for applica-tion of ChIP-seq data analysis and are encouraged tobe integrated into newly designed algorithms
bull Existing tools still have a much room for improvementwith respect to prediction accuracy and efficiency thecontradictory relationship between the time efficiencyand space complexity as well as between predictionaccuracy and application universality is generally pre-sent in these tools
bull An integrated Web server or platform providing vari-ous analysis functions in data collection and the resultanalysis is essential for the spread of new designedmethods in their applications
bull In future studies more mathematical models eggraph model statistics and combinatorial optimiza-tion could be developed and integrated to design newmotif signal detection techniques and new evaluationcriteria In addition more information eg the de-pendency between nucleotides and the height of peaksfrom ChIP-seq data should be considered to improvethe prediction accuracy
Funding
The State of South Dakota Research Innovation Center andthe Agriculture Experiment Station of South Dakota StateUniversity the National Nature Science Foundation ofChina (grant numbers 61303084 and 61432010 to BL) andthe Young Scholars Program of Shandong University (grantnumbers YSPSDU 2015WLJH19 to BL)
References1 DrsquoHaeseleer P What are DNA sequence motifs Nat
Biotechnol 200624423ndash52 Brohee S Janky R Abdel-Sater F et al Unraveling networks
of co-regulated genes on the sole basis of genome se-quences Nucleic Acids Res 2011396340ndash58
3 Davidson E Levin M Gene regulatory networks Natl Acad SciUSA 20051024935
4 Liu B Zhou C Li G et al Bacterial regulon modeling and pre-diction based on systematic cis regulatory motif analysesSci Rep 2016623030
5 Baumbach J On the power and limits of evolutionary con-servationndashunraveling bacterial gene regulatory networksNucleic Acids Res 2010387877ndash84
6 Lawrence CE Altschul SF Boguski MS et al Detecting subtlesequence signals a Gibbs sampling strategy for multiplealignment Science 1993262208ndash14
7 Pevzner PA Sze SH Combinatorial approaches to findingsubtle signals in DNA sequences Proc Int Conf Intell Syst MolBiol 20008269ndash78
8 Nakaki R Kang J Tateno M A novel ab initio identificationsystem of transcriptional regulation motifs in genome DNAsequences based on direct comparison scheme of signalnoise distributions Nucleic Acids Res 2012408835ndash48
10 | Liu et al
9 Zambelli F Pesole G Pavesi G Motif discovery and transcrip-tion factor binding sites before and after the next-generation sequencing era Brief Bioinform 201314225ndash37
10 Li G Liu B Ma Q et al A new framework for identifying cis-regulatory motifs in prokaryotes Nucleic Acids Res201139e42
11 Chen X Guo L Fan Z et al W-AlignACE an improved Gibbssampling algorithm based on more accurate position weightmatrices learned from sequence and gene expressionChIP-chip data Bioinformatics 2008241121ndash8
12 Sinha S PhyME a software tool for finding motifs in sets oforthologous sequences Methods Mol Biol 2007395309ndash18
13 Das MK Dai HK A survey of DNA motif finding algorithmsBMC Bioinformatics 20078(Suppl 7)S21
14 Wang T Stormo GD Combining phylogenetic data with co-regulated genes to identify regulatory motifs Bioinformatics2003192369ndash80
15 Liu X Brutlag DL Liu JS BioProspector discovering con-served DNA motifs in upstream regulatory regions of co-expressed genes Pac Symp Biocomput 2001127ndash38
16 Hertz GZ Stormo GD Identifying DNA and protein patternswith statistically significant alignments of multiple se-quences Bioinformatics 199915563ndash77
17 Liu XS Brutlag DL Liu JS An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments NatBiotechnol 200220835ndash9
18 Bailey TL Boden M Buske FA et al MEME SUITE tools for motifdiscovery and searching Nucleic Acids Res 200937W202ndash8
19 Olman V Xu D Xu Y CUBIC identification of regulatorybinding sites through data clustering J Bioinform Comput Biol2003121ndash40
20 Li X Wong WH Sampling motifs on phylogenetic trees ProcNatl Acad Sci USA 20051029481ndash6
21 Blanchette M Tompa M Discovery of regulatory elementsby a computational method for phylogenetic footprintingGenome Res 200212739ndash48
22 Blanchette M Tompa M FootPrinter a program designed forphylogenetic footprinting Nucleic Acids Res 2003313840ndash2
23 Li G Liu B Xu Y Accurate recognition of cis-regulatorymotifs with the correct lengths in prokaryotic genomesNucleic Acids Res 201038e12
24 Ma Q Liu B Zhou C et al An integrated toolkit for accurateprediction and analysis of cis-regulatory motifs at a genomescale Bioinformatics 2013292261ndash8
25 Tompa M Li N Bailey TL et al Assessing computationaltools for the discovery of transcription factor binding sitesNat Biotechnol 200523137ndash44
26 McCue LA Thompson W Carmack CS et al Factors influenc-ing the identification of transcription factor binding sites bycross-species comparison Genome Res 2002121523ndash32
27 Simcha D Price ND Geman D The limits of de novo DNAmotif discovery PLoS One 20127e47836
28 Siddharthan R Siggia ED van Nimwegen E PhyloGibbs aGibbs sampling motif finder that incorporates phylogenyPLoS Comput Biol 20051e67
29 Blanchette M Schwikowski B Tompa M Algorithms forphylogenetic footprinting J Comput Biol 20029211ndash23
30 Neph S Tompa M MicroFootPrinter a tool for phylogeneticfootprinting in prokaryotic genomes Nucleic Acids Res200634W366ndash8
31 Carmack CS McCue LA Newberg LA et al PhyloScan identi-fication of transcription factor binding sites using cross-species evidence Algorithms Mol Biol 200721
32 Zhang S Xu M Li S et al Genome-wide de novo prediction ofcis-regulatory binding sites in prokaryotes Nucleic Acids Res200937e72
33 Katara P Grover A Sharma V Phylogenetic footprinting aboost for microbial regulatory genomics Protoplasma2012249901ndash7
34 Tagle DA Koop BF Goodman M et al Embryonic epsilon andgamma globin genes of a prosimian primate (Galago crassi-caudatus) Nucleotide and amino acid sequences develop-mental regulation and phylogenetic footprints J Mol Biol1988203439ndash55
35 Borneman AR Gianoulis TA Zhang ZD et al Divergence oftranscription factor binding sites across related yeast spe-cies Science 2007317815ndash9
36 Odom DT Dowell RD Jacobsen ES et al Tissue-specific tran-scriptional regulation has diverged significantly betweenhuman and mouse Nat Genet 200739730ndash2
37 Boyle AP Araya CL Brdlik C et al Comparative analysis ofregulatory information and circuits across distant speciesNature 2014512453ndash6
38 Bingqiang Liu CZ Zhang H Guojun L et al An integrativeand applicable phylogenetic footprinting framework for cis-regulatory motifs identification in prokaryotic genomesBMC Genomics 201617578
39 Park PJ ChIP-seq advantages and challenges of a maturingtechnology Nat Rev Genet 200910669ndash80
40 Song L Crawford GE DNase-seq a high-resolution tech-nique for mapping active gene regulatory elements acrossthe genome from mammalian cells Cold Spring Harb Protoc201025384
41 Wang Z Gerstein M Snyder et al RNA-Seq a revolutionarytool for transcriptomics Nat Rev Genet 20081057ndash63
42 Tsankov AM Gu H Akopian V et al Transcription factorbinding dynamics during human ES cell differentiationNature 2015518344ndash9
43 Wu F Olson BG Yao J DamID-seq genome-wide mapping ofprotein-DNA interactions by high throughput sequencing ofadenine-methylated DNA fragments J Vis Exp 2016107e53620
44 Maragkakis M Alexiou P Nakaya T et al CLIPSeqTools-anovel bioinformatics CLIP-seq analysis suite RNA2015221ndash9
45 Hafner M Landthaler M Burger L et al PAR-CliPmdasha methodto identify transcriptome-wide the binding sites of RNAbinding proteins J Vis Exp 201041e2034
46 Ingolia NT Ribosome profiling new views of translationfrom single codons to genome scale Nat Rev Genet201415205ndash13
47 Giresi PG Kim J Mcdaniell RM et al FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates activeregulatory elements from human chromatin Genome Res200717877ndash85
48 Nutiu R Friedman RC Luo S et al Direct measurement ofDNA affinity landscapes on a high-throughput sequencinginstrument Nat Biotechnol 201129659ndash64
49 Collas P Dahl JA Chop it ChIP it check it the currentstatus of chromatin immunoprecipitation Front Biosci200813929ndash43
50 Kimura H Sato Y DNA Replication and Histone ModificationJapan Springer 2016
51 Suganuma T Workman JL Histone modification as a reflec-tion of metabolism Cell Cycle 2016 15481ndash2
52 Qu H Fang X A brief review on the human encyclopedia ofDNA elements (ENCODE) project Genomics ProteomicsBioinformatics 201311135ndash41
Motif Prediction based on ChIP-seq data | 11
53 Consortium EP The ENCODE (ENCyclopedia Of DNAElements) project Science 2004306636ndash40
54 Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910R25
55 Li H Durbin R Fast and accurate long-read alignment withBurrowsndashWheeler transform Bioinformatics 201026589ndash95
56 Szalkowski AM Schmid CD Rapid innovation in ChIP-seqpeak-calling algorithms is outdistancing benchmarking ef-forts Brief Bioinform 201112626ndash33 628
57 Kharchenko PV Tolstorukov MY Park PJ Design and ana-lysis of ChIP-seq experiments for DNA-binding proteins NatBiotechnol 2008261351ndash9
58 Zhang Y Liu T Meyer CA et al Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089R137
59 Jiang H Wang F Dyer NP et al CisGenome browser a flex-ible tool for genomic data visualization Bioinformatics2010261781ndash2
60 Fejes AP Robertson G Bilenky M et al FindPeaks 31 atool for identifying areas of enrichment from massivelyparallel short-read sequencing technology Bioinformatics2008241729ndash30
61 Valouev A Johnson DS Sundquist A et al Genome-wideanalysis of transcription factor binding sites based on ChIP-Seq data Nat Methods 20085829ndash34
62 Xin F Grossman R Stein L PeakRanger a cloud-enabledpeak caller for ChIP-seq data BMC Bioinformatics201112139
63 Kuan PF Chung D Pan G et al A statistical frameworkfor the analysis of ChIP-Seq data J Am Stat Assoc2011106891ndash903
64 Mathelier A Wasserman WW The next generation of tran-scription factor binding site prediction PLoS Comput Biol20139e1003214
65 Cheng C Min R Gerstein M TIP a probabilistic method foridentifying transcription factor target genes from ChIP-seqbinding profiles Bioinformatics 2011273221ndash7
66 Wu S Wang J Zhao W et al ChIP-PaM an algorithm to iden-tify protein-DNA interaction using ChIP-Seq data Theor BiolMed Model 20107(1)18
67 van Heeringen SJ Veenstra GJC GimmeMotifs a de novomotif prediction pipeline for ChIP-sequencing experimentsBioinformatics 201127270ndash1
68 Machanick P Bailey TL MEME-ChIP motif analysis of largeDNA datasets Bioinformatics 2011271696ndash7
69 Kulakovskiy IV Boeva VA Favorov AV et al Deep and widedigging for binding motifs in ChIP-Seq data Bioinformatics2010262622ndash3
70 Jothi R Cuddapah S Barski A et al Genome-wide identifica-tion of in vivo proteinndashDNA binding sites from ChIP-Seqdata Nucleic Acids Res 2008365221ndash31
71 Mercier E Droit A Li L et al An integrated pipeline for thegenome-wide analysis of transcription factor binding sitesfrom ChIP-Seq PLoS One 20116e16432
72 Hu M Yu J Taylor JM et al On the detection and refinementof transcription factor binding sites using ChIP-Seq dataNucleic Acids Res 2010382154ndash67
73 Bailey TL DREME motif discovery in transcription factorChIP-seq data Bioinformatics 2011271653ndash9
74 Jia C Carson MB Wang Y et al A new exhaustive methodand strategy for finding motifs in ChIP-enriched regionsPLoS One 20149e86044
75 Pavesi G Mereghetti P Mauri G et al Weeder Web discov-ery of transcription factor binding sites in a set of
sequences from co-regulated genes Nucleic Acids Res200432W199ndash203
76 Thomas-Chollier M Herrmann C Defrance M et al RSATpeak-motifs motif analysis in full-size ChIP-seq datasetsNucleic Acids Res 201240e31
77 Tran NT Huang CH A survey of motif finding web tools fordetecting binding site motifs in ChIP-Seq data Biol Direct201494
78 Jun Ding HH Xiaoman L SIOMICS a novel approach for sys-tematic identification of motifs in ChIP-seq data NucleicAcids Res 2014421645ndash35
79 Boeva V Surdez D Guillon N et al De novo motif identifica-tion improves the accuracy of predicting transcription factorbinding sites in ChIP-Seq data analysis Nucleic Acids Res201038e126
80 Bailey TL Williams N Misleh C et al MEME discovering andanalyzing DNA and protein sequence motifs Nucleic AcidsRes 200634369ndash73
81 Hartmann H Guthhrlein EW Siebert M et al P-value-basedregulatory motif discovery using positional weight matricesGenome Res 201323181ndash94
82 Niu M De novo prediction of cis-regulatory modules in eu-karyotic organisms Dissertations amp ThesesmdashGradworksThe university of North Carolina at Charlotte 2014
83 Bolouri H Ruzzo WL Integration of 198 ChIP-seq datasetsreveals human cis-regulatory regions J Comput Biol201219989ndash97
84 Sun H Guns T Fierro AC et al Unveiling combinatorial regu-lation through the combination of ChIP information and insilico cis-regulatory module detection Nucleic Acids Res201240e90
85 Lihu A Holban S A review of ensemble methods for denovo motif discovery in ChIP-Seq data Brief Bioinform201617731
86 Kulakovskiy IV Makeev VJ Motif discovery and motif find-ing in ChIP-Seq data Genome Analysis Current Proceduresand Applications Caister Academic Press Norfolk UK2014 83
87 Medina-Rivera A Defrance M Sand O et al RSAT 2015Regulatory Sequence Analysis Tools Nucleic Acids Res201543W50ndash6
88 Maaskola J Rajewsky N Binding site discovery from nucleicacid sequences by discriminative learning of hidden Markovmodels Nucleic Acids Res 20144212995ndash3011
89 Ikebata H Yoshida R Repulsive parallel MCMC algorithm fordiscovering diverse motifs from large sequence setsBioinformatics 2015311561ndash8
90 Stormo GD DNA binding sites representation and discov-ery Bioinformatics 20001616ndash23
91 Nishida K Frith MC Nakai K Pseudocounts for transcriptionfactor binding sites Nucleic Acids Res 200937939ndash44
92 Weirauch MT Cote A Norel R et al Evaluation of methodsfor modeling transcription factor sequence specificity NatBiotechnol 201331126ndash34
93 Bailey TL Elkan C Fitting a mixture model by expectationmaximization to discover motifs in biopolymers Proc IntConf Intell Syst Mol Biol 1994228ndash36
94 Ding J Dhillon V Li X et al Systematic discovery of cofactormotifs from ChIP-seq data by SIOMICS Methods 201579ndash8047ndash51
95 van Helden J Andre B Collado-Vides J Extracting regulatorysites from the upstream region of yeast genes by computa-tional analysis of oligonucleotide frequencies J Mol Biol1998281827ndash42
12 | Liu et al
96 van Helden J Rios AF Collado-Vides J Discovering regula-tory elements in non-coding sequences by analysis ofspaced dyads Nucleic Acids Res 2000281808ndash18
97 van Helden J del Olmo M Perez-Ortin JE Statistical analysisof yeast genomic downstream sequences reveals putativepolyadenylation signals Nucleic Acids Res 2000281000ndash10
98 Kulakovskiy I Levitsky V Oshchepkov D et al From bindingmotifs in ChIP-Seq data to improved models of transcriptionfactor binding sites J Bioinform Comput Biol 2013111340004
99 Levitsky VG Kulakovskiy IV Ershov NI et al Application ofexperimentally verified transcription factor binding sitesmodels for computational analysis of ChIP-Seq data BMCGenomics 20141580
100 Eskin E Pevzner PA Finding composite regulatory patternsin DNA sequences Bioinformatics 200218 Suppl 1S354ndash63
101 Sharov AA Ko MS Exhaustive search for over-representedDNA sequence motifs with CisFinder DNA Res 200916261ndash73
102 Hertz GZ Hartzell GW III Stormo GD Identification of con-sensus patterns in unaligned DNA sequences known to befunctionally related Comput Appl Biosci 1990681ndash92
103 Kingsford C Zaslavsky E Singh M a compact mathematicalprogramming formulation for DNA motif finding Lect NotesComput Sci 2006400913
104 Reid JE Wernisch L STEME efficient EM to find motifs inlarge data sets Nucleic Acids Res 201139e126
105 Jin VX OrsquoGeen H Iyengar S et al Identification of an OCT4 andSRY regulatory module using integrated computational and ex-perimental genomics approaches Genome Res 200717807ndash17
106 Favorov AV Gelfand MS Gerasimova AV et al A Gibbs sam-pler for identification of symmetrically structured spacedDNA motifs with improved estimation of the signal lengthBioinformatics 2005212240ndash5
107 Zhao Y Ruan S Pandey M et al Improved models for tran-scription factor binding site identification using noninde-pendent interactions Genetics 2012191781ndash90
108 Redhead E Bailey TL Discriminative motif discovery in DNAand protein sequences using the DEME algorithm BMCBioinformatics 20078385
109 Chen X Xu H Yuan P et al Integration of external signalingpathways with the core transcriptional network in embry-onic stem cells Cell 20081331106ndash17
110 Huggins P Zhong S Shiff I et al DECOD Fast and AccurateDiscriminative DNA Motif Finding Bioinformatics2011272361ndash7
111 Cuellar-Partida G Buske FA McLeay RC et al Epigenetic pri-ors for identifying active transcription factor binding sitesBioinformatics 20122856ndash62
112 Mason MJ Plath K Zhou Q Identification of context-dependent motifs by contrasting ChIP binding dataBioinformatics 2010262826ndash32
113 Teng L He B Gao P et al Discover context-specificcombinatorial transcription factor interactions byintegrating diverse ChIP-Seq data sets Nucleic Acids Res201442e24
114 Vaquerizas JM Kummerfeld SK Teichmann SA et al A cen-sus of human transcription factors function expressionand evolution Nat Rev Genet 200910252ndash63
115 Ding J Cai X Wang Y et al ChIPModule systematic discov-ery of transcription factors and their cofactors from ChIP-seq data Pac Symp Biocomput 2013320ndash31
116 Ettwiller L Paten B Ramialison M et al Trawler de novoregulatory motif discovery pipeline for chromatin immuno-precipitation Nat Methods 20074563ndash5
117 Linhart C Halperin Y Shamir R Transcription factor andmicroRNA motif discovery the Amadeus platform and acompendium of metazoan target sets Genome Res2008181180ndash9
118 Kuttippurathu L Hsing M Liu Y et al CompleteMOTIFsDNA motif discovery platform for transcription factor bind-ing experiments Bioinformatics 201127715ndash7
119 Ho JW Bishop E Karchenko PV et al ChIP-chip versus ChIP-seq lessons for experimental design and data analysis BMCGenomics 201112134
120 Che D Li G Mao F et al Detecting uber-operons in prokary-otic genomes Nucleic Acids Res 2006342418ndash27
121 Liu Y Schmidt B Liu W et al CUDAndashMEME acceleratingmotif discovery in biological sequences using CUDA-enabled graphics processing units Pattern Recognit Lett2010318
122 Alipanahi B Delong A Weirauch MT et al Predicting the se-quence specificities of DNA- and RNA-binding proteins bydeep learning Nat Biotechnol 201533831ndash8
Motif Prediction based on ChIP-seq data | 13
sequences) [88] The discriminative ideas have been integratedinto some traditional motif-finding tools to reduce false positivesin prediction For example CONSENSUS [102] calculates P-valuesfor motif profiles indicating the probabilities for the same motifsoccurring in a hypothetical random sequence set with a similarsize BOBRO [10] calculates such a P-value by practically simulat-ing random sequences set in the program as contrast underPoisson distribution BBR [24] takes coding genomic sequences asbackground data to reevaluate candidate motifs and an experi-ment on Escherichia coli genome showed that this strategy reducesthe number of false positives Certainly the function of discrim-inative motif finding is not limited to dealing with the false-posi-tive issue Some biology-driven applications specifically identifymotifs overrepresented in a positive set but not in a negative set(called decoy motif problem) Hence a motif without occurrencedifferences between two sets is not considerable no matter howconserved it is [108]
Application of discriminative motif finding is even more im-portant in ChIP-seq data analysis because the data size is usu-ally large and the peaks with or without binding activitiesnaturally compose the positive and negative sets respectivelyIn DREME [73] the first step is counting how many positive se-quences and negative sequences contain a query word Thenthe P-value from Fisherrsquos exact test is calculated for this wordThe words with significant P-values will be kept as seeds for
motif merging The same evaluation process will be conductediteratively in generation of motifs by combing similar motifseeds In application DREME performed discriminative motiffinding by taking Sox2- and Oct4-binding ChIP-seq data inmouse embryonic stem cell (mESC) as positive and negativedata set and found that Oct4 data set has significantly moreOct4-binding sites than the Sox2 data set The results over-turned the previous opinion that Sox2 and Oct4 bind their tar-gets exclusively as a heterodimer [109]
Maaskola and Rajewsky [88] developed an improved HMM-based tool Discrover for discriminative motif finding on largedata set The method consists of three parts lsquoseed findingrsquolsquoHMM optimizationrsquo and lsquosignificance filteringrsquo The seed findingpart is similar to DREME but uses more objective functions be-sides the Fisherrsquos exact test P-values Among these objectivefunctions mutual information of condition and motif occur-rence (MICO) and maximum mutual information estimationcan be used for discriminative motif evaluation on the contrastwith more than two conditions which extended the applicationof Discrover The HMM optimization step starts from a back-ground state trained by the BaumndashWelch algorithm Then theinitial emission probabilities of the motif chains are centered onthe seed sequences The posterior probability of all sequences isevaluated to calculate the expected number of sequences thathave motif occurrences Discrover conducts discriminative
Table 2 Performance summary of motif-finding tools on ChIP-seq data
Programs Criteria Preferences Speed Performance and special characteristics
MICSA [79] Speed motifcoverage
Prefers small datasets
Slow on large data sets Performance is limited by MEME performswell on filtering weak peaks
ChIPMunk [69] Speed motifcoverage
NA Faster than MEME slowerthan DREME quasi-linearcomplexity (power 127)
More accurate than MEME
DREME [73] Speed motifcoverage
Handles large ChIP-seq data (thou-sands of se-quences) prefersshort motifs
Much faster than WEEDERand MEME linear timecomplexity
Performs well at cofactor motif finding anddiscriminative motif finding only providesmotif profile matrix no output of motifsites
RSAT peak-motifs[87]
Speed motifcoverage
Handles large ChIP-seq data sets (upto 1 000 000 peaksof 100 bp each)
Faster than DREME MEMEand ChIPMunk
High prediction accuracy
FMotif [74] Speed motif cover-age SN and SP onmotif sites
Handles large ChIP-seq data handleslong motifs
Fast but requires morememory
High sensitivity and high specificity pre-dicted accurate motif profiles with highrank (1 or 2) for 12 mESC data tolerates upto 30 noise sequences
SIOMICS [78] Speed motif cover-age and specificityon motifs
NA Slower than RSAT peak-motifs much faster thanDREME especially onlarge data sets
Competitive performance with DREME andRSAT in motif finding and cofactor motiffinding competitive performance withDREME and better than RSAT peak-motifson false-positive control good motif mod-ule prediction
Discrover [88] Speed MCC(Matthews corre-lation coefficient)on nucleotide-level SN and PPVon binding sitelevel
NA Comparable with DREME inspeed
Better than Bioprospector and MDscan on SNand PPV of prediction better than DREMEon discriminative motif finding
RPMCMC [89] Speed SN and PPV NA Similar to DREME Better than WEEDER on SN and PPV of predic-tion great potential to mine many reliablediverse motifs
Motif Prediction based on ChIP-seq data | 7
learning by iterative gradient optimization based on the chosenobjective functions which has a linear run time complexitybased on the length of the input data Finally Discrover refinesthe final models by a threshold on a corrected MICO-based P-value regardless of if MICO or another objective function wasused in previous steps In addition Discrover makes extra ef-forts in detection of multiple motifs which will be discussed inthe next section
In the application of discriminative motif finding on ChIP-seq data the various choices of reference data can benefit dif-ferent biological analysis about gene regulation and is the mainfactor affecting the prediction performance [108 110]Specifically the reference data can be (i) randomly generatedsequences using a uniform distribution or a Markov process (ii)the sequences with no binding evidence (iii) the ChIP-seq peaksfor another TF (iv) the peaks of the same TF but under a differ-ent condition even (v) multiple-level reference data sets basedon a detailed grade framework of the signal strength etc [94]This information can reduce false positives and find other spe-cial motifs including decoy motifs variant motifs for one TFimpoverished motifs and collaborated motifs [108] and thuscan improve the state-of-the-art studies about complex regula-tion mechanism eg combinatorial regulation and behaviorchange of TFs among different conditions [110] It is worth not-ing that the binding activity of a TF could be affected by epigen-etic modifications in a complex fashion eg through localchromatin structure caused by the packaging of DNA histonesand other proteins [111] Cuellar-Partida et al [111] have usedthe signal from histone modification ChIP-assay to filter thescanned motifs The corresponding pitfall raised in discrimina-tive motif finding is that the sequences in a negative data setwith no binding evidence may contain potential binding sitesSome algorithms try to mitigate this effect by a special designwhich can tolerate a few motif occurrences in negative set [110]Another phenomenon is that the binding motifs of a TF couldbe affected by the binding of other TFs Mason et al [112] provedthat Oct4 preferentially binds to different motifs with or withoutSox2 bindings within 5000 bp indicating that the binding motifof Oct4 is context-dependent
Find cofactors of TFs from ChIP-seq data
In animals and plants TFs usually regulate gene expression withcooperation of other partner TFs (cofactors) [113] Growing evi-dence indicates that cofactors which interacted directly or indir-ectly play an important role in transcription regulation Thebinding sites of multiple TFs in a short DNA region often deter-mine the temporal spatial expression pattern of the downstreamgenes thus will help to elucidate underlying patterns of combina-torial regulation [114] The analysis of cofactors has attracted sub-stantial attention in traditional motif finding For example BBA[24] in the DMINDA Web server designed a probability model toevaluate the co-occurrences among identified motifs in a givenset of regulatory sequences which can reveal joint regulation re-lationships by multiple TFs The cofactor motif finding on ChIP-seq data is usually formulated by detecting multiple nonredun-dant motifs in a set of ChIP-seq peak regions [73] or detecting TFpatterns whose binding motifs frequently co-occurred on thesame set of regulatory regions [113] Recently some algorithmshave tried to take advantage of the large-scale ChIP-seq data todiscover the cofactor motifs
Researchers want to identify multiple nonredundant motifsin query ChIP-seq data The most straightforward way is to scanpeak sequences by documented motifs [84 115] However the
number of documented motifs is limited compared with thenumber of TFs leading to a loss of some real binding sitesSome de novo motif-finding methods try to first identify mul-tiple motifs and then build relationships for them with knownTFs One example is the aforementioned DREME [73] whichfeeds the whole sequences into the corresponding model ratherthan using only a small portion of the collected ChIP-seq data Itidentifies multiple motifs by removing the most statistically sig-nificant identified motif derived from Fisherrsquos exact testand then repeats the search for motifs DREME achieved re-markable performance in cofactor motifs discovery in mousecell ChIP-seq data and dramatically outperformed other preva-lent algorithms In particular DREME discovered gt2 times thenumber of cofactor motifs compared with Amadeus [116] andgt10 times compared with Trawler [117] In addition the algo-rithm peak-motifs in the RSAT package also returns additionalmotifs potentially bound by cofactors and its application onmouse cell ChIP-seq data successfully detected some cofactormotifs [76]
Rather than simply considering individual motifs separatelySIOMICS [78 94] models the cofactor motifs as motif modulesie a group of TFs with their TFBSs co-occurring in significantlylarge number of ChIP-seq peaks Cofactor motifs may occur onlyin a small set of peaks and thus can be underrepresented indi-vidually in all peak regions Therefore evaluation of motif mod-ules simultaneously could make the significance of cofactorsstand out from the massive background sequences Asdescribed in previous section SIOMICS identifies conservedmotifs by a word-based strategy based on Equation (4) Toidentify nonredundant motifs it removes redundant candidateswith lower scores and the remaining ones are used to identifyputative motifs Instead of considering motif candidatesindividually SIOMICS represents them as nodes in a tree inwhich the more frequent candidates are closer to the root whilethe branch of the tree represents a group of co-occurring motifsBy evaluating each branch based on a Poisson clumpingheuristic strategy SIOMICS identifies statistically significantmotif modules and takes motif candidates in them as final pu-tative motifs The performance comparison on the same ChIP-seq data set used by DREME and peak-motifs indicates thatmore documented cofactor motifs were identified by SIOMICSrather than DREME and RSAT Peak-motifs with comparabletime efficiency
The above methods identify cofactors from ChIP-seq data fora single TF and have obtained many valuable results Howeversynthetic analysis of diverse ChIP-seq data could provide moreinformation about the co-regulation mechanism of TFs FCOPs[113] is a method for identifying combinatorial occupancy pat-terns of multiple TFs from diverse ChIP-seq data FCOPs inte-grates two kinds of ChIP-seq data sets as input from TFsoccupancy as well as chromatin modification and it considerstwo kinds of uncertainty How possible a peak from ChIP-seqdata represents an actual binding activity and how confidentthe predicted enhancer sequences are FCOPs generate probabil-ities from these uncertainties and use them to evaluate the stat-istical significance of a given set of TFs co-occurring in a givenset of enhancers Obviously enumerating all possible subset ofenhancers has an exponential complexity which is inefficientThis method adopts a dynamic programming-based method tocalculate the probabilities which leads to linear complexitywith respect to the number of the enhancers both in time andspace Despite this method not providing any DNA motifs itcan be easily integrated into ChIP-seq motif analysis to identifycofactors
8 | Liu et al
Citation analysis of ChIP-seq motif-findingtools
To analyze the usage of existing ChIP-seq-based motif-findingtools we collected the citation information of eight tools inTable 1 and a Web server MEME-ChIP which implants MEMEand DREME The relative citations of the eight tools since 2011are shown in Figure 2 where we can see that DREME has anoverwhelming superiority among these tools and RSAT peak-motifs and ChIPMunk also have higher relative citation ratesAlthough the four tools that were developed within the past 3years (RPMCMC Discover SIOMICS and FMotif) introducedsome new ideas and obtained better prediction performancethey still fail to get more attention from biologists and almostall of their citations are from the papers focusing on develop-ment of new methods The main reason is the lack of a user-friendly Web server and a platform with multiple related func-tions limiting their usages by people without adequate compu-tational background
DREME as one motif-finding engine of MEME-ChIP has beenintegrated in the MEME suite which is the most popular motif-finding and analysis platform MEME-ChIP obtained more add-itional citations than DREME (Figure 3) and most of theseshould be attribute to DREME RSAT peak-motifs is part of RSATplatform where a series of modular computer programs is
integrated for regulatory signal detection in noncoding se-quences And the relatively high number of citations ofChIPMunk also benefited from its Web service The methodsthat only provide source code are difficult to obtain continuousattention Taking MICSA as an example it was published in2010 and was cited frequently within 2 years following publica-tion but its citation count decreased sharply in recent yearsThis phenomenon indicated that the user experience of a newalgorithm is as important as the quality of algorithm itself
Discussion
The ChIP-seq peaks provide a foundation for the identificationof motifs on a genome scale In return the predicted motifs canbe used to reevaluate the peaks and even find more potentialTF-binding sites that were missed by a ChIP-seq experimentFor example MICSA [79] takes advantage of de novo motif identi-fication to reevaluate the ChIP-seq peaks In MICSA only a sub-set of the highest peaks is fed into MEME for motifsidentification for which the probability of observing a motif bychance (P-value) is calculated Then the predicted motifs willbe used to scan all previously identified peaks and a score willbe assigned to each peak by combining the P-values of motifsthat occurred on it with its depth The experimental results
0
10
20
30
40
50
60
70
80
90
100
2011 2012 2013 2014 2015 2016
Rela
ve
annu
al ci
tao
n (
)
Year of citaon
RPMCMC 2015[91]
Discrover 2014[93]
SIOMICS 2014[78]
Fmof 2014[74]
RSAT peak-mofs 2012[94]
DREME 2011[73]
ChIPMunk 2010[69]
MICSA 2010[79]
Figure 2 Relative citation of the ChIP-seq motif-finding tools until October 2016 The numbers after a tool name indicate the year of its original release and the corres-
ponding reference Citation of each tool was extracted from the Bing Academy (httpcnbingcomacademic)
0
5
10
15
20
25
30
35
40
45
50
Cita
ons
pery
ears
ince
publ
ished
MICSA 2010[79]
ChIPMunk 2010[69]
DREME 2011[73]
RSAT peak-mofs 2012[94]
Fmof 2014[74]
SIOMICS 2014[78]
Discrover 2014[93]
RPMCMC 2015[91]
MEME-ChIP 2011[68]a
Figure 3 The average annual citation of eight motif-finding tools and MEME-ChIP aMEME-ChIP is a Web server that implants DREME and MEME to detect motifs
Motif Prediction based on ChIP-seq data | 9
showed that MICSA offered a distinct improvement in TF-bind-ing prediction accuracy with fewer false positives comparedwith the other 10 approaches Certainly searching potentialbinding sites using the predicted motifs is not limited to thepeak sequences It could even be applied on the whole-genomesequences as some binding activities may not occur under thecondition of ChIP experiments and these binding sites are alsovaluable for studying the full regulation picture of the interestedTFs
Another issue that should be discussed concerning ChIP-seqmotif finding is the difference of its application on various datasources For example the binding motifs in eukaryotes and pro-karyotes often have their respective characteristics and thelength of the binding regions in eukaryotes is much longer thanthose in prokaryotes Therefore there is a need of special designin motif-finding and analysis techniques
As with the traditional motif finding there are some ensem-ble methods for ChIP-seq-based motif finding [85 118] It was al-ways recommended that using several tools in motif finding isa better strategy as diverse techniques may capture differentcharacteristics of motifs This idea leads to the development ofensemble motif-finding methods Most of these methods sim-ply assemble several algorithms together and rank the outputof them For example W-ChIPMotif [119] includes MEMEWEEDER and MaMF [120] and assesses the output by comparingwith a randomized initial input CompleteMotifs [118] incorpor-ates three de novo tools (ChIPMunk WEEDER and CUDA-MEME)[121] and calculates the P-values of predicted motifs by a back-ground random model However using more algorithms intro-duces more false positives while they have high probability tocover real motifs Therefore further analyses are still requiredfor the output of multiple algorithms
Certainly the accuracy of prediction and the efficiency ofmethods are still the most important features in ChIP-seq motiffinding and the application of current methods shows that theyare still far from completely satisfactory More deep thinkingand advanced techniques are still needed to improve the intrin-sic algorithms For example the evolution information fromphylogenetic footprinting can be useful as prior knowledge inevaluation of candidate binding sites Besides additional func-tions such as cofactor analysis evaluating peaks based onmotifs and multiple types of discriminative motif finding are at-tracting more attention in applications They should be wellintegrated in motif-finding methods in future studies Finallyan integrated Web server or platform is essential for the appli-cation of new designed methods as it can provide more func-tions about upstream data collection and downstream resultanalysis
Most recently deep learning has demonstrated its unprece-dented performance in regulatory genomics especially in ChIP-seq-based motif analysis [122] This marked performance ismainly achieved by taking advantage of graphics processingunits and the capability to extract complex features of motifsfrom ChIP-seq data However it can be improved further ifmore information (eg DNA shape) can be taken into consider-ation rather than only the ChIP-seq peaks sequences
Key Points
bull The algorithmic strategies adopted by current motif-finding tools can be generally divided into two catego-ries word-based and profile-based which could be im-plemented with other models like tree graph and
clustering The combination of these two strategies isa better choice to break the current bottleneck in pre-diction accuracy and efficiency
bull The new features of current tools including but notlimited to discriminative motif identification (DREAMRSAT lsquopeak-motifsrsquo and Discrover) cofactor motifdetection (DREME RSAT lsquopeak-motifsrsquo SIOMICSRPMCMC and Discrover) etc are essential for applica-tion of ChIP-seq data analysis and are encouraged tobe integrated into newly designed algorithms
bull Existing tools still have a much room for improvementwith respect to prediction accuracy and efficiency thecontradictory relationship between the time efficiencyand space complexity as well as between predictionaccuracy and application universality is generally pre-sent in these tools
bull An integrated Web server or platform providing vari-ous analysis functions in data collection and the resultanalysis is essential for the spread of new designedmethods in their applications
bull In future studies more mathematical models eggraph model statistics and combinatorial optimiza-tion could be developed and integrated to design newmotif signal detection techniques and new evaluationcriteria In addition more information eg the de-pendency between nucleotides and the height of peaksfrom ChIP-seq data should be considered to improvethe prediction accuracy
Funding
The State of South Dakota Research Innovation Center andthe Agriculture Experiment Station of South Dakota StateUniversity the National Nature Science Foundation ofChina (grant numbers 61303084 and 61432010 to BL) andthe Young Scholars Program of Shandong University (grantnumbers YSPSDU 2015WLJH19 to BL)
References1 DrsquoHaeseleer P What are DNA sequence motifs Nat
Biotechnol 200624423ndash52 Brohee S Janky R Abdel-Sater F et al Unraveling networks
of co-regulated genes on the sole basis of genome se-quences Nucleic Acids Res 2011396340ndash58
3 Davidson E Levin M Gene regulatory networks Natl Acad SciUSA 20051024935
4 Liu B Zhou C Li G et al Bacterial regulon modeling and pre-diction based on systematic cis regulatory motif analysesSci Rep 2016623030
5 Baumbach J On the power and limits of evolutionary con-servationndashunraveling bacterial gene regulatory networksNucleic Acids Res 2010387877ndash84
6 Lawrence CE Altschul SF Boguski MS et al Detecting subtlesequence signals a Gibbs sampling strategy for multiplealignment Science 1993262208ndash14
7 Pevzner PA Sze SH Combinatorial approaches to findingsubtle signals in DNA sequences Proc Int Conf Intell Syst MolBiol 20008269ndash78
8 Nakaki R Kang J Tateno M A novel ab initio identificationsystem of transcriptional regulation motifs in genome DNAsequences based on direct comparison scheme of signalnoise distributions Nucleic Acids Res 2012408835ndash48
10 | Liu et al
9 Zambelli F Pesole G Pavesi G Motif discovery and transcrip-tion factor binding sites before and after the next-generation sequencing era Brief Bioinform 201314225ndash37
10 Li G Liu B Ma Q et al A new framework for identifying cis-regulatory motifs in prokaryotes Nucleic Acids Res201139e42
11 Chen X Guo L Fan Z et al W-AlignACE an improved Gibbssampling algorithm based on more accurate position weightmatrices learned from sequence and gene expressionChIP-chip data Bioinformatics 2008241121ndash8
12 Sinha S PhyME a software tool for finding motifs in sets oforthologous sequences Methods Mol Biol 2007395309ndash18
13 Das MK Dai HK A survey of DNA motif finding algorithmsBMC Bioinformatics 20078(Suppl 7)S21
14 Wang T Stormo GD Combining phylogenetic data with co-regulated genes to identify regulatory motifs Bioinformatics2003192369ndash80
15 Liu X Brutlag DL Liu JS BioProspector discovering con-served DNA motifs in upstream regulatory regions of co-expressed genes Pac Symp Biocomput 2001127ndash38
16 Hertz GZ Stormo GD Identifying DNA and protein patternswith statistically significant alignments of multiple se-quences Bioinformatics 199915563ndash77
17 Liu XS Brutlag DL Liu JS An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments NatBiotechnol 200220835ndash9
18 Bailey TL Boden M Buske FA et al MEME SUITE tools for motifdiscovery and searching Nucleic Acids Res 200937W202ndash8
19 Olman V Xu D Xu Y CUBIC identification of regulatorybinding sites through data clustering J Bioinform Comput Biol2003121ndash40
20 Li X Wong WH Sampling motifs on phylogenetic trees ProcNatl Acad Sci USA 20051029481ndash6
21 Blanchette M Tompa M Discovery of regulatory elementsby a computational method for phylogenetic footprintingGenome Res 200212739ndash48
22 Blanchette M Tompa M FootPrinter a program designed forphylogenetic footprinting Nucleic Acids Res 2003313840ndash2
23 Li G Liu B Xu Y Accurate recognition of cis-regulatorymotifs with the correct lengths in prokaryotic genomesNucleic Acids Res 201038e12
24 Ma Q Liu B Zhou C et al An integrated toolkit for accurateprediction and analysis of cis-regulatory motifs at a genomescale Bioinformatics 2013292261ndash8
25 Tompa M Li N Bailey TL et al Assessing computationaltools for the discovery of transcription factor binding sitesNat Biotechnol 200523137ndash44
26 McCue LA Thompson W Carmack CS et al Factors influenc-ing the identification of transcription factor binding sites bycross-species comparison Genome Res 2002121523ndash32
27 Simcha D Price ND Geman D The limits of de novo DNAmotif discovery PLoS One 20127e47836
28 Siddharthan R Siggia ED van Nimwegen E PhyloGibbs aGibbs sampling motif finder that incorporates phylogenyPLoS Comput Biol 20051e67
29 Blanchette M Schwikowski B Tompa M Algorithms forphylogenetic footprinting J Comput Biol 20029211ndash23
30 Neph S Tompa M MicroFootPrinter a tool for phylogeneticfootprinting in prokaryotic genomes Nucleic Acids Res200634W366ndash8
31 Carmack CS McCue LA Newberg LA et al PhyloScan identi-fication of transcription factor binding sites using cross-species evidence Algorithms Mol Biol 200721
32 Zhang S Xu M Li S et al Genome-wide de novo prediction ofcis-regulatory binding sites in prokaryotes Nucleic Acids Res200937e72
33 Katara P Grover A Sharma V Phylogenetic footprinting aboost for microbial regulatory genomics Protoplasma2012249901ndash7
34 Tagle DA Koop BF Goodman M et al Embryonic epsilon andgamma globin genes of a prosimian primate (Galago crassi-caudatus) Nucleotide and amino acid sequences develop-mental regulation and phylogenetic footprints J Mol Biol1988203439ndash55
35 Borneman AR Gianoulis TA Zhang ZD et al Divergence oftranscription factor binding sites across related yeast spe-cies Science 2007317815ndash9
36 Odom DT Dowell RD Jacobsen ES et al Tissue-specific tran-scriptional regulation has diverged significantly betweenhuman and mouse Nat Genet 200739730ndash2
37 Boyle AP Araya CL Brdlik C et al Comparative analysis ofregulatory information and circuits across distant speciesNature 2014512453ndash6
38 Bingqiang Liu CZ Zhang H Guojun L et al An integrativeand applicable phylogenetic footprinting framework for cis-regulatory motifs identification in prokaryotic genomesBMC Genomics 201617578
39 Park PJ ChIP-seq advantages and challenges of a maturingtechnology Nat Rev Genet 200910669ndash80
40 Song L Crawford GE DNase-seq a high-resolution tech-nique for mapping active gene regulatory elements acrossthe genome from mammalian cells Cold Spring Harb Protoc201025384
41 Wang Z Gerstein M Snyder et al RNA-Seq a revolutionarytool for transcriptomics Nat Rev Genet 20081057ndash63
42 Tsankov AM Gu H Akopian V et al Transcription factorbinding dynamics during human ES cell differentiationNature 2015518344ndash9
43 Wu F Olson BG Yao J DamID-seq genome-wide mapping ofprotein-DNA interactions by high throughput sequencing ofadenine-methylated DNA fragments J Vis Exp 2016107e53620
44 Maragkakis M Alexiou P Nakaya T et al CLIPSeqTools-anovel bioinformatics CLIP-seq analysis suite RNA2015221ndash9
45 Hafner M Landthaler M Burger L et al PAR-CliPmdasha methodto identify transcriptome-wide the binding sites of RNAbinding proteins J Vis Exp 201041e2034
46 Ingolia NT Ribosome profiling new views of translationfrom single codons to genome scale Nat Rev Genet201415205ndash13
47 Giresi PG Kim J Mcdaniell RM et al FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates activeregulatory elements from human chromatin Genome Res200717877ndash85
48 Nutiu R Friedman RC Luo S et al Direct measurement ofDNA affinity landscapes on a high-throughput sequencinginstrument Nat Biotechnol 201129659ndash64
49 Collas P Dahl JA Chop it ChIP it check it the currentstatus of chromatin immunoprecipitation Front Biosci200813929ndash43
50 Kimura H Sato Y DNA Replication and Histone ModificationJapan Springer 2016
51 Suganuma T Workman JL Histone modification as a reflec-tion of metabolism Cell Cycle 2016 15481ndash2
52 Qu H Fang X A brief review on the human encyclopedia ofDNA elements (ENCODE) project Genomics ProteomicsBioinformatics 201311135ndash41
Motif Prediction based on ChIP-seq data | 11
53 Consortium EP The ENCODE (ENCyclopedia Of DNAElements) project Science 2004306636ndash40
54 Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910R25
55 Li H Durbin R Fast and accurate long-read alignment withBurrowsndashWheeler transform Bioinformatics 201026589ndash95
56 Szalkowski AM Schmid CD Rapid innovation in ChIP-seqpeak-calling algorithms is outdistancing benchmarking ef-forts Brief Bioinform 201112626ndash33 628
57 Kharchenko PV Tolstorukov MY Park PJ Design and ana-lysis of ChIP-seq experiments for DNA-binding proteins NatBiotechnol 2008261351ndash9
58 Zhang Y Liu T Meyer CA et al Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089R137
59 Jiang H Wang F Dyer NP et al CisGenome browser a flex-ible tool for genomic data visualization Bioinformatics2010261781ndash2
60 Fejes AP Robertson G Bilenky M et al FindPeaks 31 atool for identifying areas of enrichment from massivelyparallel short-read sequencing technology Bioinformatics2008241729ndash30
61 Valouev A Johnson DS Sundquist A et al Genome-wideanalysis of transcription factor binding sites based on ChIP-Seq data Nat Methods 20085829ndash34
62 Xin F Grossman R Stein L PeakRanger a cloud-enabledpeak caller for ChIP-seq data BMC Bioinformatics201112139
63 Kuan PF Chung D Pan G et al A statistical frameworkfor the analysis of ChIP-Seq data J Am Stat Assoc2011106891ndash903
64 Mathelier A Wasserman WW The next generation of tran-scription factor binding site prediction PLoS Comput Biol20139e1003214
65 Cheng C Min R Gerstein M TIP a probabilistic method foridentifying transcription factor target genes from ChIP-seqbinding profiles Bioinformatics 2011273221ndash7
66 Wu S Wang J Zhao W et al ChIP-PaM an algorithm to iden-tify protein-DNA interaction using ChIP-Seq data Theor BiolMed Model 20107(1)18
67 van Heeringen SJ Veenstra GJC GimmeMotifs a de novomotif prediction pipeline for ChIP-sequencing experimentsBioinformatics 201127270ndash1
68 Machanick P Bailey TL MEME-ChIP motif analysis of largeDNA datasets Bioinformatics 2011271696ndash7
69 Kulakovskiy IV Boeva VA Favorov AV et al Deep and widedigging for binding motifs in ChIP-Seq data Bioinformatics2010262622ndash3
70 Jothi R Cuddapah S Barski A et al Genome-wide identifica-tion of in vivo proteinndashDNA binding sites from ChIP-Seqdata Nucleic Acids Res 2008365221ndash31
71 Mercier E Droit A Li L et al An integrated pipeline for thegenome-wide analysis of transcription factor binding sitesfrom ChIP-Seq PLoS One 20116e16432
72 Hu M Yu J Taylor JM et al On the detection and refinementof transcription factor binding sites using ChIP-Seq dataNucleic Acids Res 2010382154ndash67
73 Bailey TL DREME motif discovery in transcription factorChIP-seq data Bioinformatics 2011271653ndash9
74 Jia C Carson MB Wang Y et al A new exhaustive methodand strategy for finding motifs in ChIP-enriched regionsPLoS One 20149e86044
75 Pavesi G Mereghetti P Mauri G et al Weeder Web discov-ery of transcription factor binding sites in a set of
sequences from co-regulated genes Nucleic Acids Res200432W199ndash203
76 Thomas-Chollier M Herrmann C Defrance M et al RSATpeak-motifs motif analysis in full-size ChIP-seq datasetsNucleic Acids Res 201240e31
77 Tran NT Huang CH A survey of motif finding web tools fordetecting binding site motifs in ChIP-Seq data Biol Direct201494
78 Jun Ding HH Xiaoman L SIOMICS a novel approach for sys-tematic identification of motifs in ChIP-seq data NucleicAcids Res 2014421645ndash35
79 Boeva V Surdez D Guillon N et al De novo motif identifica-tion improves the accuracy of predicting transcription factorbinding sites in ChIP-Seq data analysis Nucleic Acids Res201038e126
80 Bailey TL Williams N Misleh C et al MEME discovering andanalyzing DNA and protein sequence motifs Nucleic AcidsRes 200634369ndash73
81 Hartmann H Guthhrlein EW Siebert M et al P-value-basedregulatory motif discovery using positional weight matricesGenome Res 201323181ndash94
82 Niu M De novo prediction of cis-regulatory modules in eu-karyotic organisms Dissertations amp ThesesmdashGradworksThe university of North Carolina at Charlotte 2014
83 Bolouri H Ruzzo WL Integration of 198 ChIP-seq datasetsreveals human cis-regulatory regions J Comput Biol201219989ndash97
84 Sun H Guns T Fierro AC et al Unveiling combinatorial regu-lation through the combination of ChIP information and insilico cis-regulatory module detection Nucleic Acids Res201240e90
85 Lihu A Holban S A review of ensemble methods for denovo motif discovery in ChIP-Seq data Brief Bioinform201617731
86 Kulakovskiy IV Makeev VJ Motif discovery and motif find-ing in ChIP-Seq data Genome Analysis Current Proceduresand Applications Caister Academic Press Norfolk UK2014 83
87 Medina-Rivera A Defrance M Sand O et al RSAT 2015Regulatory Sequence Analysis Tools Nucleic Acids Res201543W50ndash6
88 Maaskola J Rajewsky N Binding site discovery from nucleicacid sequences by discriminative learning of hidden Markovmodels Nucleic Acids Res 20144212995ndash3011
89 Ikebata H Yoshida R Repulsive parallel MCMC algorithm fordiscovering diverse motifs from large sequence setsBioinformatics 2015311561ndash8
90 Stormo GD DNA binding sites representation and discov-ery Bioinformatics 20001616ndash23
91 Nishida K Frith MC Nakai K Pseudocounts for transcriptionfactor binding sites Nucleic Acids Res 200937939ndash44
92 Weirauch MT Cote A Norel R et al Evaluation of methodsfor modeling transcription factor sequence specificity NatBiotechnol 201331126ndash34
93 Bailey TL Elkan C Fitting a mixture model by expectationmaximization to discover motifs in biopolymers Proc IntConf Intell Syst Mol Biol 1994228ndash36
94 Ding J Dhillon V Li X et al Systematic discovery of cofactormotifs from ChIP-seq data by SIOMICS Methods 201579ndash8047ndash51
95 van Helden J Andre B Collado-Vides J Extracting regulatorysites from the upstream region of yeast genes by computa-tional analysis of oligonucleotide frequencies J Mol Biol1998281827ndash42
12 | Liu et al
96 van Helden J Rios AF Collado-Vides J Discovering regula-tory elements in non-coding sequences by analysis ofspaced dyads Nucleic Acids Res 2000281808ndash18
97 van Helden J del Olmo M Perez-Ortin JE Statistical analysisof yeast genomic downstream sequences reveals putativepolyadenylation signals Nucleic Acids Res 2000281000ndash10
98 Kulakovskiy I Levitsky V Oshchepkov D et al From bindingmotifs in ChIP-Seq data to improved models of transcriptionfactor binding sites J Bioinform Comput Biol 2013111340004
99 Levitsky VG Kulakovskiy IV Ershov NI et al Application ofexperimentally verified transcription factor binding sitesmodels for computational analysis of ChIP-Seq data BMCGenomics 20141580
100 Eskin E Pevzner PA Finding composite regulatory patternsin DNA sequences Bioinformatics 200218 Suppl 1S354ndash63
101 Sharov AA Ko MS Exhaustive search for over-representedDNA sequence motifs with CisFinder DNA Res 200916261ndash73
102 Hertz GZ Hartzell GW III Stormo GD Identification of con-sensus patterns in unaligned DNA sequences known to befunctionally related Comput Appl Biosci 1990681ndash92
103 Kingsford C Zaslavsky E Singh M a compact mathematicalprogramming formulation for DNA motif finding Lect NotesComput Sci 2006400913
104 Reid JE Wernisch L STEME efficient EM to find motifs inlarge data sets Nucleic Acids Res 201139e126
105 Jin VX OrsquoGeen H Iyengar S et al Identification of an OCT4 andSRY regulatory module using integrated computational and ex-perimental genomics approaches Genome Res 200717807ndash17
106 Favorov AV Gelfand MS Gerasimova AV et al A Gibbs sam-pler for identification of symmetrically structured spacedDNA motifs with improved estimation of the signal lengthBioinformatics 2005212240ndash5
107 Zhao Y Ruan S Pandey M et al Improved models for tran-scription factor binding site identification using noninde-pendent interactions Genetics 2012191781ndash90
108 Redhead E Bailey TL Discriminative motif discovery in DNAand protein sequences using the DEME algorithm BMCBioinformatics 20078385
109 Chen X Xu H Yuan P et al Integration of external signalingpathways with the core transcriptional network in embry-onic stem cells Cell 20081331106ndash17
110 Huggins P Zhong S Shiff I et al DECOD Fast and AccurateDiscriminative DNA Motif Finding Bioinformatics2011272361ndash7
111 Cuellar-Partida G Buske FA McLeay RC et al Epigenetic pri-ors for identifying active transcription factor binding sitesBioinformatics 20122856ndash62
112 Mason MJ Plath K Zhou Q Identification of context-dependent motifs by contrasting ChIP binding dataBioinformatics 2010262826ndash32
113 Teng L He B Gao P et al Discover context-specificcombinatorial transcription factor interactions byintegrating diverse ChIP-Seq data sets Nucleic Acids Res201442e24
114 Vaquerizas JM Kummerfeld SK Teichmann SA et al A cen-sus of human transcription factors function expressionand evolution Nat Rev Genet 200910252ndash63
115 Ding J Cai X Wang Y et al ChIPModule systematic discov-ery of transcription factors and their cofactors from ChIP-seq data Pac Symp Biocomput 2013320ndash31
116 Ettwiller L Paten B Ramialison M et al Trawler de novoregulatory motif discovery pipeline for chromatin immuno-precipitation Nat Methods 20074563ndash5
117 Linhart C Halperin Y Shamir R Transcription factor andmicroRNA motif discovery the Amadeus platform and acompendium of metazoan target sets Genome Res2008181180ndash9
118 Kuttippurathu L Hsing M Liu Y et al CompleteMOTIFsDNA motif discovery platform for transcription factor bind-ing experiments Bioinformatics 201127715ndash7
119 Ho JW Bishop E Karchenko PV et al ChIP-chip versus ChIP-seq lessons for experimental design and data analysis BMCGenomics 201112134
120 Che D Li G Mao F et al Detecting uber-operons in prokary-otic genomes Nucleic Acids Res 2006342418ndash27
121 Liu Y Schmidt B Liu W et al CUDAndashMEME acceleratingmotif discovery in biological sequences using CUDA-enabled graphics processing units Pattern Recognit Lett2010318
122 Alipanahi B Delong A Weirauch MT et al Predicting the se-quence specificities of DNA- and RNA-binding proteins bydeep learning Nat Biotechnol 201533831ndash8
Motif Prediction based on ChIP-seq data | 13
learning by iterative gradient optimization based on the chosenobjective functions which has a linear run time complexitybased on the length of the input data Finally Discrover refinesthe final models by a threshold on a corrected MICO-based P-value regardless of if MICO or another objective function wasused in previous steps In addition Discrover makes extra ef-forts in detection of multiple motifs which will be discussed inthe next section
In the application of discriminative motif finding on ChIP-seq data the various choices of reference data can benefit dif-ferent biological analysis about gene regulation and is the mainfactor affecting the prediction performance [108 110]Specifically the reference data can be (i) randomly generatedsequences using a uniform distribution or a Markov process (ii)the sequences with no binding evidence (iii) the ChIP-seq peaksfor another TF (iv) the peaks of the same TF but under a differ-ent condition even (v) multiple-level reference data sets basedon a detailed grade framework of the signal strength etc [94]This information can reduce false positives and find other spe-cial motifs including decoy motifs variant motifs for one TFimpoverished motifs and collaborated motifs [108] and thuscan improve the state-of-the-art studies about complex regula-tion mechanism eg combinatorial regulation and behaviorchange of TFs among different conditions [110] It is worth not-ing that the binding activity of a TF could be affected by epigen-etic modifications in a complex fashion eg through localchromatin structure caused by the packaging of DNA histonesand other proteins [111] Cuellar-Partida et al [111] have usedthe signal from histone modification ChIP-assay to filter thescanned motifs The corresponding pitfall raised in discrimina-tive motif finding is that the sequences in a negative data setwith no binding evidence may contain potential binding sitesSome algorithms try to mitigate this effect by a special designwhich can tolerate a few motif occurrences in negative set [110]Another phenomenon is that the binding motifs of a TF couldbe affected by the binding of other TFs Mason et al [112] provedthat Oct4 preferentially binds to different motifs with or withoutSox2 bindings within 5000 bp indicating that the binding motifof Oct4 is context-dependent
Find cofactors of TFs from ChIP-seq data
In animals and plants TFs usually regulate gene expression withcooperation of other partner TFs (cofactors) [113] Growing evi-dence indicates that cofactors which interacted directly or indir-ectly play an important role in transcription regulation Thebinding sites of multiple TFs in a short DNA region often deter-mine the temporal spatial expression pattern of the downstreamgenes thus will help to elucidate underlying patterns of combina-torial regulation [114] The analysis of cofactors has attracted sub-stantial attention in traditional motif finding For example BBA[24] in the DMINDA Web server designed a probability model toevaluate the co-occurrences among identified motifs in a givenset of regulatory sequences which can reveal joint regulation re-lationships by multiple TFs The cofactor motif finding on ChIP-seq data is usually formulated by detecting multiple nonredun-dant motifs in a set of ChIP-seq peak regions [73] or detecting TFpatterns whose binding motifs frequently co-occurred on thesame set of regulatory regions [113] Recently some algorithmshave tried to take advantage of the large-scale ChIP-seq data todiscover the cofactor motifs
Researchers want to identify multiple nonredundant motifsin query ChIP-seq data The most straightforward way is to scanpeak sequences by documented motifs [84 115] However the
number of documented motifs is limited compared with thenumber of TFs leading to a loss of some real binding sitesSome de novo motif-finding methods try to first identify mul-tiple motifs and then build relationships for them with knownTFs One example is the aforementioned DREME [73] whichfeeds the whole sequences into the corresponding model ratherthan using only a small portion of the collected ChIP-seq data Itidentifies multiple motifs by removing the most statistically sig-nificant identified motif derived from Fisherrsquos exact testand then repeats the search for motifs DREME achieved re-markable performance in cofactor motifs discovery in mousecell ChIP-seq data and dramatically outperformed other preva-lent algorithms In particular DREME discovered gt2 times thenumber of cofactor motifs compared with Amadeus [116] andgt10 times compared with Trawler [117] In addition the algo-rithm peak-motifs in the RSAT package also returns additionalmotifs potentially bound by cofactors and its application onmouse cell ChIP-seq data successfully detected some cofactormotifs [76]
Rather than simply considering individual motifs separatelySIOMICS [78 94] models the cofactor motifs as motif modulesie a group of TFs with their TFBSs co-occurring in significantlylarge number of ChIP-seq peaks Cofactor motifs may occur onlyin a small set of peaks and thus can be underrepresented indi-vidually in all peak regions Therefore evaluation of motif mod-ules simultaneously could make the significance of cofactorsstand out from the massive background sequences Asdescribed in previous section SIOMICS identifies conservedmotifs by a word-based strategy based on Equation (4) Toidentify nonredundant motifs it removes redundant candidateswith lower scores and the remaining ones are used to identifyputative motifs Instead of considering motif candidatesindividually SIOMICS represents them as nodes in a tree inwhich the more frequent candidates are closer to the root whilethe branch of the tree represents a group of co-occurring motifsBy evaluating each branch based on a Poisson clumpingheuristic strategy SIOMICS identifies statistically significantmotif modules and takes motif candidates in them as final pu-tative motifs The performance comparison on the same ChIP-seq data set used by DREME and peak-motifs indicates thatmore documented cofactor motifs were identified by SIOMICSrather than DREME and RSAT Peak-motifs with comparabletime efficiency
The above methods identify cofactors from ChIP-seq data fora single TF and have obtained many valuable results Howeversynthetic analysis of diverse ChIP-seq data could provide moreinformation about the co-regulation mechanism of TFs FCOPs[113] is a method for identifying combinatorial occupancy pat-terns of multiple TFs from diverse ChIP-seq data FCOPs inte-grates two kinds of ChIP-seq data sets as input from TFsoccupancy as well as chromatin modification and it considerstwo kinds of uncertainty How possible a peak from ChIP-seqdata represents an actual binding activity and how confidentthe predicted enhancer sequences are FCOPs generate probabil-ities from these uncertainties and use them to evaluate the stat-istical significance of a given set of TFs co-occurring in a givenset of enhancers Obviously enumerating all possible subset ofenhancers has an exponential complexity which is inefficientThis method adopts a dynamic programming-based method tocalculate the probabilities which leads to linear complexitywith respect to the number of the enhancers both in time andspace Despite this method not providing any DNA motifs itcan be easily integrated into ChIP-seq motif analysis to identifycofactors
8 | Liu et al
Citation analysis of ChIP-seq motif-findingtools
To analyze the usage of existing ChIP-seq-based motif-findingtools we collected the citation information of eight tools inTable 1 and a Web server MEME-ChIP which implants MEMEand DREME The relative citations of the eight tools since 2011are shown in Figure 2 where we can see that DREME has anoverwhelming superiority among these tools and RSAT peak-motifs and ChIPMunk also have higher relative citation ratesAlthough the four tools that were developed within the past 3years (RPMCMC Discover SIOMICS and FMotif) introducedsome new ideas and obtained better prediction performancethey still fail to get more attention from biologists and almostall of their citations are from the papers focusing on develop-ment of new methods The main reason is the lack of a user-friendly Web server and a platform with multiple related func-tions limiting their usages by people without adequate compu-tational background
DREME as one motif-finding engine of MEME-ChIP has beenintegrated in the MEME suite which is the most popular motif-finding and analysis platform MEME-ChIP obtained more add-itional citations than DREME (Figure 3) and most of theseshould be attribute to DREME RSAT peak-motifs is part of RSATplatform where a series of modular computer programs is
integrated for regulatory signal detection in noncoding se-quences And the relatively high number of citations ofChIPMunk also benefited from its Web service The methodsthat only provide source code are difficult to obtain continuousattention Taking MICSA as an example it was published in2010 and was cited frequently within 2 years following publica-tion but its citation count decreased sharply in recent yearsThis phenomenon indicated that the user experience of a newalgorithm is as important as the quality of algorithm itself
Discussion
The ChIP-seq peaks provide a foundation for the identificationof motifs on a genome scale In return the predicted motifs canbe used to reevaluate the peaks and even find more potentialTF-binding sites that were missed by a ChIP-seq experimentFor example MICSA [79] takes advantage of de novo motif identi-fication to reevaluate the ChIP-seq peaks In MICSA only a sub-set of the highest peaks is fed into MEME for motifsidentification for which the probability of observing a motif bychance (P-value) is calculated Then the predicted motifs willbe used to scan all previously identified peaks and a score willbe assigned to each peak by combining the P-values of motifsthat occurred on it with its depth The experimental results
0
10
20
30
40
50
60
70
80
90
100
2011 2012 2013 2014 2015 2016
Rela
ve
annu
al ci
tao
n (
)
Year of citaon
RPMCMC 2015[91]
Discrover 2014[93]
SIOMICS 2014[78]
Fmof 2014[74]
RSAT peak-mofs 2012[94]
DREME 2011[73]
ChIPMunk 2010[69]
MICSA 2010[79]
Figure 2 Relative citation of the ChIP-seq motif-finding tools until October 2016 The numbers after a tool name indicate the year of its original release and the corres-
ponding reference Citation of each tool was extracted from the Bing Academy (httpcnbingcomacademic)
0
5
10
15
20
25
30
35
40
45
50
Cita
ons
pery
ears
ince
publ
ished
MICSA 2010[79]
ChIPMunk 2010[69]
DREME 2011[73]
RSAT peak-mofs 2012[94]
Fmof 2014[74]
SIOMICS 2014[78]
Discrover 2014[93]
RPMCMC 2015[91]
MEME-ChIP 2011[68]a
Figure 3 The average annual citation of eight motif-finding tools and MEME-ChIP aMEME-ChIP is a Web server that implants DREME and MEME to detect motifs
Motif Prediction based on ChIP-seq data | 9
showed that MICSA offered a distinct improvement in TF-bind-ing prediction accuracy with fewer false positives comparedwith the other 10 approaches Certainly searching potentialbinding sites using the predicted motifs is not limited to thepeak sequences It could even be applied on the whole-genomesequences as some binding activities may not occur under thecondition of ChIP experiments and these binding sites are alsovaluable for studying the full regulation picture of the interestedTFs
Another issue that should be discussed concerning ChIP-seqmotif finding is the difference of its application on various datasources For example the binding motifs in eukaryotes and pro-karyotes often have their respective characteristics and thelength of the binding regions in eukaryotes is much longer thanthose in prokaryotes Therefore there is a need of special designin motif-finding and analysis techniques
As with the traditional motif finding there are some ensem-ble methods for ChIP-seq-based motif finding [85 118] It was al-ways recommended that using several tools in motif finding isa better strategy as diverse techniques may capture differentcharacteristics of motifs This idea leads to the development ofensemble motif-finding methods Most of these methods sim-ply assemble several algorithms together and rank the outputof them For example W-ChIPMotif [119] includes MEMEWEEDER and MaMF [120] and assesses the output by comparingwith a randomized initial input CompleteMotifs [118] incorpor-ates three de novo tools (ChIPMunk WEEDER and CUDA-MEME)[121] and calculates the P-values of predicted motifs by a back-ground random model However using more algorithms intro-duces more false positives while they have high probability tocover real motifs Therefore further analyses are still requiredfor the output of multiple algorithms
Certainly the accuracy of prediction and the efficiency ofmethods are still the most important features in ChIP-seq motiffinding and the application of current methods shows that theyare still far from completely satisfactory More deep thinkingand advanced techniques are still needed to improve the intrin-sic algorithms For example the evolution information fromphylogenetic footprinting can be useful as prior knowledge inevaluation of candidate binding sites Besides additional func-tions such as cofactor analysis evaluating peaks based onmotifs and multiple types of discriminative motif finding are at-tracting more attention in applications They should be wellintegrated in motif-finding methods in future studies Finallyan integrated Web server or platform is essential for the appli-cation of new designed methods as it can provide more func-tions about upstream data collection and downstream resultanalysis
Most recently deep learning has demonstrated its unprece-dented performance in regulatory genomics especially in ChIP-seq-based motif analysis [122] This marked performance ismainly achieved by taking advantage of graphics processingunits and the capability to extract complex features of motifsfrom ChIP-seq data However it can be improved further ifmore information (eg DNA shape) can be taken into consider-ation rather than only the ChIP-seq peaks sequences
Key Points
bull The algorithmic strategies adopted by current motif-finding tools can be generally divided into two catego-ries word-based and profile-based which could be im-plemented with other models like tree graph and
clustering The combination of these two strategies isa better choice to break the current bottleneck in pre-diction accuracy and efficiency
bull The new features of current tools including but notlimited to discriminative motif identification (DREAMRSAT lsquopeak-motifsrsquo and Discrover) cofactor motifdetection (DREME RSAT lsquopeak-motifsrsquo SIOMICSRPMCMC and Discrover) etc are essential for applica-tion of ChIP-seq data analysis and are encouraged tobe integrated into newly designed algorithms
bull Existing tools still have a much room for improvementwith respect to prediction accuracy and efficiency thecontradictory relationship between the time efficiencyand space complexity as well as between predictionaccuracy and application universality is generally pre-sent in these tools
bull An integrated Web server or platform providing vari-ous analysis functions in data collection and the resultanalysis is essential for the spread of new designedmethods in their applications
bull In future studies more mathematical models eggraph model statistics and combinatorial optimiza-tion could be developed and integrated to design newmotif signal detection techniques and new evaluationcriteria In addition more information eg the de-pendency between nucleotides and the height of peaksfrom ChIP-seq data should be considered to improvethe prediction accuracy
Funding
The State of South Dakota Research Innovation Center andthe Agriculture Experiment Station of South Dakota StateUniversity the National Nature Science Foundation ofChina (grant numbers 61303084 and 61432010 to BL) andthe Young Scholars Program of Shandong University (grantnumbers YSPSDU 2015WLJH19 to BL)
References1 DrsquoHaeseleer P What are DNA sequence motifs Nat
Biotechnol 200624423ndash52 Brohee S Janky R Abdel-Sater F et al Unraveling networks
of co-regulated genes on the sole basis of genome se-quences Nucleic Acids Res 2011396340ndash58
3 Davidson E Levin M Gene regulatory networks Natl Acad SciUSA 20051024935
4 Liu B Zhou C Li G et al Bacterial regulon modeling and pre-diction based on systematic cis regulatory motif analysesSci Rep 2016623030
5 Baumbach J On the power and limits of evolutionary con-servationndashunraveling bacterial gene regulatory networksNucleic Acids Res 2010387877ndash84
6 Lawrence CE Altschul SF Boguski MS et al Detecting subtlesequence signals a Gibbs sampling strategy for multiplealignment Science 1993262208ndash14
7 Pevzner PA Sze SH Combinatorial approaches to findingsubtle signals in DNA sequences Proc Int Conf Intell Syst MolBiol 20008269ndash78
8 Nakaki R Kang J Tateno M A novel ab initio identificationsystem of transcriptional regulation motifs in genome DNAsequences based on direct comparison scheme of signalnoise distributions Nucleic Acids Res 2012408835ndash48
10 | Liu et al
9 Zambelli F Pesole G Pavesi G Motif discovery and transcrip-tion factor binding sites before and after the next-generation sequencing era Brief Bioinform 201314225ndash37
10 Li G Liu B Ma Q et al A new framework for identifying cis-regulatory motifs in prokaryotes Nucleic Acids Res201139e42
11 Chen X Guo L Fan Z et al W-AlignACE an improved Gibbssampling algorithm based on more accurate position weightmatrices learned from sequence and gene expressionChIP-chip data Bioinformatics 2008241121ndash8
12 Sinha S PhyME a software tool for finding motifs in sets oforthologous sequences Methods Mol Biol 2007395309ndash18
13 Das MK Dai HK A survey of DNA motif finding algorithmsBMC Bioinformatics 20078(Suppl 7)S21
14 Wang T Stormo GD Combining phylogenetic data with co-regulated genes to identify regulatory motifs Bioinformatics2003192369ndash80
15 Liu X Brutlag DL Liu JS BioProspector discovering con-served DNA motifs in upstream regulatory regions of co-expressed genes Pac Symp Biocomput 2001127ndash38
16 Hertz GZ Stormo GD Identifying DNA and protein patternswith statistically significant alignments of multiple se-quences Bioinformatics 199915563ndash77
17 Liu XS Brutlag DL Liu JS An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments NatBiotechnol 200220835ndash9
18 Bailey TL Boden M Buske FA et al MEME SUITE tools for motifdiscovery and searching Nucleic Acids Res 200937W202ndash8
19 Olman V Xu D Xu Y CUBIC identification of regulatorybinding sites through data clustering J Bioinform Comput Biol2003121ndash40
20 Li X Wong WH Sampling motifs on phylogenetic trees ProcNatl Acad Sci USA 20051029481ndash6
21 Blanchette M Tompa M Discovery of regulatory elementsby a computational method for phylogenetic footprintingGenome Res 200212739ndash48
22 Blanchette M Tompa M FootPrinter a program designed forphylogenetic footprinting Nucleic Acids Res 2003313840ndash2
23 Li G Liu B Xu Y Accurate recognition of cis-regulatorymotifs with the correct lengths in prokaryotic genomesNucleic Acids Res 201038e12
24 Ma Q Liu B Zhou C et al An integrated toolkit for accurateprediction and analysis of cis-regulatory motifs at a genomescale Bioinformatics 2013292261ndash8
25 Tompa M Li N Bailey TL et al Assessing computationaltools for the discovery of transcription factor binding sitesNat Biotechnol 200523137ndash44
26 McCue LA Thompson W Carmack CS et al Factors influenc-ing the identification of transcription factor binding sites bycross-species comparison Genome Res 2002121523ndash32
27 Simcha D Price ND Geman D The limits of de novo DNAmotif discovery PLoS One 20127e47836
28 Siddharthan R Siggia ED van Nimwegen E PhyloGibbs aGibbs sampling motif finder that incorporates phylogenyPLoS Comput Biol 20051e67
29 Blanchette M Schwikowski B Tompa M Algorithms forphylogenetic footprinting J Comput Biol 20029211ndash23
30 Neph S Tompa M MicroFootPrinter a tool for phylogeneticfootprinting in prokaryotic genomes Nucleic Acids Res200634W366ndash8
31 Carmack CS McCue LA Newberg LA et al PhyloScan identi-fication of transcription factor binding sites using cross-species evidence Algorithms Mol Biol 200721
32 Zhang S Xu M Li S et al Genome-wide de novo prediction ofcis-regulatory binding sites in prokaryotes Nucleic Acids Res200937e72
33 Katara P Grover A Sharma V Phylogenetic footprinting aboost for microbial regulatory genomics Protoplasma2012249901ndash7
34 Tagle DA Koop BF Goodman M et al Embryonic epsilon andgamma globin genes of a prosimian primate (Galago crassi-caudatus) Nucleotide and amino acid sequences develop-mental regulation and phylogenetic footprints J Mol Biol1988203439ndash55
35 Borneman AR Gianoulis TA Zhang ZD et al Divergence oftranscription factor binding sites across related yeast spe-cies Science 2007317815ndash9
36 Odom DT Dowell RD Jacobsen ES et al Tissue-specific tran-scriptional regulation has diverged significantly betweenhuman and mouse Nat Genet 200739730ndash2
37 Boyle AP Araya CL Brdlik C et al Comparative analysis ofregulatory information and circuits across distant speciesNature 2014512453ndash6
38 Bingqiang Liu CZ Zhang H Guojun L et al An integrativeand applicable phylogenetic footprinting framework for cis-regulatory motifs identification in prokaryotic genomesBMC Genomics 201617578
39 Park PJ ChIP-seq advantages and challenges of a maturingtechnology Nat Rev Genet 200910669ndash80
40 Song L Crawford GE DNase-seq a high-resolution tech-nique for mapping active gene regulatory elements acrossthe genome from mammalian cells Cold Spring Harb Protoc201025384
41 Wang Z Gerstein M Snyder et al RNA-Seq a revolutionarytool for transcriptomics Nat Rev Genet 20081057ndash63
42 Tsankov AM Gu H Akopian V et al Transcription factorbinding dynamics during human ES cell differentiationNature 2015518344ndash9
43 Wu F Olson BG Yao J DamID-seq genome-wide mapping ofprotein-DNA interactions by high throughput sequencing ofadenine-methylated DNA fragments J Vis Exp 2016107e53620
44 Maragkakis M Alexiou P Nakaya T et al CLIPSeqTools-anovel bioinformatics CLIP-seq analysis suite RNA2015221ndash9
45 Hafner M Landthaler M Burger L et al PAR-CliPmdasha methodto identify transcriptome-wide the binding sites of RNAbinding proteins J Vis Exp 201041e2034
46 Ingolia NT Ribosome profiling new views of translationfrom single codons to genome scale Nat Rev Genet201415205ndash13
47 Giresi PG Kim J Mcdaniell RM et al FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates activeregulatory elements from human chromatin Genome Res200717877ndash85
48 Nutiu R Friedman RC Luo S et al Direct measurement ofDNA affinity landscapes on a high-throughput sequencinginstrument Nat Biotechnol 201129659ndash64
49 Collas P Dahl JA Chop it ChIP it check it the currentstatus of chromatin immunoprecipitation Front Biosci200813929ndash43
50 Kimura H Sato Y DNA Replication and Histone ModificationJapan Springer 2016
51 Suganuma T Workman JL Histone modification as a reflec-tion of metabolism Cell Cycle 2016 15481ndash2
52 Qu H Fang X A brief review on the human encyclopedia ofDNA elements (ENCODE) project Genomics ProteomicsBioinformatics 201311135ndash41
Motif Prediction based on ChIP-seq data | 11
53 Consortium EP The ENCODE (ENCyclopedia Of DNAElements) project Science 2004306636ndash40
54 Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910R25
55 Li H Durbin R Fast and accurate long-read alignment withBurrowsndashWheeler transform Bioinformatics 201026589ndash95
56 Szalkowski AM Schmid CD Rapid innovation in ChIP-seqpeak-calling algorithms is outdistancing benchmarking ef-forts Brief Bioinform 201112626ndash33 628
57 Kharchenko PV Tolstorukov MY Park PJ Design and ana-lysis of ChIP-seq experiments for DNA-binding proteins NatBiotechnol 2008261351ndash9
58 Zhang Y Liu T Meyer CA et al Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089R137
59 Jiang H Wang F Dyer NP et al CisGenome browser a flex-ible tool for genomic data visualization Bioinformatics2010261781ndash2
60 Fejes AP Robertson G Bilenky M et al FindPeaks 31 atool for identifying areas of enrichment from massivelyparallel short-read sequencing technology Bioinformatics2008241729ndash30
61 Valouev A Johnson DS Sundquist A et al Genome-wideanalysis of transcription factor binding sites based on ChIP-Seq data Nat Methods 20085829ndash34
62 Xin F Grossman R Stein L PeakRanger a cloud-enabledpeak caller for ChIP-seq data BMC Bioinformatics201112139
63 Kuan PF Chung D Pan G et al A statistical frameworkfor the analysis of ChIP-Seq data J Am Stat Assoc2011106891ndash903
64 Mathelier A Wasserman WW The next generation of tran-scription factor binding site prediction PLoS Comput Biol20139e1003214
65 Cheng C Min R Gerstein M TIP a probabilistic method foridentifying transcription factor target genes from ChIP-seqbinding profiles Bioinformatics 2011273221ndash7
66 Wu S Wang J Zhao W et al ChIP-PaM an algorithm to iden-tify protein-DNA interaction using ChIP-Seq data Theor BiolMed Model 20107(1)18
67 van Heeringen SJ Veenstra GJC GimmeMotifs a de novomotif prediction pipeline for ChIP-sequencing experimentsBioinformatics 201127270ndash1
68 Machanick P Bailey TL MEME-ChIP motif analysis of largeDNA datasets Bioinformatics 2011271696ndash7
69 Kulakovskiy IV Boeva VA Favorov AV et al Deep and widedigging for binding motifs in ChIP-Seq data Bioinformatics2010262622ndash3
70 Jothi R Cuddapah S Barski A et al Genome-wide identifica-tion of in vivo proteinndashDNA binding sites from ChIP-Seqdata Nucleic Acids Res 2008365221ndash31
71 Mercier E Droit A Li L et al An integrated pipeline for thegenome-wide analysis of transcription factor binding sitesfrom ChIP-Seq PLoS One 20116e16432
72 Hu M Yu J Taylor JM et al On the detection and refinementof transcription factor binding sites using ChIP-Seq dataNucleic Acids Res 2010382154ndash67
73 Bailey TL DREME motif discovery in transcription factorChIP-seq data Bioinformatics 2011271653ndash9
74 Jia C Carson MB Wang Y et al A new exhaustive methodand strategy for finding motifs in ChIP-enriched regionsPLoS One 20149e86044
75 Pavesi G Mereghetti P Mauri G et al Weeder Web discov-ery of transcription factor binding sites in a set of
sequences from co-regulated genes Nucleic Acids Res200432W199ndash203
76 Thomas-Chollier M Herrmann C Defrance M et al RSATpeak-motifs motif analysis in full-size ChIP-seq datasetsNucleic Acids Res 201240e31
77 Tran NT Huang CH A survey of motif finding web tools fordetecting binding site motifs in ChIP-Seq data Biol Direct201494
78 Jun Ding HH Xiaoman L SIOMICS a novel approach for sys-tematic identification of motifs in ChIP-seq data NucleicAcids Res 2014421645ndash35
79 Boeva V Surdez D Guillon N et al De novo motif identifica-tion improves the accuracy of predicting transcription factorbinding sites in ChIP-Seq data analysis Nucleic Acids Res201038e126
80 Bailey TL Williams N Misleh C et al MEME discovering andanalyzing DNA and protein sequence motifs Nucleic AcidsRes 200634369ndash73
81 Hartmann H Guthhrlein EW Siebert M et al P-value-basedregulatory motif discovery using positional weight matricesGenome Res 201323181ndash94
82 Niu M De novo prediction of cis-regulatory modules in eu-karyotic organisms Dissertations amp ThesesmdashGradworksThe university of North Carolina at Charlotte 2014
83 Bolouri H Ruzzo WL Integration of 198 ChIP-seq datasetsreveals human cis-regulatory regions J Comput Biol201219989ndash97
84 Sun H Guns T Fierro AC et al Unveiling combinatorial regu-lation through the combination of ChIP information and insilico cis-regulatory module detection Nucleic Acids Res201240e90
85 Lihu A Holban S A review of ensemble methods for denovo motif discovery in ChIP-Seq data Brief Bioinform201617731
86 Kulakovskiy IV Makeev VJ Motif discovery and motif find-ing in ChIP-Seq data Genome Analysis Current Proceduresand Applications Caister Academic Press Norfolk UK2014 83
87 Medina-Rivera A Defrance M Sand O et al RSAT 2015Regulatory Sequence Analysis Tools Nucleic Acids Res201543W50ndash6
88 Maaskola J Rajewsky N Binding site discovery from nucleicacid sequences by discriminative learning of hidden Markovmodels Nucleic Acids Res 20144212995ndash3011
89 Ikebata H Yoshida R Repulsive parallel MCMC algorithm fordiscovering diverse motifs from large sequence setsBioinformatics 2015311561ndash8
90 Stormo GD DNA binding sites representation and discov-ery Bioinformatics 20001616ndash23
91 Nishida K Frith MC Nakai K Pseudocounts for transcriptionfactor binding sites Nucleic Acids Res 200937939ndash44
92 Weirauch MT Cote A Norel R et al Evaluation of methodsfor modeling transcription factor sequence specificity NatBiotechnol 201331126ndash34
93 Bailey TL Elkan C Fitting a mixture model by expectationmaximization to discover motifs in biopolymers Proc IntConf Intell Syst Mol Biol 1994228ndash36
94 Ding J Dhillon V Li X et al Systematic discovery of cofactormotifs from ChIP-seq data by SIOMICS Methods 201579ndash8047ndash51
95 van Helden J Andre B Collado-Vides J Extracting regulatorysites from the upstream region of yeast genes by computa-tional analysis of oligonucleotide frequencies J Mol Biol1998281827ndash42
12 | Liu et al
96 van Helden J Rios AF Collado-Vides J Discovering regula-tory elements in non-coding sequences by analysis ofspaced dyads Nucleic Acids Res 2000281808ndash18
97 van Helden J del Olmo M Perez-Ortin JE Statistical analysisof yeast genomic downstream sequences reveals putativepolyadenylation signals Nucleic Acids Res 2000281000ndash10
98 Kulakovskiy I Levitsky V Oshchepkov D et al From bindingmotifs in ChIP-Seq data to improved models of transcriptionfactor binding sites J Bioinform Comput Biol 2013111340004
99 Levitsky VG Kulakovskiy IV Ershov NI et al Application ofexperimentally verified transcription factor binding sitesmodels for computational analysis of ChIP-Seq data BMCGenomics 20141580
100 Eskin E Pevzner PA Finding composite regulatory patternsin DNA sequences Bioinformatics 200218 Suppl 1S354ndash63
101 Sharov AA Ko MS Exhaustive search for over-representedDNA sequence motifs with CisFinder DNA Res 200916261ndash73
102 Hertz GZ Hartzell GW III Stormo GD Identification of con-sensus patterns in unaligned DNA sequences known to befunctionally related Comput Appl Biosci 1990681ndash92
103 Kingsford C Zaslavsky E Singh M a compact mathematicalprogramming formulation for DNA motif finding Lect NotesComput Sci 2006400913
104 Reid JE Wernisch L STEME efficient EM to find motifs inlarge data sets Nucleic Acids Res 201139e126
105 Jin VX OrsquoGeen H Iyengar S et al Identification of an OCT4 andSRY regulatory module using integrated computational and ex-perimental genomics approaches Genome Res 200717807ndash17
106 Favorov AV Gelfand MS Gerasimova AV et al A Gibbs sam-pler for identification of symmetrically structured spacedDNA motifs with improved estimation of the signal lengthBioinformatics 2005212240ndash5
107 Zhao Y Ruan S Pandey M et al Improved models for tran-scription factor binding site identification using noninde-pendent interactions Genetics 2012191781ndash90
108 Redhead E Bailey TL Discriminative motif discovery in DNAand protein sequences using the DEME algorithm BMCBioinformatics 20078385
109 Chen X Xu H Yuan P et al Integration of external signalingpathways with the core transcriptional network in embry-onic stem cells Cell 20081331106ndash17
110 Huggins P Zhong S Shiff I et al DECOD Fast and AccurateDiscriminative DNA Motif Finding Bioinformatics2011272361ndash7
111 Cuellar-Partida G Buske FA McLeay RC et al Epigenetic pri-ors for identifying active transcription factor binding sitesBioinformatics 20122856ndash62
112 Mason MJ Plath K Zhou Q Identification of context-dependent motifs by contrasting ChIP binding dataBioinformatics 2010262826ndash32
113 Teng L He B Gao P et al Discover context-specificcombinatorial transcription factor interactions byintegrating diverse ChIP-Seq data sets Nucleic Acids Res201442e24
114 Vaquerizas JM Kummerfeld SK Teichmann SA et al A cen-sus of human transcription factors function expressionand evolution Nat Rev Genet 200910252ndash63
115 Ding J Cai X Wang Y et al ChIPModule systematic discov-ery of transcription factors and their cofactors from ChIP-seq data Pac Symp Biocomput 2013320ndash31
116 Ettwiller L Paten B Ramialison M et al Trawler de novoregulatory motif discovery pipeline for chromatin immuno-precipitation Nat Methods 20074563ndash5
117 Linhart C Halperin Y Shamir R Transcription factor andmicroRNA motif discovery the Amadeus platform and acompendium of metazoan target sets Genome Res2008181180ndash9
118 Kuttippurathu L Hsing M Liu Y et al CompleteMOTIFsDNA motif discovery platform for transcription factor bind-ing experiments Bioinformatics 201127715ndash7
119 Ho JW Bishop E Karchenko PV et al ChIP-chip versus ChIP-seq lessons for experimental design and data analysis BMCGenomics 201112134
120 Che D Li G Mao F et al Detecting uber-operons in prokary-otic genomes Nucleic Acids Res 2006342418ndash27
121 Liu Y Schmidt B Liu W et al CUDAndashMEME acceleratingmotif discovery in biological sequences using CUDA-enabled graphics processing units Pattern Recognit Lett2010318
122 Alipanahi B Delong A Weirauch MT et al Predicting the se-quence specificities of DNA- and RNA-binding proteins bydeep learning Nat Biotechnol 201533831ndash8
Motif Prediction based on ChIP-seq data | 13
Citation analysis of ChIP-seq motif-findingtools
To analyze the usage of existing ChIP-seq-based motif-findingtools we collected the citation information of eight tools inTable 1 and a Web server MEME-ChIP which implants MEMEand DREME The relative citations of the eight tools since 2011are shown in Figure 2 where we can see that DREME has anoverwhelming superiority among these tools and RSAT peak-motifs and ChIPMunk also have higher relative citation ratesAlthough the four tools that were developed within the past 3years (RPMCMC Discover SIOMICS and FMotif) introducedsome new ideas and obtained better prediction performancethey still fail to get more attention from biologists and almostall of their citations are from the papers focusing on develop-ment of new methods The main reason is the lack of a user-friendly Web server and a platform with multiple related func-tions limiting their usages by people without adequate compu-tational background
DREME as one motif-finding engine of MEME-ChIP has beenintegrated in the MEME suite which is the most popular motif-finding and analysis platform MEME-ChIP obtained more add-itional citations than DREME (Figure 3) and most of theseshould be attribute to DREME RSAT peak-motifs is part of RSATplatform where a series of modular computer programs is
integrated for regulatory signal detection in noncoding se-quences And the relatively high number of citations ofChIPMunk also benefited from its Web service The methodsthat only provide source code are difficult to obtain continuousattention Taking MICSA as an example it was published in2010 and was cited frequently within 2 years following publica-tion but its citation count decreased sharply in recent yearsThis phenomenon indicated that the user experience of a newalgorithm is as important as the quality of algorithm itself
Discussion
The ChIP-seq peaks provide a foundation for the identificationof motifs on a genome scale In return the predicted motifs canbe used to reevaluate the peaks and even find more potentialTF-binding sites that were missed by a ChIP-seq experimentFor example MICSA [79] takes advantage of de novo motif identi-fication to reevaluate the ChIP-seq peaks In MICSA only a sub-set of the highest peaks is fed into MEME for motifsidentification for which the probability of observing a motif bychance (P-value) is calculated Then the predicted motifs willbe used to scan all previously identified peaks and a score willbe assigned to each peak by combining the P-values of motifsthat occurred on it with its depth The experimental results
0
10
20
30
40
50
60
70
80
90
100
2011 2012 2013 2014 2015 2016
Rela
ve
annu
al ci
tao
n (
)
Year of citaon
RPMCMC 2015[91]
Discrover 2014[93]
SIOMICS 2014[78]
Fmof 2014[74]
RSAT peak-mofs 2012[94]
DREME 2011[73]
ChIPMunk 2010[69]
MICSA 2010[79]
Figure 2 Relative citation of the ChIP-seq motif-finding tools until October 2016 The numbers after a tool name indicate the year of its original release and the corres-
ponding reference Citation of each tool was extracted from the Bing Academy (httpcnbingcomacademic)
0
5
10
15
20
25
30
35
40
45
50
Cita
ons
pery
ears
ince
publ
ished
MICSA 2010[79]
ChIPMunk 2010[69]
DREME 2011[73]
RSAT peak-mofs 2012[94]
Fmof 2014[74]
SIOMICS 2014[78]
Discrover 2014[93]
RPMCMC 2015[91]
MEME-ChIP 2011[68]a
Figure 3 The average annual citation of eight motif-finding tools and MEME-ChIP aMEME-ChIP is a Web server that implants DREME and MEME to detect motifs
Motif Prediction based on ChIP-seq data | 9
showed that MICSA offered a distinct improvement in TF-bind-ing prediction accuracy with fewer false positives comparedwith the other 10 approaches Certainly searching potentialbinding sites using the predicted motifs is not limited to thepeak sequences It could even be applied on the whole-genomesequences as some binding activities may not occur under thecondition of ChIP experiments and these binding sites are alsovaluable for studying the full regulation picture of the interestedTFs
Another issue that should be discussed concerning ChIP-seqmotif finding is the difference of its application on various datasources For example the binding motifs in eukaryotes and pro-karyotes often have their respective characteristics and thelength of the binding regions in eukaryotes is much longer thanthose in prokaryotes Therefore there is a need of special designin motif-finding and analysis techniques
As with the traditional motif finding there are some ensem-ble methods for ChIP-seq-based motif finding [85 118] It was al-ways recommended that using several tools in motif finding isa better strategy as diverse techniques may capture differentcharacteristics of motifs This idea leads to the development ofensemble motif-finding methods Most of these methods sim-ply assemble several algorithms together and rank the outputof them For example W-ChIPMotif [119] includes MEMEWEEDER and MaMF [120] and assesses the output by comparingwith a randomized initial input CompleteMotifs [118] incorpor-ates three de novo tools (ChIPMunk WEEDER and CUDA-MEME)[121] and calculates the P-values of predicted motifs by a back-ground random model However using more algorithms intro-duces more false positives while they have high probability tocover real motifs Therefore further analyses are still requiredfor the output of multiple algorithms
Certainly the accuracy of prediction and the efficiency ofmethods are still the most important features in ChIP-seq motiffinding and the application of current methods shows that theyare still far from completely satisfactory More deep thinkingand advanced techniques are still needed to improve the intrin-sic algorithms For example the evolution information fromphylogenetic footprinting can be useful as prior knowledge inevaluation of candidate binding sites Besides additional func-tions such as cofactor analysis evaluating peaks based onmotifs and multiple types of discriminative motif finding are at-tracting more attention in applications They should be wellintegrated in motif-finding methods in future studies Finallyan integrated Web server or platform is essential for the appli-cation of new designed methods as it can provide more func-tions about upstream data collection and downstream resultanalysis
Most recently deep learning has demonstrated its unprece-dented performance in regulatory genomics especially in ChIP-seq-based motif analysis [122] This marked performance ismainly achieved by taking advantage of graphics processingunits and the capability to extract complex features of motifsfrom ChIP-seq data However it can be improved further ifmore information (eg DNA shape) can be taken into consider-ation rather than only the ChIP-seq peaks sequences
Key Points
bull The algorithmic strategies adopted by current motif-finding tools can be generally divided into two catego-ries word-based and profile-based which could be im-plemented with other models like tree graph and
clustering The combination of these two strategies isa better choice to break the current bottleneck in pre-diction accuracy and efficiency
bull The new features of current tools including but notlimited to discriminative motif identification (DREAMRSAT lsquopeak-motifsrsquo and Discrover) cofactor motifdetection (DREME RSAT lsquopeak-motifsrsquo SIOMICSRPMCMC and Discrover) etc are essential for applica-tion of ChIP-seq data analysis and are encouraged tobe integrated into newly designed algorithms
bull Existing tools still have a much room for improvementwith respect to prediction accuracy and efficiency thecontradictory relationship between the time efficiencyand space complexity as well as between predictionaccuracy and application universality is generally pre-sent in these tools
bull An integrated Web server or platform providing vari-ous analysis functions in data collection and the resultanalysis is essential for the spread of new designedmethods in their applications
bull In future studies more mathematical models eggraph model statistics and combinatorial optimiza-tion could be developed and integrated to design newmotif signal detection techniques and new evaluationcriteria In addition more information eg the de-pendency between nucleotides and the height of peaksfrom ChIP-seq data should be considered to improvethe prediction accuracy
Funding
The State of South Dakota Research Innovation Center andthe Agriculture Experiment Station of South Dakota StateUniversity the National Nature Science Foundation ofChina (grant numbers 61303084 and 61432010 to BL) andthe Young Scholars Program of Shandong University (grantnumbers YSPSDU 2015WLJH19 to BL)
References1 DrsquoHaeseleer P What are DNA sequence motifs Nat
Biotechnol 200624423ndash52 Brohee S Janky R Abdel-Sater F et al Unraveling networks
of co-regulated genes on the sole basis of genome se-quences Nucleic Acids Res 2011396340ndash58
3 Davidson E Levin M Gene regulatory networks Natl Acad SciUSA 20051024935
4 Liu B Zhou C Li G et al Bacterial regulon modeling and pre-diction based on systematic cis regulatory motif analysesSci Rep 2016623030
5 Baumbach J On the power and limits of evolutionary con-servationndashunraveling bacterial gene regulatory networksNucleic Acids Res 2010387877ndash84
6 Lawrence CE Altschul SF Boguski MS et al Detecting subtlesequence signals a Gibbs sampling strategy for multiplealignment Science 1993262208ndash14
7 Pevzner PA Sze SH Combinatorial approaches to findingsubtle signals in DNA sequences Proc Int Conf Intell Syst MolBiol 20008269ndash78
8 Nakaki R Kang J Tateno M A novel ab initio identificationsystem of transcriptional regulation motifs in genome DNAsequences based on direct comparison scheme of signalnoise distributions Nucleic Acids Res 2012408835ndash48
10 | Liu et al
9 Zambelli F Pesole G Pavesi G Motif discovery and transcrip-tion factor binding sites before and after the next-generation sequencing era Brief Bioinform 201314225ndash37
10 Li G Liu B Ma Q et al A new framework for identifying cis-regulatory motifs in prokaryotes Nucleic Acids Res201139e42
11 Chen X Guo L Fan Z et al W-AlignACE an improved Gibbssampling algorithm based on more accurate position weightmatrices learned from sequence and gene expressionChIP-chip data Bioinformatics 2008241121ndash8
12 Sinha S PhyME a software tool for finding motifs in sets oforthologous sequences Methods Mol Biol 2007395309ndash18
13 Das MK Dai HK A survey of DNA motif finding algorithmsBMC Bioinformatics 20078(Suppl 7)S21
14 Wang T Stormo GD Combining phylogenetic data with co-regulated genes to identify regulatory motifs Bioinformatics2003192369ndash80
15 Liu X Brutlag DL Liu JS BioProspector discovering con-served DNA motifs in upstream regulatory regions of co-expressed genes Pac Symp Biocomput 2001127ndash38
16 Hertz GZ Stormo GD Identifying DNA and protein patternswith statistically significant alignments of multiple se-quences Bioinformatics 199915563ndash77
17 Liu XS Brutlag DL Liu JS An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments NatBiotechnol 200220835ndash9
18 Bailey TL Boden M Buske FA et al MEME SUITE tools for motifdiscovery and searching Nucleic Acids Res 200937W202ndash8
19 Olman V Xu D Xu Y CUBIC identification of regulatorybinding sites through data clustering J Bioinform Comput Biol2003121ndash40
20 Li X Wong WH Sampling motifs on phylogenetic trees ProcNatl Acad Sci USA 20051029481ndash6
21 Blanchette M Tompa M Discovery of regulatory elementsby a computational method for phylogenetic footprintingGenome Res 200212739ndash48
22 Blanchette M Tompa M FootPrinter a program designed forphylogenetic footprinting Nucleic Acids Res 2003313840ndash2
23 Li G Liu B Xu Y Accurate recognition of cis-regulatorymotifs with the correct lengths in prokaryotic genomesNucleic Acids Res 201038e12
24 Ma Q Liu B Zhou C et al An integrated toolkit for accurateprediction and analysis of cis-regulatory motifs at a genomescale Bioinformatics 2013292261ndash8
25 Tompa M Li N Bailey TL et al Assessing computationaltools for the discovery of transcription factor binding sitesNat Biotechnol 200523137ndash44
26 McCue LA Thompson W Carmack CS et al Factors influenc-ing the identification of transcription factor binding sites bycross-species comparison Genome Res 2002121523ndash32
27 Simcha D Price ND Geman D The limits of de novo DNAmotif discovery PLoS One 20127e47836
28 Siddharthan R Siggia ED van Nimwegen E PhyloGibbs aGibbs sampling motif finder that incorporates phylogenyPLoS Comput Biol 20051e67
29 Blanchette M Schwikowski B Tompa M Algorithms forphylogenetic footprinting J Comput Biol 20029211ndash23
30 Neph S Tompa M MicroFootPrinter a tool for phylogeneticfootprinting in prokaryotic genomes Nucleic Acids Res200634W366ndash8
31 Carmack CS McCue LA Newberg LA et al PhyloScan identi-fication of transcription factor binding sites using cross-species evidence Algorithms Mol Biol 200721
32 Zhang S Xu M Li S et al Genome-wide de novo prediction ofcis-regulatory binding sites in prokaryotes Nucleic Acids Res200937e72
33 Katara P Grover A Sharma V Phylogenetic footprinting aboost for microbial regulatory genomics Protoplasma2012249901ndash7
34 Tagle DA Koop BF Goodman M et al Embryonic epsilon andgamma globin genes of a prosimian primate (Galago crassi-caudatus) Nucleotide and amino acid sequences develop-mental regulation and phylogenetic footprints J Mol Biol1988203439ndash55
35 Borneman AR Gianoulis TA Zhang ZD et al Divergence oftranscription factor binding sites across related yeast spe-cies Science 2007317815ndash9
36 Odom DT Dowell RD Jacobsen ES et al Tissue-specific tran-scriptional regulation has diverged significantly betweenhuman and mouse Nat Genet 200739730ndash2
37 Boyle AP Araya CL Brdlik C et al Comparative analysis ofregulatory information and circuits across distant speciesNature 2014512453ndash6
38 Bingqiang Liu CZ Zhang H Guojun L et al An integrativeand applicable phylogenetic footprinting framework for cis-regulatory motifs identification in prokaryotic genomesBMC Genomics 201617578
39 Park PJ ChIP-seq advantages and challenges of a maturingtechnology Nat Rev Genet 200910669ndash80
40 Song L Crawford GE DNase-seq a high-resolution tech-nique for mapping active gene regulatory elements acrossthe genome from mammalian cells Cold Spring Harb Protoc201025384
41 Wang Z Gerstein M Snyder et al RNA-Seq a revolutionarytool for transcriptomics Nat Rev Genet 20081057ndash63
42 Tsankov AM Gu H Akopian V et al Transcription factorbinding dynamics during human ES cell differentiationNature 2015518344ndash9
43 Wu F Olson BG Yao J DamID-seq genome-wide mapping ofprotein-DNA interactions by high throughput sequencing ofadenine-methylated DNA fragments J Vis Exp 2016107e53620
44 Maragkakis M Alexiou P Nakaya T et al CLIPSeqTools-anovel bioinformatics CLIP-seq analysis suite RNA2015221ndash9
45 Hafner M Landthaler M Burger L et al PAR-CliPmdasha methodto identify transcriptome-wide the binding sites of RNAbinding proteins J Vis Exp 201041e2034
46 Ingolia NT Ribosome profiling new views of translationfrom single codons to genome scale Nat Rev Genet201415205ndash13
47 Giresi PG Kim J Mcdaniell RM et al FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates activeregulatory elements from human chromatin Genome Res200717877ndash85
48 Nutiu R Friedman RC Luo S et al Direct measurement ofDNA affinity landscapes on a high-throughput sequencinginstrument Nat Biotechnol 201129659ndash64
49 Collas P Dahl JA Chop it ChIP it check it the currentstatus of chromatin immunoprecipitation Front Biosci200813929ndash43
50 Kimura H Sato Y DNA Replication and Histone ModificationJapan Springer 2016
51 Suganuma T Workman JL Histone modification as a reflec-tion of metabolism Cell Cycle 2016 15481ndash2
52 Qu H Fang X A brief review on the human encyclopedia ofDNA elements (ENCODE) project Genomics ProteomicsBioinformatics 201311135ndash41
Motif Prediction based on ChIP-seq data | 11
53 Consortium EP The ENCODE (ENCyclopedia Of DNAElements) project Science 2004306636ndash40
54 Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910R25
55 Li H Durbin R Fast and accurate long-read alignment withBurrowsndashWheeler transform Bioinformatics 201026589ndash95
56 Szalkowski AM Schmid CD Rapid innovation in ChIP-seqpeak-calling algorithms is outdistancing benchmarking ef-forts Brief Bioinform 201112626ndash33 628
57 Kharchenko PV Tolstorukov MY Park PJ Design and ana-lysis of ChIP-seq experiments for DNA-binding proteins NatBiotechnol 2008261351ndash9
58 Zhang Y Liu T Meyer CA et al Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089R137
59 Jiang H Wang F Dyer NP et al CisGenome browser a flex-ible tool for genomic data visualization Bioinformatics2010261781ndash2
60 Fejes AP Robertson G Bilenky M et al FindPeaks 31 atool for identifying areas of enrichment from massivelyparallel short-read sequencing technology Bioinformatics2008241729ndash30
61 Valouev A Johnson DS Sundquist A et al Genome-wideanalysis of transcription factor binding sites based on ChIP-Seq data Nat Methods 20085829ndash34
62 Xin F Grossman R Stein L PeakRanger a cloud-enabledpeak caller for ChIP-seq data BMC Bioinformatics201112139
63 Kuan PF Chung D Pan G et al A statistical frameworkfor the analysis of ChIP-Seq data J Am Stat Assoc2011106891ndash903
64 Mathelier A Wasserman WW The next generation of tran-scription factor binding site prediction PLoS Comput Biol20139e1003214
65 Cheng C Min R Gerstein M TIP a probabilistic method foridentifying transcription factor target genes from ChIP-seqbinding profiles Bioinformatics 2011273221ndash7
66 Wu S Wang J Zhao W et al ChIP-PaM an algorithm to iden-tify protein-DNA interaction using ChIP-Seq data Theor BiolMed Model 20107(1)18
67 van Heeringen SJ Veenstra GJC GimmeMotifs a de novomotif prediction pipeline for ChIP-sequencing experimentsBioinformatics 201127270ndash1
68 Machanick P Bailey TL MEME-ChIP motif analysis of largeDNA datasets Bioinformatics 2011271696ndash7
69 Kulakovskiy IV Boeva VA Favorov AV et al Deep and widedigging for binding motifs in ChIP-Seq data Bioinformatics2010262622ndash3
70 Jothi R Cuddapah S Barski A et al Genome-wide identifica-tion of in vivo proteinndashDNA binding sites from ChIP-Seqdata Nucleic Acids Res 2008365221ndash31
71 Mercier E Droit A Li L et al An integrated pipeline for thegenome-wide analysis of transcription factor binding sitesfrom ChIP-Seq PLoS One 20116e16432
72 Hu M Yu J Taylor JM et al On the detection and refinementof transcription factor binding sites using ChIP-Seq dataNucleic Acids Res 2010382154ndash67
73 Bailey TL DREME motif discovery in transcription factorChIP-seq data Bioinformatics 2011271653ndash9
74 Jia C Carson MB Wang Y et al A new exhaustive methodand strategy for finding motifs in ChIP-enriched regionsPLoS One 20149e86044
75 Pavesi G Mereghetti P Mauri G et al Weeder Web discov-ery of transcription factor binding sites in a set of
sequences from co-regulated genes Nucleic Acids Res200432W199ndash203
76 Thomas-Chollier M Herrmann C Defrance M et al RSATpeak-motifs motif analysis in full-size ChIP-seq datasetsNucleic Acids Res 201240e31
77 Tran NT Huang CH A survey of motif finding web tools fordetecting binding site motifs in ChIP-Seq data Biol Direct201494
78 Jun Ding HH Xiaoman L SIOMICS a novel approach for sys-tematic identification of motifs in ChIP-seq data NucleicAcids Res 2014421645ndash35
79 Boeva V Surdez D Guillon N et al De novo motif identifica-tion improves the accuracy of predicting transcription factorbinding sites in ChIP-Seq data analysis Nucleic Acids Res201038e126
80 Bailey TL Williams N Misleh C et al MEME discovering andanalyzing DNA and protein sequence motifs Nucleic AcidsRes 200634369ndash73
81 Hartmann H Guthhrlein EW Siebert M et al P-value-basedregulatory motif discovery using positional weight matricesGenome Res 201323181ndash94
82 Niu M De novo prediction of cis-regulatory modules in eu-karyotic organisms Dissertations amp ThesesmdashGradworksThe university of North Carolina at Charlotte 2014
83 Bolouri H Ruzzo WL Integration of 198 ChIP-seq datasetsreveals human cis-regulatory regions J Comput Biol201219989ndash97
84 Sun H Guns T Fierro AC et al Unveiling combinatorial regu-lation through the combination of ChIP information and insilico cis-regulatory module detection Nucleic Acids Res201240e90
85 Lihu A Holban S A review of ensemble methods for denovo motif discovery in ChIP-Seq data Brief Bioinform201617731
86 Kulakovskiy IV Makeev VJ Motif discovery and motif find-ing in ChIP-Seq data Genome Analysis Current Proceduresand Applications Caister Academic Press Norfolk UK2014 83
87 Medina-Rivera A Defrance M Sand O et al RSAT 2015Regulatory Sequence Analysis Tools Nucleic Acids Res201543W50ndash6
88 Maaskola J Rajewsky N Binding site discovery from nucleicacid sequences by discriminative learning of hidden Markovmodels Nucleic Acids Res 20144212995ndash3011
89 Ikebata H Yoshida R Repulsive parallel MCMC algorithm fordiscovering diverse motifs from large sequence setsBioinformatics 2015311561ndash8
90 Stormo GD DNA binding sites representation and discov-ery Bioinformatics 20001616ndash23
91 Nishida K Frith MC Nakai K Pseudocounts for transcriptionfactor binding sites Nucleic Acids Res 200937939ndash44
92 Weirauch MT Cote A Norel R et al Evaluation of methodsfor modeling transcription factor sequence specificity NatBiotechnol 201331126ndash34
93 Bailey TL Elkan C Fitting a mixture model by expectationmaximization to discover motifs in biopolymers Proc IntConf Intell Syst Mol Biol 1994228ndash36
94 Ding J Dhillon V Li X et al Systematic discovery of cofactormotifs from ChIP-seq data by SIOMICS Methods 201579ndash8047ndash51
95 van Helden J Andre B Collado-Vides J Extracting regulatorysites from the upstream region of yeast genes by computa-tional analysis of oligonucleotide frequencies J Mol Biol1998281827ndash42
12 | Liu et al
96 van Helden J Rios AF Collado-Vides J Discovering regula-tory elements in non-coding sequences by analysis ofspaced dyads Nucleic Acids Res 2000281808ndash18
97 van Helden J del Olmo M Perez-Ortin JE Statistical analysisof yeast genomic downstream sequences reveals putativepolyadenylation signals Nucleic Acids Res 2000281000ndash10
98 Kulakovskiy I Levitsky V Oshchepkov D et al From bindingmotifs in ChIP-Seq data to improved models of transcriptionfactor binding sites J Bioinform Comput Biol 2013111340004
99 Levitsky VG Kulakovskiy IV Ershov NI et al Application ofexperimentally verified transcription factor binding sitesmodels for computational analysis of ChIP-Seq data BMCGenomics 20141580
100 Eskin E Pevzner PA Finding composite regulatory patternsin DNA sequences Bioinformatics 200218 Suppl 1S354ndash63
101 Sharov AA Ko MS Exhaustive search for over-representedDNA sequence motifs with CisFinder DNA Res 200916261ndash73
102 Hertz GZ Hartzell GW III Stormo GD Identification of con-sensus patterns in unaligned DNA sequences known to befunctionally related Comput Appl Biosci 1990681ndash92
103 Kingsford C Zaslavsky E Singh M a compact mathematicalprogramming formulation for DNA motif finding Lect NotesComput Sci 2006400913
104 Reid JE Wernisch L STEME efficient EM to find motifs inlarge data sets Nucleic Acids Res 201139e126
105 Jin VX OrsquoGeen H Iyengar S et al Identification of an OCT4 andSRY regulatory module using integrated computational and ex-perimental genomics approaches Genome Res 200717807ndash17
106 Favorov AV Gelfand MS Gerasimova AV et al A Gibbs sam-pler for identification of symmetrically structured spacedDNA motifs with improved estimation of the signal lengthBioinformatics 2005212240ndash5
107 Zhao Y Ruan S Pandey M et al Improved models for tran-scription factor binding site identification using noninde-pendent interactions Genetics 2012191781ndash90
108 Redhead E Bailey TL Discriminative motif discovery in DNAand protein sequences using the DEME algorithm BMCBioinformatics 20078385
109 Chen X Xu H Yuan P et al Integration of external signalingpathways with the core transcriptional network in embry-onic stem cells Cell 20081331106ndash17
110 Huggins P Zhong S Shiff I et al DECOD Fast and AccurateDiscriminative DNA Motif Finding Bioinformatics2011272361ndash7
111 Cuellar-Partida G Buske FA McLeay RC et al Epigenetic pri-ors for identifying active transcription factor binding sitesBioinformatics 20122856ndash62
112 Mason MJ Plath K Zhou Q Identification of context-dependent motifs by contrasting ChIP binding dataBioinformatics 2010262826ndash32
113 Teng L He B Gao P et al Discover context-specificcombinatorial transcription factor interactions byintegrating diverse ChIP-Seq data sets Nucleic Acids Res201442e24
114 Vaquerizas JM Kummerfeld SK Teichmann SA et al A cen-sus of human transcription factors function expressionand evolution Nat Rev Genet 200910252ndash63
115 Ding J Cai X Wang Y et al ChIPModule systematic discov-ery of transcription factors and their cofactors from ChIP-seq data Pac Symp Biocomput 2013320ndash31
116 Ettwiller L Paten B Ramialison M et al Trawler de novoregulatory motif discovery pipeline for chromatin immuno-precipitation Nat Methods 20074563ndash5
117 Linhart C Halperin Y Shamir R Transcription factor andmicroRNA motif discovery the Amadeus platform and acompendium of metazoan target sets Genome Res2008181180ndash9
118 Kuttippurathu L Hsing M Liu Y et al CompleteMOTIFsDNA motif discovery platform for transcription factor bind-ing experiments Bioinformatics 201127715ndash7
119 Ho JW Bishop E Karchenko PV et al ChIP-chip versus ChIP-seq lessons for experimental design and data analysis BMCGenomics 201112134
120 Che D Li G Mao F et al Detecting uber-operons in prokary-otic genomes Nucleic Acids Res 2006342418ndash27
121 Liu Y Schmidt B Liu W et al CUDAndashMEME acceleratingmotif discovery in biological sequences using CUDA-enabled graphics processing units Pattern Recognit Lett2010318
122 Alipanahi B Delong A Weirauch MT et al Predicting the se-quence specificities of DNA- and RNA-binding proteins bydeep learning Nat Biotechnol 201533831ndash8
Motif Prediction based on ChIP-seq data | 13
showed that MICSA offered a distinct improvement in TF-bind-ing prediction accuracy with fewer false positives comparedwith the other 10 approaches Certainly searching potentialbinding sites using the predicted motifs is not limited to thepeak sequences It could even be applied on the whole-genomesequences as some binding activities may not occur under thecondition of ChIP experiments and these binding sites are alsovaluable for studying the full regulation picture of the interestedTFs
Another issue that should be discussed concerning ChIP-seqmotif finding is the difference of its application on various datasources For example the binding motifs in eukaryotes and pro-karyotes often have their respective characteristics and thelength of the binding regions in eukaryotes is much longer thanthose in prokaryotes Therefore there is a need of special designin motif-finding and analysis techniques
As with the traditional motif finding there are some ensem-ble methods for ChIP-seq-based motif finding [85 118] It was al-ways recommended that using several tools in motif finding isa better strategy as diverse techniques may capture differentcharacteristics of motifs This idea leads to the development ofensemble motif-finding methods Most of these methods sim-ply assemble several algorithms together and rank the outputof them For example W-ChIPMotif [119] includes MEMEWEEDER and MaMF [120] and assesses the output by comparingwith a randomized initial input CompleteMotifs [118] incorpor-ates three de novo tools (ChIPMunk WEEDER and CUDA-MEME)[121] and calculates the P-values of predicted motifs by a back-ground random model However using more algorithms intro-duces more false positives while they have high probability tocover real motifs Therefore further analyses are still requiredfor the output of multiple algorithms
Certainly the accuracy of prediction and the efficiency ofmethods are still the most important features in ChIP-seq motiffinding and the application of current methods shows that theyare still far from completely satisfactory More deep thinkingand advanced techniques are still needed to improve the intrin-sic algorithms For example the evolution information fromphylogenetic footprinting can be useful as prior knowledge inevaluation of candidate binding sites Besides additional func-tions such as cofactor analysis evaluating peaks based onmotifs and multiple types of discriminative motif finding are at-tracting more attention in applications They should be wellintegrated in motif-finding methods in future studies Finallyan integrated Web server or platform is essential for the appli-cation of new designed methods as it can provide more func-tions about upstream data collection and downstream resultanalysis
Most recently deep learning has demonstrated its unprece-dented performance in regulatory genomics especially in ChIP-seq-based motif analysis [122] This marked performance ismainly achieved by taking advantage of graphics processingunits and the capability to extract complex features of motifsfrom ChIP-seq data However it can be improved further ifmore information (eg DNA shape) can be taken into consider-ation rather than only the ChIP-seq peaks sequences
Key Points
bull The algorithmic strategies adopted by current motif-finding tools can be generally divided into two catego-ries word-based and profile-based which could be im-plemented with other models like tree graph and
clustering The combination of these two strategies isa better choice to break the current bottleneck in pre-diction accuracy and efficiency
bull The new features of current tools including but notlimited to discriminative motif identification (DREAMRSAT lsquopeak-motifsrsquo and Discrover) cofactor motifdetection (DREME RSAT lsquopeak-motifsrsquo SIOMICSRPMCMC and Discrover) etc are essential for applica-tion of ChIP-seq data analysis and are encouraged tobe integrated into newly designed algorithms
bull Existing tools still have a much room for improvementwith respect to prediction accuracy and efficiency thecontradictory relationship between the time efficiencyand space complexity as well as between predictionaccuracy and application universality is generally pre-sent in these tools
bull An integrated Web server or platform providing vari-ous analysis functions in data collection and the resultanalysis is essential for the spread of new designedmethods in their applications
bull In future studies more mathematical models eggraph model statistics and combinatorial optimiza-tion could be developed and integrated to design newmotif signal detection techniques and new evaluationcriteria In addition more information eg the de-pendency between nucleotides and the height of peaksfrom ChIP-seq data should be considered to improvethe prediction accuracy
Funding
The State of South Dakota Research Innovation Center andthe Agriculture Experiment Station of South Dakota StateUniversity the National Nature Science Foundation ofChina (grant numbers 61303084 and 61432010 to BL) andthe Young Scholars Program of Shandong University (grantnumbers YSPSDU 2015WLJH19 to BL)
References1 DrsquoHaeseleer P What are DNA sequence motifs Nat
Biotechnol 200624423ndash52 Brohee S Janky R Abdel-Sater F et al Unraveling networks
of co-regulated genes on the sole basis of genome se-quences Nucleic Acids Res 2011396340ndash58
3 Davidson E Levin M Gene regulatory networks Natl Acad SciUSA 20051024935
4 Liu B Zhou C Li G et al Bacterial regulon modeling and pre-diction based on systematic cis regulatory motif analysesSci Rep 2016623030
5 Baumbach J On the power and limits of evolutionary con-servationndashunraveling bacterial gene regulatory networksNucleic Acids Res 2010387877ndash84
6 Lawrence CE Altschul SF Boguski MS et al Detecting subtlesequence signals a Gibbs sampling strategy for multiplealignment Science 1993262208ndash14
7 Pevzner PA Sze SH Combinatorial approaches to findingsubtle signals in DNA sequences Proc Int Conf Intell Syst MolBiol 20008269ndash78
8 Nakaki R Kang J Tateno M A novel ab initio identificationsystem of transcriptional regulation motifs in genome DNAsequences based on direct comparison scheme of signalnoise distributions Nucleic Acids Res 2012408835ndash48
10 | Liu et al
9 Zambelli F Pesole G Pavesi G Motif discovery and transcrip-tion factor binding sites before and after the next-generation sequencing era Brief Bioinform 201314225ndash37
10 Li G Liu B Ma Q et al A new framework for identifying cis-regulatory motifs in prokaryotes Nucleic Acids Res201139e42
11 Chen X Guo L Fan Z et al W-AlignACE an improved Gibbssampling algorithm based on more accurate position weightmatrices learned from sequence and gene expressionChIP-chip data Bioinformatics 2008241121ndash8
12 Sinha S PhyME a software tool for finding motifs in sets oforthologous sequences Methods Mol Biol 2007395309ndash18
13 Das MK Dai HK A survey of DNA motif finding algorithmsBMC Bioinformatics 20078(Suppl 7)S21
14 Wang T Stormo GD Combining phylogenetic data with co-regulated genes to identify regulatory motifs Bioinformatics2003192369ndash80
15 Liu X Brutlag DL Liu JS BioProspector discovering con-served DNA motifs in upstream regulatory regions of co-expressed genes Pac Symp Biocomput 2001127ndash38
16 Hertz GZ Stormo GD Identifying DNA and protein patternswith statistically significant alignments of multiple se-quences Bioinformatics 199915563ndash77
17 Liu XS Brutlag DL Liu JS An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments NatBiotechnol 200220835ndash9
18 Bailey TL Boden M Buske FA et al MEME SUITE tools for motifdiscovery and searching Nucleic Acids Res 200937W202ndash8
19 Olman V Xu D Xu Y CUBIC identification of regulatorybinding sites through data clustering J Bioinform Comput Biol2003121ndash40
20 Li X Wong WH Sampling motifs on phylogenetic trees ProcNatl Acad Sci USA 20051029481ndash6
21 Blanchette M Tompa M Discovery of regulatory elementsby a computational method for phylogenetic footprintingGenome Res 200212739ndash48
22 Blanchette M Tompa M FootPrinter a program designed forphylogenetic footprinting Nucleic Acids Res 2003313840ndash2
23 Li G Liu B Xu Y Accurate recognition of cis-regulatorymotifs with the correct lengths in prokaryotic genomesNucleic Acids Res 201038e12
24 Ma Q Liu B Zhou C et al An integrated toolkit for accurateprediction and analysis of cis-regulatory motifs at a genomescale Bioinformatics 2013292261ndash8
25 Tompa M Li N Bailey TL et al Assessing computationaltools for the discovery of transcription factor binding sitesNat Biotechnol 200523137ndash44
26 McCue LA Thompson W Carmack CS et al Factors influenc-ing the identification of transcription factor binding sites bycross-species comparison Genome Res 2002121523ndash32
27 Simcha D Price ND Geman D The limits of de novo DNAmotif discovery PLoS One 20127e47836
28 Siddharthan R Siggia ED van Nimwegen E PhyloGibbs aGibbs sampling motif finder that incorporates phylogenyPLoS Comput Biol 20051e67
29 Blanchette M Schwikowski B Tompa M Algorithms forphylogenetic footprinting J Comput Biol 20029211ndash23
30 Neph S Tompa M MicroFootPrinter a tool for phylogeneticfootprinting in prokaryotic genomes Nucleic Acids Res200634W366ndash8
31 Carmack CS McCue LA Newberg LA et al PhyloScan identi-fication of transcription factor binding sites using cross-species evidence Algorithms Mol Biol 200721
32 Zhang S Xu M Li S et al Genome-wide de novo prediction ofcis-regulatory binding sites in prokaryotes Nucleic Acids Res200937e72
33 Katara P Grover A Sharma V Phylogenetic footprinting aboost for microbial regulatory genomics Protoplasma2012249901ndash7
34 Tagle DA Koop BF Goodman M et al Embryonic epsilon andgamma globin genes of a prosimian primate (Galago crassi-caudatus) Nucleotide and amino acid sequences develop-mental regulation and phylogenetic footprints J Mol Biol1988203439ndash55
35 Borneman AR Gianoulis TA Zhang ZD et al Divergence oftranscription factor binding sites across related yeast spe-cies Science 2007317815ndash9
36 Odom DT Dowell RD Jacobsen ES et al Tissue-specific tran-scriptional regulation has diverged significantly betweenhuman and mouse Nat Genet 200739730ndash2
37 Boyle AP Araya CL Brdlik C et al Comparative analysis ofregulatory information and circuits across distant speciesNature 2014512453ndash6
38 Bingqiang Liu CZ Zhang H Guojun L et al An integrativeand applicable phylogenetic footprinting framework for cis-regulatory motifs identification in prokaryotic genomesBMC Genomics 201617578
39 Park PJ ChIP-seq advantages and challenges of a maturingtechnology Nat Rev Genet 200910669ndash80
40 Song L Crawford GE DNase-seq a high-resolution tech-nique for mapping active gene regulatory elements acrossthe genome from mammalian cells Cold Spring Harb Protoc201025384
41 Wang Z Gerstein M Snyder et al RNA-Seq a revolutionarytool for transcriptomics Nat Rev Genet 20081057ndash63
42 Tsankov AM Gu H Akopian V et al Transcription factorbinding dynamics during human ES cell differentiationNature 2015518344ndash9
43 Wu F Olson BG Yao J DamID-seq genome-wide mapping ofprotein-DNA interactions by high throughput sequencing ofadenine-methylated DNA fragments J Vis Exp 2016107e53620
44 Maragkakis M Alexiou P Nakaya T et al CLIPSeqTools-anovel bioinformatics CLIP-seq analysis suite RNA2015221ndash9
45 Hafner M Landthaler M Burger L et al PAR-CliPmdasha methodto identify transcriptome-wide the binding sites of RNAbinding proteins J Vis Exp 201041e2034
46 Ingolia NT Ribosome profiling new views of translationfrom single codons to genome scale Nat Rev Genet201415205ndash13
47 Giresi PG Kim J Mcdaniell RM et al FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates activeregulatory elements from human chromatin Genome Res200717877ndash85
48 Nutiu R Friedman RC Luo S et al Direct measurement ofDNA affinity landscapes on a high-throughput sequencinginstrument Nat Biotechnol 201129659ndash64
49 Collas P Dahl JA Chop it ChIP it check it the currentstatus of chromatin immunoprecipitation Front Biosci200813929ndash43
50 Kimura H Sato Y DNA Replication and Histone ModificationJapan Springer 2016
51 Suganuma T Workman JL Histone modification as a reflec-tion of metabolism Cell Cycle 2016 15481ndash2
52 Qu H Fang X A brief review on the human encyclopedia ofDNA elements (ENCODE) project Genomics ProteomicsBioinformatics 201311135ndash41
Motif Prediction based on ChIP-seq data | 11
53 Consortium EP The ENCODE (ENCyclopedia Of DNAElements) project Science 2004306636ndash40
54 Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910R25
55 Li H Durbin R Fast and accurate long-read alignment withBurrowsndashWheeler transform Bioinformatics 201026589ndash95
56 Szalkowski AM Schmid CD Rapid innovation in ChIP-seqpeak-calling algorithms is outdistancing benchmarking ef-forts Brief Bioinform 201112626ndash33 628
57 Kharchenko PV Tolstorukov MY Park PJ Design and ana-lysis of ChIP-seq experiments for DNA-binding proteins NatBiotechnol 2008261351ndash9
58 Zhang Y Liu T Meyer CA et al Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089R137
59 Jiang H Wang F Dyer NP et al CisGenome browser a flex-ible tool for genomic data visualization Bioinformatics2010261781ndash2
60 Fejes AP Robertson G Bilenky M et al FindPeaks 31 atool for identifying areas of enrichment from massivelyparallel short-read sequencing technology Bioinformatics2008241729ndash30
61 Valouev A Johnson DS Sundquist A et al Genome-wideanalysis of transcription factor binding sites based on ChIP-Seq data Nat Methods 20085829ndash34
62 Xin F Grossman R Stein L PeakRanger a cloud-enabledpeak caller for ChIP-seq data BMC Bioinformatics201112139
63 Kuan PF Chung D Pan G et al A statistical frameworkfor the analysis of ChIP-Seq data J Am Stat Assoc2011106891ndash903
64 Mathelier A Wasserman WW The next generation of tran-scription factor binding site prediction PLoS Comput Biol20139e1003214
65 Cheng C Min R Gerstein M TIP a probabilistic method foridentifying transcription factor target genes from ChIP-seqbinding profiles Bioinformatics 2011273221ndash7
66 Wu S Wang J Zhao W et al ChIP-PaM an algorithm to iden-tify protein-DNA interaction using ChIP-Seq data Theor BiolMed Model 20107(1)18
67 van Heeringen SJ Veenstra GJC GimmeMotifs a de novomotif prediction pipeline for ChIP-sequencing experimentsBioinformatics 201127270ndash1
68 Machanick P Bailey TL MEME-ChIP motif analysis of largeDNA datasets Bioinformatics 2011271696ndash7
69 Kulakovskiy IV Boeva VA Favorov AV et al Deep and widedigging for binding motifs in ChIP-Seq data Bioinformatics2010262622ndash3
70 Jothi R Cuddapah S Barski A et al Genome-wide identifica-tion of in vivo proteinndashDNA binding sites from ChIP-Seqdata Nucleic Acids Res 2008365221ndash31
71 Mercier E Droit A Li L et al An integrated pipeline for thegenome-wide analysis of transcription factor binding sitesfrom ChIP-Seq PLoS One 20116e16432
72 Hu M Yu J Taylor JM et al On the detection and refinementof transcription factor binding sites using ChIP-Seq dataNucleic Acids Res 2010382154ndash67
73 Bailey TL DREME motif discovery in transcription factorChIP-seq data Bioinformatics 2011271653ndash9
74 Jia C Carson MB Wang Y et al A new exhaustive methodand strategy for finding motifs in ChIP-enriched regionsPLoS One 20149e86044
75 Pavesi G Mereghetti P Mauri G et al Weeder Web discov-ery of transcription factor binding sites in a set of
sequences from co-regulated genes Nucleic Acids Res200432W199ndash203
76 Thomas-Chollier M Herrmann C Defrance M et al RSATpeak-motifs motif analysis in full-size ChIP-seq datasetsNucleic Acids Res 201240e31
77 Tran NT Huang CH A survey of motif finding web tools fordetecting binding site motifs in ChIP-Seq data Biol Direct201494
78 Jun Ding HH Xiaoman L SIOMICS a novel approach for sys-tematic identification of motifs in ChIP-seq data NucleicAcids Res 2014421645ndash35
79 Boeva V Surdez D Guillon N et al De novo motif identifica-tion improves the accuracy of predicting transcription factorbinding sites in ChIP-Seq data analysis Nucleic Acids Res201038e126
80 Bailey TL Williams N Misleh C et al MEME discovering andanalyzing DNA and protein sequence motifs Nucleic AcidsRes 200634369ndash73
81 Hartmann H Guthhrlein EW Siebert M et al P-value-basedregulatory motif discovery using positional weight matricesGenome Res 201323181ndash94
82 Niu M De novo prediction of cis-regulatory modules in eu-karyotic organisms Dissertations amp ThesesmdashGradworksThe university of North Carolina at Charlotte 2014
83 Bolouri H Ruzzo WL Integration of 198 ChIP-seq datasetsreveals human cis-regulatory regions J Comput Biol201219989ndash97
84 Sun H Guns T Fierro AC et al Unveiling combinatorial regu-lation through the combination of ChIP information and insilico cis-regulatory module detection Nucleic Acids Res201240e90
85 Lihu A Holban S A review of ensemble methods for denovo motif discovery in ChIP-Seq data Brief Bioinform201617731
86 Kulakovskiy IV Makeev VJ Motif discovery and motif find-ing in ChIP-Seq data Genome Analysis Current Proceduresand Applications Caister Academic Press Norfolk UK2014 83
87 Medina-Rivera A Defrance M Sand O et al RSAT 2015Regulatory Sequence Analysis Tools Nucleic Acids Res201543W50ndash6
88 Maaskola J Rajewsky N Binding site discovery from nucleicacid sequences by discriminative learning of hidden Markovmodels Nucleic Acids Res 20144212995ndash3011
89 Ikebata H Yoshida R Repulsive parallel MCMC algorithm fordiscovering diverse motifs from large sequence setsBioinformatics 2015311561ndash8
90 Stormo GD DNA binding sites representation and discov-ery Bioinformatics 20001616ndash23
91 Nishida K Frith MC Nakai K Pseudocounts for transcriptionfactor binding sites Nucleic Acids Res 200937939ndash44
92 Weirauch MT Cote A Norel R et al Evaluation of methodsfor modeling transcription factor sequence specificity NatBiotechnol 201331126ndash34
93 Bailey TL Elkan C Fitting a mixture model by expectationmaximization to discover motifs in biopolymers Proc IntConf Intell Syst Mol Biol 1994228ndash36
94 Ding J Dhillon V Li X et al Systematic discovery of cofactormotifs from ChIP-seq data by SIOMICS Methods 201579ndash8047ndash51
95 van Helden J Andre B Collado-Vides J Extracting regulatorysites from the upstream region of yeast genes by computa-tional analysis of oligonucleotide frequencies J Mol Biol1998281827ndash42
12 | Liu et al
96 van Helden J Rios AF Collado-Vides J Discovering regula-tory elements in non-coding sequences by analysis ofspaced dyads Nucleic Acids Res 2000281808ndash18
97 van Helden J del Olmo M Perez-Ortin JE Statistical analysisof yeast genomic downstream sequences reveals putativepolyadenylation signals Nucleic Acids Res 2000281000ndash10
98 Kulakovskiy I Levitsky V Oshchepkov D et al From bindingmotifs in ChIP-Seq data to improved models of transcriptionfactor binding sites J Bioinform Comput Biol 2013111340004
99 Levitsky VG Kulakovskiy IV Ershov NI et al Application ofexperimentally verified transcription factor binding sitesmodels for computational analysis of ChIP-Seq data BMCGenomics 20141580
100 Eskin E Pevzner PA Finding composite regulatory patternsin DNA sequences Bioinformatics 200218 Suppl 1S354ndash63
101 Sharov AA Ko MS Exhaustive search for over-representedDNA sequence motifs with CisFinder DNA Res 200916261ndash73
102 Hertz GZ Hartzell GW III Stormo GD Identification of con-sensus patterns in unaligned DNA sequences known to befunctionally related Comput Appl Biosci 1990681ndash92
103 Kingsford C Zaslavsky E Singh M a compact mathematicalprogramming formulation for DNA motif finding Lect NotesComput Sci 2006400913
104 Reid JE Wernisch L STEME efficient EM to find motifs inlarge data sets Nucleic Acids Res 201139e126
105 Jin VX OrsquoGeen H Iyengar S et al Identification of an OCT4 andSRY regulatory module using integrated computational and ex-perimental genomics approaches Genome Res 200717807ndash17
106 Favorov AV Gelfand MS Gerasimova AV et al A Gibbs sam-pler for identification of symmetrically structured spacedDNA motifs with improved estimation of the signal lengthBioinformatics 2005212240ndash5
107 Zhao Y Ruan S Pandey M et al Improved models for tran-scription factor binding site identification using noninde-pendent interactions Genetics 2012191781ndash90
108 Redhead E Bailey TL Discriminative motif discovery in DNAand protein sequences using the DEME algorithm BMCBioinformatics 20078385
109 Chen X Xu H Yuan P et al Integration of external signalingpathways with the core transcriptional network in embry-onic stem cells Cell 20081331106ndash17
110 Huggins P Zhong S Shiff I et al DECOD Fast and AccurateDiscriminative DNA Motif Finding Bioinformatics2011272361ndash7
111 Cuellar-Partida G Buske FA McLeay RC et al Epigenetic pri-ors for identifying active transcription factor binding sitesBioinformatics 20122856ndash62
112 Mason MJ Plath K Zhou Q Identification of context-dependent motifs by contrasting ChIP binding dataBioinformatics 2010262826ndash32
113 Teng L He B Gao P et al Discover context-specificcombinatorial transcription factor interactions byintegrating diverse ChIP-Seq data sets Nucleic Acids Res201442e24
114 Vaquerizas JM Kummerfeld SK Teichmann SA et al A cen-sus of human transcription factors function expressionand evolution Nat Rev Genet 200910252ndash63
115 Ding J Cai X Wang Y et al ChIPModule systematic discov-ery of transcription factors and their cofactors from ChIP-seq data Pac Symp Biocomput 2013320ndash31
116 Ettwiller L Paten B Ramialison M et al Trawler de novoregulatory motif discovery pipeline for chromatin immuno-precipitation Nat Methods 20074563ndash5
117 Linhart C Halperin Y Shamir R Transcription factor andmicroRNA motif discovery the Amadeus platform and acompendium of metazoan target sets Genome Res2008181180ndash9
118 Kuttippurathu L Hsing M Liu Y et al CompleteMOTIFsDNA motif discovery platform for transcription factor bind-ing experiments Bioinformatics 201127715ndash7
119 Ho JW Bishop E Karchenko PV et al ChIP-chip versus ChIP-seq lessons for experimental design and data analysis BMCGenomics 201112134
120 Che D Li G Mao F et al Detecting uber-operons in prokary-otic genomes Nucleic Acids Res 2006342418ndash27
121 Liu Y Schmidt B Liu W et al CUDAndashMEME acceleratingmotif discovery in biological sequences using CUDA-enabled graphics processing units Pattern Recognit Lett2010318
122 Alipanahi B Delong A Weirauch MT et al Predicting the se-quence specificities of DNA- and RNA-binding proteins bydeep learning Nat Biotechnol 201533831ndash8
Motif Prediction based on ChIP-seq data | 13
9 Zambelli F Pesole G Pavesi G Motif discovery and transcrip-tion factor binding sites before and after the next-generation sequencing era Brief Bioinform 201314225ndash37
10 Li G Liu B Ma Q et al A new framework for identifying cis-regulatory motifs in prokaryotes Nucleic Acids Res201139e42
11 Chen X Guo L Fan Z et al W-AlignACE an improved Gibbssampling algorithm based on more accurate position weightmatrices learned from sequence and gene expressionChIP-chip data Bioinformatics 2008241121ndash8
12 Sinha S PhyME a software tool for finding motifs in sets oforthologous sequences Methods Mol Biol 2007395309ndash18
13 Das MK Dai HK A survey of DNA motif finding algorithmsBMC Bioinformatics 20078(Suppl 7)S21
14 Wang T Stormo GD Combining phylogenetic data with co-regulated genes to identify regulatory motifs Bioinformatics2003192369ndash80
15 Liu X Brutlag DL Liu JS BioProspector discovering con-served DNA motifs in upstream regulatory regions of co-expressed genes Pac Symp Biocomput 2001127ndash38
16 Hertz GZ Stormo GD Identifying DNA and protein patternswith statistically significant alignments of multiple se-quences Bioinformatics 199915563ndash77
17 Liu XS Brutlag DL Liu JS An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments NatBiotechnol 200220835ndash9
18 Bailey TL Boden M Buske FA et al MEME SUITE tools for motifdiscovery and searching Nucleic Acids Res 200937W202ndash8
19 Olman V Xu D Xu Y CUBIC identification of regulatorybinding sites through data clustering J Bioinform Comput Biol2003121ndash40
20 Li X Wong WH Sampling motifs on phylogenetic trees ProcNatl Acad Sci USA 20051029481ndash6
21 Blanchette M Tompa M Discovery of regulatory elementsby a computational method for phylogenetic footprintingGenome Res 200212739ndash48
22 Blanchette M Tompa M FootPrinter a program designed forphylogenetic footprinting Nucleic Acids Res 2003313840ndash2
23 Li G Liu B Xu Y Accurate recognition of cis-regulatorymotifs with the correct lengths in prokaryotic genomesNucleic Acids Res 201038e12
24 Ma Q Liu B Zhou C et al An integrated toolkit for accurateprediction and analysis of cis-regulatory motifs at a genomescale Bioinformatics 2013292261ndash8
25 Tompa M Li N Bailey TL et al Assessing computationaltools for the discovery of transcription factor binding sitesNat Biotechnol 200523137ndash44
26 McCue LA Thompson W Carmack CS et al Factors influenc-ing the identification of transcription factor binding sites bycross-species comparison Genome Res 2002121523ndash32
27 Simcha D Price ND Geman D The limits of de novo DNAmotif discovery PLoS One 20127e47836
28 Siddharthan R Siggia ED van Nimwegen E PhyloGibbs aGibbs sampling motif finder that incorporates phylogenyPLoS Comput Biol 20051e67
29 Blanchette M Schwikowski B Tompa M Algorithms forphylogenetic footprinting J Comput Biol 20029211ndash23
30 Neph S Tompa M MicroFootPrinter a tool for phylogeneticfootprinting in prokaryotic genomes Nucleic Acids Res200634W366ndash8
31 Carmack CS McCue LA Newberg LA et al PhyloScan identi-fication of transcription factor binding sites using cross-species evidence Algorithms Mol Biol 200721
32 Zhang S Xu M Li S et al Genome-wide de novo prediction ofcis-regulatory binding sites in prokaryotes Nucleic Acids Res200937e72
33 Katara P Grover A Sharma V Phylogenetic footprinting aboost for microbial regulatory genomics Protoplasma2012249901ndash7
34 Tagle DA Koop BF Goodman M et al Embryonic epsilon andgamma globin genes of a prosimian primate (Galago crassi-caudatus) Nucleotide and amino acid sequences develop-mental regulation and phylogenetic footprints J Mol Biol1988203439ndash55
35 Borneman AR Gianoulis TA Zhang ZD et al Divergence oftranscription factor binding sites across related yeast spe-cies Science 2007317815ndash9
36 Odom DT Dowell RD Jacobsen ES et al Tissue-specific tran-scriptional regulation has diverged significantly betweenhuman and mouse Nat Genet 200739730ndash2
37 Boyle AP Araya CL Brdlik C et al Comparative analysis ofregulatory information and circuits across distant speciesNature 2014512453ndash6
38 Bingqiang Liu CZ Zhang H Guojun L et al An integrativeand applicable phylogenetic footprinting framework for cis-regulatory motifs identification in prokaryotic genomesBMC Genomics 201617578
39 Park PJ ChIP-seq advantages and challenges of a maturingtechnology Nat Rev Genet 200910669ndash80
40 Song L Crawford GE DNase-seq a high-resolution tech-nique for mapping active gene regulatory elements acrossthe genome from mammalian cells Cold Spring Harb Protoc201025384
41 Wang Z Gerstein M Snyder et al RNA-Seq a revolutionarytool for transcriptomics Nat Rev Genet 20081057ndash63
42 Tsankov AM Gu H Akopian V et al Transcription factorbinding dynamics during human ES cell differentiationNature 2015518344ndash9
43 Wu F Olson BG Yao J DamID-seq genome-wide mapping ofprotein-DNA interactions by high throughput sequencing ofadenine-methylated DNA fragments J Vis Exp 2016107e53620
44 Maragkakis M Alexiou P Nakaya T et al CLIPSeqTools-anovel bioinformatics CLIP-seq analysis suite RNA2015221ndash9
45 Hafner M Landthaler M Burger L et al PAR-CliPmdasha methodto identify transcriptome-wide the binding sites of RNAbinding proteins J Vis Exp 201041e2034
46 Ingolia NT Ribosome profiling new views of translationfrom single codons to genome scale Nat Rev Genet201415205ndash13
47 Giresi PG Kim J Mcdaniell RM et al FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates activeregulatory elements from human chromatin Genome Res200717877ndash85
48 Nutiu R Friedman RC Luo S et al Direct measurement ofDNA affinity landscapes on a high-throughput sequencinginstrument Nat Biotechnol 201129659ndash64
49 Collas P Dahl JA Chop it ChIP it check it the currentstatus of chromatin immunoprecipitation Front Biosci200813929ndash43
50 Kimura H Sato Y DNA Replication and Histone ModificationJapan Springer 2016
51 Suganuma T Workman JL Histone modification as a reflec-tion of metabolism Cell Cycle 2016 15481ndash2
52 Qu H Fang X A brief review on the human encyclopedia ofDNA elements (ENCODE) project Genomics ProteomicsBioinformatics 201311135ndash41
Motif Prediction based on ChIP-seq data | 11
53 Consortium EP The ENCODE (ENCyclopedia Of DNAElements) project Science 2004306636ndash40
54 Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910R25
55 Li H Durbin R Fast and accurate long-read alignment withBurrowsndashWheeler transform Bioinformatics 201026589ndash95
56 Szalkowski AM Schmid CD Rapid innovation in ChIP-seqpeak-calling algorithms is outdistancing benchmarking ef-forts Brief Bioinform 201112626ndash33 628
57 Kharchenko PV Tolstorukov MY Park PJ Design and ana-lysis of ChIP-seq experiments for DNA-binding proteins NatBiotechnol 2008261351ndash9
58 Zhang Y Liu T Meyer CA et al Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089R137
59 Jiang H Wang F Dyer NP et al CisGenome browser a flex-ible tool for genomic data visualization Bioinformatics2010261781ndash2
60 Fejes AP Robertson G Bilenky M et al FindPeaks 31 atool for identifying areas of enrichment from massivelyparallel short-read sequencing technology Bioinformatics2008241729ndash30
61 Valouev A Johnson DS Sundquist A et al Genome-wideanalysis of transcription factor binding sites based on ChIP-Seq data Nat Methods 20085829ndash34
62 Xin F Grossman R Stein L PeakRanger a cloud-enabledpeak caller for ChIP-seq data BMC Bioinformatics201112139
63 Kuan PF Chung D Pan G et al A statistical frameworkfor the analysis of ChIP-Seq data J Am Stat Assoc2011106891ndash903
64 Mathelier A Wasserman WW The next generation of tran-scription factor binding site prediction PLoS Comput Biol20139e1003214
65 Cheng C Min R Gerstein M TIP a probabilistic method foridentifying transcription factor target genes from ChIP-seqbinding profiles Bioinformatics 2011273221ndash7
66 Wu S Wang J Zhao W et al ChIP-PaM an algorithm to iden-tify protein-DNA interaction using ChIP-Seq data Theor BiolMed Model 20107(1)18
67 van Heeringen SJ Veenstra GJC GimmeMotifs a de novomotif prediction pipeline for ChIP-sequencing experimentsBioinformatics 201127270ndash1
68 Machanick P Bailey TL MEME-ChIP motif analysis of largeDNA datasets Bioinformatics 2011271696ndash7
69 Kulakovskiy IV Boeva VA Favorov AV et al Deep and widedigging for binding motifs in ChIP-Seq data Bioinformatics2010262622ndash3
70 Jothi R Cuddapah S Barski A et al Genome-wide identifica-tion of in vivo proteinndashDNA binding sites from ChIP-Seqdata Nucleic Acids Res 2008365221ndash31
71 Mercier E Droit A Li L et al An integrated pipeline for thegenome-wide analysis of transcription factor binding sitesfrom ChIP-Seq PLoS One 20116e16432
72 Hu M Yu J Taylor JM et al On the detection and refinementof transcription factor binding sites using ChIP-Seq dataNucleic Acids Res 2010382154ndash67
73 Bailey TL DREME motif discovery in transcription factorChIP-seq data Bioinformatics 2011271653ndash9
74 Jia C Carson MB Wang Y et al A new exhaustive methodand strategy for finding motifs in ChIP-enriched regionsPLoS One 20149e86044
75 Pavesi G Mereghetti P Mauri G et al Weeder Web discov-ery of transcription factor binding sites in a set of
sequences from co-regulated genes Nucleic Acids Res200432W199ndash203
76 Thomas-Chollier M Herrmann C Defrance M et al RSATpeak-motifs motif analysis in full-size ChIP-seq datasetsNucleic Acids Res 201240e31
77 Tran NT Huang CH A survey of motif finding web tools fordetecting binding site motifs in ChIP-Seq data Biol Direct201494
78 Jun Ding HH Xiaoman L SIOMICS a novel approach for sys-tematic identification of motifs in ChIP-seq data NucleicAcids Res 2014421645ndash35
79 Boeva V Surdez D Guillon N et al De novo motif identifica-tion improves the accuracy of predicting transcription factorbinding sites in ChIP-Seq data analysis Nucleic Acids Res201038e126
80 Bailey TL Williams N Misleh C et al MEME discovering andanalyzing DNA and protein sequence motifs Nucleic AcidsRes 200634369ndash73
81 Hartmann H Guthhrlein EW Siebert M et al P-value-basedregulatory motif discovery using positional weight matricesGenome Res 201323181ndash94
82 Niu M De novo prediction of cis-regulatory modules in eu-karyotic organisms Dissertations amp ThesesmdashGradworksThe university of North Carolina at Charlotte 2014
83 Bolouri H Ruzzo WL Integration of 198 ChIP-seq datasetsreveals human cis-regulatory regions J Comput Biol201219989ndash97
84 Sun H Guns T Fierro AC et al Unveiling combinatorial regu-lation through the combination of ChIP information and insilico cis-regulatory module detection Nucleic Acids Res201240e90
85 Lihu A Holban S A review of ensemble methods for denovo motif discovery in ChIP-Seq data Brief Bioinform201617731
86 Kulakovskiy IV Makeev VJ Motif discovery and motif find-ing in ChIP-Seq data Genome Analysis Current Proceduresand Applications Caister Academic Press Norfolk UK2014 83
87 Medina-Rivera A Defrance M Sand O et al RSAT 2015Regulatory Sequence Analysis Tools Nucleic Acids Res201543W50ndash6
88 Maaskola J Rajewsky N Binding site discovery from nucleicacid sequences by discriminative learning of hidden Markovmodels Nucleic Acids Res 20144212995ndash3011
89 Ikebata H Yoshida R Repulsive parallel MCMC algorithm fordiscovering diverse motifs from large sequence setsBioinformatics 2015311561ndash8
90 Stormo GD DNA binding sites representation and discov-ery Bioinformatics 20001616ndash23
91 Nishida K Frith MC Nakai K Pseudocounts for transcriptionfactor binding sites Nucleic Acids Res 200937939ndash44
92 Weirauch MT Cote A Norel R et al Evaluation of methodsfor modeling transcription factor sequence specificity NatBiotechnol 201331126ndash34
93 Bailey TL Elkan C Fitting a mixture model by expectationmaximization to discover motifs in biopolymers Proc IntConf Intell Syst Mol Biol 1994228ndash36
94 Ding J Dhillon V Li X et al Systematic discovery of cofactormotifs from ChIP-seq data by SIOMICS Methods 201579ndash8047ndash51
95 van Helden J Andre B Collado-Vides J Extracting regulatorysites from the upstream region of yeast genes by computa-tional analysis of oligonucleotide frequencies J Mol Biol1998281827ndash42
12 | Liu et al
96 van Helden J Rios AF Collado-Vides J Discovering regula-tory elements in non-coding sequences by analysis ofspaced dyads Nucleic Acids Res 2000281808ndash18
97 van Helden J del Olmo M Perez-Ortin JE Statistical analysisof yeast genomic downstream sequences reveals putativepolyadenylation signals Nucleic Acids Res 2000281000ndash10
98 Kulakovskiy I Levitsky V Oshchepkov D et al From bindingmotifs in ChIP-Seq data to improved models of transcriptionfactor binding sites J Bioinform Comput Biol 2013111340004
99 Levitsky VG Kulakovskiy IV Ershov NI et al Application ofexperimentally verified transcription factor binding sitesmodels for computational analysis of ChIP-Seq data BMCGenomics 20141580
100 Eskin E Pevzner PA Finding composite regulatory patternsin DNA sequences Bioinformatics 200218 Suppl 1S354ndash63
101 Sharov AA Ko MS Exhaustive search for over-representedDNA sequence motifs with CisFinder DNA Res 200916261ndash73
102 Hertz GZ Hartzell GW III Stormo GD Identification of con-sensus patterns in unaligned DNA sequences known to befunctionally related Comput Appl Biosci 1990681ndash92
103 Kingsford C Zaslavsky E Singh M a compact mathematicalprogramming formulation for DNA motif finding Lect NotesComput Sci 2006400913
104 Reid JE Wernisch L STEME efficient EM to find motifs inlarge data sets Nucleic Acids Res 201139e126
105 Jin VX OrsquoGeen H Iyengar S et al Identification of an OCT4 andSRY regulatory module using integrated computational and ex-perimental genomics approaches Genome Res 200717807ndash17
106 Favorov AV Gelfand MS Gerasimova AV et al A Gibbs sam-pler for identification of symmetrically structured spacedDNA motifs with improved estimation of the signal lengthBioinformatics 2005212240ndash5
107 Zhao Y Ruan S Pandey M et al Improved models for tran-scription factor binding site identification using noninde-pendent interactions Genetics 2012191781ndash90
108 Redhead E Bailey TL Discriminative motif discovery in DNAand protein sequences using the DEME algorithm BMCBioinformatics 20078385
109 Chen X Xu H Yuan P et al Integration of external signalingpathways with the core transcriptional network in embry-onic stem cells Cell 20081331106ndash17
110 Huggins P Zhong S Shiff I et al DECOD Fast and AccurateDiscriminative DNA Motif Finding Bioinformatics2011272361ndash7
111 Cuellar-Partida G Buske FA McLeay RC et al Epigenetic pri-ors for identifying active transcription factor binding sitesBioinformatics 20122856ndash62
112 Mason MJ Plath K Zhou Q Identification of context-dependent motifs by contrasting ChIP binding dataBioinformatics 2010262826ndash32
113 Teng L He B Gao P et al Discover context-specificcombinatorial transcription factor interactions byintegrating diverse ChIP-Seq data sets Nucleic Acids Res201442e24
114 Vaquerizas JM Kummerfeld SK Teichmann SA et al A cen-sus of human transcription factors function expressionand evolution Nat Rev Genet 200910252ndash63
115 Ding J Cai X Wang Y et al ChIPModule systematic discov-ery of transcription factors and their cofactors from ChIP-seq data Pac Symp Biocomput 2013320ndash31
116 Ettwiller L Paten B Ramialison M et al Trawler de novoregulatory motif discovery pipeline for chromatin immuno-precipitation Nat Methods 20074563ndash5
117 Linhart C Halperin Y Shamir R Transcription factor andmicroRNA motif discovery the Amadeus platform and acompendium of metazoan target sets Genome Res2008181180ndash9
118 Kuttippurathu L Hsing M Liu Y et al CompleteMOTIFsDNA motif discovery platform for transcription factor bind-ing experiments Bioinformatics 201127715ndash7
119 Ho JW Bishop E Karchenko PV et al ChIP-chip versus ChIP-seq lessons for experimental design and data analysis BMCGenomics 201112134
120 Che D Li G Mao F et al Detecting uber-operons in prokary-otic genomes Nucleic Acids Res 2006342418ndash27
121 Liu Y Schmidt B Liu W et al CUDAndashMEME acceleratingmotif discovery in biological sequences using CUDA-enabled graphics processing units Pattern Recognit Lett2010318
122 Alipanahi B Delong A Weirauch MT et al Predicting the se-quence specificities of DNA- and RNA-binding proteins bydeep learning Nat Biotechnol 201533831ndash8
Motif Prediction based on ChIP-seq data | 13
53 Consortium EP The ENCODE (ENCyclopedia Of DNAElements) project Science 2004306636ndash40
54 Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910R25
55 Li H Durbin R Fast and accurate long-read alignment withBurrowsndashWheeler transform Bioinformatics 201026589ndash95
56 Szalkowski AM Schmid CD Rapid innovation in ChIP-seqpeak-calling algorithms is outdistancing benchmarking ef-forts Brief Bioinform 201112626ndash33 628
57 Kharchenko PV Tolstorukov MY Park PJ Design and ana-lysis of ChIP-seq experiments for DNA-binding proteins NatBiotechnol 2008261351ndash9
58 Zhang Y Liu T Meyer CA et al Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089R137
59 Jiang H Wang F Dyer NP et al CisGenome browser a flex-ible tool for genomic data visualization Bioinformatics2010261781ndash2
60 Fejes AP Robertson G Bilenky M et al FindPeaks 31 atool for identifying areas of enrichment from massivelyparallel short-read sequencing technology Bioinformatics2008241729ndash30
61 Valouev A Johnson DS Sundquist A et al Genome-wideanalysis of transcription factor binding sites based on ChIP-Seq data Nat Methods 20085829ndash34
62 Xin F Grossman R Stein L PeakRanger a cloud-enabledpeak caller for ChIP-seq data BMC Bioinformatics201112139
63 Kuan PF Chung D Pan G et al A statistical frameworkfor the analysis of ChIP-Seq data J Am Stat Assoc2011106891ndash903
64 Mathelier A Wasserman WW The next generation of tran-scription factor binding site prediction PLoS Comput Biol20139e1003214
65 Cheng C Min R Gerstein M TIP a probabilistic method foridentifying transcription factor target genes from ChIP-seqbinding profiles Bioinformatics 2011273221ndash7
66 Wu S Wang J Zhao W et al ChIP-PaM an algorithm to iden-tify protein-DNA interaction using ChIP-Seq data Theor BiolMed Model 20107(1)18
67 van Heeringen SJ Veenstra GJC GimmeMotifs a de novomotif prediction pipeline for ChIP-sequencing experimentsBioinformatics 201127270ndash1
68 Machanick P Bailey TL MEME-ChIP motif analysis of largeDNA datasets Bioinformatics 2011271696ndash7
69 Kulakovskiy IV Boeva VA Favorov AV et al Deep and widedigging for binding motifs in ChIP-Seq data Bioinformatics2010262622ndash3
70 Jothi R Cuddapah S Barski A et al Genome-wide identifica-tion of in vivo proteinndashDNA binding sites from ChIP-Seqdata Nucleic Acids Res 2008365221ndash31
71 Mercier E Droit A Li L et al An integrated pipeline for thegenome-wide analysis of transcription factor binding sitesfrom ChIP-Seq PLoS One 20116e16432
72 Hu M Yu J Taylor JM et al On the detection and refinementof transcription factor binding sites using ChIP-Seq dataNucleic Acids Res 2010382154ndash67
73 Bailey TL DREME motif discovery in transcription factorChIP-seq data Bioinformatics 2011271653ndash9
74 Jia C Carson MB Wang Y et al A new exhaustive methodand strategy for finding motifs in ChIP-enriched regionsPLoS One 20149e86044
75 Pavesi G Mereghetti P Mauri G et al Weeder Web discov-ery of transcription factor binding sites in a set of
sequences from co-regulated genes Nucleic Acids Res200432W199ndash203
76 Thomas-Chollier M Herrmann C Defrance M et al RSATpeak-motifs motif analysis in full-size ChIP-seq datasetsNucleic Acids Res 201240e31
77 Tran NT Huang CH A survey of motif finding web tools fordetecting binding site motifs in ChIP-Seq data Biol Direct201494
78 Jun Ding HH Xiaoman L SIOMICS a novel approach for sys-tematic identification of motifs in ChIP-seq data NucleicAcids Res 2014421645ndash35
79 Boeva V Surdez D Guillon N et al De novo motif identifica-tion improves the accuracy of predicting transcription factorbinding sites in ChIP-Seq data analysis Nucleic Acids Res201038e126
80 Bailey TL Williams N Misleh C et al MEME discovering andanalyzing DNA and protein sequence motifs Nucleic AcidsRes 200634369ndash73
81 Hartmann H Guthhrlein EW Siebert M et al P-value-basedregulatory motif discovery using positional weight matricesGenome Res 201323181ndash94
82 Niu M De novo prediction of cis-regulatory modules in eu-karyotic organisms Dissertations amp ThesesmdashGradworksThe university of North Carolina at Charlotte 2014
83 Bolouri H Ruzzo WL Integration of 198 ChIP-seq datasetsreveals human cis-regulatory regions J Comput Biol201219989ndash97
84 Sun H Guns T Fierro AC et al Unveiling combinatorial regu-lation through the combination of ChIP information and insilico cis-regulatory module detection Nucleic Acids Res201240e90
85 Lihu A Holban S A review of ensemble methods for denovo motif discovery in ChIP-Seq data Brief Bioinform201617731
86 Kulakovskiy IV Makeev VJ Motif discovery and motif find-ing in ChIP-Seq data Genome Analysis Current Proceduresand Applications Caister Academic Press Norfolk UK2014 83
87 Medina-Rivera A Defrance M Sand O et al RSAT 2015Regulatory Sequence Analysis Tools Nucleic Acids Res201543W50ndash6
88 Maaskola J Rajewsky N Binding site discovery from nucleicacid sequences by discriminative learning of hidden Markovmodels Nucleic Acids Res 20144212995ndash3011
89 Ikebata H Yoshida R Repulsive parallel MCMC algorithm fordiscovering diverse motifs from large sequence setsBioinformatics 2015311561ndash8
90 Stormo GD DNA binding sites representation and discov-ery Bioinformatics 20001616ndash23
91 Nishida K Frith MC Nakai K Pseudocounts for transcriptionfactor binding sites Nucleic Acids Res 200937939ndash44
92 Weirauch MT Cote A Norel R et al Evaluation of methodsfor modeling transcription factor sequence specificity NatBiotechnol 201331126ndash34
93 Bailey TL Elkan C Fitting a mixture model by expectationmaximization to discover motifs in biopolymers Proc IntConf Intell Syst Mol Biol 1994228ndash36
94 Ding J Dhillon V Li X et al Systematic discovery of cofactormotifs from ChIP-seq data by SIOMICS Methods 201579ndash8047ndash51
95 van Helden J Andre B Collado-Vides J Extracting regulatorysites from the upstream region of yeast genes by computa-tional analysis of oligonucleotide frequencies J Mol Biol1998281827ndash42
12 | Liu et al
96 van Helden J Rios AF Collado-Vides J Discovering regula-tory elements in non-coding sequences by analysis ofspaced dyads Nucleic Acids Res 2000281808ndash18
97 van Helden J del Olmo M Perez-Ortin JE Statistical analysisof yeast genomic downstream sequences reveals putativepolyadenylation signals Nucleic Acids Res 2000281000ndash10
98 Kulakovskiy I Levitsky V Oshchepkov D et al From bindingmotifs in ChIP-Seq data to improved models of transcriptionfactor binding sites J Bioinform Comput Biol 2013111340004
99 Levitsky VG Kulakovskiy IV Ershov NI et al Application ofexperimentally verified transcription factor binding sitesmodels for computational analysis of ChIP-Seq data BMCGenomics 20141580
100 Eskin E Pevzner PA Finding composite regulatory patternsin DNA sequences Bioinformatics 200218 Suppl 1S354ndash63
101 Sharov AA Ko MS Exhaustive search for over-representedDNA sequence motifs with CisFinder DNA Res 200916261ndash73
102 Hertz GZ Hartzell GW III Stormo GD Identification of con-sensus patterns in unaligned DNA sequences known to befunctionally related Comput Appl Biosci 1990681ndash92
103 Kingsford C Zaslavsky E Singh M a compact mathematicalprogramming formulation for DNA motif finding Lect NotesComput Sci 2006400913
104 Reid JE Wernisch L STEME efficient EM to find motifs inlarge data sets Nucleic Acids Res 201139e126
105 Jin VX OrsquoGeen H Iyengar S et al Identification of an OCT4 andSRY regulatory module using integrated computational and ex-perimental genomics approaches Genome Res 200717807ndash17
106 Favorov AV Gelfand MS Gerasimova AV et al A Gibbs sam-pler for identification of symmetrically structured spacedDNA motifs with improved estimation of the signal lengthBioinformatics 2005212240ndash5
107 Zhao Y Ruan S Pandey M et al Improved models for tran-scription factor binding site identification using noninde-pendent interactions Genetics 2012191781ndash90
108 Redhead E Bailey TL Discriminative motif discovery in DNAand protein sequences using the DEME algorithm BMCBioinformatics 20078385
109 Chen X Xu H Yuan P et al Integration of external signalingpathways with the core transcriptional network in embry-onic stem cells Cell 20081331106ndash17
110 Huggins P Zhong S Shiff I et al DECOD Fast and AccurateDiscriminative DNA Motif Finding Bioinformatics2011272361ndash7
111 Cuellar-Partida G Buske FA McLeay RC et al Epigenetic pri-ors for identifying active transcription factor binding sitesBioinformatics 20122856ndash62
112 Mason MJ Plath K Zhou Q Identification of context-dependent motifs by contrasting ChIP binding dataBioinformatics 2010262826ndash32
113 Teng L He B Gao P et al Discover context-specificcombinatorial transcription factor interactions byintegrating diverse ChIP-Seq data sets Nucleic Acids Res201442e24
114 Vaquerizas JM Kummerfeld SK Teichmann SA et al A cen-sus of human transcription factors function expressionand evolution Nat Rev Genet 200910252ndash63
115 Ding J Cai X Wang Y et al ChIPModule systematic discov-ery of transcription factors and their cofactors from ChIP-seq data Pac Symp Biocomput 2013320ndash31
116 Ettwiller L Paten B Ramialison M et al Trawler de novoregulatory motif discovery pipeline for chromatin immuno-precipitation Nat Methods 20074563ndash5
117 Linhart C Halperin Y Shamir R Transcription factor andmicroRNA motif discovery the Amadeus platform and acompendium of metazoan target sets Genome Res2008181180ndash9
118 Kuttippurathu L Hsing M Liu Y et al CompleteMOTIFsDNA motif discovery platform for transcription factor bind-ing experiments Bioinformatics 201127715ndash7
119 Ho JW Bishop E Karchenko PV et al ChIP-chip versus ChIP-seq lessons for experimental design and data analysis BMCGenomics 201112134
120 Che D Li G Mao F et al Detecting uber-operons in prokary-otic genomes Nucleic Acids Res 2006342418ndash27
121 Liu Y Schmidt B Liu W et al CUDAndashMEME acceleratingmotif discovery in biological sequences using CUDA-enabled graphics processing units Pattern Recognit Lett2010318
122 Alipanahi B Delong A Weirauch MT et al Predicting the se-quence specificities of DNA- and RNA-binding proteins bydeep learning Nat Biotechnol 201533831ndash8
Motif Prediction based on ChIP-seq data | 13
96 van Helden J Rios AF Collado-Vides J Discovering regula-tory elements in non-coding sequences by analysis ofspaced dyads Nucleic Acids Res 2000281808ndash18
97 van Helden J del Olmo M Perez-Ortin JE Statistical analysisof yeast genomic downstream sequences reveals putativepolyadenylation signals Nucleic Acids Res 2000281000ndash10
98 Kulakovskiy I Levitsky V Oshchepkov D et al From bindingmotifs in ChIP-Seq data to improved models of transcriptionfactor binding sites J Bioinform Comput Biol 2013111340004
99 Levitsky VG Kulakovskiy IV Ershov NI et al Application ofexperimentally verified transcription factor binding sitesmodels for computational analysis of ChIP-Seq data BMCGenomics 20141580
100 Eskin E Pevzner PA Finding composite regulatory patternsin DNA sequences Bioinformatics 200218 Suppl 1S354ndash63
101 Sharov AA Ko MS Exhaustive search for over-representedDNA sequence motifs with CisFinder DNA Res 200916261ndash73
102 Hertz GZ Hartzell GW III Stormo GD Identification of con-sensus patterns in unaligned DNA sequences known to befunctionally related Comput Appl Biosci 1990681ndash92
103 Kingsford C Zaslavsky E Singh M a compact mathematicalprogramming formulation for DNA motif finding Lect NotesComput Sci 2006400913
104 Reid JE Wernisch L STEME efficient EM to find motifs inlarge data sets Nucleic Acids Res 201139e126
105 Jin VX OrsquoGeen H Iyengar S et al Identification of an OCT4 andSRY regulatory module using integrated computational and ex-perimental genomics approaches Genome Res 200717807ndash17
106 Favorov AV Gelfand MS Gerasimova AV et al A Gibbs sam-pler for identification of symmetrically structured spacedDNA motifs with improved estimation of the signal lengthBioinformatics 2005212240ndash5
107 Zhao Y Ruan S Pandey M et al Improved models for tran-scription factor binding site identification using noninde-pendent interactions Genetics 2012191781ndash90
108 Redhead E Bailey TL Discriminative motif discovery in DNAand protein sequences using the DEME algorithm BMCBioinformatics 20078385
109 Chen X Xu H Yuan P et al Integration of external signalingpathways with the core transcriptional network in embry-onic stem cells Cell 20081331106ndash17
110 Huggins P Zhong S Shiff I et al DECOD Fast and AccurateDiscriminative DNA Motif Finding Bioinformatics2011272361ndash7
111 Cuellar-Partida G Buske FA McLeay RC et al Epigenetic pri-ors for identifying active transcription factor binding sitesBioinformatics 20122856ndash62
112 Mason MJ Plath K Zhou Q Identification of context-dependent motifs by contrasting ChIP binding dataBioinformatics 2010262826ndash32
113 Teng L He B Gao P et al Discover context-specificcombinatorial transcription factor interactions byintegrating diverse ChIP-Seq data sets Nucleic Acids Res201442e24
114 Vaquerizas JM Kummerfeld SK Teichmann SA et al A cen-sus of human transcription factors function expressionand evolution Nat Rev Genet 200910252ndash63
115 Ding J Cai X Wang Y et al ChIPModule systematic discov-ery of transcription factors and their cofactors from ChIP-seq data Pac Symp Biocomput 2013320ndash31
116 Ettwiller L Paten B Ramialison M et al Trawler de novoregulatory motif discovery pipeline for chromatin immuno-precipitation Nat Methods 20074563ndash5
117 Linhart C Halperin Y Shamir R Transcription factor andmicroRNA motif discovery the Amadeus platform and acompendium of metazoan target sets Genome Res2008181180ndash9
118 Kuttippurathu L Hsing M Liu Y et al CompleteMOTIFsDNA motif discovery platform for transcription factor bind-ing experiments Bioinformatics 201127715ndash7
119 Ho JW Bishop E Karchenko PV et al ChIP-chip versus ChIP-seq lessons for experimental design and data analysis BMCGenomics 201112134
120 Che D Li G Mao F et al Detecting uber-operons in prokary-otic genomes Nucleic Acids Res 2006342418ndash27
121 Liu Y Schmidt B Liu W et al CUDAndashMEME acceleratingmotif discovery in biological sequences using CUDA-enabled graphics processing units Pattern Recognit Lett2010318
122 Alipanahi B Delong A Weirauch MT et al Predicting the se-quence specificities of DNA- and RNA-binding proteins bydeep learning Nat Biotechnol 201533831ndash8
Motif Prediction based on ChIP-seq data | 13