An algorithmic perspective of de novo cis-regulatory motif ... us...sequencing (ChIP-seq) technology...

An algorithmic perspective of de novo cis-regulatory

motif finding based on ChIP-seq dataBingqiang Liu Jinyu Yang Yang Li Adam McDermaid and Qin MaCorresponding author Qin Ma Department of Agronomy Horticulture and Plant Science South Dakota State University Brookings SD 57007 USATel 1-706-254-4293 E-mail qinmasdstateedu

Abstract

Transcription factors are proteins that bind to specific DNA sequences and play important roles in controlling the expres-sion levels of their target genes Hence prediction of transcription factor binding sites (TFBSs) provides a solid foundationfor inferring gene regulatory mechanisms and building regulatory networks for a genome Chromatin immunoprecipitationsequencing (ChIP-seq) technology can generate large-scale experimental data for such proteinndashDNA interactions providingan unprecedented opportunity to identify TFBSs (aka cis-regulatory motifs) The bottleneck however is the lack of robustmathematical models as well as efficient computational methods for TFBS prediction to make effective use of massiveChIP-seq data sets in the public domain The purpose of this study is to review existing motif-finding methods for ChIP-seqdata from an algorithmic perspective and provide new computational insight into this field The state-of-the-art methodswere shown through summarizing eight representative motif-finding algorithms along with corresponding challenges andintroducing some important relative functions according to specific biological demands including discriminative motiffinding and cofactor motifs analysis Finally potential directions and plans for ChIP-seq-based motif-finding tools wereshowcased in support of future algorithm development

Key words cis-regulatory elements ChIP-seq motif finding algorithm

Introduction

Cis-regulatory motifs are usually conserved short DNA se-quences which tend to be 8ndash20 bp long [1] Typically they aretranscription factor binding sites (TFBSs) and play significantroles in regulating transcription rates of nearby genes and fur-ther control their expression levels Hence de novo motif predic-tion and related analyses (eg motif scan and comparison)provide a solid foundation for the inference of gene transcrip-tional regulatory mechanisms in both prokaryotic and eukary-otic organisms [2 3] Moreover these techniques alsosubstantially contribute to some system-level studies such asregulon modeling and regulatory network construction [2 4 5]With the rapidly growing availability of sequenced genomesand advanced biotechnologies substantial computational tech-niques have been carried out to identify motifs from query DNA

sequences Nevertheless the variations among motifs and theirshort length make their discovery a challenging problem

Substantial efforts have been devoted in seeking a reliableand efficient way for motif identification over the past few dec-ades Since the 1980s identifying motifs in provided promotershas been one of the most prevalent approaches and numeroustools have been developed [6ndash13] such as AlignACEBioProspector CONSENSUS MDscan MEME CUBIC MDscanand BOBRO [10 11 13ndash24] Some of these tools have been suc-cessfully applied to various organisms for regulatory networkconstruction [2 5] The underlying mechanism is that the co-regulated genes should exhibit overrepresented common motifsin their promoter regions Although considerable efforts havebeen made one non-negligible limitation is the high false-posi-tive rates in predictions [9 25ndash27] Under the assumption that

Bingqiang Liu is an associate professor in School of Mathematics at Shandong University Jinan Shandong P R ChinaJinyu Yang is a PhD student in Department of Mathematics and Statistics at South Dakota State University Brookings SD USAYang Li is a PhD student in School of Mathematics at Shandong University Jinan Shandong P R ChinaAdam McDermaid is a PhD student in Department of Mathematics and Statistics at South Dakota State University Brookings SD USAQin Ma is the director of Bioinformatics and Mathematical Biosciences Laboratory and an assistant professor in Department of Agronomy Horticultureand Plant Science at South Dakota State University SD USA He is also an adjunct assistant professor in BioSNTRSubmitted 20 December 2016 Received (in revised form) 22 February 2017

VC The Author 2017 Published by Oxford University Press All rights reserved For Permissions please email journalspermissionsoupcom

1

Briefings in Bioinformatics 2017 1ndash13

doi 101093bibbbx026Software Review

the motifs in promoters tend to evolve at a lower rate and there-fore be more conserved than nonfunctional surrounding se-quences some phylogenetic footprinting-based algorithmshave been developed to reduce the false-positive rate such asPhyloGibbs Footprinter PhyloCon and MicroFootprinter [14 28ndash32] A phylogenetic footprinting strategy was firstly proposed in1988 [33 34] and has significantly improved the state-of-the-artperformance in this field However the majority of programsbased on phylogenetic footprinting did not make full use of thephylogenetic relationship of query promoter sequences fromvarious genomes [21] Owing to this limitation some promotersfrom highly divergent species could be included and the motifinstances are not conserved enough to carry out motif predic-tion [35ndash37] Most recently Liu et al [4 38] developed two com-putational pipelines aiming to break this bottleneckSpecifically they extracted phylogenetic relationships fromregulatory sequences using a combinatorial framework basedon 216 selected representative genomes to refine the ortholo-gous promoter set It is noteworthy that all the methods men-tioned above could be potentially improved by integratingadditional experimental data

The rapid development of high-throughput biotechnologies[39ndash48] has provided new insight and powerful support for regu-latory mechanism analyses and genome-scale regulatory net-work elucidation In particular chromatin immunoprecipitationsequencing(ie ChIP-seq followed by high-throughput DNAsequencing) provides massive proteinndashDNA interactive infor-mation and has been successfully applied to genome-wide ana-lyses of transcription factor (TF) binding histone modificationmarkers and polymerase binding [39 49] This technology usesone chosen protein (ie TF histone or DNA polymerase of inter-est) to bind the whole-genome sequences [50 51] then somephysical techniques such as ultrasonic are used to shear thecross-linked DNA sequences into segments after proteinndashDNAinteraction [39] finally these segments will be sequenced intoshort reads [52 53] These reads could be mapped onto their ref-erence genome if available using Bowtie [54] BWA [55] etcBased on the mapping results the motif-enriched genomic re-gions could be identified by peak-calling tools [56] such as SPP[57] MACS [58] CisGenome [59] FindPeaks [60] QuEST [61] andPeakRanger [62] These regions will be the potential bindingsites of the chosen protein

Recent studies suggest that ChIP-seq can be effectively inte-grated into and benefit TFBS discovery tools [57 61 63ndash73] Itprovides high-throughput motif signals and allows genome-scale discovery in a cell More accurate binding regions (peaks)can be derived from ChIP-seq experiments thus leading tomore reliable prediction performance [9] However the peaksdetected from ChIP-seq data can be up to a few hundred basepairs while the documented cis-regulatory motifs are usuallyonly as long as 8ndash20 bp [74] Therefore an ab initio motif discov-ery method is still indispensable to (i) identify the accuratebinding sites from these ChIP-seq peaks and (ii) build con-served motif profiles for further study in transcriptional regula-tion Unfortunately some widely used motif discovery toolseg MEME and WEEDER [75] cannot be directly used on ChIP-seq peaks as they are designed for co-regulated promoter se-quences with limited size Recently some efforts have beenmade to rectify this problem by modifying traditional motif-finding tools to adapt to the ChIP-seq data [68 74 76] or design-ing specific strategies for ChIP-seq-based motif finding [73 77]The computational challenges of these tools include but notlimited to (i) huge amounts of sequenced ChIP-seq reads canmake motif finding a computationally infeasible problem [9] (ii)

failure to identify the motifs associated with cofactors of theChIP-ed TF [73] or cis-regulatory modules [78] (iii) lack of insightin integration of ChIP-seq data sets from multiple TFs [79] (iv)the traditional false-positive issue in motif prediction causedby the noise in ChIP-seq technology [74] (v) lack of an efficientway to determine the correct lengths of motifs except exhaust-ively enumerating each length within an interval [18 80 81]and (vi) weak support in elucidation of the mutual interactionsamong multiple motifs from larger ChIP-ed data sets [82ndash84]which is important in disease diagnosis through gene regula-tory network construction

So far there have been a few reviews on motif findingfocused on the analysis of ChIP-seq data [77 85 86] Tran et al[77] introduced nine Web tools with numerous details in theirusages and applications For each of them the authors show-cased the requirement of the input format default parametersoutput format basic characteristics of the tool and a brief intro-duction of the algorithm Rather than focusing on individual al-gorithms or tools Lihu et al [85] were dedicated to reviewingseven ensemble methods with input and output requirementsand a brief introduction of the included algorithms Kulakovskiyet al provided a systematic history of tool development in motiffinding before the ChIP-seq technology and advantages as wellas computational challenges of using ChIP-seq in motif findingSeveral ChIP-seq-based methods and their applications were re-viewed focusing on their overall stories or workflows [86]However it is still under investigation in a detailed algorithmicaspect as the performance of existing methods is far from satis-factory based on both efficiency and performance From the al-gorithmic point of view this article presents the mainchallenges and characteristics of de novo motif finding in ChIP-seq peaks through systematically reviewing eight representa-tive tools (Table 1) Specifically the review mainly focuses onthe motif-finding techniques adopted by these methods as wellas such additional specific functions as discriminative motiffinding and cofactor motif identification Through summarizingthe existing limitations and revealing algorithmic potentialsseveral promising directions for further improvements of ChIP-seq-based motif finding have been proposed both in methodo-logical aspects and concrete applications

Motif representation and identification

A motif represents a set of DNA segments with the same lengthwhich are binding sites for the same TF Each segment of themotif is called an instance and different instances of the samemotif tend to be similar with each other on sequence level(Figure 1A) A representation model of a motif to demonstratethe similarity of its instances is expected to accurately capturethe characteristics of proteinndashDNA binding activity of its corres-ponding TF [90]

The most straightforward model to denote the binding pref-erence of a TF on each position along a motif is the lsquoconsensusrsquosequence (eg AGTCA or AGTCG for the motif in Figure 1A) whichis composed of the concatenation of the most frequent nucleo-tide on each position It can be seen as the ancestor of the bind-ing sites of the same TF with an assumption that these sitesevolved from it Although the consensus presents the character-istics of a motif in each position in a simple and clear way thevariations in this motif are absent in this model The lsquodegener-ate consensusrsquo was proposed to fill this gap using IUPAC wildcards to replace the exact nucleotides (A G C and T) For ex-ample W means both A and T in this position could be recog-nized by the TF of this motif (Figure 1B) [9]

2 | Liu et al

A more accurate and most commonly used model is thelsquomotif profilersquo A profile is built by aligning the available in-stances of a motif M and counting the frequency of each nucleo-tide at each position (fij) These frequencies give rise to a typicalmatrix representation of a motif profile (Mffrac14 fij4 l Figure 1A)called a lsquoposition weight matrixrsquo (PWM) An alternative way ofconstructing the PWM is using the probability distribution to re-place frequencies (Mpfrac14 pij4 l) Specifically these frequencieswill be divided by the number of binding sites of this motif andsuch a representation of the PWM in Figure 1A is shown inEquation (1)

Mp frac14 fpijg frac14

06 0 0 02 04

0 06 0 0 04

0 04 0 08 02

04 0 1 0 0

2666664

3777775

(1)

Taking the background frequencies of each nucleotides intoconsideration the PWM in Equation (1) can be further modifiedas Mgfrac14 gij where gijfrac14 log(fijbi) and biis the probability of theith nucleotide of (A G C and T) appearing in the background se-quences Based on this version of PWM we can calculate thelsquoinformation contentrsquo (IC) to evaluate how conserved this motifis ie Equation (2)

I Meth THORN frac14X1

jfrac141

X4

ifrac141

pij logpij

bi(2)

In addition the matrix can also be used to evaluate how agiven DNA segment s with length l is consistent with motif Mby calculating the score

score seth THORN frac14Xl

jfrac141

X4

ifrac141

pij logpij

bi dij (3)

where dijfrac14 1 if the jth nucleotide of s is the ith nucleotide of (AG C and T) dijfrac14 0 otherwise A problem of this model is that theprobability fij could be zero for a small set of binding sites giv-ing rise to negative infinity in Equations (2) and (3) A commonmethod to avoid this bias is adding a certain valueT

able

1Fe

atu

res

of

the

too

lsfo

rC

hIP

-seq

mo

tif

fin

din

g

Pro

gram

sM

ICSA

Ch

IPM

un

kD

REM

ER

SAT

pea

k-m

oti

fsFM

oti

fSI

OM

ICS

Dis

cro

ver

RPM

CM

C

Wo

rd-b

ased

Yes

Yes

Yes

Yes

Yes

Pro

file

-bas

edY

esY

esY

esT

ech

niq

ues

and

crit

eria

MEM

EthornPS

SMsc

ore

IC-l

ike

fun

ctio

ns

thornse

qu

ence

wei

ght

Wo

rdco

un

tthorn

Fish

erte

stR

SAT

oli

go-a

nal

ysis

d

yad

-an

alys

isl

oca

lwo

rdan

alys

isM

EME

and

Ch

IPM

un

k

Wo

rden

um

erat

ion

thornsu

ffix

treethorn

IC-b

ased

z-sc

ore

IC-l

ike

fun

ctio

ns

thorntr

eem

eth

od

sH

MM thornm

ult

iple

test

ing

corr

ecte

dP-

valu

e

MC

MC

Dis

crim

inat

ive

mo

tif

fin

din

gY

esY

esY

es

Co

fact

ors

Mu

ltip

lem

oti

fsM

ult

iple

div

erse

mo

tifs

Mu

ltip

led

iver

sem

oti

fsM

oti

fm

od

ule

sM

ult

iple

div

erse

mo

tifs

Mu

ltip

led

iver

sem

oti

fsM

oti

fco

mp

aris

on

RSA

Tco

mp

are-

mo

tifs

STA

MP

TO

MT

OM

Yea

ro

fre

leas

ean

dre

fere

nce

s20

10[7

9]20

10[6

9]20

11[7

3]20

12[8

7]20

14[7

4]20

14[7

8]20

14[8

8]20

15[8

9]

A B

C

Figure 1 Representation of a motif (A) an example of motif consensus degener-

ate consensus and profile (B) a full list of wild cards in the degenerate consen-

sus (C) a different motif but has same profile with motif in (A)

Motif Prediction based on ChIP-seq data | 3

(pseudocount) for each position of the motif [91] Through sys-tem simulation and analysis Nishida et al found that the opti-mal pseudocount value is correlated with the entropy of a motifprofile Specifically the less conserved motif profiles preferlarger pseudocount value and 08 is suggested in general

As shown above multiple approaches for modeling TFrsquosDNA-binding specificity have been developed A systematiccomparison of these approaches can provide substantially valu-able information for further motif-finding algorithm designDREAM5 Consortium organized a competition on motif repre-sentation models by applying 26 approaches to in vitro protein-binding microarray data [92] These approaches adopt variousstrategies including but not limited to k-mers model PWMhidden Markov models (HMMs) and dinucleotides The k-mersand PWM are the two main strategies which show similar aver-age performance on multiple data sets However they have sub-stantial differences in terms of individual performance on a fewdata sets indicating that motif finding is sensitive for model se-lection Another interesting observation is that the IC of a motifmay not fully represent its accuracy It is obviously contrarywith the basic principle of general motif-finding tools thusdeeper investigation into this area is still needed to improve themotif representation models

These motif representation models are still not perfect witha common disadvantage that they ignore the correlation amongdifferent positions in a motif For example the motif in Figure1C has the same PWM as the motif in Figure 1a but apparentlythe first two positions in this motif are correlated and depend-ent to each other Hence a high order Markov model is sug-gested to be integrated into PWM matrix [24] Meanwhile it isunsure whether a known PWM can fairly represent the wholepopulation scenario as the frequencies of nucleotides in eachposition are calculated only from the known binding sites of aTF A motif profile built on partially identified bindings sites of aTF may induce bias when it is used to interpret the global bind-ing preference especially when this profile is used to model theorthologous binding sites from various species A more funda-mental debate is Do the nucleotides with lower frequenciesimply lower binding ability At the time of this writing thereare still no clear answers to this question and deeper thoughtabout above concerns will bring potential ways to improveexisting representation models of cis-regulatory motifs

Motif signal detection techniques andperformance evaluation

The basic computational assumption of motif identification isthat they are overrepresented as conserved patterns in given se-quences The scattered instances of a motif are not perfectlyidentical but similar with each other Once identified and wellaligned they will show significance in conservation comparedwith background sequences Therefore finding and aligningthese instances are the primary issues and remain a challengein motif finding ChIP-seq data supply enriched resource formotif finding yet make this problem more challenging as thefull-size peak sequences from a ChIP-seq experiment are muchlarger than co-regulated promoters in traditional motif finding

Motif-finding methods mainly fall into two categories word-based methods (ie consensus-based) and profile-based meth-ods [9 13] Word-based methods usually enumerate and com-pare nucleotides starting from a consensus sequence with afixed length and a tolerance of mutations Theoretically thisstrategy is able to identify global optimal solutions but suffers

from high computational complexity when it is applied to large-scale input data or when to-be-identified motifs are relativelylong and with a large number of mutations [13] The profile-based methods usually start with some aligned patterns eitherrandomly chosen [93] or enumerated in a limited subset of inputdata [93] and refine them based on some criteria on the wholedata These criteria are designed to evaluate the overrepre-sented significance of aligned profiles from the input se-quences Improvements are mostly conducted in a heuristicway eg neighboring improvement (add or delete patterns tosee if the profile goes better similar to a hill-climbing method)or iterative statistical methods [Gibbs sampling or expectationmaximization (EM)] The profile-based methods usually runfaster than word-based methods and have better performancein predicting motifs with complex mutations However thiskind of methods tend to fail in detection of multiple motifs es-pecially when the data size is large as the iterative procedurethey adopted often falls into local optimizations which is diffi-cult to escape [89]

Word-based methods

The dominant model of these methods is the (l d)-motifs where lis the width of a motif and d is the maximum number of muta-tions between a motif instance and the consensus sequence [7374 76 78 88 94] Intuitively the goal is to find potential (l d)-motifs that are significantly presented in the input sequencesThese methods can be further divided into two categoriespattern-driven and sample-driven [74] The former enumerates all4l l-mers on character set (A G C and T) as consensus sequences tobuild candidate motifs The latter only uses real l-mers from inputdata to generate all possible (l d)-motifs Five algorithms specific-ally designed for motif finding in ChIP-seq peaks FMotif [74]DREME [73] RSAT peak-motifs [76] SIOMICS [78 94] and Discrover[88] are reviewed with more details as follows

DREME [73] is a sample-driven method which exhaustivelysearches for exact words and heuristically expands them towords with wild cards DREME requires a positive data set iethe peaks derived from a ChIP-seq data set and a negative dataset ie different peaks from other ChIP-seq experiments or ran-domly shuffled version of the positive data It takes all l-mers inthe positive data set and counts the number of their occurrencein the two data sets The top 100 words that are significantlyoverrepresented as determined by Fisherrsquos exact test are takenas seed motifs DREME then expands the seed motifs by allow-ing exactly one additional wild card The newly generatedwords are evaluated and only the top 100 most significantwords are kept To save time in the word generation stepDREME estimates the number of sequences matching generatedmotifs rather than searching explicitly for them in the data setHowever this program prefers short motifs (with length 4ndash10 bp) hence is more suitable for monomeric eukaryotic TFs

The RSAT package [87] provides a pipeline lsquopeak-motifrsquo todiscover motifs in ChIP-seq peaks [76] It integrates a set of try-and-test algorithms using complementary criteria to detect ex-ceptional words ie lsquooligo-analysisrsquo lsquodyad-analysisrsquo lsquoposition-analysisrsquo and lsquolocal-word-analysisrsquo [95ndash97] It is a sample-drivenmethod starting by counting all 6ndash7 bp long nucleotide occur-rences within the input regulatory sequence set and estimatingtheir statistical significances The expected word frequencieswere estimated with a 1st-order or 2nd-order Markov model bydefault A lower order can be chosen to improve the sensitivityof frequency estimation in a smaller data set Finally the sig-nificant words are assembled and converted to PWMs for

4 | Liu et al

further usage It is worth noting that this algorithm uses cali-brated tables which only takes the uneven nucleotide represen-tation in the noncoding sequences into consideration Thisfeature combined with the simplicity of its binomial-basedstatistical analysis results in the high efficiency of RSAT lsquopeak-motifsrsquo Experimental results also showcased that it is signifi-cantly faster than other tools like DREME [73] ChIPMunk [98 99]and MEME-ChIP [68] However the methodological simplicitylimits this method to the detection of short and relatively con-served motifs Longer motifs must be reconstituted by the com-bination of overlapping short nucleotides

FMotif [74] is an exhaustive pattern-driven strategy for find-ing motifs in ChIP-seq peaks A similar idea has been used inseveral traditional motif-finding tools for co-regulated data egWEEDER [75] and MITRA [100] One unique advantage ofpattern-driven methods is being able to identify planted (l d)-motifs without prior knowledge of width l FMotif basically fol-lows the strategy of WEEDER to enumerate all the possible (l d)-motifs in a depth-first manner and scan all motif occurrences ina suffix tree Although this strategy has achieved high accuracyin searching for short eukaryotic motifs [25] it still has severaldrawbacks For example it sets a constraint on mutation toler-ance dl which is not satisfied by many motifs and the runtime will greatly increase in identifying long motifs or motifswith low occurrence frequency The effect of these drawbacks isgetting worse when faced with large-scale ChIP-seq data Tosolve these problems FMotif designed a new suffix tree struc-ture for motif instance searching It keeps the mismatching in-formation generated when scanning i-mers in suffix tree anduses them in the searching of (ithorn 1)-mers The new structuredramatically decreases the time complexity of motif instancesearching and achieves higher accuracy compared withWEEDER thus it makes this enumeration strategy applicableon ChIP-seq data For example the run time of FMotif is about32 min and 475 h when identifying planted (16 5)-motif in thedata set with 10 000 and 80 000 nucleotides correspondinglyHowever WEEDER spends gt10 h when identifying the samemotif in the data set with having 1200 nucleotides [74] The effi-ciency can be even improved through a clustering strategy incombining l-mer enumeration (eg cisFinder) [101] One short-coming of FMotif is higher space complexity because morememory is required to store mismatched information

The above three methods have made substantial efforts onthe motif signal detection procedure SIOMICS [78 92] adopts arelatively simpler strategy to build motifs but pays more atten-tion to reassess these motifs based on the significance of agroup of motifs (called motif modules) Specifically SIOMICStakes all l-mers (lfrac14 8 by default) in given sequences as consen-suses and builds a motif for each of them by searching all seg-ments with at most one mismatch as its motif instances Themotifs are evaluated and ranked by the following Equation (4)

log ethnTHORNl

Xl

jfrac141

X4

i

pij log pij 1n

Xallsegments

log p0 seth THORNeth THORN

24

35 (4)

where n is the number of instances in this motif pij is the sameas that in Equation (2) and p0(s) is the probability of generatingTFBSs based on background nucleotide frequencies This scorenormalizes the effects of motif length and number of instancesthus has no bias in motif interpretation and comparison It hasbeen successfully used in MDscan another motif-finding toolfor ChIP-chip data [17] SIOMICS eliminates redundancy in pre-diction through removing the motif which has a higher-rank

similar motif The predicted motifs were reassessed as motifmodules by a tree strategy which will be discussed in the lsquoFindco-factors of TFs from ChIP-seq datarsquo section The basic as-sumption is that an individually insignificant motif should betaken as a valuable motif if it is overrepresented when con-sidered as a member of motif module together with the motifsof some other factors

The word-based methods can be roughly considered as aglobal optimization strategy as they enumerate all possible (ld)-motifs However they usually only have a good performanceon simulation data but not on real biological data because the (ld)-motif model is too strict for representing real TFBSs A hightime complexity is another bottleneck of the enumeration strat-egy Even though many techniques have been developed to ac-celerate the searching speed it is still a tough problem in ChIP-seq motif finding

Profile-based methods

Profile-based motif-finding methods are usually based on an opti-mization strategy with a motif profile score eg IC for PWMSpecifically they try to find a collection of DNA segments givingrise to a motif profile with the highest score among all the com-binations of candidates However it is impractical to identify thebest profile by exhaustive enumeration as going through all pos-sible combinations will lead to extremely high time complexityTherefore heuristic strategies are usually adopted to narrowdown the searching space First some primary profiles will bebuilt by randomly selecting one or several DNA segments or byan enumerative search in partially selected input sequences(based on some preliminary knowledge) Then they will be im-proved aiming to gain higher profile scores during the iterationprocess The improvement process is usually conducted itera-tively in a heuristic way [102] eg EM [80] and Gibbs sampling[15] It is noteworthy that high-quality motif profiles can be builtthrough combinatorial optimization strategies eg integrate lin-ear programming [103] and graph-based techniques [7 10 23]With ChIP-seq data a critical challenge to efficiently handle theexponentially increased number of sequences has been posed tothe conventional motif discovery methods

Several traditional motif-finding tools were updated to adaptto ChIP-seq data and most of them focused on how to improvethe time efficiency faced with the larger data size For exampleMEME-ChIP [68] runs MEME as one of its primary motif detectionengines however it limits the input data as no gt600 sequencesHybrid Motif Sampler (HMS) combines deterministic identifica-tion of motif instances and random sampling strategy togetherto accelerate its iteration process [72] STEME [104] which canbe considered as an updated version of MEME introduced a suf-fix tree structure in the EM process to index the sequences setwhich decreases the running time to a different magnitudeMICSA uses MEME on a subset of the highest ChIP-seq peaks toidentify motifs [79] and then MEME is applied for severalrounds on the sequences in which no motifs have been identi-fied ChIPMotifs is originally designed for ChIP-chip data andintegrates MEME and WEEDER to predict primary motif candi-dates followed by a bootstrap resampling approach to deter-mine optimal cutoff and filter against the false-positivecandidates [105]

ChIPMunk [69] is a new iterative algorithm for ChIP-seq datathat combines greedy optimization with bootstrapping Theevaluation of motif profiles in ChIPMunk is based on theKullback Discrete Information Content (KDIC) which is calcu-lated by Equation (5)


KDIC frac14Xl

jfrac141

X4

ifrac141

log fij log N

NXl

jfrac141

X4

ifrac141

fij log bi (5)

where the variables l N fij and bi are the same as those definedin Equations (3) and (4) Like CONSENSUS ChIPMunk uses agreedy strategy to find motif profiles that have high KDIC val-ues The identified motif profiles are ranked based on PWMscores and then improved by an EM-like iterative process It isworth noting that the rebuilding of motif profiles integrated theweight scores calculated from the ChIP-seq coverage informa-tion Specifically ChIPMunk set a score to each sequence as itsnormalized maximal peak value and these weights are as-signed to each position forming the motif profiles Theexperiment on several ChIP-seq data set indicated thatChIPMunk obtained a better performance both in running speedand prediction quality compared with MEME HMS [72] andSeSiMCMC [106]

RPMCMC [89] is a profile-based method which combinesGibbs motif samplers and a reversible-jump MCMC method Asa parallel version of Gibbs RPMCMC fully uses parallel mechan-isms to accelerate motif finding This breakthrough means a lotto current motif finding studies especially in this big data eradriven by high-throughput sequencing technology eg ChIP-seq Although various motif-finding methods have been pro-posed before such as DREME HEGMA WEEDER and Gibbs motifsampler [15] they have limited power in properly controllingthe trade-off between computation time and motif detection ac-curacy Fortunately RPMCMC makes multiple interacting motifsamplers in parallel to ensure an acceptable run time comparedwith other methods Most importantly a repulsive force isapplied in RPMCMC to separate different motif samplers closeto each other making further contribution in getting rid of localoptima In this way RPMCMC has achieved extremely promis-ing performances on synthetic promoter sequence and ENCODEChIP-seq data sets Meanwhile it can also be applied to diversemotif identification ie discriminative motifs and more detailsregarding this will be discussed in the next section [89]

The profile-based methods stand on an assumption thateach nucleotide in a motif instance participates independentlyin the TF-binding activity However more studies have indi-cated that the neighboring positions have strong dependent ef-fect in some motifs [107] Taking this issue into considerationMathelier and Wasserman [64] designed a new transcriptionfactor flexible model (TFFM) and developed a prediction systembased on this model It uses a 1st-order HMM to elucidate the di-nucleotide dependencies Specifically the distribution probabil-ities of nucleotides in position i are dependent on thenucleotide found at position i-1 The performance evaluation on96 ChIP-seq data sets indicates that the models considering pos-ition dependence outperform the other models on 90 data setsof 96 ie 94 in terms of the area under the curves for the cor-responding receiver operating characteristic curves [64]Although this algorithm is not designed to build motif profilesthe TFFM model can be easily modified to integrate the correl-ation between positions in de novo motif finding

Although several motif representation models have been de-veloped and new methods have been designed for motif findingon ChIP-seq data the problem is still far from a sound solutionbecause of the complexity of gene regulation the large-scale ofChIP-seq data and the low specificity of some binding sites Inthe future new models for motif presentation may be createdto replace the consensus and profiles model and improve theaccuracy of motif finding

Assessment criteria and performance evaluation

The main difference between motif finding before and after thenext-generation sequencing technology is the dramaticallyincreased data size [9] Substantial experiments showcased thatthe most popular traditional motif-finding tools eg MEME andWEEDER cannot handle the genome-scale ChIP-seq data con-taining thousands of peaks Therefore almost all ChIP-seq-based motif-finding tools focus on improvement in efficiency inaddition to accuracy and conduct comprehensive performanceevaluations on synthetic data or real biological data sets Theperformance evaluation was mainly carried out based on thecomparison with other tools like MEME WEEDER and the most-cited ChIP-seq-based tool DREME The criteria include coverageon known binding sites (eg sensitivity or SN) similarity be-tween predicted motif profiles and the documented ones speci-ficity (SP) positive predicted value (PPV) and the running timeon data sets with different scales

Here we summarized the performance evaluation of theeight ChIP-seq-based motif-finding tools as a sketch in Table 2Overall no method can outperform others in all situationsThese eight tools showed their preferences on different datasizes different motif lengths and advantages on special contextof motif finding Generally DREME has a good performance onprimary motif finding from large-scale data set along withother two functions discriminative motif finding and cofactormotif finding which will be discussed in next section Howeverit prefers short motif identification and only outputs profilematrices without information of corresponding motif instancesIn this situation we have to rely on additional motif scanningtools to identify the concrete TF-binding sites MICSA the ear-liest among the eight tools takes MEME as the search enginethus its speed and performance are highly limited by MEMERSAT peak-motifs and FMotif handle large data sets with a goodprediction accuracy and run faster than DREME SpecificallyFMotif shows good tolerance on noise rate of input sequencesSIOMICS works well on finding motif modules and runs fasterthan DREME but slower than RSAT peak-motifs Discrover usesDREME to identify motif seeds and has a better performance ondiscriminative motif finding than DREME RPMCMC shows agreat potential on finding diverse motifs in real biological datasets More details can be found in Table 2 from which usersmay get a clue to appropriate tool selection in specific contextof motif-finding applications

Development of advanced functions withbiological insightsDiscriminative motif finding

Traditional motif-finding tools aim to identify a group of con-served motifs in query sequences which are expected to containinstances of the to-be-detected motifs Besides these motifssome other randomly conserved DNA patterns may also exist inthese sequences Distinguishing these false positives from realmotifs has always been a big challenge in motif finding Thisissue is getting more serious because the high-throughputsequencing techniques bring in more false positives One solutionto this problem is to carry out the discriminative motif findingwhich is to find motifs whose occurrence frequencies vary be-tween the query sequence set and several well-defined controlsets The simplest contrast for discriminative motif finding con-tains a positive set (query sequence set) and a negative set (se-quences without binding activity or randomly generated

6 | Liu et al

sequences) [88] The discriminative ideas have been integratedinto some traditional motif-finding tools to reduce false positivesin prediction For example CONSENSUS [102] calculates P-valuesfor motif profiles indicating the probabilities for the same motifsoccurring in a hypothetical random sequence set with a similarsize BOBRO [10] calculates such a P-value by practically simulat-ing random sequences set in the program as contrast underPoisson distribution BBR [24] takes coding genomic sequences asbackground data to reevaluate candidate motifs and an experi-ment on Escherichia coli genome showed that this strategy reducesthe number of false positives Certainly the function of discrim-inative motif finding is not limited to dealing with the false-posi-tive issue Some biology-driven applications specifically identifymotifs overrepresented in a positive set but not in a negative set(called decoy motif problem) Hence a motif without occurrencedifferences between two sets is not considerable no matter howconserved it is [108]

Application of discriminative motif finding is even more im-portant in ChIP-seq data analysis because the data size is usu-ally large and the peaks with or without binding activitiesnaturally compose the positive and negative sets respectivelyIn DREME [73] the first step is counting how many positive se-quences and negative sequences contain a query word Thenthe P-value from Fisherrsquos exact test is calculated for this wordThe words with significant P-values will be kept as seeds for

motif merging The same evaluation process will be conductediteratively in generation of motifs by combing similar motifseeds In application DREME performed discriminative motiffinding by taking Sox2- and Oct4-binding ChIP-seq data inmouse embryonic stem cell (mESC) as positive and negativedata set and found that Oct4 data set has significantly moreOct4-binding sites than the Sox2 data set The results over-turned the previous opinion that Sox2 and Oct4 bind their tar-gets exclusively as a heterodimer [109]

Maaskola and Rajewsky [88] developed an improved HMM-based tool Discrover for discriminative motif finding on largedata set The method consists of three parts lsquoseed findingrsquolsquoHMM optimizationrsquo and lsquosignificance filteringrsquo The seed findingpart is similar to DREME but uses more objective functions be-sides the Fisherrsquos exact test P-values Among these objectivefunctions mutual information of condition and motif occur-rence (MICO) and maximum mutual information estimationcan be used for discriminative motif evaluation on the contrastwith more than two conditions which extended the applicationof Discrover The HMM optimization step starts from a back-ground state trained by the BaumndashWelch algorithm Then theinitial emission probabilities of the motif chains are centered onthe seed sequences The posterior probability of all sequences isevaluated to calculate the expected number of sequences thathave motif occurrences Discrover conducts discriminative

Table 2 Performance summary of motif-finding tools on ChIP-seq data

Programs Criteria Preferences Speed Performance and special characteristics

MICSA [79] Speed motifcoverage

Prefers small datasets

Slow on large data sets Performance is limited by MEME performswell on filtering weak peaks

ChIPMunk [69] Speed motifcoverage

NA Faster than MEME slowerthan DREME quasi-linearcomplexity (power 127)

More accurate than MEME

DREME [73] Speed motifcoverage

Handles large ChIP-seq data (thou-sands of se-quences) prefersshort motifs

Much faster than WEEDERand MEME linear timecomplexity

Performs well at cofactor motif finding anddiscriminative motif finding only providesmotif profile matrix no output of motifsites

RSAT peak-motifs[87]

Speed motifcoverage

Handles large ChIP-seq data sets (upto 1 000 000 peaksof 100 bp each)

Faster than DREME MEMEand ChIPMunk

High prediction accuracy

FMotif [74] Speed motif cover-age SN and SP onmotif sites

Handles large ChIP-seq data handleslong motifs

Fast but requires morememory

High sensitivity and high specificity pre-dicted accurate motif profiles with highrank (1 or 2) for 12 mESC data tolerates upto 30 noise sequences

SIOMICS [78] Speed motif cover-age and specificityon motifs

NA Slower than RSAT peak-motifs much faster thanDREME especially onlarge data sets

Competitive performance with DREME andRSAT in motif finding and cofactor motiffinding competitive performance withDREME and better than RSAT peak-motifson false-positive control good motif mod-ule prediction

Discrover [88] Speed MCC(Matthews corre-lation coefficient)on nucleotide-level SN and PPVon binding sitelevel

NA Comparable with DREME inspeed

Better than Bioprospector and MDscan on SNand PPV of prediction better than DREMEon discriminative motif finding

RPMCMC [89] Speed SN and PPV NA Similar to DREME Better than WEEDER on SN and PPV of predic-tion great potential to mine many reliablediverse motifs


learning by iterative gradient optimization based on the chosenobjective functions which has a linear run time complexitybased on the length of the input data Finally Discrover refinesthe final models by a threshold on a corrected MICO-based P-value regardless of if MICO or another objective function wasused in previous steps In addition Discrover makes extra ef-forts in detection of multiple motifs which will be discussed inthe next section

In the application of discriminative motif finding on ChIP-seq data the various choices of reference data can benefit dif-ferent biological analysis about gene regulation and is the mainfactor affecting the prediction performance [108 110]Specifically the reference data can be (i) randomly generatedsequences using a uniform distribution or a Markov process (ii)the sequences with no binding evidence (iii) the ChIP-seq peaksfor another TF (iv) the peaks of the same TF but under a differ-ent condition even (v) multiple-level reference data sets basedon a detailed grade framework of the signal strength etc [94]This information can reduce false positives and find other spe-cial motifs including decoy motifs variant motifs for one TFimpoverished motifs and collaborated motifs [108] and thuscan improve the state-of-the-art studies about complex regula-tion mechanism eg combinatorial regulation and behaviorchange of TFs among different conditions [110] It is worth not-ing that the binding activity of a TF could be affected by epigen-etic modifications in a complex fashion eg through localchromatin structure caused by the packaging of DNA histonesand other proteins [111] Cuellar-Partida et al [111] have usedthe signal from histone modification ChIP-assay to filter thescanned motifs The corresponding pitfall raised in discrimina-tive motif finding is that the sequences in a negative data setwith no binding evidence may contain potential binding sitesSome algorithms try to mitigate this effect by a special designwhich can tolerate a few motif occurrences in negative set [110]Another phenomenon is that the binding motifs of a TF couldbe affected by the binding of other TFs Mason et al [112] provedthat Oct4 preferentially binds to different motifs with or withoutSox2 bindings within 5000 bp indicating that the binding motifof Oct4 is context-dependent

Find cofactors of TFs from ChIP-seq data

In animals and plants TFs usually regulate gene expression withcooperation of other partner TFs (cofactors) [113] Growing evi-dence indicates that cofactors which interacted directly or indir-ectly play an important role in transcription regulation Thebinding sites of multiple TFs in a short DNA region often deter-mine the temporal spatial expression pattern of the downstreamgenes thus will help to elucidate underlying patterns of combina-torial regulation [114] The analysis of cofactors has attracted sub-stantial attention in traditional motif finding For example BBA[24] in the DMINDA Web server designed a probability model toevaluate the co-occurrences among identified motifs in a givenset of regulatory sequences which can reveal joint regulation re-lationships by multiple TFs The cofactor motif finding on ChIP-seq data is usually formulated by detecting multiple nonredun-dant motifs in a set of ChIP-seq peak regions [73] or detecting TFpatterns whose binding motifs frequently co-occurred on thesame set of regulatory regions [113] Recently some algorithmshave tried to take advantage of the large-scale ChIP-seq data todiscover the cofactor motifs

Researchers want to identify multiple nonredundant motifsin query ChIP-seq data The most straightforward way is to scanpeak sequences by documented motifs [84 115] However the

number of documented motifs is limited compared with thenumber of TFs leading to a loss of some real binding sitesSome de novo motif-finding methods try to first identify mul-tiple motifs and then build relationships for them with knownTFs One example is the aforementioned DREME [73] whichfeeds the whole sequences into the corresponding model ratherthan using only a small portion of the collected ChIP-seq data Itidentifies multiple motifs by removing the most statistically sig-nificant identified motif derived from Fisherrsquos exact testand then repeats the search for motifs DREME achieved re-markable performance in cofactor motifs discovery in mousecell ChIP-seq data and dramatically outperformed other preva-lent algorithms In particular DREME discovered gt2 times thenumber of cofactor motifs compared with Amadeus [116] andgt10 times compared with Trawler [117] In addition the algo-rithm peak-motifs in the RSAT package also returns additionalmotifs potentially bound by cofactors and its application onmouse cell ChIP-seq data successfully detected some cofactormotifs [76]

Rather than simply considering individual motifs separatelySIOMICS [78 94] models the cofactor motifs as motif modulesie a group of TFs with their TFBSs co-occurring in significantlylarge number of ChIP-seq peaks Cofactor motifs may occur onlyin a small set of peaks and thus can be underrepresented indi-vidually in all peak regions Therefore evaluation of motif mod-ules simultaneously could make the significance of cofactorsstand out from the massive background sequences Asdescribed in previous section SIOMICS identifies conservedmotifs by a word-based strategy based on Equation (4) Toidentify nonredundant motifs it removes redundant candidateswith lower scores and the remaining ones are used to identifyputative motifs Instead of considering motif candidatesindividually SIOMICS represents them as nodes in a tree inwhich the more frequent candidates are closer to the root whilethe branch of the tree represents a group of co-occurring motifsBy evaluating each branch based on a Poisson clumpingheuristic strategy SIOMICS identifies statistically significantmotif modules and takes motif candidates in them as final pu-tative motifs The performance comparison on the same ChIP-seq data set used by DREME and peak-motifs indicates thatmore documented cofactor motifs were identified by SIOMICSrather than DREME and RSAT Peak-motifs with comparabletime efficiency

The above methods identify cofactors from ChIP-seq data fora single TF and have obtained many valuable results Howeversynthetic analysis of diverse ChIP-seq data could provide moreinformation about the co-regulation mechanism of TFs FCOPs[113] is a method for identifying combinatorial occupancy pat-terns of multiple TFs from diverse ChIP-seq data FCOPs inte-grates two kinds of ChIP-seq data sets as input from TFsoccupancy as well as chromatin modification and it considerstwo kinds of uncertainty How possible a peak from ChIP-seqdata represents an actual binding activity and how confidentthe predicted enhancer sequences are FCOPs generate probabil-ities from these uncertainties and use them to evaluate the stat-istical significance of a given set of TFs co-occurring in a givenset of enhancers Obviously enumerating all possible subset ofenhancers has an exponential complexity which is inefficientThis method adopts a dynamic programming-based method tocalculate the probabilities which leads to linear complexitywith respect to the number of the enhancers both in time andspace Despite this method not providing any DNA motifs itcan be easily integrated into ChIP-seq motif analysis to identifycofactors

8 | Liu et al

Citation analysis of ChIP-seq motif-findingtools

To analyze the usage of existing ChIP-seq-based motif-findingtools we collected the citation information of eight tools inTable 1 and a Web server MEME-ChIP which implants MEMEand DREME The relative citations of the eight tools since 2011are shown in Figure 2 where we can see that DREME has anoverwhelming superiority among these tools and RSAT peak-motifs and ChIPMunk also have higher relative citation ratesAlthough the four tools that were developed within the past 3years (RPMCMC Discover SIOMICS and FMotif) introducedsome new ideas and obtained better prediction performancethey still fail to get more attention from biologists and almostall of their citations are from the papers focusing on develop-ment of new methods The main reason is the lack of a user-friendly Web server and a platform with multiple related func-tions limiting their usages by people without adequate compu-tational background

DREME as one motif-finding engine of MEME-ChIP has beenintegrated in the MEME suite which is the most popular motif-finding and analysis platform MEME-ChIP obtained more add-itional citations than DREME (Figure 3) and most of theseshould be attribute to DREME RSAT peak-motifs is part of RSATplatform where a series of modular computer programs is

integrated for regulatory signal detection in noncoding se-quences And the relatively high number of citations ofChIPMunk also benefited from its Web service The methodsthat only provide source code are difficult to obtain continuousattention Taking MICSA as an example it was published in2010 and was cited frequently within 2 years following publica-tion but its citation count decreased sharply in recent yearsThis phenomenon indicated that the user experience of a newalgorithm is as important as the quality of algorithm itself

Discussion

The ChIP-seq peaks provide a foundation for the identificationof motifs on a genome scale In return the predicted motifs canbe used to reevaluate the peaks and even find more potentialTF-binding sites that were missed by a ChIP-seq experimentFor example MICSA [79] takes advantage of de novo motif identi-fication to reevaluate the ChIP-seq peaks In MICSA only a sub-set of the highest peaks is fed into MEME for motifsidentification for which the probability of observing a motif bychance (P-value) is calculated Then the predicted motifs willbe used to scan all previously identified peaks and a score willbe assigned to each peak by combining the P-values of motifsthat occurred on it with its depth The experimental results

0

10

20

30

40

50

60

70

80

90

100

2011 2012 2013 2014 2015 2016

Rela

ve

annu

al ci

tao

n (

)

Year of citaon

RPMCMC 2015[91]

Discrover 2014[93]

SIOMICS 2014[78]

Fmof 2014[74]

RSAT peak-mofs 2012[94]

DREME 2011[73]

ChIPMunk 2010[69]

MICSA 2010[79]

Figure 2 Relative citation of the ChIP-seq motif-finding tools until October 2016 The numbers after a tool name indicate the year of its original release and the corres-

ponding reference Citation of each tool was extracted from the Bing Academy (httpcnbingcomacademic)

0

5

10

15

20

25

30

35

40

45

50

Cita

ons

pery

ears

ince

publ

ished

MICSA 2010[79]

ChIPMunk 2010[69]

DREME 2011[73]


Fmof 2014[74]

SIOMICS 2014[78]

Discrover 2014[93]

RPMCMC 2015[91]

MEME-ChIP 2011[68]a

Figure 3 The average annual citation of eight motif-finding tools and MEME-ChIP aMEME-ChIP is a Web server that implants DREME and MEME to detect motifs


showed that MICSA offered a distinct improvement in TF-bind-ing prediction accuracy with fewer false positives comparedwith the other 10 approaches Certainly searching potentialbinding sites using the predicted motifs is not limited to thepeak sequences It could even be applied on the whole-genomesequences as some binding activities may not occur under thecondition of ChIP experiments and these binding sites are alsovaluable for studying the full regulation picture of the interestedTFs

Another issue that should be discussed concerning ChIP-seqmotif finding is the difference of its application on various datasources For example the binding motifs in eukaryotes and pro-karyotes often have their respective characteristics and thelength of the binding regions in eukaryotes is much longer thanthose in prokaryotes Therefore there is a need of special designin motif-finding and analysis techniques

As with the traditional motif finding there are some ensem-ble methods for ChIP-seq-based motif finding [85 118] It was al-ways recommended that using several tools in motif finding isa better strategy as diverse techniques may capture differentcharacteristics of motifs This idea leads to the development ofensemble motif-finding methods Most of these methods sim-ply assemble several algorithms together and rank the outputof them For example W-ChIPMotif [119] includes MEMEWEEDER and MaMF [120] and assesses the output by comparingwith a randomized initial input CompleteMotifs [118] incorpor-ates three de novo tools (ChIPMunk WEEDER and CUDA-MEME)[121] and calculates the P-values of predicted motifs by a back-ground random model However using more algorithms intro-duces more false positives while they have high probability tocover real motifs Therefore further analyses are still requiredfor the output of multiple algorithms

Certainly the accuracy of prediction and the efficiency ofmethods are still the most important features in ChIP-seq motiffinding and the application of current methods shows that theyare still far from completely satisfactory More deep thinkingand advanced techniques are still needed to improve the intrin-sic algorithms For example the evolution information fromphylogenetic footprinting can be useful as prior knowledge inevaluation of candidate binding sites Besides additional func-tions such as cofactor analysis evaluating peaks based onmotifs and multiple types of discriminative motif finding are at-tracting more attention in applications They should be wellintegrated in motif-finding methods in future studies Finallyan integrated Web server or platform is essential for the appli-cation of new designed methods as it can provide more func-tions about upstream data collection and downstream resultanalysis

Most recently deep learning has demonstrated its unprece-dented performance in regulatory genomics especially in ChIP-seq-based motif analysis [122] This marked performance ismainly achieved by taking advantage of graphics processingunits and the capability to extract complex features of motifsfrom ChIP-seq data However it can be improved further ifmore information (eg DNA shape) can be taken into consider-ation rather than only the ChIP-seq peaks sequences

Key Points

bull The algorithmic strategies adopted by current motif-finding tools can be generally divided into two catego-ries word-based and profile-based which could be im-plemented with other models like tree graph and

clustering The combination of these two strategies isa better choice to break the current bottleneck in pre-diction accuracy and efficiency

bull The new features of current tools including but notlimited to discriminative motif identification (DREAMRSAT lsquopeak-motifsrsquo and Discrover) cofactor motifdetection (DREME RSAT lsquopeak-motifsrsquo SIOMICSRPMCMC and Discrover) etc are essential for applica-tion of ChIP-seq data analysis and are encouraged tobe integrated into newly designed algorithms

bull Existing tools still have a much room for improvementwith respect to prediction accuracy and efficiency thecontradictory relationship between the time efficiencyand space complexity as well as between predictionaccuracy and application universality is generally pre-sent in these tools

bull An integrated Web server or platform providing vari-ous analysis functions in data collection and the resultanalysis is essential for the spread of new designedmethods in their applications

bull In future studies more mathematical models eggraph model statistics and combinatorial optimiza-tion could be developed and integrated to design newmotif signal detection techniques and new evaluationcriteria In addition more information eg the de-pendency between nucleotides and the height of peaksfrom ChIP-seq data should be considered to improvethe prediction accuracy

Funding

The State of South Dakota Research Innovation Center andthe Agriculture Experiment Station of South Dakota StateUniversity the National Nature Science Foundation ofChina (grant numbers 61303084 and 61432010 to BL) andthe Young Scholars Program of Shandong University (grantnumbers YSPSDU 2015WLJH19 to BL)

References1 DrsquoHaeseleer P What are DNA sequence motifs Nat

Biotechnol 200624423ndash52 Brohee S Janky R Abdel-Sater F et al Unraveling networks

of co-regulated genes on the sole basis of genome se-quences Nucleic Acids Res 2011396340ndash58

3 Davidson E Levin M Gene regulatory networks Natl Acad SciUSA 20051024935

4 Liu B Zhou C Li G et al Bacterial regulon modeling and pre-diction based on systematic cis regulatory motif analysesSci Rep 2016623030

5 Baumbach J On the power and limits of evolutionary con-servationndashunraveling bacterial gene regulatory networksNucleic Acids Res 2010387877ndash84

6 Lawrence CE Altschul SF Boguski MS et al Detecting subtlesequence signals a Gibbs sampling strategy for multiplealignment Science 1993262208ndash14

7 Pevzner PA Sze SH Combinatorial approaches to findingsubtle signals in DNA sequences Proc Int Conf Intell Syst MolBiol 20008269ndash78

8 Nakaki R Kang J Tateno M A novel ab initio identificationsystem of transcriptional regulation motifs in genome DNAsequences based on direct comparison scheme of signalnoise distributions Nucleic Acids Res 2012408835ndash48

10 | Liu et al

9 Zambelli F Pesole G Pavesi G Motif discovery and transcrip-tion factor binding sites before and after the next-generation sequencing era Brief Bioinform 201314225ndash37

10 Li G Liu B Ma Q et al A new framework for identifying cis-regulatory motifs in prokaryotes Nucleic Acids Res201139e42

11 Chen X Guo L Fan Z et al W-AlignACE an improved Gibbssampling algorithm based on more accurate position weightmatrices learned from sequence and gene expressionChIP-chip data Bioinformatics 2008241121ndash8

12 Sinha S PhyME a software tool for finding motifs in sets oforthologous sequences Methods Mol Biol 2007395309ndash18

13 Das MK Dai HK A survey of DNA motif finding algorithmsBMC Bioinformatics 20078(Suppl 7)S21

14 Wang T Stormo GD Combining phylogenetic data with co-regulated genes to identify regulatory motifs Bioinformatics2003192369ndash80

15 Liu X Brutlag DL Liu JS BioProspector discovering con-served DNA motifs in upstream regulatory regions of co-expressed genes Pac Symp Biocomput 2001127ndash38

16 Hertz GZ Stormo GD Identifying DNA and protein patternswith statistically significant alignments of multiple se-quences Bioinformatics 199915563ndash77

17 Liu XS Brutlag DL Liu JS An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments NatBiotechnol 200220835ndash9

18 Bailey TL Boden M Buske FA et al MEME SUITE tools for motifdiscovery and searching Nucleic Acids Res 200937W202ndash8

19 Olman V Xu D Xu Y CUBIC identification of regulatorybinding sites through data clustering J Bioinform Comput Biol2003121ndash40

20 Li X Wong WH Sampling motifs on phylogenetic trees ProcNatl Acad Sci USA 20051029481ndash6

21 Blanchette M Tompa M Discovery of regulatory elementsby a computational method for phylogenetic footprintingGenome Res 200212739ndash48

22 Blanchette M Tompa M FootPrinter a program designed forphylogenetic footprinting Nucleic Acids Res 2003313840ndash2

23 Li G Liu B Xu Y Accurate recognition of cis-regulatorymotifs with the correct lengths in prokaryotic genomesNucleic Acids Res 201038e12

24 Ma Q Liu B Zhou C et al An integrated toolkit for accurateprediction and analysis of cis-regulatory motifs at a genomescale Bioinformatics 2013292261ndash8

25 Tompa M Li N Bailey TL et al Assessing computationaltools for the discovery of transcription factor binding sitesNat Biotechnol 200523137ndash44

26 McCue LA Thompson W Carmack CS et al Factors influenc-ing the identification of transcription factor binding sites bycross-species comparison Genome Res 2002121523ndash32

27 Simcha D Price ND Geman D The limits of de novo DNAmotif discovery PLoS One 20127e47836

28 Siddharthan R Siggia ED van Nimwegen E PhyloGibbs aGibbs sampling motif finder that incorporates phylogenyPLoS Comput Biol 20051e67

29 Blanchette M Schwikowski B Tompa M Algorithms forphylogenetic footprinting J Comput Biol 20029211ndash23

30 Neph S Tompa M MicroFootPrinter a tool for phylogeneticfootprinting in prokaryotic genomes Nucleic Acids Res200634W366ndash8

31 Carmack CS McCue LA Newberg LA et al PhyloScan identi-fication of transcription factor binding sites using cross-species evidence Algorithms Mol Biol 200721

32 Zhang S Xu M Li S et al Genome-wide de novo prediction ofcis-regulatory binding sites in prokaryotes Nucleic Acids Res200937e72

33 Katara P Grover A Sharma V Phylogenetic footprinting aboost for microbial regulatory genomics Protoplasma2012249901ndash7

34 Tagle DA Koop BF Goodman M et al Embryonic epsilon andgamma globin genes of a prosimian primate (Galago crassi-caudatus) Nucleotide and amino acid sequences develop-mental regulation and phylogenetic footprints J Mol Biol1988203439ndash55

35 Borneman AR Gianoulis TA Zhang ZD et al Divergence oftranscription factor binding sites across related yeast spe-cies Science 2007317815ndash9

36 Odom DT Dowell RD Jacobsen ES et al Tissue-specific tran-scriptional regulation has diverged significantly betweenhuman and mouse Nat Genet 200739730ndash2

37 Boyle AP Araya CL Brdlik C et al Comparative analysis ofregulatory information and circuits across distant speciesNature 2014512453ndash6

38 Bingqiang Liu CZ Zhang H Guojun L et al An integrativeand applicable phylogenetic footprinting framework for cis-regulatory motifs identification in prokaryotic genomesBMC Genomics 201617578

39 Park PJ ChIP-seq advantages and challenges of a maturingtechnology Nat Rev Genet 200910669ndash80

40 Song L Crawford GE DNase-seq a high-resolution tech-nique for mapping active gene regulatory elements acrossthe genome from mammalian cells Cold Spring Harb Protoc201025384

41 Wang Z Gerstein M Snyder et al RNA-Seq a revolutionarytool for transcriptomics Nat Rev Genet 20081057ndash63

42 Tsankov AM Gu H Akopian V et al Transcription factorbinding dynamics during human ES cell differentiationNature 2015518344ndash9

43 Wu F Olson BG Yao J DamID-seq genome-wide mapping ofprotein-DNA interactions by high throughput sequencing ofadenine-methylated DNA fragments J Vis Exp 2016107e53620

44 Maragkakis M Alexiou P Nakaya T et al CLIPSeqTools-anovel bioinformatics CLIP-seq analysis suite RNA2015221ndash9

45 Hafner M Landthaler M Burger L et al PAR-CliPmdasha methodto identify transcriptome-wide the binding sites of RNAbinding proteins J Vis Exp 201041e2034

46 Ingolia NT Ribosome profiling new views of translationfrom single codons to genome scale Nat Rev Genet201415205ndash13

47 Giresi PG Kim J Mcdaniell RM et al FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates activeregulatory elements from human chromatin Genome Res200717877ndash85

48 Nutiu R Friedman RC Luo S et al Direct measurement ofDNA affinity landscapes on a high-throughput sequencinginstrument Nat Biotechnol 201129659ndash64

49 Collas P Dahl JA Chop it ChIP it check it the currentstatus of chromatin immunoprecipitation Front Biosci200813929ndash43

50 Kimura H Sato Y DNA Replication and Histone ModificationJapan Springer 2016

51 Suganuma T Workman JL Histone modification as a reflec-tion of metabolism Cell Cycle 2016 15481ndash2

52 Qu H Fang X A brief review on the human encyclopedia ofDNA elements (ENCODE) project Genomics ProteomicsBioinformatics 201311135ndash41


53 Consortium EP The ENCODE (ENCyclopedia Of DNAElements) project Science 2004306636ndash40

54 Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910R25

55 Li H Durbin R Fast and accurate long-read alignment withBurrowsndashWheeler transform Bioinformatics 201026589ndash95

56 Szalkowski AM Schmid CD Rapid innovation in ChIP-seqpeak-calling algorithms is outdistancing benchmarking ef-forts Brief Bioinform 201112626ndash33 628

57 Kharchenko PV Tolstorukov MY Park PJ Design and ana-lysis of ChIP-seq experiments for DNA-binding proteins NatBiotechnol 2008261351ndash9

58 Zhang Y Liu T Meyer CA et al Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089R137

59 Jiang H Wang F Dyer NP et al CisGenome browser a flex-ible tool for genomic data visualization Bioinformatics2010261781ndash2

60 Fejes AP Robertson G Bilenky M et al FindPeaks 31 atool for identifying areas of enrichment from massivelyparallel short-read sequencing technology Bioinformatics2008241729ndash30

61 Valouev A Johnson DS Sundquist A et al Genome-wideanalysis of transcription factor binding sites based on ChIP-Seq data Nat Methods 20085829ndash34

62 Xin F Grossman R Stein L PeakRanger a cloud-enabledpeak caller for ChIP-seq data BMC Bioinformatics201112139

63 Kuan PF Chung D Pan G et al A statistical frameworkfor the analysis of ChIP-Seq data J Am Stat Assoc2011106891ndash903

64 Mathelier A Wasserman WW The next generation of tran-scription factor binding site prediction PLoS Comput Biol20139e1003214

65 Cheng C Min R Gerstein M TIP a probabilistic method foridentifying transcription factor target genes from ChIP-seqbinding profiles Bioinformatics 2011273221ndash7

66 Wu S Wang J Zhao W et al ChIP-PaM an algorithm to iden-tify protein-DNA interaction using ChIP-Seq data Theor BiolMed Model 20107(1)18

67 van Heeringen SJ Veenstra GJC GimmeMotifs a de novomotif prediction pipeline for ChIP-sequencing experimentsBioinformatics 201127270ndash1

68 Machanick P Bailey TL MEME-ChIP motif analysis of largeDNA datasets Bioinformatics 2011271696ndash7

69 Kulakovskiy IV Boeva VA Favorov AV et al Deep and widedigging for binding motifs in ChIP-Seq data Bioinformatics2010262622ndash3

70 Jothi R Cuddapah S Barski A et al Genome-wide identifica-tion of in vivo proteinndashDNA binding sites from ChIP-Seqdata Nucleic Acids Res 2008365221ndash31

71 Mercier E Droit A Li L et al An integrated pipeline for thegenome-wide analysis of transcription factor binding sitesfrom ChIP-Seq PLoS One 20116e16432

72 Hu M Yu J Taylor JM et al On the detection and refinementof transcription factor binding sites using ChIP-Seq dataNucleic Acids Res 2010382154ndash67

73 Bailey TL DREME motif discovery in transcription factorChIP-seq data Bioinformatics 2011271653ndash9

74 Jia C Carson MB Wang Y et al A new exhaustive methodand strategy for finding motifs in ChIP-enriched regionsPLoS One 20149e86044

75 Pavesi G Mereghetti P Mauri G et al Weeder Web discov-ery of transcription factor binding sites in a set of

sequences from co-regulated genes Nucleic Acids Res200432W199ndash203

76 Thomas-Chollier M Herrmann C Defrance M et al RSATpeak-motifs motif analysis in full-size ChIP-seq datasetsNucleic Acids Res 201240e31

77 Tran NT Huang CH A survey of motif finding web tools fordetecting binding site motifs in ChIP-Seq data Biol Direct201494

78 Jun Ding HH Xiaoman L SIOMICS a novel approach for sys-tematic identification of motifs in ChIP-seq data NucleicAcids Res 2014421645ndash35

79 Boeva V Surdez D Guillon N et al De novo motif identifica-tion improves the accuracy of predicting transcription factorbinding sites in ChIP-Seq data analysis Nucleic Acids Res201038e126

80 Bailey TL Williams N Misleh C et al MEME discovering andanalyzing DNA and protein sequence motifs Nucleic AcidsRes 200634369ndash73

81 Hartmann H Guthhrlein EW Siebert M et al P-value-basedregulatory motif discovery using positional weight matricesGenome Res 201323181ndash94

82 Niu M De novo prediction of cis-regulatory modules in eu-karyotic organisms Dissertations amp ThesesmdashGradworksThe university of North Carolina at Charlotte 2014

83 Bolouri H Ruzzo WL Integration of 198 ChIP-seq datasetsreveals human cis-regulatory regions J Comput Biol201219989ndash97

84 Sun H Guns T Fierro AC et al Unveiling combinatorial regu-lation through the combination of ChIP information and insilico cis-regulatory module detection Nucleic Acids Res201240e90

85 Lihu A Holban S A review of ensemble methods for denovo motif discovery in ChIP-Seq data Brief Bioinform201617731

86 Kulakovskiy IV Makeev VJ Motif discovery and motif find-ing in ChIP-Seq data Genome Analysis Current Proceduresand Applications Caister Academic Press Norfolk UK2014 83

87 Medina-Rivera A Defrance M Sand O et al RSAT 2015Regulatory Sequence Analysis Tools Nucleic Acids Res201543W50ndash6

88 Maaskola J Rajewsky N Binding site discovery from nucleicacid sequences by discriminative learning of hidden Markovmodels Nucleic Acids Res 20144212995ndash3011

89 Ikebata H Yoshida R Repulsive parallel MCMC algorithm fordiscovering diverse motifs from large sequence setsBioinformatics 2015311561ndash8

90 Stormo GD DNA binding sites representation and discov-ery Bioinformatics 20001616ndash23

91 Nishida K Frith MC Nakai K Pseudocounts for transcriptionfactor binding sites Nucleic Acids Res 200937939ndash44

92 Weirauch MT Cote A Norel R et al Evaluation of methodsfor modeling transcription factor sequence specificity NatBiotechnol 201331126ndash34

93 Bailey TL Elkan C Fitting a mixture model by expectationmaximization to discover motifs in biopolymers Proc IntConf Intell Syst Mol Biol 1994228ndash36

94 Ding J Dhillon V Li X et al Systematic discovery of cofactormotifs from ChIP-seq data by SIOMICS Methods 201579ndash8047ndash51

95 van Helden J Andre B Collado-Vides J Extracting regulatorysites from the upstream region of yeast genes by computa-tional analysis of oligonucleotide frequencies J Mol Biol1998281827ndash42

12 | Liu et al

96 van Helden J Rios AF Collado-Vides J Discovering regula-tory elements in non-coding sequences by analysis ofspaced dyads Nucleic Acids Res 2000281808ndash18

97 van Helden J del Olmo M Perez-Ortin JE Statistical analysisof yeast genomic downstream sequences reveals putativepolyadenylation signals Nucleic Acids Res 2000281000ndash10

98 Kulakovskiy I Levitsky V Oshchepkov D et al From bindingmotifs in ChIP-Seq data to improved models of transcriptionfactor binding sites J Bioinform Comput Biol 2013111340004

99 Levitsky VG Kulakovskiy IV Ershov NI et al Application ofexperimentally verified transcription factor binding sitesmodels for computational analysis of ChIP-Seq data BMCGenomics 20141580

100 Eskin E Pevzner PA Finding composite regulatory patternsin DNA sequences Bioinformatics 200218 Suppl 1S354ndash63

101 Sharov AA Ko MS Exhaustive search for over-representedDNA sequence motifs with CisFinder DNA Res 200916261ndash73

102 Hertz GZ Hartzell GW III Stormo GD Identification of con-sensus patterns in unaligned DNA sequences known to befunctionally related Comput Appl Biosci 1990681ndash92

103 Kingsford C Zaslavsky E Singh M a compact mathematicalprogramming formulation for DNA motif finding Lect NotesComput Sci 2006400913

104 Reid JE Wernisch L STEME efficient EM to find motifs inlarge data sets Nucleic Acids Res 201139e126

105 Jin VX OrsquoGeen H Iyengar S et al Identification of an OCT4 andSRY regulatory module using integrated computational and ex-perimental genomics approaches Genome Res 200717807ndash17

106 Favorov AV Gelfand MS Gerasimova AV et al A Gibbs sam-pler for identification of symmetrically structured spacedDNA motifs with improved estimation of the signal lengthBioinformatics 2005212240ndash5

107 Zhao Y Ruan S Pandey M et al Improved models for tran-scription factor binding site identification using noninde-pendent interactions Genetics 2012191781ndash90

108 Redhead E Bailey TL Discriminative motif discovery in DNAand protein sequences using the DEME algorithm BMCBioinformatics 20078385

109 Chen X Xu H Yuan P et al Integration of external signalingpathways with the core transcriptional network in embry-onic stem cells Cell 20081331106ndash17

110 Huggins P Zhong S Shiff I et al DECOD Fast and AccurateDiscriminative DNA Motif Finding Bioinformatics2011272361ndash7

111 Cuellar-Partida G Buske FA McLeay RC et al Epigenetic pri-ors for identifying active transcription factor binding sitesBioinformatics 20122856ndash62

112 Mason MJ Plath K Zhou Q Identification of context-dependent motifs by contrasting ChIP binding dataBioinformatics 2010262826ndash32

113 Teng L He B Gao P et al Discover context-specificcombinatorial transcription factor interactions byintegrating diverse ChIP-Seq data sets Nucleic Acids Res201442e24

114 Vaquerizas JM Kummerfeld SK Teichmann SA et al A cen-sus of human transcription factors function expressionand evolution Nat Rev Genet 200910252ndash63

115 Ding J Cai X Wang Y et al ChIPModule systematic discov-ery of transcription factors and their cofactors from ChIP-seq data Pac Symp Biocomput 2013320ndash31

116 Ettwiller L Paten B Ramialison M et al Trawler de novoregulatory motif discovery pipeline for chromatin immuno-precipitation Nat Methods 20074563ndash5

117 Linhart C Halperin Y Shamir R Transcription factor andmicroRNA motif discovery the Amadeus platform and acompendium of metazoan target sets Genome Res2008181180ndash9

118 Kuttippurathu L Hsing M Liu Y et al CompleteMOTIFsDNA motif discovery platform for transcription factor bind-ing experiments Bioinformatics 201127715ndash7

119 Ho JW Bishop E Karchenko PV et al ChIP-chip versus ChIP-seq lessons for experimental design and data analysis BMCGenomics 201112134

120 Che D Li G Mao F et al Detecting uber-operons in prokary-otic genomes Nucleic Acids Res 2006342418ndash27

121 Liu Y Schmidt B Liu W et al CUDAndashMEME acceleratingmotif discovery in biological sequences using CUDA-enabled graphics processing units Pattern Recognit Lett2010318

122 Alipanahi B Delong A Weirauch MT et al Predicting the se-quence specificities of DNA- and RNA-binding proteins bydeep learning Nat Biotechnol 201533831ndash8


the motifs in promoters tend to evolve at a lower rate and there-fore be more conserved than nonfunctional surrounding se-quences some phylogenetic footprinting-based algorithmshave been developed to reduce the false-positive rate such asPhyloGibbs Footprinter PhyloCon and MicroFootprinter [14 28ndash32] A phylogenetic footprinting strategy was firstly proposed in1988 [33 34] and has significantly improved the state-of-the-artperformance in this field However the majority of programsbased on phylogenetic footprinting did not make full use of thephylogenetic relationship of query promoter sequences fromvarious genomes [21] Owing to this limitation some promotersfrom highly divergent species could be included and the motifinstances are not conserved enough to carry out motif predic-tion [35ndash37] Most recently Liu et al [4 38] developed two com-putational pipelines aiming to break this bottleneckSpecifically they extracted phylogenetic relationships fromregulatory sequences using a combinatorial framework basedon 216 selected representative genomes to refine the ortholo-gous promoter set It is noteworthy that all the methods men-tioned above could be potentially improved by integratingadditional experimental data

The rapid development of high-throughput biotechnologies[39ndash48] has provided new insight and powerful support for regu-latory mechanism analyses and genome-scale regulatory net-work elucidation In particular chromatin immunoprecipitationsequencing(ie ChIP-seq followed by high-throughput DNAsequencing) provides massive proteinndashDNA interactive infor-mation and has been successfully applied to genome-wide ana-lyses of transcription factor (TF) binding histone modificationmarkers and polymerase binding [39 49] This technology usesone chosen protein (ie TF histone or DNA polymerase of inter-est) to bind the whole-genome sequences [50 51] then somephysical techniques such as ultrasonic are used to shear thecross-linked DNA sequences into segments after proteinndashDNAinteraction [39] finally these segments will be sequenced intoshort reads [52 53] These reads could be mapped onto their ref-erence genome if available using Bowtie [54] BWA [55] etcBased on the mapping results the motif-enriched genomic re-gions could be identified by peak-calling tools [56] such as SPP[57] MACS [58] CisGenome [59] FindPeaks [60] QuEST [61] andPeakRanger [62] These regions will be the potential bindingsites of the chosen protein

Recent studies suggest that ChIP-seq can be effectively inte-grated into and benefit TFBS discovery tools [57 61 63ndash73] Itprovides high-throughput motif signals and allows genome-scale discovery in a cell More accurate binding regions (peaks)can be derived from ChIP-seq experiments thus leading tomore reliable prediction performance [9] However the peaksdetected from ChIP-seq data can be up to a few hundred basepairs while the documented cis-regulatory motifs are usuallyonly as long as 8ndash20 bp [74] Therefore an ab initio motif discov-ery method is still indispensable to (i) identify the accuratebinding sites from these ChIP-seq peaks and (ii) build con-served motif profiles for further study in transcriptional regula-tion Unfortunately some widely used motif discovery toolseg MEME and WEEDER [75] cannot be directly used on ChIP-seq peaks as they are designed for co-regulated promoter se-quences with limited size Recently some efforts have beenmade to rectify this problem by modifying traditional motif-finding tools to adapt to the ChIP-seq data [68 74 76] or design-ing specific strategies for ChIP-seq-based motif finding [73 77]The computational challenges of these tools include but notlimited to (i) huge amounts of sequenced ChIP-seq reads canmake motif finding a computationally infeasible problem [9] (ii)

failure to identify the motifs associated with cofactors of theChIP-ed TF [73] or cis-regulatory modules [78] (iii) lack of insightin integration of ChIP-seq data sets from multiple TFs [79] (iv)the traditional false-positive issue in motif prediction causedby the noise in ChIP-seq technology [74] (v) lack of an efficientway to determine the correct lengths of motifs except exhaust-ively enumerating each length within an interval [18 80 81]and (vi) weak support in elucidation of the mutual interactionsamong multiple motifs from larger ChIP-ed data sets [82ndash84]which is important in disease diagnosis through gene regula-tory network construction

So far there have been a few reviews on motif findingfocused on the analysis of ChIP-seq data [77 85 86] Tran et al[77] introduced nine Web tools with numerous details in theirusages and applications For each of them the authors show-cased the requirement of the input format default parametersoutput format basic characteristics of the tool and a brief intro-duction of the algorithm Rather than focusing on individual al-gorithms or tools Lihu et al [85] were dedicated to reviewingseven ensemble methods with input and output requirementsand a brief introduction of the included algorithms Kulakovskiyet al provided a systematic history of tool development in motiffinding before the ChIP-seq technology and advantages as wellas computational challenges of using ChIP-seq in motif findingSeveral ChIP-seq-based methods and their applications were re-viewed focusing on their overall stories or workflows [86]However it is still under investigation in a detailed algorithmicaspect as the performance of existing methods is far from satis-factory based on both efficiency and performance From the al-gorithmic point of view this article presents the mainchallenges and characteristics of de novo motif finding in ChIP-seq peaks through systematically reviewing eight representa-tive tools (Table 1) Specifically the review mainly focuses onthe motif-finding techniques adopted by these methods as wellas such additional specific functions as discriminative motiffinding and cofactor motif identification Through summarizingthe existing limitations and revealing algorithmic potentialsseveral promising directions for further improvements of ChIP-seq-based motif finding have been proposed both in methodo-logical aspects and concrete applications

Motif representation and identification

A motif represents a set of DNA segments with the same lengthwhich are binding sites for the same TF Each segment of themotif is called an instance and different instances of the samemotif tend to be similar with each other on sequence level(Figure 1A) A representation model of a motif to demonstratethe similarity of its instances is expected to accurately capturethe characteristics of proteinndashDNA binding activity of its corres-ponding TF [90]

The most straightforward model to denote the binding pref-erence of a TF on each position along a motif is the lsquoconsensusrsquosequence (eg AGTCA or AGTCG for the motif in Figure 1A) whichis composed of the concatenation of the most frequent nucleo-tide on each position It can be seen as the ancestor of the bind-ing sites of the same TF with an assumption that these sitesevolved from it Although the consensus presents the character-istics of a motif in each position in a simple and clear way thevariations in this motif are absent in this model The lsquodegener-ate consensusrsquo was proposed to fill this gap using IUPAC wildcards to replace the exact nucleotides (A G C and T) For ex-ample W means both A and T in this position could be recog-nized by the TF of this motif (Figure 1B) [9]

2 | Liu et al



06 0 0 02 04

0 06 0 0 04

0 04 0 08 02

04 0 1 0 0

2666664

3777775

(1)



jfrac141

X4

ifrac141

pij logpij

bi(2)



jfrac141

X4

ifrac141

pij logpij

bi dij (3)


able

1Fe

atu

res

of

the

too

lsfo

rC

hIP

-seq

mo

tif

fin

din

g

Pro

gram

sM

ICSA

Ch

IPM

un

kD

REM

ER

SAT

pea

k-m

oti

fsFM

oti

fSI

OM

ICS

Dis

cro

ver

RPM

CM

C

Wo

rd-b

ased

Yes

Yes

Yes

Yes

Yes

Pro

file

-bas

edY

esY

esY

esT

ech

niq

ues

and

crit

eria

MEM

EthornPS

SMsc

ore

IC-l

ike

fun

ctio

ns

thornse

qu

ence

wei

ght

Wo

rdco

un

tthorn

Fish

erte

stR

SAT

oli

go-a

nal

ysis

d

yad

-an

alys

isl

oca

lwo

rdan

alys

isM

EME

and

Ch

IPM

un

k

Wo

rden

um

erat

ion

thornsu

ffix

treethorn

IC-b

ased

z-sc

ore

IC-l

ike

fun

ctio

ns

thorntr

eem

eth

od

sH

MM thornm

ult

iple

test

ing

corr

ecte

dP-

valu

e

MC

MC

Dis

crim

inat

ive

mo

tif

fin

din

gY

esY

esY

es

Co

fact

ors

Mu

ltip

lem

oti

fsM

ult

iple

div

erse

mo

tifs

Mu

ltip

led

iver

sem

oti

fsM

oti

fm

od

ule

sM

ult

iple

div

erse

mo

tifs

Mu

ltip

led

iver

sem

oti

fsM

oti

fco

mp

aris

on

RSA

Tco

mp

are-

mo

tifs

STA

MP

TO

MT

OM

Yea

ro

fre

leas

ean

dre

fere

nce

s20

10[7

9]20

10[6

9]20

11[7

3]20

12[8

7]20

14[7

4]20

14[7

8]20

14[8

8]20

15[8

9]

A B

C












Word-based methods




4 | Liu et al




log ethnTHORNl

Xl

jfrac141

X4

i

pij log pij 1n

Xallsegments


24

35 (4)









KDIC frac14Xl

jfrac141

X4

ifrac141

log fij log N

NXl

jfrac141

X4

ifrac141

fij log bi (5)










6 | Liu et al


















Speed motifcoverage
























8 | Liu et al





Discussion


0

10

20

30

40

50

60

70

80

90

100

2011 2012 2013 2014 2015 2016

Rela

ve

annu

al ci

tao

n (

)

Year of citaon

RPMCMC 2015[91]

Discrover 2014[93]

SIOMICS 2014[78]

Fmof 2014[74]


DREME 2011[73]

ChIPMunk 2010[69]

MICSA 2010[79]



0

5

10

15

20

25

30

35

40

45

50

Cita

ons

pery

ears

ince

publ

ished

MICSA 2010[79]

ChIPMunk 2010[69]

DREME 2011[73]


Fmof 2014[74]

SIOMICS 2014[78]

Discrover 2014[93]

RPMCMC 2015[91]

MEME-ChIP 2011[68]a








Key Points







Funding











10 | Liu et al


























































































12 | Liu et al































06 0 0 02 04

0 06 0 0 04

0 04 0 08 02

04 0 1 0 0

2666664

3777775

(1)



jfrac141

X4

ifrac141

pij logpij

bi(2)



jfrac141

X4

ifrac141

pij logpij

bi dij (3)


able

1Fe

atu

res

of

the

too

lsfo

rC

hIP

-seq

mo

tif

fin

din

g

Pro

gram

sM

ICSA

Ch

IPM

un

kD

REM

ER

SAT

pea

k-m

oti

fsFM

oti

fSI

OM

ICS

Dis

cro

ver

RPM

CM

C

Wo

rd-b

ased

Yes

Yes

Yes

Yes

Yes

Pro

file

-bas

edY

esY

esY

esT

ech

niq

ues

and

crit

eria

MEM

EthornPS

SMsc

ore

IC-l

ike

fun

ctio

ns

thornse

qu

ence

wei

ght

Wo

rdco

un

tthorn

Fish

erte

stR

SAT

oli

go-a

nal

ysis

d

yad

-an

alys

isl

oca

lwo

rdan

alys

isM

EME

and

Ch

IPM

un

k

Wo

rden

um

erat

ion

thornsu

ffix

treethorn

IC-b

ased

z-sc

ore

IC-l

ike

fun

ctio

ns

thorntr

eem

eth

od

sH

MM thornm

ult

iple

test

ing

corr

ecte

dP-

valu

e

MC

MC

Dis

crim

inat

ive

mo

tif

fin

din

gY

esY

esY

es

Co

fact

ors

Mu

ltip

lem

oti

fsM

ult

iple

div

erse

mo

tifs

Mu

ltip

led

iver

sem

oti

fsM

oti

fm

od

ule

sM

ult

iple

div

erse

mo

tifs

Mu

ltip

led

iver

sem

oti

fsM

oti

fco

mp

aris

on

RSA

Tco

mp

are-

mo

tifs

STA

MP

TO

MT

OM

Yea

ro

fre

leas

ean

dre

fere

nce

s20

10[7

9]20

10[6

9]20

11[7

3]20

12[8

7]20

14[7

4]20

14[7

8]20

14[8

8]20

15[8

9]

A B

C












Word-based methods




4 | Liu et al




log ethnTHORNl

Xl

jfrac141

X4

i

pij log pij 1n

Xallsegments


24

35 (4)









KDIC frac14Xl

jfrac141

X4

ifrac141

log fij log N

NXl

jfrac141

X4

ifrac141

fij log bi (5)










6 | Liu et al


















Speed motifcoverage
























8 | Liu et al





Discussion


0

10

20

30

40

50

60

70

80

90

100

2011 2012 2013 2014 2015 2016

Rela

ve

annu

al ci

tao

n (

)

Year of citaon

RPMCMC 2015[91]

Discrover 2014[93]

SIOMICS 2014[78]

Fmof 2014[74]


DREME 2011[73]

ChIPMunk 2010[69]

MICSA 2010[79]



0

5

10

15

20

25

30

35

40

45

50

Cita

ons

pery

ears

ince

publ

ished

MICSA 2010[79]

ChIPMunk 2010[69]

DREME 2011[73]


Fmof 2014[74]

SIOMICS 2014[78]

Discrover 2014[93]

RPMCMC 2015[91]

MEME-ChIP 2011[68]a








Key Points







Funding











10 | Liu et al


























































































12 | Liu et al




































Word-based methods




4 | Liu et al




log ethnTHORNl

Xl

jfrac141

X4

i

pij log pij 1n

Xallsegments


24

35 (4)









KDIC frac14Xl

jfrac141

X4

ifrac141

log fij log N

NXl

jfrac141

X4

ifrac141

fij log bi (5)










6 | Liu et al


















Speed motifcoverage
























8 | Liu et al





Discussion


0

10

20

30

40

50

60

70

80

90

100

2011 2012 2013 2014 2015 2016

Rela

ve

annu

al ci

tao

n (

)

Year of citaon

RPMCMC 2015[91]

Discrover 2014[93]

SIOMICS 2014[78]

Fmof 2014[74]


DREME 2011[73]

ChIPMunk 2010[69]

MICSA 2010[79]



0

5

10

15

20

25

30

35

40

45

50

Cita

ons

pery

ears

ince

publ

ished

MICSA 2010[79]

ChIPMunk 2010[69]

DREME 2011[73]


Fmof 2014[74]

SIOMICS 2014[78]

Discrover 2014[93]

RPMCMC 2015[91]

MEME-ChIP 2011[68]a








Key Points







Funding











10 | Liu et al


























































































12 | Liu et al
































log ethnTHORNl

Xl

jfrac141

X4

i

pij log pij 1n

Xallsegments


24

35 (4)









KDIC frac14Xl

jfrac141

X4

ifrac141

log fij log N

NXl

jfrac141

X4

ifrac141

fij log bi (5)










6 | Liu et al


















Speed motifcoverage
























8 | Liu et al





Discussion


0

10

20

30

40

50

60

70

80

90

100

2011 2012 2013 2014 2015 2016

Rela

ve

annu

al ci

tao

n (

)

Year of citaon

RPMCMC 2015[91]

Discrover 2014[93]

SIOMICS 2014[78]

Fmof 2014[74]


DREME 2011[73]

ChIPMunk 2010[69]

MICSA 2010[79]



0

5

10

15

20

25

30

35

40

45

50

Cita

ons

pery

ears

ince

publ

ished

MICSA 2010[79]

ChIPMunk 2010[69]

DREME 2011[73]


Fmof 2014[74]

SIOMICS 2014[78]

Discrover 2014[93]

RPMCMC 2015[91]

MEME-ChIP 2011[68]a








Key Points







Funding











10 | Liu et al


























































































12 | Liu et al





























KDIC frac14Xl

jfrac141

X4

ifrac141

log fij log N

NXl

jfrac141

X4

ifrac141

fij log bi (5)










6 | Liu et al


















Speed motifcoverage
























8 | Liu et al





Discussion


0

10

20

30

40

50

60

70

80

90

100

2011 2012 2013 2014 2015 2016

Rela

ve

annu

al ci

tao

n (

)

Year of citaon

RPMCMC 2015[91]

Discrover 2014[93]

SIOMICS 2014[78]

Fmof 2014[74]


DREME 2011[73]

ChIPMunk 2010[69]

MICSA 2010[79]



0

5

10

15

20

25

30

35

40

45

50

Cita

ons

pery

ears

ince

publ

ished

MICSA 2010[79]

ChIPMunk 2010[69]

DREME 2011[73]


Fmof 2014[74]

SIOMICS 2014[78]

Discrover 2014[93]

RPMCMC 2015[91]

MEME-ChIP 2011[68]a








Key Points







Funding











10 | Liu et al


























































































12 | Liu et al














































Speed motifcoverage
























8 | Liu et al





Discussion


0

10

20

30

40

50

60

70

80

90

100

2011 2012 2013 2014 2015 2016

Rela

ve

annu

al ci

tao

n (

)

Year of citaon

RPMCMC 2015[91]

Discrover 2014[93]

SIOMICS 2014[78]

Fmof 2014[74]


DREME 2011[73]

ChIPMunk 2010[69]

MICSA 2010[79]



0

5

10

15

20

25

30

35

40

45

50

Cita

ons

pery

ears

ince

publ

ished

MICSA 2010[79]

ChIPMunk 2010[69]

DREME 2011[73]


Fmof 2014[74]

SIOMICS 2014[78]

Discrover 2014[93]

RPMCMC 2015[91]

MEME-ChIP 2011[68]a








Key Points







Funding











10 | Liu et al


























































































12 | Liu et al





































8 | Liu et al





Discussion


0

10

20

30

40

50

60

70

80

90

100

2011 2012 2013 2014 2015 2016

Rela

ve

annu

al ci

tao

n (

)

Year of citaon

RPMCMC 2015[91]

Discrover 2014[93]

SIOMICS 2014[78]

Fmof 2014[74]


DREME 2011[73]

ChIPMunk 2010[69]

MICSA 2010[79]



0

5

10

15

20

25

30

35

40

45

50

Cita

ons

pery

ears

ince

publ

ished

MICSA 2010[79]

ChIPMunk 2010[69]

DREME 2011[73]


Fmof 2014[74]

SIOMICS 2014[78]

Discrover 2014[93]

RPMCMC 2015[91]

MEME-ChIP 2011[68]a








Key Points







Funding











10 | Liu et al


























































































12 | Liu et al

































Discussion


0

10

20

30

40

50

60

70

80

90

100

2011 2012 2013 2014 2015 2016

Rela

ve

annu

al ci

tao

n (

)

Year of citaon

RPMCMC 2015[91]

Discrover 2014[93]

SIOMICS 2014[78]

Fmof 2014[74]


DREME 2011[73]

ChIPMunk 2010[69]

MICSA 2010[79]



0

5

10

15

20

25

30

35

40

45

50

Cita

ons

pery

ears

ince

publ

ished

MICSA 2010[79]

ChIPMunk 2010[69]

DREME 2011[73]


Fmof 2014[74]

SIOMICS 2014[78]

Discrover 2014[93]

RPMCMC 2015[91]

MEME-ChIP 2011[68]a








Key Points







Funding











10 | Liu et al


























































































12 | Liu et al


































Key Points







Funding











10 | Liu et al


























































































12 | Liu et al






















































































































12 | Liu et al





























An algorithmic perspective of de novo cis-regulatory motif ... us...sequencing (ChIP-seq) technology...

Documents

Transcript of An algorithmic perspective of de novo cis-regulatory motif ... us...sequencing (ChIP-seq) technology...