Statistical Techniques for Examining Gene Regulation
Transcript of Statistical Techniques for Examining Gene Regulation
Statistical Techniques for Examining GeneRegulation
A thesis presented
by
Shane Tyler Jensen
to
The Department of Statistics
in partial fulfillment of the requirementsfor the degree of
Doctor of Philosophyin the subject of
Statistics
Harvard UniversityCambridge, Massachusetts
May 2004
c©2004 - Shane T. JensenAll rights reserved.
Thesis Advisor: Professor Jun S. Liu Shane Tyler Jensen
Statistical Techniques for Examining Gene Regulation
Abstract
Genes are often regulated in living cells by proteins called transcription factors
(TFs) that bind directly to short segments of DNA in close proximity to certain tar-
get genes. These short segments have a conserved appearance, which is called a
motif. The experimental determination of TF binding sites is expensive and time-
consuming. Many motif-finding programs have been developed but no program
is clearly superior in all situations, making it difficult to judge which of the motifs
predicted by these algorithms is biologically relevant.
This thesis provides a review of previous approaches to the problem of motif dis-
covery. We derive a comprehensive scoring function based on a full Bayesian
model, which can handle unknown site abundance, unknown motif width, and
two-block motifs with variable-length gaps. In addition, this scoring function for-
mulation enables us to objectively compare different predicted motifs and select
the optimal ones, effectively combining the strengths of existing programs.
An algorithm, BioOptimizer, is proposed to optimize a scoring function, thereby
reducing noise in the motif signal found by any motif-finding program. The accu-
racy of BioOptimizer, when used in conjunction with several existing programs,
is shown to be superior to any of these motif-finding programs alone when eval-
uated by simulation studies and real-data applications in bacteria.
We then propose a Bayesian hierarchical clustering model for the common struc-
ture between a set of discovered motifs. This clustering model is implemented,
iii
using a Gibbs sampling strategy, on a dataset of 116 TF motifs and several ap-
proaches to analyzing the clustering results are discussed. A Uniform clustering
prior is also considered and is compared to the Dirichlet process prior. Our clus-
tering strategy is general enough to be appropriate and useful in a variety of other
statistical settings.
Finally, our techniques for motif discovery and motif clustering are used in com-
bination to predict co-regulated genes in the bacteria Bacillus subtilis. Sequences
from several closely related species are used to discover motifs conserved by evo-
lution, and these conserved motifs are then used to cluster genes together into
putative co-regulated groups. This clustering is validated and examined in detail
using several external measures of cell regulation.
iv
Contents
Title page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Introduction and Previous Work 1
1.1 The Biology of Transcription Regulation . . . . . . . . . . . . . . . . 1
1.2 Consensus Sequence Formulation . . . . . . . . . . . . . . . . . . . . 5
1.3 Position-Specific Weight Matrix Formulation . . . . . . . . . . . . . 9
1.4 Motif Discovery for the PSWM Formulation . . . . . . . . . . . . . . 11
1.5 Problems with Existing Motif Discovery Methods . . . . . . . . . . 14
1.6 Modeling Motif Similarity by Clustering . . . . . . . . . . . . . . . . 15
1.7 Combining Motif Discovery and Clustering . . . . . . . . . . . . . . 17
2 Bayesian Motif Discovery Models 21
2.1 A Full Bayesian Model . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Markov Chain Monte Carlo Implementation . . . . . . . . . . . . . 25
2.3 Fixed Number of Sites in A . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Unrestricted Model for A . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Dealing with Multiple Motif Types . . . . . . . . . . . . . . . . . . . 29
v
2.6 Extensions of the Bayesian Motif Model . . . . . . . . . . . . . . . . 30
2.6.1 Variable motif abundance p0 . . . . . . . . . . . . . . . . . . 31
2.6.2 Variable motif width w . . . . . . . . . . . . . . . . . . . . . . 32
2.6.3 Two-Block Motifs . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 Scoring Function Optimization 35
3.1 Bayesian scoring functions . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Non-Bayesian scoring functions . . . . . . . . . . . . . . . . . . . . . 38
3.3 Optimizing a scoring function . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Using Scoring Functions to Extend the Model . . . . . . . . . . . . . 43
3.4.1 Overlapping Motif Sites . . . . . . . . . . . . . . . . . . . . . 44
3.4.2 Unknown Motif Site Abundance . . . . . . . . . . . . . . . . 44
3.4.3 Unknown Motif Width . . . . . . . . . . . . . . . . . . . . . . 45
3.4.4 Two-Block Motifs . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Detecting Poor Motifs with the Null Score . . . . . . . . . . . . . . . 48
4 Motif Discovery Results 49
4.1 Simulation Comparison of Scoring Functions . . . . . . . . . . . . . 49
4.2 Real Data Comparison of Scoring Functions . . . . . . . . . . . . . . 53
4.3 Simulation Comparison of Motif-Finding Programs . . . . . . . . . 55
4.4 Real Data BioOptimizer Evaluation: One-Block . . . . . . . . . . . . 58
4.5 Real Data BioOptimizer Evaluation: Two-Block . . . . . . . . . . . . 62
4.6 Using Different Motif Width Prior Distributions . . . . . . . . . . . 64
4.7 Special Restrictions on A in Real Data . . . . . . . . . . . . . . . . . 67
5 Bayesian Motif Clustering Model 70
5.1 Hierarchical Framework . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Clustering of Observations . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3 Gibbs Sampling Implementation . . . . . . . . . . . . . . . . . . . . 73
vi
5.4 Motif Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5 Clustering of Two-Block Motifs . . . . . . . . . . . . . . . . . . . . . 77
5.6 Advantages of our Clustering Model . . . . . . . . . . . . . . . . . . 78
5.7 Comparison with Other Clustering Priors . . . . . . . . . . . . . . . 79
6 Analyzing Motif Clustering Results 83
6.1 Clustering Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2 Best Clustering Partition . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3 Strength of Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4 Examining Particular Clusters in Detail . . . . . . . . . . . . . . . . 92
6.5 Effect of Prior Specification on Clustering Results . . . . . . . . . . 93
6.6 Effect of w on Clustering Results . . . . . . . . . . . . . . . . . . . . 96
7 Prediction of Co-Regulated Genes 101
7.1 Collection of Orthologous Gene Sets . . . . . . . . . . . . . . . . . . 102
7.2 Motif Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.3 Clustering Genes Based on Discovered Motifs . . . . . . . . . . . . 110
7.3.1 Validation of Gene Clusters . . . . . . . . . . . . . . . . . . . 112
7.4 Studyset Clustering Results . . . . . . . . . . . . . . . . . . . . . . . 115
7.5 Detailed Examination of Studyset Clusters . . . . . . . . . . . . . . 121
7.6 Whole Genome Clustering Results . . . . . . . . . . . . . . . . . . . 126
7.7 Detailed Examination of Whole Genome Clusters . . . . . . . . . . 128
8 Discussion and Future Work 135
vii
List of Figures
1.1 Sequence logo of a motif . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Four different discovered motifs . . . . . . . . . . . . . . . . . . . . 15
2.1 Graphical representation of the motif discovery parameters . . . . . 22
2.2 Graphical representation of a two-block motif . . . . . . . . . . . . . 33
4.1 Sequence logo of known CRP sites . . . . . . . . . . . . . . . . . . . 55
4.2 Comparison of different prior width penalty terms . . . . . . . . . . 66
5.1 Comparison of clustering statistics between DP and Uniform priors 82
6.1 Clustering tree for dataset based on a motif width of 8 bps . . . . . 87
6.2 Sequence logos for clusters 1 and 2, with families . . . . . . . . . . . 93
6.3 Clustering statistics between Uniform and DP models . . . . . . . . 94
6.4 Comparison of clustering trees between Uniform and DP models . 95
6.5 Distribution of motif widths in dataset . . . . . . . . . . . . . . . . . 97
6.6 Comparison of clustering trees using different motif widths . . . . 98
7.1 Microarray and sequence-based gene clustering procedures . . . . 102
7.2 Phylogenetic tree of seven related bacterial species . . . . . . . . . . 104
7.3 Flowchart for motif discovery procedure . . . . . . . . . . . . . . . 110
7.4 Clustering tree for studyset joint-block motifs . . . . . . . . . . . . . 116
7.5 Flowchart for studyset motif clustering procedure . . . . . . . . . . 117
viii
7.6 Distribution of cluster sizes for studyset best partitions . . . . . . . 118
7.7 Graph of connected studyset clusters . . . . . . . . . . . . . . . . . . 122
7.8 Flowchart for genome motif clustering procedure . . . . . . . . . . 127
7.9 Distribution of cluster sizes for whole genome partition . . . . . . . 128
7.10 Graph of connected and significant whole genome clusters, part 1 . 130
7.11 Graph of connected and significant whole genome clusters, part 2 . 133
ix
List of Tables
1.1 IUPAC nomenclature for consensus sequences . . . . . . . . . . . . 6
1.2 Matrix representations of a motif . . . . . . . . . . . . . . . . . . . . 9
4.1 Simulation comparison of scoring function optimizations . . . . . . 51
4.2 Comparison of scoring function optimizations on the CRP dataset . 54
4.3 Simulation comparison of motif-finding programs . . . . . . . . . . 57
4.4 Comparison of motif predictions for one-block datasets . . . . . . . 61
4.5 Comparison of motif predictions for two-block datasets . . . . . . . 64
4.6 Performance of different motif width priors . . . . . . . . . . . . . . 66
6.1 Protein Families in Dataset . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2 Best partition of clusters for dataset . . . . . . . . . . . . . . . . . . . 90
6.3 Top five clusters for all three motif widths . . . . . . . . . . . . . . . 99
7.1 Bacterial species included in the study . . . . . . . . . . . . . . . . . 103
7.2 Orthologous gene pairs with B.subtilis . . . . . . . . . . . . . . . . . 105
7.3 Sequence distributions for each dataset . . . . . . . . . . . . . . . . 106
7.4 Significant studyset predicted clusters . . . . . . . . . . . . . . . . . 119
7.5 Genome clusters significant on multiple measures . . . . . . . . . . 129
x
Acknowledgments
This dissertation would not have been possible without the guidance and insight
of my advisor, Professor Jun S. Liu. Although I was able to match his enthuasi-
asm, I was totally incapable of keeping up with all the ideas and possible di-
rections that he would share with me. The result is that I still have a “TO-DO”
list that rivals my thesis in terms of length. I could not have asked for a more
supportive and generous advisor.
Many thanks go to Professor Donald Rubin for his advice and insight as well as
ample amounts of his excellent scotch. Also, thanks to Don, I will never again
attempt to present my implementation method before presenting my model. I
am also grateful to the rest of the statistics faculty for their teaching and helpful
discussions throughout my time here at Harvard.
Also essential to my thesis were the biological applications provided by Profes-
sor Richard Losick and his molecular biology group, especially Patrick Eichen-
berger. Many of the novel statistical techniques I present in this thesis evolved
from the interesting scientific questions posed by Rich and Patrick. I am thankful
for my other collaborators as well, with special thanks to Cristian Castillo-Davis
for helpful discussions and the use of his program GeneMerge, Lei Shen for his
computational assistance, and Xiaole Liu for use of her program BioProspector.
The single person who has borne the brunt of my stress and anxiety over the
last five years is Ms. Aline Normoyle. My work would not have been possi-
ble without her love, support and sense of humour. I also thank my family for
encouraging me and helping to keep my life in perspective.
xi
I was very lucky to arrive at Harvard at the same time as the statistics ladies,
Liz Stuart and Sam Cook. I can not even speculate where I would be without
their friendship and support. Hosung Kang, Gopi Goswami, and Byron Ellis
arrived the year after we did. All my friends have been patient with me through
five years of paranoid diatribes and the experiences we shared together were
seriously good times. I am also grateful to Jim Greiner, Mayetri Gupta, Nondas
Sourlas, Claudia Pedroza, and the rest of the students in the Harvard statistics
department for their friendship and statistics help.
I am very appreciative for the non-Harvard perspective of my friends Tal Nawy
and Azadeh Akhavan who have been incredibly supportive of me for as long as
I can remember. I also thank my Masters advisor, Professor George Styan, for
encouraging me to continue my graduate education.
Finally, I would like to thank the Boston Red Sox and New England Patriots for
providing me with endless distraction and, in the case of the Pats, inspiration
to succeed. The Red Sox, on the other hand, taught me that you don’t have to
succeed to still be entertaining, and I am, if nothing else, very entertaining.
xii
Chapter 1
Introduction and Previous Work
1.1 The Biology of Transcription Regulation
The complete information that defines the characteristics of living cells within an
organism is encoded in the form of a moderately simple molecule, deoxyribonu-
cleic acid, or DNA. The building blocks of DNA are four nucleotides, abbrevi-
ated by their attached organic bases as A, C, G, and T. A-T and C-G are com-
plementary bases between which hydrogen bonds can form. A DNA molecule
consists of two long chains of nucleotides that are complimentary to each other
and joined by hydrogen bonds twisted into a double helix. This structure gives
rise to the term “base pair” when describing a DNA sequence. The specific order-
ing of these nucleotides, the “genetic code”, is the means by which information is
stored that completely defines all functions within a cell. With the recent develop-
ment of high-throughput sequencing technology, the National Institute of Health
genetic sequence database, GenBank, has sustained an exponential growth rate
since 1982. Right now GenBank contains the complete genomic sequences of over
1000 organisms (Benson et al., 2002) with approximately 22 billion DNA bases.
The central dogma of molecular biology dictates that certain segments of the
1
DNA (i.e., genes) are transcribed into another molecule, RNA, which serves as
a transient template to make the basic building blocks of cellular life, proteins.
Although all the cells in the same organism possess exactly the same DNA se-
quences (i.e., genetic information), they display different physiological character-
istics in different tissues, developmental stages, and environmental conditions.
This “differentiation” is caused by the differences among the collections of pro-
teins that are synthesized in different cells or at different cell states. If a protein
is being synthesized at a certain state, its coding DNA (called a gene) is termed
as “active” or “expressed”. Thus, a cell in a particular physiological state can be
roughly viewed as a mechanical system where each different gene is switched
either on (active) or off (inactive).
In many organisms, the DNA that codes for proteins (genes) is only a small por-
tion of the total genomic DNA. For example, genes make up only about 1.5% of
the human genome (IUPAC, 1986). The non-coding components of DNA, which
were initially considered as “junk” sequences, actually contain the control mech-
anisms for activating and deactivating the genes, and thus the synthesis and non-
synthesis of proteins. Most of the control sequences for a gene lie in the upstream
regulatory region, which is the region of a few hundred or thousand base pairs
directly before the gene. Transcribing or activating a gene requires not only the
DNA sequence in the upstream region, but also many proteins called transcrip-
tion factors (TF). When these TFs are present, they bind to specific DNA patterns
in the upstream sequence of genes, and either induce or repress the transcription
of these genes by recruiting other necessary proteins (Lodish et al., 1995).
One transcription factor can bind to many different upstream regions, thus regu-
lating the transcription of many genes. The binding sites of the same transcrip-
2
tion factor show a significant sequence conservation, which is often summarized
as a short (5-20 bases long) common pattern called a transcription factor binding
motif (TFBM) or binding consensus, although some variability is tolerated. It is
the main focus of this thesis to discover the locations of these motif sites and to
model the common patterns shared by different motifs.
In prokaryotes (lower organisms without nuclei), there are fewer TFs, their motifs
tend to be relatively long, and the strength of regulation for a particular gene
often depends on how close a particular site matches the consensus for the motif.
The more mismatches to the consensus in a binding site, the less often the TF will
bind and therefore the less control it will exert on the target gene. The variability
between sites is sometimes crucial to the regulatory process, since TF binding
sites that are perfect matches to the optimal pattern would bind the TF too tightly,
preventing the subsequent steps of transcription (Pfahl, 1981).
In eukaryotes (higher organisms with nuclei), many more transcription factors
are involved in the regulation of a gene, and their binding motifs tend to be
shorter. Eukaryotic upstream regions usually contain regulatory modules, a col-
lection of adjacent binding sites (sometimes multiple binding sites) of several
transcription factors. Transcription regulation not only relies on the combina-
tion of the TFs involved, but also on the number of site copies in the upstream
regions (Werner, 1999).
Characterizing the motifs of TFs and locating TF binding sites are crucial tasks
for understanding how the cell regulates its genes in response to developmental
and environmental changes. However, the gold standard experimental proce-
dures to determine binding sites are inefficient, sometimes impractical, and can
only discover one transcription factor binding site at a time. With the availabil-
3
ity of complete genome sequences, biologists are using techniques such as DNA
microarray (Schena et al., 1995) or serial analysis of gene expression (Velculescu
et al., 1995) to measure the expression level of every gene in an organism in vari-
ous conditions.
A genome can be divided into gene clusters according to similarities in their gene
expression (Eisen et al., 1998). Genes in the same expression cluster respond simi-
larly to environmental and developmental changes and thus may be co-regulated
by the same TF or the same group of TFs. Therefore, our computational analy-
sis can be focused on the search for TF binding sites in the upstream of genes
contained in a particular cluster.
Another experimental procedure called Chromatin Immuno-Precipitation follow-
ed by microarray (ChIP-array or ChIP-on-chip) can measure where a particular
TF binds to DNA in the whole genome, although at a coarse resolution of 1-2
kbps. Again, computational analysis is required to pinpoint the short binding
sites of a transcription factor from all the long TF binding targets.
A focused version of these experiments involves a comparison between normal
(“wild-type”) organisms in a particular species and mutant organisms that have
has a specific regulatory protein “knocked-out” of their genome. These mutant
organisms can not produce this particular regulatory protein of interest, and so
whichever genes are normally regulated by this protein will not be regulated in
these mutant organisms. Thus, any genes that show large differences in gene
expression (as measured by DNA microarrays) are considered as possible targets
of this regulatory protein.
However, in practice it is often difficult to measure differential expression from
microarray data (Tseng et al., 2001), and often arbitrary expression thresholds are
4
used to classify genes as either differentially regulated or not. As a consequence,
the set of genes that is used in order to search for TF binding sites can contain
several “false-positive” sequences corresponding to genes that were judged to be
under the control of a protein of interest, but in reality are not. Thus, the dis-
covery of TF binding sites can serve as an important validation technique when
attempting to elucidate the set of genes controlled by a particular protein.
With the ever expanding number of whole genomes sequenced and high through-
put gene expression and protein-DNA binding data, motif finding and transcrip-
tion regulatory network elucidation have become major research topics in com-
putational biology.
There are two ways of discovering novel binding sites of a TF: scanning meth-
ods and de novo methods. In a scanning method, one uses a motif representation
resulting from experimentally determined binding sites to scan the genome se-
quence to find more matches. In de novo methods, one attempts to find novel
motifs that are “enriched” in a set of upstream sequences. This thesis focuses on
the latter class of methods. The de novo methods can also be divided into two
classes, according roughly to two general data formulations for representing a
motif: the consensus sequence or a position-specific weight matrix (PSWM).
1.2 Consensus Sequence Formulation
The consensus sequence shows the motif as a string of IUPAC (1986) characters as
shown in Table 1.1. For example, the Mse motif consensus CRCAAAW suggests
that the Mse protein binds to sites starting with a C, followed by A or G, followed
by CAAA, and followed by A or T. In this section, we use word and segment inter-
5
changeably to mean a short DNA sequence being tested by our motif model as a
potential binding site. When scanning a set of sequences against a consensus, all
words matching the consensus are considered putative binding sites. This some-
times results in many false positive sites, and it may miss some true sites with
variability that isn’t represented by the sequence.
Table 1.1: IUPAC nomenclature for consensus sequencesA Adenine C CytosineG Guanine T ThymineR Purines (A,G) Y Pyrimidines (C,T)W Weak hydrogen bond (A,T) S Strong hydrogen bond (C,G)M Amino Group (A,C) K Keto Group (G,T)B not A (C,G,T) D not C (A,G,T)H not G (A,C,T) V not T (A,C,G)N any (A,C,G,T)
Early research on discovering motifs was usually simplified to finding a sequence
pattern enriched or over-represented in the sequence dataset compared to the
genome background. Therefore, many computational algorithms for finding mo-
tif consensus sequences adopted a “pattern-driven” or “word enumeration” ap-
proach by enumerating predefined consensus patterns to see which is signifi-
cantly enriched in the sequence dataset.
The first consensus sequence enumeration method was developed (Galas et al.,
1985) to search for a TATA-box motif that appears once in each upstream region.
They first align all the upstream sequences at the transcription start site. Then for
every aligned position, they search in the 9-base windows centered at that posi-
tion of all the sequences. In this window, every possible pattern bi of width 6 is
scored according to: S(bi) = (6/6)qi6 + (5/6)qi5 + (4/6)qi4, where qik is the number
of sequences whose best matching 6-mer (subsequence of length 6) to bi in the
6
9-base window has k matched-positions. The highest scoring pattern is consid-
ered as a potential motif and the positions corresponding to this are considered
potential binding locations.
In most motif finding problems, the binding site locations are unknown and
their distances from the transcription start site vary extensively. Therefore, oligo-
analysis (van Helden et al., 1998) was developed to find sequence patterns en-
riched in the whole upstream region. This method enumerates every possible
pattern bi of certain width to determine whether it occurs in the dataset more
than expected. Sinha and Tompa (2000) later extended this method to allow for
one-base mismatch and to use the IUPAC alphabet to find motifs with more flex-
ible base substitutions. To speed up computation, Sinha and Tompa calculated
the mean and variance of the number of occurrences of bi and determined its
significance by a Z-test. Their calculations were based on a 3rd order Markov
model for non-coding sequences in the genome. As shown in Liu et al. (2001),
the Markov model discriminates against meaningless patterns such as AAAA or
ATAT that are frequently found in the non-coding sequences and therefore in-
creases the specificity of the discovered motifs.
The time to enumerate all possible consensus patterns increases exponentially as
the pattern width increases, so finding longer motif patterns is a challenge. Since
many long motifs are more conserved near the two ends, van Helden et al. (2000)
proposed to detect long motifs as spaced dyad patterns such as w1 ·ns ·w2, where
w1 and w2 are the dyad motif words with a short enough widths, and ns is the s-
base spacer of unspecified sequence. The expected occurrences of a spaced dyad
can either be calculated from joint distribution of w1 and w2 assuming that w1 and
w2 are conditionally independent, or by counting w1 · ns · w2 occurrences in the
7
whole genome non-coding sequences.
Another method encodes nucleotides using a 2-bit binary number instead of an 8-
bit character, and converts the sequence into a much shorter array for quick access
(Hampson et al., 2000). A third method uses a suffix tree to represent all patterns
of all widths that exist in the whole genome non-coding regions (Brazma et al.,
1998). Keich and Pevzner (2002) introduce models for more refined consensus
pattern searching, which are useful in in the case of very subtle motifs. Each node
contains a sequence pattern that reflects the path from the root to the node, and
stores information of the count and location of all the sequences matching that
pattern. In addition, each node can branch into A, C, G, and T to form patterns
one base longer. Although building the full tree is extremely time and memory
intensive, one can trim many “rare” nodes to speed up tree-building.
A recent method called MobyDick builds longer motifs from concatenating shorter
ones (Bussemaker et al., 2000). MobyDick models the sequence dataset as being
generated by concatenations of words drawn independently from a dictionary
with their respective “usage” frequencies. The initial motif dictionary contains
individual bases A, C, G and T, with their frequencies estimated from genome
non-coding sequences. Longer patterns are formed by adding into the dictionary
those concatenated word pairs that have occurred more than expected (e.g., “CG”
would be treated as a new word if its occurrence is significantly more than what
is expected from the independent pairing). The frequencies are re-estimated for
all the words in the new dictionary to maximize the likelihood of generating the
sequence dataset. The process is repeated until no new words can be added. This
method has recently been generalized to a stochastic dictionary model (Gupta
and Liu, 2003).
8
1.3 Position-Specific Weight Matrix Formulation
An alternative motif formulation is a position-specific weight matrix (PSWM) or
simply motif matrix, which measures the desirability of each base at each position
of the motif. The simplest matrix is an alignment matrix Njk, which records the
occurrence of base k at position j of all the aligned sites for this motif (Table 1.2).
Also shown in Table 1.2 is the corresponding frequency matrix (fjk = Njk/N),
where N is the number of motif sites, and weight matrix log[fjk/θ0k] (Hertz and
Stormo, 1999), where θ0k is the proportion of base k in the non-motif (background)
positions.
Table 1.2: Matrix representations of a motif
Alignment matrixPos A C G T
1 0 4 7 12 2 1 8 13 0 0 12 04 12 0 0 05 0 0 0 126 0 0 0 127 12 0 0 08 6 1 2 3
Frequency matrixA C G T
0.00 0.33 0.58 0.080.17 0.08 0.67 0.080.00 0.00 1.00 0.001.00 0.00 0.00 0.000.00 0.00 0.00 1.000.00 0.00 0.00 1.001.00 0.00 0.00 0.000.50 0.08 0.17 0.25
Weight matrixA C G T
-2.6 0.3 0.8 -1.0-0.4 -1.0 0.9 -1.0-2.6 -2.6 1.3 -2.61.3 -2.6 -2.6 -2.6-2.6 -2.6 -2.6 1.3-2.6 -2.6 -2.6 1.31.3 -2.6 -2.6 -2.60.7 -1.0 -0.4 0.0
Schneider and Stephens (1990) used the position-specific weight matrix to con-
struct a Sequence Logo as a means by which to visualize the appearance of the mo-
tif. Figure 1.1 gives the sequence logo corresponding to the matrix in Table 1.2.
The height of each position is equal to its information content (∑
k fjk log[fjk/θ0k])
and the size of each letter is proportional to the letter’s relative frequency.
A formal statistical model for the position-specific weight matrices was described
in Lawrence and Reilly (1990) and a complete Bayesian method was given in
9
0
1
2
1T
CG
2T
C
A
G
3G 4A 5T 6T 7A 8T
A
Figure 1.1: Sequence logo of a motif
Liu (1994) and Liu et al. (1995). In this model, the sequence data is represented
as an array S, where Sij is the base in position j of sequence i. Each base can
take on K = 4 different values corresponding to the nucleotides A, C, G, and
T. To reflect the fact that the motif sites within S are substrings of length w that
are conserved relative to each other, we model them as independent realizations
from a common Motif model. That is,
(r1, . . . , rw) ∼ ProductMultinomial(Θ = (θ1, θ2, . . . , θw))
if (r1, . . . , rw) is an observed motif site in S, where θi = (θjA, θjC, θjG, θjT ) is a
probability vector for the preference of the four nucleotide types in position j.
This model means that, for example, the motif site “TTACTAA” is generated with
probability θ1T θ2T θ3Aθ4Cθ5T θ6Aθ7A.
The remainder of the sequences are classified as nonsites or background, for
which the simplest model is the multinomial distribution with the “null” fre-
quency θ0 = (θ0A, . . . , θ0T ). Since the motif sites are only a tiny fraction of the
whole sequence data, we can estimate θ0 first (e.g., direct counting of the 4 nu-
cleotide types) and subsequently treat it as known. It has been shown recently
that using a Markov chain to model the nonsite positions can improve the motif
specificity (Liu et al., 2001).
From the alignment of a set of binding sites, we can easily derive a frequency
matrix fjk, which is the MLE of θjk, and the weight matrix given in Table 1.2.
10
These matrices can be used to scan the whole genome sequence, by computing
for each segment the likelihood of that segment being generated from the motif
model, to discover novel realizations of the binding motif. This strategy tends to
be more accurate in capturing the correct sites than using the matching criterion
based upon the consensus sequence formulation.
1.4 Motif Discovery for the PSWM Formulation
In a majority of gene regulation analysis problems, we know neither the locations
of the motif sites nor the motif pattern (i.e., Θ or an estimate of it). Thus, we need
to simultaneously estimate the motif matrix and locate the possible motif sites in
the sequence data. A particularly successful class of computational algorithms
for this problem adopts a “data-driven” or “matrix update” approach based ei-
ther on the EM algorithm (Dempster et al., 1977) or Gibbs sampling (Geman and
Geman, 1984). These methods typically initiate a motif matrix randomly and use
the sequence dataset to gradually refine the motif. It is the focus of Chapters 2-3
to give an overview and extension of this class of algorithms, providing for them
a rigorous Bayesian foundation, and to discuss possible improvements.
The first algorithm for discovering novel motifs was Consensus (Stormo and
Hartzell, 1989). Assuming that each sequence contains one motif site, the algo-
rithm starts by examining all possible locations of the motif sites in the first two
sequences (a total of (n1 − w + 1)(n2 − w + 1) comparisons), and chooses the top
X pairs of motif sites according to the relative entropy scores of their correspond-
ing motif matrix, where the score is defined as ψENT =∑w
j=1
∑Tk=A fjk log fjk/θ0k,
where fjk is frequency matrix and log fjk/θ0k is the weight matrix entry given in
Table 1.2.
11
Later, another scoring function was deduced to estimate the p-value of each mo-
tif, which is the probability of observing a motif from random alignment of the
same size that scores equally or higher (Hertz and Stormo, 1999). Only motifs
with high information content or low p-value are retained, and each is aligned
with every possible w-mer (subsequence of length w) in the third sequence to
form a set of new matrices and the top K matrices are retained. The algorithm
cycles through all the sequences in the same fashion and the best-scoring motifs
are reported at the end as potential TF binding motifs. When there are more motif
sites in the first few sequences in the dataset, especially the first two sequences,
Consensus is effective. Otherwise, a number of runs using different sequence
orders are needed.
Lawrence and Reilly (1990) developed another matrix motif discovery algorithm
based on a missing data formulation, which will be detailed in the next chapter,
and the EM algorithm (Dempster et al., 1977). The original algorithm restricts
each sequence to contain one TF site. A later method called Meme (Bailey and
Elkan (1994); Grundy et al. (1996)) overcomes this limitation by introducing a
prior probability for every position to be the start of a motif site. The algorithm
also uses every existing w-mer in the sequence data set to initialize the EM it-
eration, thus improving the convergence properties of the original method of
Lawrence and Reilly (1990).
About the same time, a Bayesian method and several related Gibbs sampling al-
gorithms for motif discovery were also developed (Lawrence et al. (1993); Liu
(1994); Liu et al. (1995)) and these Bayesian approaches together with powerful
Markov chain Monte Carlo tools demonstrate more modeling and computational
flexibilities. For example, many new methods have been explored to extend the
12
functionality of Gibbs sampling. Gibbs Motif Sampler incorporates a prior prob-
ability of motif occurrence in the sampling, thus allowing variable motif sites in
each input sequence (Liu et al., 1995). By only considering the k positions out of
w in the motif with the richest information content, it allows the motif to contain
small gaps.
AlignAce continues to improve the Gibbs Motif Sampler by iteratively mask-
ing out aligned sites in order to find multiple different motifs (Roth et al., 1998).
BioProspector is a Gibbs sampler that uses a Markov model estimated from the
whole genome non-coding sequences to represent the non-motif background in
order to improve the motif specificity (Liu et al., 2001). It can also find motifs that
have two conserved blocks separated by a non-conserved gap of variable length.
All of these procedures are more or less statistically formulated, in contrast to the
word enumeration methods of van Helden et al. (1998),van Helden et al. (2000),
Sinha and Tompa (2000), Hampson et al. (2000), and Brazma et al. (1998).
Algorithms based on word matches are usually exhaustive in finding motifs, but
are limited by the maximum width of the motif that can be enumerated. Pro-
grams based on matrix update algorithms can find motifs of any specified width,
but none can guarantee convergence or a globally optimal motif. To strike a bal-
ance of the two, a recent algorithm MDscan (Liu et al., 2002) first uses a word
enumeration method to search motifs from the top L sequences that biologists
are most confident contain the motif. Using every existing w-mer in these se-
quences as a seed, MDscan finds all w-mers in the L sequences that are similar to
the seed and constructs from them a motif matrix. All the motif matrices are eval-
uated by a semi-Bayesian scoring function and the best ones are further refined
using all the sequences in the data set. When the motif is weak and the data are
13
noisy, searching for motifs first from sequences with high signal to background
ratio increases the chance of success.
An extensive presentation of Bayesian motif discovery models as well as possible
model extensions is given in Chapter 2.
1.5 Problems with Existing Motif Discovery Methods
These algorithms are all fairly fast, easy to use, and reasonably accurate, although
their relative performances do vary depending on the real-data situation since
each is implementing a different model. In addition, the results from stochastic
motif-finding algorithms can vary between independent runs. When these al-
gorithms do give different motif predictions, practitioners have a difficult time
deciding which results are “best” for a real data situation.
In addition, each model has certain limitations, for example, the need to input a
site abundance parameter, restrictions on the number of sites per sequence, or a
fixed motif width.
In Chapter 3, we present a scoring function optimization approach that provides
a principled means by which to compare and improve motif site predictions from
previous motif-finding algorithms. The scoring function approach has the advan-
tage of being simple to understand as well as easy to implement and extend to
eliminate several tenuous assumptions. Our procedure, implemented in the pro-
gram BioOptimizer, can be used in conjunction with any motif-finding programs
currently available and to compare different prediction results.
In Chapter 4, we demonstrate improved motif-finding accuracy for BioOptimizer
over other motif-finding programs in both simulation and real-data studies. We
14
also show that BioOptimizer can provide extra flexibility compared to other motif-
finding programs, e.g., inferring the motif site abundance parameter and the mo-
tif width, and allowing for motifs consisting of two conserved blocks separated
by a variable-length gap of non-conserved nucleotides.
1.6 Modeling Motif Similarity by Clustering
Although the discovery and characterization of a single motif is often the goal of
a particular biological investigation, it is common for biologists and statisticians
to be interested in examining the similarities and differences between an entire
collection of discovered TF motifs. Figure 1.2 shows the sequence logos for four
motifs that resulted from four separate motif discovery procedures.
Tal1beta-E47S AGL3
0
1
2
bits
1 2TACG 3T
GA 4A
C
5G
C 6A 7CGT
8AC
9T 10T
G
11CGT
12GCAT
0
1
2
bits
1GT
C 2ATC 3G
CTA 4C
GAT 5G
CTA 6C
TA
7GCTA 8C
GAT 9T
CGA
10CTG
ARNT MEF2
0
1
2
bits
1 2 3AC 4G
A 5C 6G 7T 8G 9 10 0
1
2
bits
1ATC 2T 3C
A 4CAT 5A
T 6AT 7T
A
8AT 9G
A 10TAG
Figure 1.2: Four different discovered motifs
It is clear that there exists some similarity or common structure between some of
these discovered motifs, and one could argue for grouping of Tal1beta-E47s and
ARNT together (based on the common CA and TG positions) along with a possi-
ble grouping for AGL3 and and MEF2 (based on the final four ATAG positions).
However, this grouping strategy is based on ad-hoc personal judgement.
15
The statistical problem of interest here is to model the common structure between
these different motifs and find a principled means by which to group, or cluster
motifs together based upon their similarity.
This general idea of grouping similar motifs has been applied by Gordon et al.
(2004), where structural and biochemical database information is used to group
motifs in order to further improve motif discovery. However, this method is
labor-intensive and requires substantial additional information beyond just se-
quence information. We desire a statistical approach that utilizes only the se-
quence data available in our usual motif discovery setting.
There are several traditional statistical techniques for clustering observations to-
gether which are reviewed in Hartigan (1975). Hierarchical Tree Clustering joins ob-
servations together into successively larger clusters based upon some sort of sim-
ilarity measure. K-means Clustering groups observations into a pre-determined
number of clusters by minimizing a within-cluster distance measure.
Each of these techniques have elements that are not ideally suited for our desired
goal of motif clustering. Hierarchical tree clustering requires the user to specify
a distance metric between the observations (in this case, motif matrices), and it
is not clear for comparing motifs what type of simple distance metric should be
used. In addition, the result of this algorithm is a tree that joins all observations
together, and it is not clear where the tree should be “cut” in order to produce
a set of clusters. K-means clustering is useful in situations where the number of
clusters is known a priori, but this is also not the case here, since we have very
little idea how many motifs might cluster together in a particular collection of
motifs.
In addition, these techniques consider the given observations as fixed and known,
16
which is not the case for our applications where each motif in a collection is only
an estimate generated by whatever motif discovery procedure used.
Recognizing that our discovered motifs themselves are estimated parameters, we
need to model both within-motif and between-motif variability. In Chapter 5, we
outline a Bayesian hierarchical clustering model that encompasses both levels of
variability and does not require prior knowledge of the number of clusters.
We present an implementation of this model based upon Gibbs sampling. In
addition to eliminating the clustering problems mentioned above, our stochastic
implementation strategy allows us to examine not only optimal clustering results,
but also the variability in those clustering results.
In Chapter 6, we present various techniques for summarizing and understand-
ing the results from our clustering procedure, with an application to a dataset
containing 116 TF motifs.
1.7 Combining Motif Discovery and Clustering
Biologists are interested in inferring the regulation network of a living cell by
deducing the sets of genes that are co-regulated by specific transcription factors.
As mentioned in Section 1.1 above, microarray data is often collected in order
to deduce co-regulated clusters of genes based on similality of gene expression
patterns. Often, gene-knockout experiments are used to detect genes regulated
by a specific known transcription factor, as in Eichenberger et al. (2003) and Molle
et al. (2003). In other studies, such as Eisen et al. (1998), genes are grouped into
co-regulated clusters based on similarity of gene expression patterns over a time-
series of experiments.
17
In either case, microarray data collection is expensive and time-consuming. As
well, gene expression studies are restricted to a limited number of species for
which microarray chips have been designed. For these reasons, it would be
very beneficial to biologists to develop computational techniques for inferring
sets of co-regulated genes that avoids the need for gene expression information
and instead utilizes more widely available data sources, namely genomic DNA
sequence information.
When the sequence information from several closely-related species is available,
an alternative motif discovery strategy is to look for binding sites that are con-
served between sets of orthologous genes across different species, rather than
across different genes within the same species. Genes which are orthologous to
each other produce proteins with the same function in their respective species
and therefore are probably regulated in a similar fashion.
This strategy, often referred to as “phylogenetic footprinting”, is based upon the
idea that subsets of genomic DNA that are biologically important are more likely
to be conserved by evolution. Genes are examples of DNA sequences which are
usually conserved by evolution, since the proteins which are produced by these
genes can be drastically altered by changes in the gene sequence. Similarily, we
expect that transcription factor binding sites will be conserved by evolution, since
any changes to the binding site sequence that alters the ability of the TF to bind
could have a dramatic effect on gene regulation.
Phylogenetic footprinting has the advantage that clusters of co-regulated genes
do not have to be inferred beforehand (eg. by microarray data), since we are
looking for motifs that are conserved across species instead of across genes within
a single species. The disadvantage of this method is that the complete sequence
18
information from several related species must be known and orthologous genes
within these species must be identified.
Fortunately, the genomes of many related bacterial species have been completely
sequenced and are available publicly from the National Center for Biotechnology
Information (www.ncbi.nlm.nih.gov). McCue et al. (2001) used the sequence
information from 9 bacterial species to identify TF-binding sites in Escherichia coli.
Our organism of interest is the bacterium Bacillus subtilis, which also has several
related species for which complete genome information is available.
Building on top of the concept of phylogenetic footprinting is the idea that many
of the motifs discovered within each of these orthologous gene sets will be sim-
ilar enough in appearance that we will be able group them into clusters. If the
motifs found upstream of several Bacillus subtilis genes are similar enough to be
clustered together, then it is possible that the same TF (recognizing that common
motif) is targeting each of the genes in that cluster.
Thus, by combining statistical techniques for both motif discovery and motif clus-
tering, we can infer potentially co-regulated gene clusters solely on the basis of
the sequence information from several closely-related species.
Qin et al. (2003) used a similar combination of motif discovery and clustering
framework and the motifs discovered by McCue et al. (2001) to cluster Escherichia
coli genes into potentially co-regulated clusters. Their motif-discovery procedure
was restricted to single-block motifs with fixed width even though many bacterial
transcription factors binding motifs consist of two blocks with a variable-length
gap. As well, very little external information was used to validate the gene clus-
ters which were inferred by their procedure. Wang and Stormo (2003) introduce
an algorithm, Phylocon, which combines sequence information between related
19
species with sequence information between co-regulated genes within a single
species to improve motif discovery. Although it was not their intended goal,
their general framework (comparing motifs between genes that were discovered
by cross-species sequence comparison) is similar to our strategy for inferring co-
regulated genes.
As a final application of our improved methods for motif discovery (Chapters 2-
4) and motif clustering (Chapters 5-6), we will use a combined procedure to pre-
dict co-regulated genes in the bacteria Bacillus subtilis. Our model extensions for
motif discovery permits us to focus not just on the discovery of single-block mo-
tifs but also allow for two-block motifs with a variable-length gap. As well, in
addition to an optimized motif signal (Chapter 3), we allow for variable motif
width and unknown motif abundance. Our clustering model is designed to al-
low for both one and two-block motifs, which will also enable us to study the
interaction between clusters that are formed based upon either one-block or two-
block discovered motifs.
In addition, using B. subtilis as our target organism means that we can utilize
external information about gene regulation in B.subtilis to validate our inferred
gene clusters. We will use four different validation measures, based upon gene
expression data, functional classifications, and known TF interactions, to test our
gene clusters for biological relevance.
20
Chapter 2
Bayesian Motif Discovery Models
2.1 A Full Bayesian Model
As in Chapter 1, we let S denote the set of sequences under investigation, where
each Sij takes value in an alphabet of size K (K=4 for DNA sequences). Within
S, we postulate that there are substrings (r1, . . . , rw) of length w that are sites of
an unknown motif model.
The locations of these sites are unknown, so we introduce a missing array of
indicators A, where Aij is either one or zero indicating whether or not position j
in sequence i is the starting point of a motif site.
The composition of the motif is represented by the frequency matrix Θ, where
θjk is the frequency of nucleotide k in column j of the motif. The nucleotide
composition of the background (portions of the sequences that are not motif sites)
is represented by the vector θ0 where θ0k is the frequency of nucleotide k in the
background. A graphical representation of these quantities is given in Figure 2.1.
A particular realization of A (ie. a particular set of motif sites) allows us to break
21
Sequence Data S Site Indicators A Motif Θaaaacatcgatacctacttttggtcgt 000000000001000000000000000 θ1a θ2a · · · θwa
aacctacgtctagcatcgaaatcgacg 010000000000000000000000000 θ1c θ2c · · · θwc
aattatgctacgtacgcggtcgtacgt 000000000000000000010000000 θ1g θ2g · · · θwg
θ1t θ2t · · · θwt
Figure 2.1: Graphical representation of the motif discovery parameters
our sequence data S to two parts, one which consists only of the bases in the motif
sites, and the complementary subset which are the remaining background bases.
We let N be the matrix of the counts of the different nucleotides in all of the motif
sites i.e.,Njk is the number of sites with nucleotide k (k = 1, . . . , 4) in position j of
the motif. For now, we assume that the motif width w is known so that N and Θ
have fixed dimensions of w× 4. We will discuss generalizations to variable motif
width later in Chapter3.
As mentioned in Chapter 1, we assume that each motif site is an independent
realization from a Product-Multinomial distribution parameterized by Θ, which
means that each vector of column counts Nj = (Nj1, . . . , Nj4) independently fol-
lows a multinomial distribution parameterized by θj = (θj1, . . . , θj4)
(Nj1, . . . , Nj4) ∼ Multinomial((θj1, . . . , θj4)),
The corresponding vector of background nucleotide counts is denoted by N0
where N0k is the count of nucleotide k in the background portion of the sequence
dataset. The simplest model for the background counts is that every background
nucleotide is an independent realization from a Multinomial distribution param-
eterized by θ0
(N01, . . . , N04) ∼ Multinomial((θ01, . . . , θ04)),
22
Viewing A as missing data, we can write down the likelihood of S as
p(S | Θ, θ0,A) ∝ θN0
0 ×
w∏
j=1
θNj
j
with the notation θNj
j =∏4
k=1 θNjk
jk . To enable a Bayesian analysis, we employ the
following conjugate prior distributions for Θ and θ0,
Θ ∼ ProductDirichlet(B = (β1, . . . ,βw)) and θ0 ∼ Dirichlet(β0)
where βj = (βj1, . . . , βj4). For a brief review of multinomial models with Dirichlet
prior distributions, refer to Gelman et al. (1995).
Our Dirichlet prior parameters B = (βjk) and β0 = (β0k) can be interpreted as
a matrix of pseudo-counts which are being added to our motif count matrix N
and background count vector N0. This can be seen in the conditional posterior
distribution
p(Θ, θ0 | S,A,B,β0) ∝ θN0+β0−10 ×
w∏
j=1
θNj+βj−1j
We can consider more general models for our background counts N0, such as
modelling each background nucleotide as a realization from a l-th order Markov
chain (empirically l=3 works the best). In this more general situation, we can
write the above model as
p(Θ, θ0 | S,A,B) ∝ p(N0 | θ0) × p(θ0) ×
w∏
j=1
θNj+βj−1j
where θ0 now denotes the parameters in the background Markov model, and
p(θ0) is some prior distribution for these parameters.
In general, it is relatively easy to estimate the background parameters θ0 since the
vast majority of the sequence dataset is background sequence. For this reason, we
23
will assume that our background parameters θ0 are fixed and known a priori. For
simplicity of exposition, we will assume the simple Multinomial model for N0 as
a realization of θ0, though the models that follow are easily generalized to more
complicated background models.
The model has thus far been described as conditional on a particular set of known
motif sites A, but in reality the matrix of site locations A is also unknown and
should also be considered as a set of random variables. In the Bayesian frame-
work, we prescribe a particular prior distribution for our unknown A, which we
will assume is a priori independent of our other set of unknown parameters, the
motif frequency matrix Θ.
In following sections, we will describe several specific prior distributions for A,
but generally, we now have the following joint posterior distribution of our un-
known motif site locations and motif frequency matrix:
p(Θ,A | S, θ0,B) ∝ p(S | Θ,A, θ0) × p(Θ|B) × p(A)
∝ θN0
0 ×w∏
j=1
θNj+βj−1j × p(A)
Our goal for motif discovery is inference based upon this joint posterior distribu-
tion. For those more comfortable with the likelihood framework, this posterior
inference is equivalent to maximum likelihood inference under vague prior infor-
mation. There are advantages to using the Bayesian framework, however, since
it allows for the easy incorporation of prior information and for removal of nui-
sance parameters.
24
2.2 Markov Chain Monte Carlo Implementation
In a typical data augmentation-based Gibbs sampling algorithm (Tanner and
Wong, 1987), the desired posterior distribution p(Θ,A | S, θ0,B) can be simu-
lated by starting with arbitrary initial values of the unknown parameters Θ0, and
then for t = 0, 1, . . ., iteratively sampling from the two conditional distributions:
1. p(At | Θt, θ0,S,B);
2. p(Θt+1 | At, θ0,S,B).
Given enough time steps, the draws simulated in this fashion will converge to
draws from the desired posterior distribution.
Typically, we are most interested in the draws from p(A | S, θ0,B) which would
indicate the most likely positions of the unknown conserved sites. For this reason,
and since Θ is a high dimension (w×4) matrix, drawing the Θ parameters at every
iteration can be both time-consuming and inefficient.
As demonstrated by Liu (1994), the algorithm can be improved by integrating
over Θ so that we can simulate draws via Gibbs sampling from the posterior
distribution p(A | S, θ0,B) directly, where
p(A | S, θ0,B) =
∫
p(Θ,A | S, θ0,B) dΘ
We now give variations on the basic motif model under different assumptions
and the algorithmic consequences of these assumptions. First, we present the
simplest model where the total number of sites is fixed. Then, we present an
improved model where the total number of sites is allowed to vary. We briefly
25
discuss extending the model to multiple motifs. Finally, we discuss relaxing the
assumptions of fixed motif abundance and motif width.
2.3 Fixed Number of Sites in A
In the early methods (e.g., Lawrence and Reilly (1990); Cardon and Stormo (1992);
Lawrence et al. (1993)), it was assumed that each sequence must contain one and
only one motif site, which corresponds to assuming that Aij = 0 for all but one
entry in the ith row. Thus, no explicit prior distribution for A is needed if we
suppose that the motif site can be anywhere in the sequence with equal probabil-
ities. These algorithms, as described in Lawrence et al. (1993) and Liu (1994), are
based on the following assumptions
(a) there is only one type of motif present in the sequence data, and
(b) there is one and only one motif site per sequence.
In this case, the missing indicator array A reduces to a vector a = (a1, . . . , am)
where ai gives the location of the single site within sequence i (out of m total
sequences).
The marginal posterior distribution of interest, p(A | S, θ0,B), can be simulated
by drawing iteratively from the distribution of each ai conditional on the site
locations in the other sequences, A∗,
p(ai | A∗, θ0,S,B) ∝
4∏
k=1
(
θN0
0
θN∗
0
0
)
×
w∏
j=1
Γ(m+ |βj|)
Γ(m− 1 + |βj|)
∏
k Γ(Njk + βjk)∏
k Γ(N∗jk + βjk)
≈
w∏
j=1
(
θjrj
θ0rj
)
(2.1)
26
where the site starting at ai is (r1, . . . , rw). N∗ and N
∗0 are the motif and back-
ground counts from the (m − 1) sequences besides sequence i, and |βj| is the
number of prior pseudo-counts added to position j of the motif matrix. θjk is
the best estimate of the motif frequencies θjk from the (m − 1) sequences besides
sequence i,
θjk =N∗
jk + βjk
m− 1 + |βj|
which is also given in Lawrence et al. (1993). Γ(·) is the Gamma function (Γ(x) =
(x− 1)! for integer x) which results from integrating Θ out of our full conditional
posterior distribution p(A,Θ | S, θ0,B).
Thus, ai can be randomly drawn from all possible starting points in sequence i
with probability proportional to p(ai | A∗, θ0,S,B) given in (2.1), in either exact
or approximate form.
Liu (1994) gives a version of the distribution (2.1) in the case where θ0 is unknown
with prior distribution Dirichlet(β0 = (β01, . . . , β04)),
p(ai | A∗,S,B,β0) ∝
Γ(|N0| + |β0|)
Γ(|N∗0| + |β0|)
∏
k Γ(N0k + β0k)∏
k Γ(N∗0k + β0k)
×
w∏
j=1
Γ(m+ |βj|)
Γ(m− 1 + |βj|)
∏
k Γ(Njk + βjk)∏
k Γ(N∗jk + βjk)
≈
w∏
j=1
(
θjrj
θ0rj
)
(2.2)
where |N0| is the total number of background counts in all m sequences and |N∗0|
is the total number of background counts in the m − 1 sequences excluding se-
quence i. θ0k is the best estimate of the background frequencies θ0k from the m−1
sequences besides sequence i,
θ0k =N∗
0k + β0k
|N∗0k| + |β0|
27
To avoid being trapped in a phase-shift mode, they also included a Metropolis
step to allow for all the motif sites to move to the left or right by a few positions.
2.4 Unrestricted Model for A
As pointed out in Liu et al. (1995), it is often too restrictive an assumption to hold
the total number of unknown sites as fixed and known.
In this unrestricted model, we consider each Aij as an independent random in-
dicator variable with an a priori probability p0 that Aij = 1 (and hence is a motif
start site). This probability p0 is referred to as the motif abundance parameter.
Since each Aij is independent, we allow for the possibility that some sequences
will have multiple motif sites (ie. several Aij = 1 in sequence i) as well as the pos-
sibility that some sequences may have no motif sites (ie. all Aij = 0 in sequence
i).
This flexibility to allow some sequences to contain no sites is especially important
when analysing sequences within studies where many sequences in a dataset
could be “false-positives”, as described in Section 1.1.
Under this model, the full posterior distribution of our unknown Θ and A is
p(A,Θ | S, θ0, p0,B) ∝ p|A|0 (1 − p0)
L−|A| × θN0
0 ×
w∏
j=1
θNj+βj−1j (2.3)
where |A| is the total number of sites, now assumed to be unknown. The quantity
L = N−(w−1)m, whereN is the total number of nucleotides andm is the number
of sequences. L is the total number of possible site positions, since sites are not
allowed to overlap the ends of a sequence.
28
Integrating out Θ, we have our marginal posterior distribution of interest
p(A | S, θ0, p0,B) ∝ p|A|0 (1 − p0)
L−|A| × θN0
0 ×
w∏
j=1
∏
k Γ(Njk + βjk)
Γ(|A| + |βj|)(2.4)
Liu et al. (1995) considered θ0 as unknown in which case the marginal posterior
distribution of interest is
p(A | S, p0,B,β0) ∝ p|A|0 (1 − p0)
L−|A| ×
∏
k Γ(N0k + β0k)
Γ(|N0| + |β0|)
×
w∏
j=1
∏
k Γ(Njk + βjk)
Γ(|A| + |βj|)(2.5)
and based on this distribution constructed a predictive updating algorithm based
on the probability equation
p(Aij = 1 | A∗,S, p0,B,β0)
p(Aij = 0 | A∗,S, p0,B,β0)∝
p0
1 − p0×
w∏
j=1
(
θjrj
θ0rj
)
(2.6)
where (r1, . . . , rw) is the site sequence starting at Aij and A∗, θjk, θ0k are the same
as in the previous section.
2.5 Dealing with Multiple Motif Types
Although this situation is not the focus of this paper, it is worth mentioning that
the model (2.3) can be extended to the situation where we suspect that multiple
distinct motif patterns exist in the same set of sequences. The simplest strategy is
to introduce more motif matrices, one for each motif type, and to let the variable
Aij indicate not only the start of a motif site, but also the motif type (Liu et al.,
1995). Another strategy is to mask out the discovered sites of the first motif and
repeat the usual motif-finding procedure (Roth et al., 1998).
As pointed out in Lawrence et al. (1993), searching for several patterns simulta-
neously permits the sharing of information between them to aid in the discovery
29
of unknown sites of each. They present a multiple-motif version of the multino-
mial sampler, where the multiple motifs are restricted to have the same ordering
(collinearity) between different sequences. Potential modeling of the spacing be-
tween motifs is also mentioned but not implemented.
Liu et al. (1999) mention that this early model for collinearity is computationally
inefficient, and propose that the models for a single motif be combined with a
Hidden Markov Model (HMM) for insertions/deletions between different mo-
tifs. This unified model, called the propagation model, capitalizes on the collinear-
ity properties inherent to hidden Markov models but does not require the large
amount of free parameters that a typical HMM would. There is the additional
model selection issue (Gelman et al. (1995); Kass and Raftery (1995)) for deter-
mining the appropriate total number of different motif patterns.
More recently, Xing et al. (2003) presented LOGOS, a hidden Markov model for
the occurrence of multiple motifs combined with a separate hierarchical Bayesian
Markovian model for each different motif. Frith et al. (2003) introduce software,
Cluster-Buster, which combines the information from known motif patterns to
find dense clusters of motifs in genome-wide searches.
2.6 Extensions of the Bayesian Motif Model
In many situations, very little information is known a priori about either the motif
abundance or the motif width. In these cases, it is preferable to treat both quanti-
ties as random variables instead of fixed and known quantities. As well, we can
consider extending the model beyond the concept of a single block of contingu-
ous nucleotides.
30
2.6.1 Variable motif abundance p0
The statistical model summarized by (2.4) assumes known motif site abundance
p0. However, in practice one might not have a very good idea how many motif
sites to expect in a given dataset. Current motif-finding algorithms often use ad
hoc estimates of p0, such as assuming a particular number of sites per sequence.
With our continued focus on full Bayesian modeling, we instead consider p0 as a
random variable with a Beta(a, b) prior distribution. Jensen et al. (2004) demon-
strate, via a simulation study, that treating p0 as a random variable leads to better
performance than using a fixed p0.
If we assume that the motif abundance ratio p0 is unknown with a Beta(a, b) prior
distribution, then the full posterior distribution (2.3) becomes
p(A,Θ, p0 | S, θ0,B) ∝ p|A|+a−10 (1 − p0)
L−|A|+b−1 × θN0
0 ×
w∏
j=1
θNj+βj−1j (2.7)
A specific prior distribution for p0 would be a Uniform(0, 1) distribution, which
corresponds to a Beta(1, 1) distribution. This prior distribution is non-informative
in the sense that it will have very little influence on the results compared to the
influence of the observed sequence data.
We can then integrate out the random variable p0 as well as the parameters Θ to
get
p(A | S, θ0,B) ∝ Ba,b(|A|, L− |A|) × θN0
0 ×w∏
j=1
∏
k Γ(Njk + βjk)
Γ(|A| + |βj|)(2.8)
where Ba,b(c, d) is the Beta function∫ 1
0xa+c−1(1 − x)b+d−1dx/
∫ 1
0xa−1(1 − x)b−1dx.
This marginal posterior distribution can be used to construct a predictive updat-
ing algorithm similar to the predictive updating algorithm based on (2.6).
31
2.6.2 Variable motif width w
Liu et al. (1995) suggest that the assumption of fixed motif width w can be relaxed
somewhat to allow so-called fragmentation of motifs. In a fragmentation model,
only J columns of a motif of width w are selected to form the motif pattern.
This is accomplished by positing additional missing indicator variables for whe-
ther or not each of the w positions of a motif are considered as part of a conserved
motif pattern. This new missing data can be incorporated into a larger model
and a Gibbs sampling strategy can again be used for implementation. This frag-
mentation model is useful for correcting the problem that earlier Gibbs sampling
strategies could get stuck in local modes that were phase-shifted versions of the
true signal.
A slightly different approach to correcting this same phase shift problem is to
insert a Metropolis step within the Gibbs sampler that shifted each motif in one
direction or the other (Liu, 1994).
If we vieww as an unknown variable and treat it directly, then we face a Bayesian
model selection problem (Gelman et al., 1995) since, for different width w, the
dimensionality of the motif parameter Θ is different. Lawrence et al. (1993) use
an ad hoc information per parameter criterion to select the best motif width.
Noting that Θ can be integrated out from the model to avoid the dimensionality
change, Gupta and Liu (2003) place a prior distribution on w, and use a Metropo-
lis step to update w based on the joint distribution.
If we posit w as a random variable with a prior distribution p(w), then our marg-
32
inal posterior distribution (2.8) becomes
p(A, w | S, θ0,B) ∝ p(w) × Ba,b(|A|, L− |A|) × θN0
0
×
w∏
j=1
∏
k Γ(Njk + βjk)
Γ(|A| + |βj|)
Γ(|βj|)∏
k Γ(βjk)(2.9)
which has both A and w as unknown variables. Possible prior distributions for
w could be Poisson(w0) distribution with w0 representing an a priori expectation
for the motif width. Other possible prior distributions are the Geometric(w0), or
Exponential(w0).
2.6.3 Two-Block Motifs
We consider a final extension for the possibility that a particular regulatory pro-
tein binds to the DNA strand in two places instead of just one. In this case, the
binding motif can be summarized by two conserved blocks that are separated by
a gap of non-conserved nucleotides that can vary slightly in length, as depicted
in Figure 2.2.
Width w 1Width w 2
Block 2Block 1
Gap
Width g
Figure 2.2: Graphical representation of a two-block motif
We let Θ1 and Θ2, with width w1 and w2, be the frequency matrices of the two
motif blocks, respectively. If we assume that the nucleotide composition of both
blocks are independent from each other, it is not difficult to extend our Bayesian
model to accommodate the two-block motifs.
33
The only complication is that we must account for the gap between the two
blocks, which can be of different length between different sites. If our current
configuration of A has m sites, the gap lengths of these two-block motif sites are
denoted as G = (g1, . . . , gm). We assume a priori that each gi is independent and
that gi ∼ Uniform(G1, G2). In other words, each gap length can be anywhere from
a minimum of G1 to a maximum of G2, with equal probabilities for each gap in
that range.
Due to the rotation of the DNA double-helix, in many studies G1 and G2 are
typically separated by about 3 nucleotides. We now have a marginal posterior
distribution where A, G and the motif widths w1 and w2 are all allowed to vary,
p(A,G, w1, w2|S, θ0,B) ∝ p(w1) × p(w2) × Ba,b(|A|, L− |A|) × θN0
0
×
w1+w2∏
j=1
∏
k Γ(Njk + βjk)
Γ(|A| + |βj|)
Γ(|βj|)∏
k Γ(βjk)(2.10)
with the implicit restriction that each gi lies within the interval [G1, G2].
34
Chapter 3
Scoring Function Optimization
As described in Chapter 1, there are several existing motif-finding programs that
are more or less related to the models presented in Chapter 2. However, each
algorithm does differ in various parameter settings and model assumptions and
in many cases the user does not have the freedom to alter these settings.
As a consequence, the performance of each program will vary between different
sequence datasets and each program could give different sets of predicted sites.
In addition, most of these programs are stochastic algorithms, so independent
runs within the same program on the same dataset can also lead to different sets
of predicted sites.
Practitioners are disconcerted by these differences, since they lack the means by
which to compare the sets of predicted sites from different programs. Thus, our
initial motivation for this research was to provide a simple but principled rule
for deciding, out of a collection of different configurations of A (different sets of
predicted sites), which configuration of A was the “best”.
In this situation where the single “best” answer to a motif-finding problem is
desired (i.e. the “best” set of site predictions or the “best” consensus matrix),
35
our goal is to find the optimal value of a particular scoring function. Under our
Bayesian formulation, we focus on scoring functions which are values of an ap-
propriate posterior distribution.
This scoring function formulation enables us to quantify the “goodness” of dif-
ferent configurations of A in terms of their fit to our posterior distribution (and
hence our Bayesian probability model).
Because of the need for a speedy algorithm, it is sensible to seek strategies, such
as optimizing a scoring function, instead of a full posterior analysis. In addition,
due to the intrinsic presence of multiple modes in the marginal distribution of A,
summarizing this distribution with a posterior mean or posterior interval can be
misleading. This is because Gibbs sampling chains started from different initial
values can get stuck in different modes, leading to a posterior mean estimate
which might not be in an area of high posterior mass.
Here we examine several scoring functions that have been used in practice to
evaluate a discovered motif and as well as some novel generalizations.
3.1 Bayesian scoring functions
We begin by assuming for now that the motif width w and the abundance ratio p0
are known, in addition to our running assumption that the background parame-
ters θ0 are fixed and known.
For simplicity, we also assume that the number of prior counts in each column
of the motif matrix is constant, ie. |βj| = |β| for all j. In each scoring function,
we ignore the collection of terms that are constant with respect to the unknown
parameters.
36
The first scoring function is the exact log-posterior marginal density for A:
ψexact(A) = log p(A | S, θ0, p0, w,B)
= |A|logit(p0) +∑
k
N0k log θ0k +
w∑
j=1
log
[∏
k Γ(Njk + βjk)
Γ(|A| + |β|)
]
(3.1)
Although this exact scoring function may not appear very intuitive to the reader,
it is closely related to the following intuitive scoring function through a series of
approximations including Stirling’s formula (Stirling, 1730),
Γ(x+ 1) = x! ≈ xxe−x(2πx)1/2 (3.2)
Using Stirling’s formula (3.2), we can approximate ψexact as
ψstir(A) = |A|logit(p0) −3
2w log(|A| + |β| − 1)
+
w∑
j=1
∑
k
(Njk + βjk −1
2) log
(
Njk + βjk − 1
|A| + |β| − 1
)
−Njk log θ0k
≈ |A|
[
logit(p0) +w∑
j=1
∑
k
θjk log
(
θjk
θ0k
)]
−3
2w log(|A| + |β| − 1), (3.3)
where θjk =Njk+βjk
|A|+|β|.
Our empirical results show that the Stirling scoring function ψstir tracks ψexact
very well for realistic values of |A| and Njk.
Another scoring function approximation that we can consider is based on the en-
tropy distance between the frequency matrix entries θjk and the fixed background
frequencies θ0k
ψent(A) = |A|
[
logit(p0) +w∑
j=1
∑
k
θjk log
(
θjk
θ0k
)]
(3.4)
Compared with this heuristic-based scoring function, ψStir has an additional term,
which gives an additional penalty to a large number of motif sites.
37
The entropy distance is also called the Kullback-Leibler information (for discrete
measures) in the statistics literature (Kullback and Leibler, 1951). A form similar
to the entropy scoring function is mentioned in Lawrence et al. (1993).
3.2 Non-Bayesian scoring functions
It is interesting to note that scoring functions related to the entropy approxima-
tion (3.4) have arisen in the motif-finding literature outside of the context of a
Bayesian formulation.
In developing their Consensus algorithm, Stormo and Hartzell (1989) introduced
a scoring function very similar to ψent which they call the information content:
ψinfo(A) =w∑
j=1
∑
k
θjk logθjk
θ0k, where θjk =
Njk
m. (3.5)
where m is the number of sequences. This function is equivalent to all the forego-
ing scoring functions where when the motif width w is assumed known and the
total number of motif sites |A| is assumed to be equal to the number of sequences
m, which was the case in Stormo and Hartzell (1989), Lawrence and Reilly (1990),
and Lawrence et al. (1993), and in the model presented in Section 2.3.
However, when |A| is unknown, function ψinfo cannot be used to find a proper
set of motif sites — it will converge to a set of very few motif sites with high con-
servation and ignore potential sites that are less conserved. A Bayesian remedy
is to give a prior distribution f(A), and then construct
ψ′info(A) = log f(A) + |A|
w∑
j=1
∑
k
θjk logθjk
θ0k.
This scoring function is nearly equivalent to the entropy one we have shown
earlier except that a more flexible prior of A is allowed here. A temptation here
38
is to use a prior on |A| directly, but this overlooks the “entropy number”, i.e., the
number of different A’s that can give rise to the same value of |A|.
Liu et al. (2002) present an algorithm called MDSCAN for motif-finding based not
only on sequence data but also on gene expression information from microarray
experiments. Since the true p0 is rarely known in practice, they propose to opti-
mize the following scoring function:
ψmd(A) =log(|A|)
w
w∑
j=1
∑
k
θjk logθjk
θ0k. (3.6)
The functional form again shares some similarities to the entropy approximation
given above. Although function ψmd is not intended as an approximation to the
posterior distribution p(A | θ0, p0,S), it can still be used as a scoring function to
evaluate different configurations of A.
3.3 Optimizing a scoring function
Now that our scoring function formulation gives us a means by which to compare
the quality of different configurations of motif start sites A, we will now outline
a simple algorithm for finding the optimal configuration of start sites by finding
the A with the locally best possible score.
We accomplish our optimization of one of the scoring functions described above
by using a Metropolis algorithm-based approach (Metropolis and Ulam (1949);
Metropolis et al. (1953)).
In the Metropolis steps, we systematically scan through every element of the
matrix A, and decide whether the indicator variable at this position should be
“changed” to its opposite value. If Aij is a motif start site (Aij = 1), we remove
39
that site. If Aij = 0, we add a site starting at that position.
If we denote A′ as A with this change made, then we calculate the following
Metropolis ratio:
r = min{1, exp{ψ(A′) − ψ(A)}/T}
The decision to accept the change or to keep A unchanged is made with proba-
bility r and 1 − r, respectively. The scoring function ψ can be taken as any of the
scores discussed earlier in this chapter. The parameter T is called the temperature
of the algorithm, with low temperatures restricting the algorithm to accept only
small jumps and high temperatures allowing for more freedom to move around
the parameter space.
We focus on the following optimization strategy. The Temperature=0 strategy
forces the algorithm to accept only changes that immediately improve the score,
since forcing T to approach 0 then forces r to equal 0 if ψ(A′) < ψ(A) or r to equal
1 if ψ(A′) ≥ ψ(A). With this type of deterministic strategy, it is important that
we start the algorithm in an area near the mode of the density, or else our simple
hill-climbing algorithm is guaranteed to get stuck in an inferior local mode.
Therefore, one would first want to run the dataset through a ”first-pass” motif-
finding program (e.g. BioProspector, Consensus, AlignAce, or Meme) which
would give a set of predicted sites that are near the area of high posterior density,
and then use these predicted sites as the starting point of a T = 0 optimization
algorithm. In this scenario, our optimization strategy is intended to “clean up”
the output produced by algorithms such as BioProspector, Consensus, AlignAce
or Meme.
Those with experience in the fields of statistical computing or physics must have
40
recognized that our procedure is just a local hill-climbing method and can be
viewed as a special case of simulated annealing (with immediate freezing).
To optimize the scoring functions outlined in the previous section, we developed
a software package called BioOptimizer which is currently available for Unix plat-
forms. BioOptimizer takes as input both the sequence data as well as a starting
set of motif sites, such as those provided by BioProspector, Consensus, AlignAce
or Meme.
BioOptimizer systematically scans through every element of the matrix A, and
changes the indicator variable at the position Aij to its opposite value only if the
value of the scoring function is improved ie. accepting the change A to A′ only if
ψexact(A′) − ψexact(A) > 0
where ψ is a scoring function from the previous sections.
Thus, BioOptimizer only introduces small changes to A and only accepts changes
that immediately improve the score ψ, so if the algorithm is started near a inferior
local mode then it will converge only to that inferior local mode.
Thus, the output of BioOptimizer is the set of predicted sites with the locally best
possible score. When using the exact scoring function (3.1), these set of predicted
sites are also the locally best possible fit to our model. Thus, when the exact
scoring function is used, BioOptimizer is essentially trying to improve the fit of
the predicted sites to our Bayesian motif discovery model.
If we use the exact scoring function (3.1), then the difference ψexact(A′)−ψexact(A)
that determines whether or not the change is accepted is given by the intuitive
and simple formula (3.7) in the case of adding a motif site, and a corresponding
formula (3.8) in the case of removing a motif site.
41
If we let A′ be the same as A except for the addition of one site with nucleotides
(r1, r2, . . . , rw), then the difference in exact scores reduces to
ψexact(A′) − ψexact(A) = log
(
p
w∏
j=1
θjrj
θ0rj
)
(3.7)
where p = (|A|+1)/(N−|A|) andN is the total number of potential site locations.
For the site we are potentially adding, we can view the numerator of the product
in (3.7) as the probability of the site being a motif, while the denominator is the
probability of the site being background. Thus, we only accept the addition of the
site if ratio of the probability of motif to background is greater than the estimated
motif abundance p, which has an intuitive appeal.
Removing a motif site involves an analagous formula to (3.7). If we instead let A′
be the same as A except for the removal of one site with nucleotides (r1, r2, . . . , rw),
then the difference in exact scores reduces to
ψexact(A′) − ψexact(A) = log
(
1
p
w∏
j=1
θ0rj
θjrj
)
(3.8)
where now p = |A|/(N − |A| + 1) and N is the total number of potential site
locations. Again, we see that accepting the change involves comparing the ratio
of the motif vs background to the estimated motif abundance.
Although BioOptimizer only does local optimization, it has two basic advan-
tages: (a) it can compare motifs predicted by different motif-finding algorithms
and find the best one among them; and (b) it further improves the motif predic-
tion resulting from any of the current algorithms we have tested, e.g., BioProspec-
tor, Consensus, AlignACE, and Meme. These existing algorithms are proficient
at finding good configurations of A, and each program has its own advantages
in real-data situations. In Chapter 4, we demonstrate that BioOptimizer has im-
proved the motif site predictions in almost all cases.
42
Briefly considering other Metropolis strategies, the Temperature=1 strategy is equi-
valent to sampling from the posterior distribution, if the score function is the ex-
act log-posterior. But for other types of score functions, this approach imposes a
target density on the parameter space, which may or may not be desirable. One
can run this algorithm over many iterations and analyze the Monte Carlo sam-
ples such obtained. We did not implement this strategy because of an overlap of
the effort with previous approaches such as Gibbs Motif Sampler, AlignAce, and
BioProspector.
A Simulated Annealing (Kirkpatrick et al., 1983) strategy combines deterministic
and stochastic strategies by starting the algorithm at a high temperature such as
T = 4 and then slowly decreasing the temperature down to T = 0 as the al-
gorithm continues through many iterations through all positions of A. For the
current thesis, we restrict ourselves to the goal of the T = 0 strategy, i.e., de-
terministic improvement upon the output from current motif-finding algorithms
such as BioProspector.
3.4 Using Scoring Functions to Extend the Model
In addition to comparing and optimizing sets of predicted motif sites, the scor-
ing function formulation can also be used to extend our model to relax several
assumptions which are necessary for the current motif-finding programs such as
BioProspector, Consensus, AlignAce or Meme to operate. Several of these exten-
sions were discussed in the previous chapter and scoring function formulation
now allows us to implement these more general models.
43
3.4.1 Overlapping Motif Sites
The posterior distribution of interest (2.4) and the corresponding exact scoring
function (3.1) are based upon the implicit assumption that motifs are not allowed
to overlap each other and so any given nucleotide can contribute at most once to
the motif matrix. A simple modification to the exact scoring function (3.1) can be
made to allow some nucleotides to contribute more than once to the motif matrix
and thus correctly allow for overlapping motifs. The current implementation of
our BioOptimizer software does not have a restriction against overlapping motifs.
3.4.2 Unknown Motif Site Abundance
Most current motif-finding programs have an assumption of known motif abun-
dance, which is either fixed or can be entered by the user. In section (2.6.1) we
described a posterior distribution with a motif abundance parameter p0 that was
not fixed and known, but was allowed to vary with a certain prior distribution
p0 ∼ Uniform(0, 1).
We can then mathematically integrated the random variable p0 out of our model,
which will leave us with a posterior distribution that no longer depends on pre-
specified site abundance, and the corresponding scoring function
ψ′exact(A) = logB1,1(|A|, L− |A|) +
4∑
k=1
N0k log θ0k
+w∑
j=1
log
4∏
k=1
Γ(Njk + βjk)
Γ(|A| + |βj|)
(3.9)
where Ba,b(c, d) is the Beta function as in the previous chapter. Here again L =
N − (w − 1)m, where N is the total number of nucleotides and m is the number
44
of sequences. L is the total number of possible site positions, since sites are not
allowed to overlap the ends of a sequence.
We can again use the Stirling formula (3.2) to approximate the Γ(·) functions as
well as log[B1,1(|A|, L− |A|)] so that we have
ψ′stir(A) = |A|
[
logit(p0) − 1 +
w∑
j=1
∑
k
θjk log
(
θjk
θ0k
)]
−3
2w log(|A| + |β| − 1) (3.10)
where p0 = |A|/L is the estimated motif abundance ratio and θjk is the same as in
Section 3.1. Removing the same additional penalty term as noted in Section 3.1,
we have our entropy scoring function with a variable motif abundance p0,
ψ′ent(A) = |A|
[
logit(p0) − 1 +w∑
j=1
4∑
k=1
θjk logθjk
θ0k
]
(3.11)
Despite their more complicated mathematical form, these scoring functions are
easy to compute for any A and can be easily implemented in a BioOptimizer
program.
3.4.3 Unknown Motif Width
We can also consider extending our model to allow the width of our unknown
motif to vary. This extension is useful since, in real datasets, there is often very
little known about the motif width a priori. Current motif-finding programs such
as BioProspector, Consensus or AlignAce force the user to input a motif width
that is fixed for the entire run of the program. Meme is the lone exception that
allows the motif width to vary.
We can instead let the motif width w be a random variable that has a prior distri-
bution p(w), which will give us several extra terms in the scoring function (3.9)
45
for our exact log-posterior density,
ψexact(A, w) = log p(w) + logB1,1(|A|, L− |A|) +N0k log θ0k
−w log
[
Γ(|A| + |β|)
Γ(|β|)
]
+w∑
j=1
∑
k
log
[
Γ(Njk + βjk)
Γ(βjk)
]
(3.12)
This exact scoring function also has a corresponding Stirling approximation,
ψStir(A, w) ≈ log p(w) + |A|
[
logit(p0) − 1 +
w∑
j=1
∑
k
θjk log
(
θjk
θ0k
)]
−w∑
j=1
∑
k
(βjk −1
2) log
[
βjk − 1
|β| − 1
]
−3
2w log
[
|A| + |β| − 1
|β| − 1
]
(3.13)
where again θjk is the same as in Section (3.1).
A natural prior distribution for w would be the Poisson(w0), where w0 represents
our a priori expectation for the motif width. One could also consider other prior
distributions for w, such as Geometric(w0) or Exponential(w0).
Note that these scores are not only a function of the set of predicted sites A but
also the motif width w, so we can now consider an optimization algorithm for
not only the set of predicted sites, but also the motif width.
In addition to the Metropolis algorithm for optimizing A presented in Section
3.3, we can now also propose small changes to the motif width w and accept
these changes if they improve the score. Specifically, we consider to either add
(w′ = w + 1) or delete a position (w′ = w − 1) from the current motif and see if
such a change increases the score ie. accept the change only if
ψ(A, w′) − ψ(A, w) > 0
where again ψ can be any scoring function. Our BioOptimizer software uses
the exact scoring function (3.12). The above procedure and the usual procedure
for optimizing A are iteratively repeated until no further changes to A or w are
accepted, at which point we consider A and w to be “optimized.”
46
3.4.4 Two-Block Motifs
As mentioned in Section 2.6.3, our model can be extended to motifs that consist of
two contiguous blocks separated by a gap of non-conserved nucleotides that can
vary in length. Of the current motif-finding programs introduced in Chapter 1,
only BioProspector has the flexibility to find two-block motifs with variable gap,
but with the usual restrictions that the width of each block must be fixed and
known.
In Section 2.6.3, we presented a model that not only allows the gaps to vary, but
also the widths of each block as well. The exact scoring function for the posterior
distribution (2.10) is
ψexact(A,G,w) = log p(A,G, w1, w2|S, θ0,B)
= log p(w1) + log p(w2) +
logB1,1(|A|, N − |A|) +4∑
k=1
n0k log θ0k +
w1+w2∑
j=1
log
Γ(|βj|)4∏
k=1
Γ(βjk)
·
4∏
k=1
Γ(Njk + βjk)
Γ(|A| + |βj|)
(3.14)
with the implicit restriction that each gi lies within the interval [G1, G2].
This exact score (3.14) is a function of the predicted sites A, the gaps G between
the two blocks at each site, and the two block widths w1 and w2. We have im-
plemented a two-block version of BioOptimizer that optimizes not only the set of
predicted sites A, but also the gap lengths G (within pre-specified limits [G1, G2])
as well as the two block widths w1 and w2.
47
3.5 Detecting Poor Motifs with the Null Score
A disadvantage of motif-finding programs such as BioProspector, Consensus,
AlignAce or Meme is that they will always output a predicted motif, even when
a real motif is not present. As mentioned in earlier in the chapter, the scoring
function formulation gives us a principled method by which to compare differ-
ent sets of predicted sites. We can utilize this same benefit to provide a diagnostic
for very poor motif signals found by the usual motif-finding programs.
In addition to comparing different motifs based on their BioOptimizer score, we
can also compare these motifs to the BioOptimizer score generated by a matrix
A containing no predicted sites, ie. Aij = 0 for all i and j. This “null score”
serves as a minimal criterion for any predicted motif. We should not be confident
about any predicted motifs that have a score lower than the “null score”, since
this essentially means the motif signal has less posterior value than no signal at
all.
It is often observed, such as in our simulation study in Section 4.3, that BioOpti-
mizer will often converge to a null motif with no sites when given a poor starting
point, which implies that the starting point definitely had a lower score than the
null score.
In the case where BioOptimizer does not converge to a motif with no sites, the
null score is still calculated for comparison, since the final BioOptimizer score
may still be less than the null score, but can not converge to that null score be-
cause it is stuck in an inferior local mode. Since BioOptimizer also allows the
motif width to vary, the null score is calculated based on a motif of width w0, the
a priori expected motif width.
48
Chapter 4
Motif Discovery Results
4.1 Simulation Comparison of Scoring Functions
In Sections 3.1 and 3.2, we outlined a few scoring functions that could be used in
a motif-finding algorithm: the exact log-posterior as in (3.1), its Stirling approxi-
mation as in (3.3), its entropy approximation as in (3.4), the scoring function (3.6)
used by the MDscan (Liu et al., 2002), and the information-content function (3.5)
used by Consensus. We designed the following simulation study to investigate
the relative ability of each scoring function to find unknown motif sites under
various sequence conditions.
Since ψinfo is only suitable for the case in which the number of sites is known, we
only compared the effectiveness of the first four scoring functions. We include
the MDscan scoring function here since we are interested in evaluating its perfor-
mance against the other scoring functions, though it is not an approximation to a
posterior distribution.
In order to investigate specifically the ability of each scoring function to improve
motif site prediction, our scoring functions for this simulation study assumed a
49
known motif width fixed at the true value.
Each simulated dataset consisted of 20 sequences of 200 bps each, with each se-
quence containing exactly one true motif. Datasets were generated multiple (200)
times under each combination of the following conditions:
1. Width of motif: short (8 base pairs) or long (16 base pairs)
2. Degree of motif conservation: high (91%) or low (70%)
High conservation means that each column of the true motif matrix had a dom-
inant nucleotide with 91% probability (all others 3% equally). Low conservation
means that each motif position had a dominant nucleotide with 70% probability
(all others 10% equally). These values of 91% and 70% were chosen somewhat
arbitrarily, but are reasonable when compared to discovered motifs in bacteria
(Sections 4.4-4.5). The number of simulated datasets was limited to be 200 due to
the time required by BioProspector to discover motifs in each dataset.
We also compared the effects of the prior distribution on Θ by using two different
sizes of pseudo-counts, βjk = 2 vs. βjk = 1.1. This comparison will affect the three
scoring functions derived from our complete Bayesian model, but will not affect
ψmd since no prior distribution was involved in its derivation.
We tested the effect of the scoring function optimization strategy for improv-
ing the results from BioProspector. BioProspector was run on each dataset and
the best motif result was retained. We then applied our optimization algorithm,
based upon each of the four scoring functions mentioned above, to this best Bio-
Prospector result. The motif result from each optimization algorithm was also
retained after the optimization algorithm had converged.
50
Table 4.1 gives the accuracy of the results from algorithms using each of the four
scoring functions. Accuracy is measured by two statistics, the percentage of cor-
rect sites found, and how close the motif consensus found matches the true motif
consensus. Accuracy of Predicted Sites is the percentage of true sites found in
each simulated dataset, averaged over all simulated datasets. Shifting of up to
3 base pairs was allowed. Consensus Match is the proportion of datasets where
the consensus found matches the true consensus (up to 2 mismatched/shifted
letters allowed when w = 8 and 4 allowed when w = 16). The average number of
predicted sites is given in parentheses in the table.
Table 4.1: Simulation comparison of scoring function optimizations
Accuracy of Predicted Sites (Average |A|)Prior Motif Conser BioProspector Scoring Function Optimization
Counts Width vation Results Exact Stirling Entropy MDscan
1.1 8 91 79 (18) 80 (18) 81 (19) 81 (20) 80 (18)2 8 91 79 (18) 80 (18) 80 (18) 67 (15) 80 (18)
1.1 8 70 9 (15) 8 (8) 10 (11) 3 (2) 12 (19)2 8 70 9 (15) 1 (0) 1 (0) 0 (0) 12 (19)
1.1 16 91 85 (17) 91 (19) 91 (20) 91 (23) 80 (16)2 16 91 84 (17) 91 (20) 91 (20) 91 (24) 80 (16)
1.1 16 70 41 (11) 51 (14) 59 (17) 62 (20) 43 (11)2 16 70 41 (11) 51 (13) 54 (14) 41 (10) 43 (11)
Consensus Match (Average |A|)Prior Motif Conser BioProspector Scoring Function Optimization
Width vation Results Exact Stirling Entropy MDscan
1.1 8 91 98 (18) 98 (18) 98 (19) 98 (20) 98 (18)2 8 91 98 (18) 98 (18) 98 (18) 82 (15) 98 (18)
1.1 8 70 22 (15) 18 (8) 22 (11) 10 (2) 26 (19)2 8 70 22 (15) 6 (0) 6 (0) 2 (0) 26 (19)
1.1 16 91 100 (17) 100 (19) 100 (20) 100 (23) 100 (16)2 16 91 100 (17) 100 (20) 100 (20) 100 (24) 100 (16)
1.1 16 70 86 (11) 88 (14) 90 (17) 88 (20) 88 (11)2 16 70 86 (11) 86 (13) 88 (14) 62 (10) 88 (11)
The first conclusion we can reach is that the optimization strategy seems to im-
prove the accuracy of the predicted sites in comparison with the BioProspector re-
51
sult. Regardless of motif width or conservation, the “accuracy of predicted sites”
is almost always higher for each scoring function compared to the BioProspector
output, except in the case of a short motif/low conservation, where no method
seems to work. Based on our “sample size” of 200, the simulation standard errors
for the percentages in Table 4.1 range from 1.5% (assuming a true percentage of
95%) up to 3.5% (assuming a true percentage of 50%).
The results are not as dramatic for the consensus match, suggesting that the scor-
ing function optimization is primarily refining the signal that has already been
found by the Gibbs sampling-based BioProspector. Thus, it seems that this op-
timization strategy has accomplished its intended goal of “cleaning up” the Bio-
Prospector output.
In general, the algorithms do not do nearly as well for low conservation as high
conservation, especially in the case of the shorter motif. This is partly due to
the fact that the optimization algorithm is deterministically restricted to stay in
the same local mode that the BioProspector output is stuck in, and so these algo-
rithms do not have the freedom to correct a poor starting point.
For the low conservation datasets, performance is much better for a longer mo-
tif than for a shorter motif, suggesting that a certain threshold of information is
needed for the Gibbs sampling algorithm BioProspector, and consequently our
optimization algorithm, to be successful. If conservation is reduced, one needs
a longer motif for the algorithms to do well. In the case of a short motif and
low conservation, extra information (such as a prior information about the motif
locations or Θ) is clearly needed.
The exact, Stirling and entropy scoring functions display similar performance in
most situations, although the entropy scoring function appears to do noticeably
52
worse in some cases with the larger prior pseudo-counts and is in general most
affected by a change in prior pseudo-counts.
MDscan in general doesn’t perform as well as the three Bayesian scores, except in
the case where the signal is very weak (low conservation and short motif). This
may be because in the case of a really weak signal, the prior distributions used
for the Bayesian scores swamp the weak signal so that it can’t be detected. This is
also shown in by the slightly improved performance in Table 4.1 when the prior
pseudo-counts are smaller. However, in situations where prior information is ac-
tually available, the formal use of a prior distribution will allow us to incorporate
that information properly.
Overall, these simulation results for the predicted sites suggest that there is al-
most always a benefit associated with using a deterministic optimization algo-
rithm to further improve the output from a stochastic algorithm such as Bio-
Prospector, and that this benefit seems generally to be the greatest when using
the exact scoring function or one of its approximations, in terms of a reason-
able number of predicted sites and the accuracy of those sites. The additional
computational cost of the optimization algorithm is small (≈ 2 minutes for each
simulated dataset).
4.2 Real Data Comparison of Scoring Functions
We examine the performance of our different scoring functions on a dataset con-
sisting of 18 E.coli sequences that contain cyclic-AMP receptor protein (CRP)
binding sites. Each sequence is 105 bps long and each contains at least one 22 bps
motif site that has been experimentally determined via the footprinting method
53
(Lawrence and Reilly, 1990), for a total of 24 known sites in this dataset.
This dataset has been previously analyzed by Lawrence and Reilly (1990) using
an EM algorithm and Liu (1994) using a Gibbs sampler.
Similar to our strategy with the simulated datasets, we first used the program
BioProspector to find a set of initial motif sites, and then use our T = 0 opti-
mization strategy with one of the four scoring functions to further improve the
BioProspector result. For the first three scoring functions, prior pseudo-counts of
βjk = 1.1 were used.
Table 4.2 shows the results from these optimization algorithms, in terms of the
consensus sequence for the motif, the number of sites predicted, and the number
of predicted sites that corresponded to one of the 24 experimentally established
(“correct”) positions of the CRP binding sites. Nucleotides in the consensus se-
quence are capitalized only if they are over 75% conserved in that position.
Table 4.2: Comparison of scoring function optimizations on the CRP dataset
Scoring Function Consensus Sequence Number of Number ofPredicted Sites Correct Sites
BioProspector ttaTtTgAtcgaggTCACActt 9 9 / 24
Exact ttaTgTgAacgagtTCACAttt 15 15 / 24Stirling tttTgTGAtcgagcTCACAttt 18 18 / 24Entropy taaTgTgAtcgaggTCACAttt 20 17 / 24MDscan ttaTgTGAacgaggTCACActt 11 11 / 24
These real data results are similar to the ones from our simulation study. For
each scoring function, the optimization algorithm improved upon the original
BioProspector signal in terms of the number of correct sites predicted.
As shown in Table 4.2, the consensus sequences of the motifs found by using dif-
ferent scoring functions are similar. The three scoring functions (exact, Stirling
54
and entropy) that are closely related to the complete Bayesian model seem to per-
form noticeably better than the MDscan score, with the Stirling scoring function
performing the best in this example.
As a comparison, the “true” motif based on the alignment of the 24 experimental
sites is displayed in Figure 4.1 in the form of a sequence logo. It is seen that the
differing positions of the 5 consensus sequences in Table 4.2 correspond to the
information-weak or ambiguous positions shown in the sequence logo.
0
1
2
bits
1CGAT
2GAT
3CGTA 4A
CT
5CTG
6GCAT
7CTAG
8CTGA
9 10 11 12CTGA
13
TCAG
14 15
CGT
16
TAC
17
TCGA
18
GTAC
19
TCGA
20 21
GCAT
22
GCAT
Figure 4.1: Sequence logo of known CRP sites
4.3 Simulation Comparison of Motif-Finding Programs
The results of the previous section suggest that our scoring function optimiza-
tion procedure improves prediction accuracy in simulated datasets, and that the
best scoring functions in this regard were the ones derived from our exact log-
posterior distribution (2.4) or a close approximation.
However, the results of the previous section were all based upon comparisons
with a single motif-finding program, BioProspector, and so an additional sim-
ulation study was undertaken to validate the superior performance of our op-
timization program BioOptimizer in motif site prediction over all four current
motif-finding programs: BioProspector, Consensus, AlignAce and Meme.
Again, since this simulation study was designed to specifically examine only
55
the motif site prediction of BioOptimizer relative to the other motif-finding pro-
grams, we used a version of BioOptimizer based on the exact scoring function
(3.1) for a known motif width, fixed at the true value.
In this simulation study, two hundred sequence datasets were generated under
each combination of several conditions:
1. Number of sequences: small (20 sequences) or large (100 sequences)
2. Width of motif: short (8 base pairs) or long (16 base pairs)
3. Degree of motif conservation: high or low
In each dataset, a true motif site was placed in each sequence. Same as the pre-
vious study, high conservation means that each column of the true motif matrix
had a dominant nucleotide with 91% probability (all others 3% equally), while
low conservation means that each motif position had a dominant nucleotide with
70% probability (all others 10% equally). The number of simulated datasets was
again limited to be 200 due to the time required by each of the motif-finding pro-
grams.
For each simulated dataset, we applied the motif-finding programs BioProspec-
tor, Consensus, AlignAce and Meme, and compared the results in terms of pre-
dicted sites to the true site locations. BioOptimizer was then applied separately
to each set of BioProspector, Consensus, AlignAce, and Meme results, and the
optimized results were also compared to the true site locations.
We compared the performances (Table 4.3) of these algorithms in terms of accu-
racy of predicted sites, which again is the percentage of true sites found (shifting
of up to 3 bps again allowed) in each simulated dataset averaged over all simu-
56
lated datasets, and total number of predicted sites (|A|).
Table 4.3: Simulation comparison of motif-finding programs
Average % of true sitesMotif Conser True # First-Pass found (Average |A|)Width vation of Sites Program First-Pass BioOptimizer
short high 20 AlignACE 59 (17) 64 (15)short high 20 BioProspector 79 (18) 81 (19)short high 20 Consensus 79 (17) 81 (18)short high 20 MEME 81 (18) 81 (18)
short high 100 AlignACE 50 (55) 70 (76)short high 100 BioProspector 68 (74) 81 (88)short high 100 Consensus 17 (24) 27 (30)short high 100 MEME 49 (50) 80 (87)
long high 20 AlignACE 90 (18) 93 (19)long high 20 BioProspector 85 (17) 92 (19)long high 20 Consensus 90 (18) 92 (19)long high 20 MEME 91 (18) 92 (19)
long high 100 AlignACE 89 (90) 91 (92)long high 100 BioProspector 85 (86) 91 (92)long high 100 Consensus 50 (50) 91 (92)long high 100 MEME 50 (50) 91 (92)
long low 20 AlignACE 27 (14) 30 (8)long low 20 BioProspector 39 (11) 46 (12)long low 20 Consensus 37 (9) 44 (11)long low 20 MEME 45 (11) 48 (12)
long low 100 AlignACE 34 (44) 44 (48)long low 100 BioProspector 38 (41) 54 (59)long low 100 Consensus 45 (48) 54 (59)long low 100 MEME 45 (48) 54 (58)
As shown in Table 4.3, this simulation study demonstrates again that BioOpti-
mizer has improved the accuracy of the motif site prediction over AlignAce, Bio-
Prospector, Consensus, and Meme alone for all combinations of motif length,
conservation level, and number of sequences. Same as in Section 4.1, the simula-
tion standard errors for the percentages in Table 4.3 range from around 1.5% up
to 3.5%, which are generally quite small compared to the differences in percent-
age between BioOptimizer and each first-pass program. The number of predicted
57
sites is also generally closer to the truth for BioOptimizer over any of the motif-
finding programs alone.
In addition to a clear gain in accuracy from using BioOptimizer, it is also worth
noting that the accuracy seems to be generally best when using BioProspector or
Meme as a starting point compared to Consensus and AlignAce.
For the cases with short motifs and low conservation, the performance of all
motif-finding programs was very poor. In most of these cases, none of the first-
pass algorithms was not able to detect the true motif signal, and BioOptimizer
did not improve upon these results.
In many of these weak signal cases, it was observed that the BioOptimizer algo-
rithm would start from the incorrect signal (based completely on false positive
motif sites) found by a first-pass algorithm and converge to a motif configuration
A with no sites. This may be an added benefit of BioOptimizer over other motif-
finding programs in that BioOptimizer will not tend to give the false impression
of a real motif signal when in fact the correct motif signal has not been found.
4.4 Real Data BioOptimizer Evaluation: One-Block
We examined two sequence datasets, each of which contains a one-block tran-
scription factor binding motif. The first dataset is for the transcription factor
Spo0A in the bacteria B. subtilis. The Spo0A sequence dataset consists of the
200 bp upstream regions of 70 genes that showed preferential hybridization to
the Spo0A protein in Chromatin Immuno-Precipitation experiments (Molle et al.,
2003). We have 20 Spo0A binding sites that have been confirmed experimentally
and can be used to validate our strategy.
58
There is some prior information about the Spo0A binding motif. The literature
consensus (Strauch et al., 1990) is thought to be a 7-mer, although the true width
of the motif has not been firmly established. Also, it is not known whether or
not the orientation of the bound protein (relative to the gene) is relevant, so we
need to look for sites in both the forward (5′ → 3′) and the reverse complement
strands.
The second dataset is the same CRP dataset used in Section 4.2. This dataset has
been previously analyzed by Stormo and Hartzell (1989), Lawrence and Reilly
(1990) and Liu (1994). Their analyses focused on detecting sites for a motif with a
width of 22 base pairs, which we will use as our prior expectation w0, but we will
let the true width be inferred by the algorithm.
As outlined in the Section 3.3, our basic strategy is to use a current motif-finding
program, such as AlignAce, BioProspector, Consensus, or Meme to find a good
configuration of motif start sites A, and then use our optimization program BioOp-
timizer to improve the score of the motif.
When the exact scoring function is used, BioOptimizer is essentially trying to im-
prove the fit of the predicted sites to our Bayesian motif discovery model. Since
the performances of these motif-finding programs vary with datasets, BioOpti-
mizer has the advantage of being able to build upon motif results from all of
these different first-pass programs.
In most cases, the motif width is not known a priori but must be fixed when using
a first-pass program such as BioProspector, Consensus, or AlignAce. Our strategy
is to collect the motif results from each first-pass program using each of several
different motif widths, and then apply our optimization program BioOptimizer
to each result separately. BioOptimizer will then optimize each motif result with
59
respect to both the predicted sites and the unknown motif width, as well as pro-
viding an optimal score for each motif result that can be used to compare between
motif results. The “best” motif would then be the motif result with the greatest
BioOptimizer score.
For the spo0A dataset, we ran BioProspector separately for motif widths varying
from 7 to 12 bps, each time collecting the top 5 motif predictions. We also ran
Consensus and AlignAcefor each of these motif widths and collected the top 5
motif results. For the CRP dataset, we collected the top 5 motif results from Bio-
Prospector, Consensus, and AlignAce for each fixed motif width between 20 and
24 bps.
BioOptimizer was then applied to each of these motif results, giving us a total
of 30 optimized spo0A motifs (6 widths × top 5) and 25 optimized CRP motifs
(5 widths × top 5) for each of our first-pass programs BioProspector, Consensus
and AlignAce. Meme has the built-in capability to try different motif widths, so
we collected the top 5 motifs from Meme directly.
Table 4.4 shows the BioOptimizer results with the best score from each first-pass
program for both datasets. Motif predictions from the first-pass programs (used
as BioOptimizer input) are also shown. In addition to the motif width w, con-
sensus sequence and number of predicted sites |A|, we also provide “% True”,
which is the percentage of experimentally-confirmed sites in each dataset that
was found by each algorithm.
We see from the table that the identical optimal CRP motif resulted from three
different starting configurations in terms of both motif width and actual binding
sites. However, as noted in Section 3.3 section, different starting points are not
guaranteed to converge to the exact same optimal configuration, as we see in the
60
Table 4.4: Comparison of motif predictions for one-block datasets
TF # of Results from First-Pass Program Best BioOptimizer ResultSeqs Program w |A| % True w |A| % True Consensus
spo0A 70 BioProspector 12 40 35 12 50 60 TTTGTCGAAaaaConsensus 11 38 50 11 47 50 TTTGTCgAAaaAlignAce 9 28 30 12 49 60 tTTGTCGAAaaa
Meme 15 38 35 14 50 55 TTTGTCGAAaaatgCRP 18 BioProspector 22 11 43 24 13 57 AtttaTgTGAtcgaggTCACActt
Consensus 24 13 57 24 13 57 AtttaTgTGAtcgaggTCACActtAlignACE 24 10 43 24 13 57 AtttaTgTGAtcgaggTCACActt
Meme 20 18 70 19 18 70 TGTgAacgagttCACAttt
Spo0A results where very different starting configurations led to very similar but
not identical optimal motifs.
In general, BioOptimizer leads to more consistent results even when started from
BioProspector, AlignAce, Meme, or Consensus results that differ in both motif
width and consensus sequence. This is a reassuring result, since there are many
cases in practice where little is known a priori about a binding motif, including its
width.
For both datasets, the optimal motif width seems to be longer than our prior
expectations. It also appears that the binding motif of CRP actually consists of
two highly-conserved blocks with a gap of less-conserved nucleotides, so we will
revisit the CRP dataset in Section 4.5 for our two-block motif results.
Upon examination of the Spo0A results, we see that the number of predicted sites
are generally close to the number of sequences, but there are some sequences in
both of these datasets that do not have site predictions. This is a natural conse-
quence of our model, since we might expect several false-positive sequences in
our dataset, as mentioned in Section 2.4.
For both datasets, the use of BioOptimizer increased the proportion of true sites
found compared to the motif results from one of the first-pass programs alone,
61
suggesting that BioOptimizer has improved the accuracy of the motif results for
both CRP and Spo0A.
4.5 Real Data BioOptimizer Evaluation: Two-Block
We examined datasets for 4 two-block transcription factors σE, σF, σH and σK
in the bacteria B. subtilis. Given the results in the previous section, we also re-
examined the CRP dataset to see if we could find the CRP binding motif in two
short blocks instead of one long block.
Microarray experiments (Eichenberger et al., 2003) comparing wild-type B.subtilis
cells to cells where the gene for σE had been inactivated and to cells where σE
was overexpressed were used to identify 155 transcriptional units (operons) as
direct targets of the σE binding protein. Our σE dataset consisted of the 200 bp
upstream regions from these 155 transcriptional units. Our σF dataset (S. Wang,
P. Eichenberger and R. Losick, personal communication), σH dataset (Britton et al.,
2002), and σK dataset (P. Eichenberger and R. Losick, personal communication),
consisted of 38, 46 and 76 upstream regions respectively, each found by a similar
set of experiments.
Some prior information is available for each of these two-block binding motifs.
Helmann and Moran Jr. (2002) give the consensus of σE as ATa (block 1) and
cATAcanT (block 2) with a gap of 16-18 bps, the consensus of σF binding mo-
tif as GywTA and GgnrAnAnTw with a gap of 15 bps, the consensus of σH as
RnAGGAawWW and RnnGAAT with a gap of 11-12 bps, and the consensus of σK
as AC and CATAnnnT with a gap of 16-18 bps.
Since neither AlignAce, Consensus, nor Meme can be used to find a two-block
62
motif, we used only BioProspector as a first-pass program.
For each σ dataset, BioProspector was used to find good starting configurations
under a variety of fixed block widths ranging from 5 to 9 base pairs, and several
different gap ranges (11-13 bps, 12-14 bps, 13-15 bps), resulting in 75 predicted
motifs (the top 5 motifs for each of 5 different widths and 3 different gap ranges).
For the CRP dataset, we specified fixed block widths from 5 to 7 bps with shorter
gap ranges (4-6 bps, 5-7 bps, 6-8 bps), so 45 motifs in total were predicted by
BioProspector.
BioOptimizer was used to optimize and choose the best motif among all the Bio-
Prospector motif predictions for a dataset. For all the five datasets, we used a
prior expected width of 7 bps for both blocks in the BioOptimizer runs.
These best BioOptimizer motifs are shown in Table 4.5, along with the BioProspec-
tor motif result that served as their starting point. The consensus and the number
of predicted sites are given along with the dimension attribute “Dim” of the mo-
tif, defined as w1 − (gap range) − w2.
Just as in our one-block motif results, each of these two-block datasets has a cer-
tain number (“# True Sites”) of experimentally verified sites which were used to
validate our motif prediction accuracy. The Column “% True” indicates the per-
centage of experimentally-confirmed sites found by that motif result.
The BioOptimizer motif results for all four σ datasets resemble their prior con-
sensus sequence. In all five datasets, the use of BioOptimizer increased the pro-
portion of confirmed sites found when compared with the BioProspector motif
result alone. This improvement in accuracy is especially dramatic in the larger
datasets of σE and σK as well as the CRP dataset.
63
Table 4.5: Comparison of motif predictions for two-block datasets
TF # of # True Results from BioProspector Best BioOptimizer ResultSeqs Sites Dim |A| % True Dim |A| % True Consensus
σE 155 59 8-(11-13)-8 106 46 11-(10-12)-11 145 80 ttgtcaTattt-ttcATAtaatgσF 38 11 9-(11-13)-9 25 64 7-(10-12)-11 38 91 GtaTaaa-tGgcaAtAcTaσH 46 19 7-(13-15)-7 39 68 6-(13-15)-8 80 74 aaAGGa-tagaGAAtσK 76 35 7-(13-15)-7 58 17 5-(14-16)-11 58 57 gcACa-gcATAtgaTaa
CRP 18 23 6-(6-8)-6 17 70 5-(7-9)-7 27 91 tGTcA-CAcattt
It is also worth noting that the accuracy for the CRP dataset was improved in the
two-block analysis relative to our results in Section 4.4.
In addition to this improved accuracy, BioOptimizer also has the important fea-
ture that the motif width is treated as an unknown quantity that can vary. In the
datasets studied here, the optimal motif width found by BioOptimizer was often
substantially different from our a priori expectations.
4.6 Using Different Motif Width Prior Distributions
It is interesting to examine how different prior specifications for motif width w
affect the performance of BioOptimizer. The effect on BioOptimizer of using dif-
ferent prior distributions is a different functional form for the term log p(w) in our
variable-wdith exact scoring function (3.12).
We consider three different prior distributions, each with E(w) = w0:
1. w ∼ Poisson(w0): log p(w) = w log(w0) − w0 − log(Γ(w + 1))
2. w ∼ Exponential(w0): log p(w) = − log(w0) − w/w0
3. w ∼ Geometric(w0): log p(w) = − log(w0) + (w − 1) ∗ log(1 − w−10 )
BioOptimizer locally optimizes the motif width by proposing changes of the form
64
w′ = w+1 and w′ = w−1, ie. adding or removing a position from one of the ends
of the motif. We can examine how the use of these different prior distributions
penalizes the addition of columns to our motif matrix.
ψPois(w + 1) − ψPois(w) = log(w0/w)
ψExp(w + 1) − ψExp(w) = −w−10
ψGeo(w + 1) − ψGeo(w) = log(1 − w−10 )
There are several things worth noting based upon these functional forms. The
first thing to note is is that as w0 becomes larger, the exponential and geometric
penalty term become very similar and small, since log(1 − x) ≈ −x as x → 0.
Figure 4.2 below shows the behaviour of the exponential and geometric penalties
for smaller w0.
The second thing to note is that only the Poisson penalty term involves both the
expected motif width w0 and the current motif width w. Figure 4.2 also shows
the contour plot for the Poisson penalty term as a function of both w and w0. Not
surprisingly, the penalty for increasing w is largest when w is already much larger
than w0.
We examined the performance of BioOptimizer on two one-block motif datasets,
one with a long true motif (CRP) and one with a short true motif (spo0A), using
the three different prior distributions. For each prior distribution, we also exam-
ined the use of three different values of w0. We used our a priori expected motif
widths (w0 = 22 for CRP, w0 = 7 for spo0A), as well as two small values w0 = 2
and w0 = 1.1 that were intended to provide extra penalty to the addition of more
columns.
We used BioOptimizer under each of these conditions (3 different priors × 3 dif-
65
2 4 6 8 10
−8−6
−4−2
0
Exp and Geo Penalty
w0
2 4 6 8 10
−8−6
−4−2
0
w0
Poisson Penalty
w0
w
5 10 15 20 25
510
1520
25
Figure 4.2: Comparison of different prior width penalty terms
ferent w0) separately on several BioProspector starting points (25 for CRP, 30 for
spo0A), and the average change in the optimal motif width w made by BioOpti-
mizer for each of these conditions is given in the Table 4.6.
Table 4.6: Performance of different motif width priors
CRP dataset Spo0A datasetPrior w0 Final w - Start w Prior w0 Final w - Start wexp 22 0.00 exp 7 0.20geo 22 0.00 geo 7 0.20
poisson 22 -0.03 poisson 7 0.13exp 2 -0.33 exp 2 0.07geo 2 -0.77 geo 2 0.03
poisson 2 -2.30 poisson 2 -0.23exp 1.1 -0.90 exp 1.1 -0.03geo 1.1 -2.30 geo 1.1 -0.57
poisson 1.1 -2.80 poisson 1.1 -0.43
When using reasonable, literature-based expected motif widths of w0 = 7 for
spo0A and w0 = 22 for CRP, we observe almost no difference between the per-
formance of BioOptimizer when using the three different prior distributions. The
priors show greater differences in performance when using a small w0, which is
66
not surprising given the results in Figure 4.2.
It is interesting to note that the Poisson prior seems to give final motif widths that
are smaller generally than the other two prior distributions, which implies that
the penalty for adding a column is largest for the Poisson prior when w0 = 2 and
w0 = 1.1.
4.7 Special Restrictions on A in Real Data
For some applications, special restrictions on the unknown matrix of site posi-
tions A can be built into our motif discovery model (and scoring function for-
mulation) to accommodate different levels of uncertainty about different parts of
a sequence dataset. An example is the search for the SpoIIID motif in Bacillus
subtilis. The consensus sequence for the SpoIIID motif was hypothesized by Hal-
berg and Kroos (1994) to aaggACAanc, based on 10 experimentally-confirmed
binding sites.
Two different types of microarray experiments were performed to compare gene
expression between wild type B.subtilis and B.subtilis with the SpoIIID protein re-
moved (Eichenberger et al., 2004). In addition to the usual cDNA microarray ex-
periment which provides a list of genes that are potentially regulated by SpoIIID,
another microarray experiment was performed using Chromatin-Immunopreci-
pitation (ChIP) technology that provides much more certain evidence that SpoI-
IID has a binding site near to particular genes.
Both of these experiments lead to a total list of 89 genes that are potentially reg-
ulated by SpoIIID, but we are more certain about 40 of these genes due to the
additional ChIP experiments. This extra information needs to be incorporated
67
into our procedure for finding the SpoIIID motif in the upstream sequences of
these genes.
Our solution is a motif discovery model that is a compromise between the re-
stricted model of Section 2.3 and the unrestricted model of Section 2.4 where our
model forces some of the sequences (the ones identified by the ChIP experiment)
to contain at least one site, while the other sequences (from the usual microar-
ray experiment) are unrestricted. This model is implemented using a version of
BioOptimizer that restricts specific rows of the matrix A to always contain at least
one Aij = 1.
However, BioProspector must still be used to find a good starting point for BioOp-
timizer. BioProspector can not be used to fit this compromise model, but can
be used to fit the restricted model where all sequences are forced to contain at
least one site, so BioProspector was used to find an initial motif using only the
ChIP sequences. BioOptimizer was then run on the full dataset using this Bio-
Prospector starting point. This procedure was repeated for a variety of motif
widths (w = 6, 7, . . . , 12), using the top five motifs found by BioProspector for
each width.
Seventeen experimentally-verified sites were available to validate our discovered
motifs, and the BioOptimizer motif with the highest proportion of known sites
predicted (9/17) was a 8 bp motif with consensus sequence gGACAaGc and a to-
tal of 68 predicted sites in our 89 sequences. It should be noted that this motif
was not the motif with the highest BioOptimizer score. The BioOptimizer final
motif result with the highest score was a 12 bp motif with consensus sequence
ataaaAcAaGca, with 100 predicted sites across the 89 total sequences. This
“best” motif correctly predicted less experimentally-verified sites (7/17) and is
68
not as good of a match to the consensus sequence postulated by Halberg and
Kroos (1994).
69
Chapter 5
Bayesian Motif Clustering Model
As mentioned in Chapter 1, the procedures for motif discovery described in Chap-
ters 2-4 applied across several sequence datasets results in a collection of discov-
ered motifs with count matrices {N1, . . . ,Nn}. Our focus is now to investigate
this collection for similarity between motifs based on their discovered count ma-
trices.
We use a Bayesian hierarchical model to infer common structure, in the form of
clusters, within our collection of motifs. The data for each discovered motif is
a count matrix Ni which can have different widths and number of counts com-
pared to other TF motifs. Our clustering will be based on a motif matrices with a
fixed width w, so we assume each of these n raw motif matrices should contain a
submatrix Yi, i = 1, . . . , n of dimension w× 4 that will be considered the ”central
motif” upon which the clustering will be based.
70
5.1 Hierarchical Framework
Hierarchical models are useful in a variety of scientific problems when the struc-
ture of the data suggests multiple levels of uncertainty. We want to include com-
ponents for both within-motif and between-motif variability of the nucleotide
counts Yijk where i indexes the motif, j indexes the w columns within each motif,
and k indexes the four possible nucleotides within each column.
Our model on the within-motif variability between different binding sites for a
count motif Yi is the same product-multinomial model assumed for motif dis-
covery in Chapter 2. We assume that each position (column) of the count matrix
Yi follows an independent multinomial distribution parameterized by the same
column of an unknown frequency matrix Θi ie.
Within-motif level: p(Yi|θi) =w∏
j=1
p(Yij|θij)
Yij = (Yija, . . . , Yijt) ∼ Multinomial(ni, θij = (θija, . . . , θijt))
For our between-motif variability, we simply assume that each motif frequency
matrix Θi in our collection share a common but completely unknown distribu-
tion, denoted F(·), ie.
Between-motif level: Frequency matrices p(Θi)
Θi = (θi1, . . . , θiw) ∼ F(·)
where F(·) is an unknown distribution with w dimensions for the columns × 4
dimensions for the nucleotides (constrained to sum to one).
This unknown distribution F(·) represents the common structure between the
different motifs in the dataset. Estimation of this unknown distribution is com-
71
plicated by the fact that our frequency matrices Θi are unknown, with only the
count matrices Yi being observed.
A common prior for an unknown distribution F(·) is a Dirichlet process D(γ)
with characteristic smooth measure γ. Here, we have a multidimensional F(·),
so we use a Dirichlet process prior D(γ1 × · · · × γw) where each smooth measure
γj is four dimensional but constrained to sum to one. We take a uniform γj =
Dirichlet(α, . . . , α) for each smooth measure j = 1, . . . , w.
5.2 Clustering of Observations
An important consequence of our model is that it enables similar motifs to be
clustered together into groups with identical frequency matrices. Ferguson (1974)
states that if x1, . . . , xn are n observations, taking on K distinct values ζ1, . . . , ζK
drawn from F(·) with prior D(γ), then
F(·)|ζ1, . . . , ζK ∼ D(γ∗) = D(γ +
K∑
i=1
δζi)
So the distribution of F(·) conditional on K distinct observations is a mixture of
the smooth measure α and K point masses. This point mass component allows
for the clustering of similar observations.
If we were to draw an additional (n + 1)-th observation x from this distribution
D(γ∗), that new observation would either come from the smooth measure γ, or
would take on a value exactly equal to one of the current ζi’s, say ζk, in which
case ζk and x are defined as being in the same cluster.
The conditional distribution p(ζi|ζ−i) of one current observation ζi, given all other
observations ζ−i, is also a mixture between the smooth measure and K point
72
masses at each of the ζ−i that represent the unique values within ζ−i. Any obser-
vations ζm and ζn in ζ−i that have the same value are defined as being in the same
cluster.
This conditional distribution allows us to implement our model via a Gibbs sam-
pling algorithm (Geman and Geman, 1984), which is a Markov Chain Monte
Carlo strategy for simulating unknown parameters (or sets of parameters) one
at a time by conditioning on the current values of all the other parameters.
Liu (1996) examines the use of Dirichlet processes as a prior in a binomial hi-
erarchical setting. Green and Richardson (2001) discuss the use of the Dirichlet
process as a flexible model for clustering observations, and present an extended
class of Dirichlet-Multinomial allocations for which the Dirichlet process is a lim-
iting case. Medvedovic and Sivaganesan (2002) uses the clustering properties of
the Dirichlet process prior as part of a hierarchical model for gene expression
profiles from microarray data.
5.3 Gibbs Sampling Implementation
For our motif clustering model, the Gibbs sampler could intuitively be based on
p(Θi|Θ−i). However, since our Θi’s are actually unknown, a more efficient clus-
tering procedure involves drawing values of the clustering indicators directly,
without dealing with drawing a frequency matrix Θi for each motif i at each iter-
ation.
We denote our clustering indicators as zi where zi = k if Θi takes on the same
value as Θk (and hence is in the k-th cluster) or zi = 0 if Θi is drawn from the
smooth measure γ (and hence forms a new cluster).
73
We would like to sample directly from the conditional distribution of these clus-
tering indicators, ie. we want to sample from
p(zi|z−i,Y)
where we again use the notation z−i or Θ−i to mean all the z or Θ parameters
except the i-th one.
p(zi|z−i,Y) =
∫ ∫
p(zi,Θi,Θ−i|z−i,Y) dΘi dΘ−i
∝
∫ ∫
p(Yi|Θi,Θ−i, z,Y−i) × p(Θi|Θ−i, z,Y−i) ×
p(Θ−i|z,Y−i) × p(zi|z−i,Y−i) dΘi dΘ−i
Now, as we mentioned previously, Yi|Θi follows a product multinomial distri-
bution independent of the other count matrices:
p(Yi|Θi, zi,Θ−i, z−i,Y−i) = p(Yi|Θi) = ΘYi
i
where we again use the notation ΘY to mean
∏wj=1
∏4k=1 θ
Yjk
jk .
Also, as we mentioned, the conditional distribution of Θi is a mixture of the
smooth measure γ and K clusters, indexed by zi,
p(Θi|zi = 0) =
[
Γ(4α)
Γ(α)4
]w
Θαi
p(Θi|zi = k,Θ−i, z−i) = δ(Θi = Θk) k = 1, . . . , K
We have the conditional prior distribution of zi, which for the Dirichlet process
prior is
p(zi = 0|z−i) =c
c+ n− 1
p(zi = l|z−i) =nl
c+ n− 1(5.1)
74
where nl are the number of z’s in z−i which are equal to l. It is evident from
the prior probabilities (5.1) that the probability for joining a particular cluster
increases as the number of observations in that cluster increases, implying that
the Dirichlet process prior favors unequal allocations of observations.
Returning to our posterior probability calculation, if we first consider the case
zi = 0 (ie. forming a new cluster), then we have
p(zi = 0|z−i,Y) ∝
∫
p(Yi|Θi) p(Θi|zi = 0) p(zi = 0|z−i) dΘi
∝
∫
ΘYi
i
[
Γ(4α)
Γ(α)4
]w
Θα−1i
c
c+ n− 1dΘi
∝c
c+ n− 1
w∏
j=1
∏
k Γ(Yijk + α)
Γ(∑
k Yijk + 4α)
Γ(4α)
Γ(α)4(5.2)
Now, for the case where zi = l 6= 0 (ie. joining an existing cluster that already has
a count matrix Yl), then we have
p(zi = l|z−i,Y) ∝
∫ ∫
p(Yi|Θi) δ(Θi=Θl)p(Θ−i|zi = l, z−i,Y−i) ×
p(zi = l|z−i) dΘi dΘ−i
∝
∫
p(Yi|Θl) p(Θl|z,Y−i) p(zi = l|z−i) dΘl
∝
∫
ΘYi
l
p(Yl|Θl, z)p(Θl|z)
p(Yl|z)
nl
c+ n− 1dΘl
∝nl
c + n− 1
∫
ΘYi+Yl+α−1l dΘl
∫
ΘYl+α−1l dΘl
∝nl
c + n− 1
w∏
j=1
∏
k Γ(Yijk + Yljk + α)
Γ(∑
k Yijk + Yljk + 4α)
Γ(∑
k Yljk + 4α)∏
k Γ(Yljk + α)(5.3)
A complete iteration of our Gibbs sampling algorithm results in a complete sam-
ple z of our clustering indicators, which also represents a complete partition of
our motif matrices.
75
5.4 Motif Alignment
An additional missing component of the analysis is the fact that we do not neces-
sarily know which “central motif” Yi of length w to use within the raw alignment
matrix of length ni > w for motif i. For example, if our clustering algorithm is
based on a fixed width of w = 6 and our i-th raw motif matrix Ni has 8 positions,
than we have three possible choices for our central motif: Yi = columns 1 to 6 of
Ni, Yi = columns 2 to 7 of Ni, or Yi = columns 3 to 8 of Ni.
Our hierarchical clustering model assumes that the Yi for each motif is known,
so we need an additional step where, for each raw data matrix, the best location
of the central motif Yi is drawn conditional the other motifs Y−i and clustering
indicators z−i for the other motifs.
p(Yi|z−i,Y−i) =
∫
p(Yi|Θi) p(Θi|z−i,Y−i) dΘi
=
∫
ΘYi
i
c
c+ n− 1
[
Γ(4α)
Γα4
]w
Θαi dΘi
+K∑
l=1
∫
ΘYi
i
nl
c+ n− 1δ(Θi=Θl)
p(Yl, Θl|z−i)
p(Yl|z−i)dΘi
=c
c+ n− 1
[
Γ(4α)
Γα4
]w ∫
ΘYi+αi dΘi
+K∑
l=1
nl
c+ n− 1
∫
ΘYi+Yl+αi dΘi
∫
ΘYl+αi dΘi
=c
c+ n− 1
[
Γ(4α)
Γα4
]w ∏
k Γ(Yijk + α)
Γ(∑
k Yijk + 4α)
+K∑
l=1
nl
c + n− 1
w∏
j=1
∏
k Γ(Yijk + Yljk)
Γ(∑
k Yijk + Yljk + 4α)
Γ(∑
k Yljk + 4α)∏
k Γ(Yljk + α)(5.4)
This alignment procedure is performed every tenth iteration of the marginal Gibbs
sampler described in the previous section.
76
5.5 Clustering of Two-Block Motifs
As described in the motif discovery Chapters 2-4, our motif discovery strategy
may focus not only on single-block motifs but also on two-block motifs with a
variable length gap. A subsequent question is how to use the two-block informa-
tion in our clustering model? We propose two different strategies for clustering
two-block motifs.
The first strategy is to separate the two-block motif as two independent single
block motifs, and cluster these new single block motifs together with any original
one-block motifs. The disadvantage of this strategy is that we are ignoring the
linkage between the two blocks, with the advantage being that we are able to
cluster both two-block and one-block motif results together.
The alternative strategy is to cluster the two-block motif as a single entity, but still
allowing separate alignments steps within each block. This strategy acknowl-
edges the inherent link between the two blocks, but does not allow us to cluster
the two-block motifs together with the one-block motif results.
We are interested in examining the similarity and differences in the clustering
results under these two different strategies. Utilizing both strategies gives us
additional power to detect motifs that are similar based on combinations of one
or two blocks.
This type of situation can occur in practice, such as in Table 4.5 where the two-
block motifs for TFs σE and σK seem to share similar second blocks but quite
different first blocks. The combination of these two strategies will play a critical
role when we combine these clustering methods with our motif discovery meth-
ods to predict co-regulated gene clusters in Chapter 7.
77
5.6 Advantages of our Clustering Model
This model has several advantages over several traditional clustering methods
briefly discussed in Chapter 1. First of all, our hierarchical framework lets us
account for uncertainty in the count matrices that represent each TF motif by
assuming a product multinomial distribution. Most clustering programs, such as
hierarchical tree clustering or K-means clustering would assume the count matrices
are fixed and known without error.
The second advantage is that our clustering strategy does not need to use any ad
hoc distance measures in order to compare motifs. At each iteration of the Gibbs
sampling algorithm, the decision to cluster a particular observation is determined
by the conditional distribution of zi given all other information (z−i,Yi).
Thus, our distance metric is exactly equal to the conditional posterior distribution
under our full Bayesian clustering model, which is analogous to our motif discov-
ery strategy (Chapter 3) of using a scoring function based on the exact posterior
distribution instead of an ad hoc scoring function.
A third advantage of our clustering model allows not only the clusters them-
selves to vary (in terms of which motifs are members of which clusters) but also
the number of clusters is allowed to vary. This is a key improvement over a clus-
tering technique that requires the number of clusters to be fixed (such as K-means
clustering) in this situation, since we have very little idea a priori about how many
motifs we might expect would be similar to each other.
Another advantage of our Bayesian formulation and stochastic implementation is
that it allows us to summarize the model with posterior sampling which gives us
an idea of the variability of our clustering results, whereas traditional clustering
78
methods typically give only a point estimate.
In Chapter 6 when we discuss strategies for analyzing our clustering results, we
will also focus primarily on point estimates, such as the posterior mode, but we
also will address the variability of our results.
In Chapters 2-4, we extended the usual motif discovery models to the case where
motif width w is allowed to vary. Although our clustering model assumes a
known motif width w, our additional motif alignment steps allow the central
matrix Yi within each raw data matrix Ni to vary, which effectively means that
our clustering results can be based on the “most-conserved” portions of each mo-
tif count matrix, regardless of the differences in width between each discovered
motif.
We also extended motif discovery models to motifs that consist of two conserved
blocks separated by a gap of variable length. Our strategy of combining two dif-
ferent clustering procedures, indepedent-block and joint-block, allows informa-
tion to be shared between one and two-block motifs, while still acknowledging
the natural linkage between each block of a two-block motif.
5.7 Comparison with Other Clustering Priors
As mentioned in Section 5.3, the Dirichlet Process prior favors unequal allocation
of observations, meaning that each new observation has a greater prior probabil-
ity of being placed in a cluster that already has many observations. If we already
have n observations divided into L clusters z with n1, . . . , nL members, then
P (zn+1 = l|z,DP) =nl
c+ nl = 1, . . . , L
P (zn+1 = L+ 1|z,DP) =c
c+ n(5.5)
79
An alternative is a uniform clustering prior which favors equal allocations of ob-
servations ie. the prior probability that a new observation is placed in any one of
the existing clusters is uniform,
P (zn+1 = l|z,Unif) =1
c+ Ll = 1, . . . , L
P (zn+1 = L+ 1|z,Unif) =c
c+ L(5.6)
In fact, we can consider both the Dirichlet process and uniform clustering specifi-
cations as particular cases of a more general clustering prior distribution, where
P (zn+1 = l|z) ∝ f(nl) l = 1, . . . , L
P (zn+1 = L+ 1|z) ∝ c (5.7)
This general clustering model reduces to the Dirichlet process when f(nl) = nl
and the uniform clustering prior when f(nl) = 1, but more general functions may
be desirable in particular situations.
The prior density of a partition z under either our Dirichlet process or uniform
clustering model can be calculated recursively using either formulas (5.5) or (5.6).
For the Dirichlet process, for the first cluster with members (z1, z2, . . . , zn1), we
have
p(z1, . . . , zn1) = p(z1)p(z2|z1) · · ·p(zn1
|zn1−1)
=1
c+ 1·
2
c+ 2· · ·
n1 − 1
c+ n1 − 1=
c · (n1 − 1)!n1∏
i=1
(c+ i− 1)
Continuing this process through all L clusters in the partition z, we finally have
p(z|DP) =
cL ·L∏
l=1
(nl − 1)!
n∏
i=1
(c+ i− 1)(5.8)
80
It is worth noting that the prior density (5.8) under the Dirichlet process model
does not depend on the ordering of our recursive calculations ie. the ordering
in which our conditional probabilities were calculated. This means that different
partitions with the same cluster sizes are exchangable under the Dirichlet process
model.
For the uniform clustering model, starting from the first cluster with members
(z1, z2, . . . , zn1), we have
p(z1, . . . , zn1) = p(z1)p(z2|z1) · · · p(zn1
|zn1−1)
=c
c·
1
c+ 1·
1
c+ 1· · ·
1
c+ 1=
c
c(c+ 1)n1−1
Continuing these recursive calculations through all L clusters in the partition z,
we have
p(z|Unif) =cL−1 · (c+ L)
L∏
l=1
(c+ l)nl
(5.9)
Examining the prior density (5.9), we see that the denominator does depend on
the ordering in which our recursive calculations were performed. Thus, we will
get different values of (5.9) for different orderings of unequally-sized clusters,
which should actually be exchangeable.
As suggested by Green and Richardson (2001), to ensure exchangebility of our
uniform clustering model, we make our prior density p(z|Unif) a function of a
“signature” of the partition that is identical for exchangable partitions. For exam-
ple, if we let p(z|Unif) = k · p(z′|Unif) where z′ is z with the zi’s arranged in order
from the largest cluster to the smallest, then the calculation of (5.9) for z′ will be
the same for all exchangable values of z. All of these complications are avoided
in the Dirichlet process model which automatically gives the same prior density
value for exchangable partitions.
81
We can also compare the behaviour of these two clustering prior specifications
with a simple simulation study, where 1000 complete partitions z = (z1, . . . , zn)
with n = 1000 and c = 1 were generated under both sets of probabilities (5.5)
and (5.6) above. In Figure 5.1, we see the distributions of both the number of
clusters as well as the size of the multiple-member (nl > 1) clusters over all of our
simulated partitions.
Number of Clusters − DP
Fre
qu
en
cy
0 10 20 30 40 50
05
01
00
15
0
Number of Clusters − Uniform
Fre
qu
en
cy
0 10 20 30 40 50
05
01
00
15
02
00
Size of Clusters − DP
Fre
qu
en
cy
0 200 400 600 800 1000
01
00
02
00
03
00
04
00
05
00
0
Size of Clusters − Uniform
Fre
qu
en
cy
0 200 400 600 800 1000
01
00
03
00
05
00
0
Figure 5.1: Comparison of clustering statistics between DP and Uniform priors
As expected, the number of clusters (with multiple members) is much larger un-
der the uniform prior and the size of some clusters from the Dirichlet process are
much larger than any generated from the uniform prior specification.
82
Chapter 6
Analyzing Motif Clustering Results
In this chapter, we will apply our clustering model to a dataset of 116 different
transcription factors and discuss different strategies for visualizing and analyz-
ing our clustering results. The raw data was provided (C. Lawrence, personal
communication) in the form of 116 nucleotide-count matrices that differed sub-
stantial in appearance, number of counts, and motif width. Between different
motifs, the number of counts varies from less than 10 to 185 counts.
The motifs were generally short, with an average motif width of approximately
11 bps, so only a one-block clustering strategy was used. We focus on a central
motif of width 8, though we will also examine the clustering for a central motif
of width 6 and 10. In Chapter 7, we present an application that combines both
one-block and two-block motif clustering in combination with motif discovery.
For this 116 motif dataset, we also have the extra information that the TF for
each motif has been classified a priori into a particular “protein family” based on
the common physical structure of their DNA-binding domains. For example, one
family of transcription factors is the helix-loop-helix family, which has two DNA-
binding helix domains that bind directly to the DNA strand and are joined by a
83
loop domain. Table 6.1 shows each protein family in the dataset, along with the
number of motifs, the average number of sites (|A|) and the average width (w).
Table 6.1: Protein Families in Dataset
Family Number Average |A| Average wTEA 1 12 12MADS 5 64 11TATA-BOX 1 54 16RUNT 1 38 9bHLH 6 31 10FORKHEAD 8 25 13NUCLEAR 16 26 13HOMEO-ZIP 1 25 8T-BOX 1 40 11ZN-FINGER 24 28 9PAIRED 3 29 14bZIP 9 26 10REL 6 19 10ETS 7 31 8HOMEO 6 36 9TRP-CLUSTER 5 36 11HMG 6 31 10bHLH-ZIP 4 25 8CAAT-BOX 1 116 16PAIRED-HOMEO 1 21 30IPT/TIG 1 10 16P53 1 17 20AP2 1 185 9UNKNOWN 1 10 8
We will use this extra “protein family” information in order to validate the results
produced by our clustering model. For our Dirichlet process prior described in
Sections 5.1-5.3 above, we chose prior parameters α and prior weight c to both be
equal to 1.
As described in Section 5.3, our Bayesian hierarchical clustering model was im-
plemented using a Gibbs sampling algorithm. Each iteration of the Gibbs sam-
84
pler produces two vectors, one giving the alignment of each central matrix Yi
within the raw motif matrix Ni for each motif, and the other giving the clustering
indicator zi for each motif.
Since our clustering model is implemented by using a Markov-chain Monte-
Carlo algorithm, it is important to evaluate whether or not our Gibbs sampling
iterations have converged to our desired posterior distribution.
Following the recommendation of Gelman and Rubin (1992), we started separate
chains of our Gibbs sampling algorithm from several different starting points ie.
different initial partitions z0. Examples of our starting partitions are the “each-
in-own” partition: z0i = i for i = 1, . . . , n, the “all-in-one” partition: z0
i = 1 for
i = 1, . . . , n, and a “random” partition, where the clustering indicators were ran-
domly drawn from a discrete uniform distribution.
Each Gibbs sampling chain was run for 500 iterations. The within-chain versus
between-chain variance measure R (Gelman and Rubin, 1992) was calculated for
two functions of the clusters: the average cluster size and the number of clusters.
R was less than 1.1 for both of these quantities after 500 iterations, leading us to
conclude that our MCMC algorithm had converged.
We now discuss several strategies for analyzing the results from our Gibbs sam-
pling implementation.
6.1 Clustering Trees
An intuitive means for examining our overall clustering results is the posterior
probability pij that a particular pair of motifs i and j are in the same cluster. The
value of pij for any two motifs i and j can be estimated by the proportion of
85
iterations that have motif i and j in the same cluster. This quantity is a Monte
Carlo estimate of the posterior mean of the indicator variable for motif i and j
being in the same cluster.
Based on these pairwise clustering probabilities pij, a pairwise distance measure
can be calculated between each pair of motifs in the dataset, dij = 1 − pij. The
distance matrix for an entire dataset can then be analyzed by a single-linkage,
average-linkage or complete-linkage hierarchical tree algorithm. The result of
this procedure is a tree structure, which visualizes the clustering pattern for the
entire dataset.
The clustering tree, based on the average-linkage hierarchical tree algorithm for
our dataset with central motif width of 8 bps is given in Figure 6.1. With the
restriction of an 8 bp central motif width, our dataset was reduced to 90 valid
motifs (ie. motifs with width greater than or equal to 8 bps).
The motifs are labeled by both the motif “name”, which is of the form MAxxxx
where xxxx is a number, as well as the protein family to which that motif belongs.
Any motifs that did not cluster with any other motifs are not shown. The length of
each tree “branch” shared by a group of motifs is proportional to the probability
that the group of motifs are in the same cluster.
We can see several interesting relationships from this clustering tree. There are
several groups of motifs that always cluster together but do not cluster with any
other motifs, such as the (ETS-MA0062,ETS-MA0028,ETS-MA0078) group in the
middle of the tree. The clustering tree also allows us to weaker, more variable
clustering relationships between motifs, such as bZIP-MA0102 in the middle of
the tree, which has a low but non-zero probability of being grouped with the
much tighter pair of motifs bZIP-MA0025 and bZIP-MA0043.
86
NU
CL
EA
R−
MA
00
74
NU
CL
EA
R−
MA
01
16
NU
CL
EA
R−
MA
00
71
NU
CL
EA
R−
MA
00
72
NU
CL
EA
R−
MA
01
17
NU
CL
EA
R−
MA
00
66
NU
CL
EA
R−
MA
01
11
RE
L−
MA
00
61
RE
L−
MA
01
05
bH
LH
−Z
IP−
MA
00
58
bH
LH
−Z
IP−
MA
00
59
TR
P−
CL
US
TE
R−
MA
00
50
TR
P−
CL
US
TE
R−
MA
00
51
FO
RK
HE
AD
−M
A0
04
1
FO
RK
HE
AD
−M
A0
04
7
bH
LH
−M
A0
05
5
bH
LH
−M
A0
04
8
ET
S−
MA
00
62
ET
S−
MA
00
28
ET
S−
MA
00
76
HO
ME
O−
MA
00
27
HM
G−
MA
00
44
bZ
IP−
MA
01
02
bZ
IP−
MA
00
25
bZ
IP−
MA
00
43
RE
L−
MA
01
07
RE
L−
MA
00
23
RE
L−
MA
01
01
bZ
IP−
MA
00
18
bZ
IP−
MA
00
97
FO
RK
HE
AD
−M
A0
04
2
FO
RK
HE
AD
−M
A0
04
0
NU
CL
EA
R−
MA
00
07
NU
CL
EA
R−
MA
01
09
ZN
−F
ING
ER
−M
A0
01
2
ZN
−F
ING
ER
−M
A0
01
0
HM
G−
MA
00
84
ZN
−F
ING
ER
−M
A0
01
3
FO
RK
HE
AD
−M
A0
03
0
FO
RK
HE
AD
−M
A0
03
3
FO
RK
HE
AD
−M
A0
03
2
FO
RK
HE
AD
−M
A0
03
1
Figure 6.1: Clustering tree for dataset based on a motif width of 8 bps
Looking at the protein family information, it is clear that most of the high-probab-
ility clusters of motifs all belong to the same family, providing a strong indica-
tion that TFs in the same protein family can have very similar motifs. There
are interesting exceptions, such as HMG-MA0084, which has a high probability of
clustering with ZN-FINGER-MA0010 and ZN-FINGER-MA0012. Also, it seems
that ZN-FINGER-MA0013 has a moderately high probability of clustering with
the FORKHEAD cluster consisting of motifs MA0030-MA0033. Finally, these two
larger groupings, both shown on the right in Figure 6.1 have a low probability of
clustering with each other, as indicated by the short common branch at the top of
87
the figure. These relationships may merit further examination to see if there is a
biologically significant reason behind the similarity of these groups of motifs.
6.2 Best Clustering Partition
Although they allow us to examine the clustering structure of the entire dataset,
these clustering tree is not ideal for deducing the “best partition” or best set of
clusters in the dataset, since the clustering tree represents a posterior mean across
many different partitions. This is the same problem that was mentioned for the
technique of hierarchical tree clustering in Chapter 1. One could “cut the tree” at
any number of different threshold distances and thereby produce any number of
possible partitions, but a less arbitrary alternative is to take our best estimate of
the posterior mode of our clusters.
We estimate this posterior mode by calculating the posterior value of the partition
z at the end of each iteration of our sampler, and retaining the partition z with the
highest posterior value as our best estimate of the mode.
The posterior value p(z|Y) of z is calculated as the product of the likelihood value
p(Y|z) and the prior value p(z|α). If our partition z has L clusters, each with nl
members and count matrix Yl (the sum of all w × 4 count matrices in cluster l),
then the likelihood value is
p(Y|z) ∝
L∏
l=1
∫
p(Yl|Θl)p(Θl|z)dΘl ∝
L∏
l=1
∫
ΘYl
l Θα−1l ∝
L∏
l=1
w∏
j=1
∏
k Γ(Yljk + α)
Γ(∑
k Yljk + 4α)
The prior value of a partition z (conditional on our Dirichlet process prior with
88
measure α and prior weight c) was calculated in Section 5.7 to be
p(z|α) =
cL ·L∏
l=1
(nl − 1)!
n∏
i=1
(c+ i− 1)
So, our posterior value for a particular partition z with L clusters, each with nl
members and count matrices Yl, is
p(z|Y) ∝L∏
l=1
w∏
j=1
∏
k Γ(Yljk + α)
Γ(∑
k Yljk + 4α)×
cL ·L∏
l=1
(nl − 1)!
n∏
i=1
(c+ i− 1)(6.1)
For our dataset, the partition z with the highest posterior value consisted of 16
multiple-member clusters containing 42 out of 90 total motifs. These 16 clusters
are listed in Table 6.2, along with the cluster size, cluster strength, total number
of sites in the cluster (|A|) and the consensus sequence for the cluster. The clus-
ter strength statistic will be explained in Section 6.3. The consensus sequence is
a representation of the total count matrix for the cluster, giving the nucleotide
with the highest count in each position. A nucleotide is only capitalized if its
nucleotide frequency is greater than 0.75 in that position.
Also given in Table 6.2 are motifs contained in each cluster, and the proportion
of each protein family present in that cluster. As suggested by the clustering tree
in Section 6.1, most of our “best” clusters contain motifs from within a single TF
protein family. Three exceptions are: cluster 2 which is mostly FORKHEAD mo-
tifs but also contains a ZN-FINGER motif MA0013, cluster 7 which contains two
ZN-FINGER motifs and one HMG motif MA0084, and cluster 16, which contains
a HOMEO motif and a HMG motif.
Although this best partition has reduced our dataset to a list of interesting clus-
ters, we have lost information about the variability of these clusters by focusing
89
Table 6.2: Best partition of clusters for dataset
Clus Size Strength |A| Consensus Families Motifs1 5 187.7 145 gTAGGTCA NUCLEAR (5/5) MA0066 MA0071 MA0072
MA0117 MA01112 5 140.9 93 gTAAACAa FORKHEAD (4/5) MA0030 MA0033 MA0032
MA0031ZN-FINGER (1/5) MA0013
3 3 72.0 44 GgaTTTCC REL (3/3) MA0023 MA0101 MA01074 3 70.9 55 aCCGGAAg ETS (3/3) MA0028 MA0062 MA00765 2 46.0 32 AAgcGAAA TRP-CLUSTER (2/2) MA0050 MA00516 2 45.6 48 taaGaACa NUCLEAR (2/2) MA0007 MA01097 3 45.6 49 taaACAAt ZN-FINGER (2/3) MA0010 MA0012
HMG (1/3) MA00848 2 43.6 38 acCACGTG bHLH-ZIP (2/2) MA0058 MA00599 2 38.6 49 TGTTTaTt FORKHEAD (2/2) MA0042 MA0040
10 3 38.1 59 TTacGtAA bZIP (3/3) MA0025 MA0043 MA010211 2 37.8 20 cGaGTTCA NUCLEAR (2/2) MA0074 MA011612 2 34.8 64 TgTTtgtT FORKHEAD (2/2) MA0041 MA004713 2 30.7 56 GGGgatTc REL (2/2) MA0061 MA010514 2 27.4 49 gTGACGTG bZIP (2/2) MA0018 MA009715 2 20.2 70 CAGCTGcg bHLH (2/2) MA0055 MA004816 2 13.2 23 gTtGTact HOMEO (1/2) MA0027
HMG (1/2) MA0044
on a point estimate. The first two clusters mentioned as exceptions in the previ-
ous paragraph are also discussed in Section 6.1, but with the additional informa-
tion that MA0084 is very strongly linked to the other members of its cluster while
MA0013 seems to have a somewhat lower probability of being included in its
cluster. In the next section, we discuss characteristics that allow us to summarize
some of the variability present within our best partition.
6.3 Strength of Clusters
We can also examine cluster-level and observation-level clustering characteristics
within this best partition of our motif matrices. We can measure the strength of
each cluster by calculating the Bayes factor (Kass and Raftery, 1995) for the current
cluster l, with members z = (z1, z2, . . . , znl), versus each member of the cluster
90
forming its own cluster,
Strength(Cluster l) = log
[
P (z all same |Y)
P (z all different |Y)
]
= log
[
P (Y|z all same)
P (Y|z all different)×
P (z all same)
P (z all different)
]
For a cluster of motifs (Y1, . . . ,Ym) and clustering indicators z = (z1, . . . , zm),
Strength = log
∫
ΘY+α−1dΘ
m∏
i=1
∫
ΘYi+α−1i dΘi
×(m− 1)!
cm−1
= log
w∏
j=1
Q
k Γ(Yjk+α)P
k Γ(Yjk+4α)
m∏
i=1
w∏
j=1
Q
k Γ(Yijk+α)P
k Γ(Yijk+4α)
×(m− 1)!
cm−1
where Y and Θ again denote the count and frequency matrices for the entire
cluster together.
The clusters within our best partition can then be ranked by this measure of clus-
ter strength, giving us an extra measure of confidence/uncertainty about infer-
ence based upon a specific cluster. In Table 6.2 above, the 16 clusters from our
best partition are ranked from strongest to weakest. It is clear from the table that
this measure of cluster strength is quite dependent upon the size of the cluster:
larger clusters tend to have a higher value of cluster strength.
We can also measure clustering strength at the level of individual motifs within
our best partition by calculating, for each motif, the posterior probability that it
should belong to that cluster, as opposed to any of the other existing clusters or
being its own cluster. For each motif i, this posterior probability p(zi|z−i,Y) is
the same calculation that is performed during each iteration of our Gibbs sam-
pling algorithm, but in this case we are conditioning on the best partition i.e.,
p(zi|z−i,Y).
91
For most of the motifs in Table 6.2, the individual clustering probabilities are very
close to 1. The two motifs MA0084 and MA0013mentioned in the previous section
have individual clustering probabilities of 1.000 and 0.983 respectively, indicating
some variability but otherwise an overwhelming tendency to be in their assigned
cluster.
There are also a few motifs that show a much higher variability for being in their
particular cluster. MA0025 and MA0102 have probabilities of 0.575 and 0.071 of
being in Cluster 10, while MA0027 and MA0044 both have probabilities of 0.240
of being in Cluster 16.
Given that both motifs MA0027 and MA0044 have low individual clustering prob-
abilities, it is not surprising that Cluster 16 is also the weakest cluster on our
Cluster Strength measure. In many large clustering datasets, such as the ones we
will encounter in Chapter 7, it may be advisable to eliminate these weaker motifs
from the best partition.
6.4 Examining Particular Clusters in Detail
We can examine our best partition in detail by looking at the sequence logos for
individual clusters. Figure 6.2 shows the sequences logos for cluster 1 (containing
only NUCLEAR motifs) and cluster 2 (containing mostly FORKHEAD motifs)
from Table 6.2 along with the sequence logos across the entire NUCLEAR and
FORKHEAD families.
Not surprisingly, clusters 1 and 2 show much higher motif conservation than
the motifs that represent the entire NUCLEAR and FORKHEAD families, re-
spectively. The application of our clustering model has allowed us to identify
92
Cluster 1 Entire NUCLEAR family
0
1
2
1
T
C
G
2
G
AT
3
GA
4
C
G5
T
A
G6
C
GT
7
A
G
C
8
T
C
A0
1
2
1
C
A
G
2
A
G
T
3
C
T
AG
4
A
C
T
G
5
C
T
G
6
A
C
T
7
T
A
C
8
T
C
G
A
Cluster 2 Entire FORKHEAD family
0
1
2
1
AG
2
C
T
3
C
A
4
A5
G
A6
A
TC
7
T
G
A8
G
C
T
A
0
1
2
1
A
C
G
T
2
TAG
3
G
AT
4
AT
5
G
AT
6
T
G
A
7
C
A
T
8
C
G
AT
Figure 6.2: Sequence logos for clusters 1 and 2, with families
highly-conserved subgroups within several of the protein families present in our
dataset. This subgroup information can be used to further improve motif discov-
ery for additional motifs belonging to this protein family, since motifs based on
these clustered subgroups should be easier to detect in large sequence databases
than the weaker motif based on the entire protein family.
6.5 Effect of Prior Specification on Clustering Results
In Section 5.7, we observed dramatic differences in terms of number of clusters
and average cluster size between partitions z generated directly from the Dirich-
let process prior compared to the Uniform clustering prior. Even though the two
priors seem quite different, we should also examine the posterior clustering re-
sults for the TF motif data between models with the Dirichlet Process prior and
93
the Uniform prior.
The distribution (over all partitions produced by the Gibbs sampler) of the num-
ber of multiple-member clusters and the average size of these clusters is given in
Figure 6.3.
Number of Clusters − DP
Fre
quen
cy
14 15 16 17 18
050
100
150
Number of Clusters − Unif
Fre
quen
cy
14 15 16 17 18
050
100
150
200
Average Cluster Size − DP
Fre
quen
cy
2.40 2.45 2.50 2.55 2.60 2.65 2.70
050
100
150
Average Cluster Size − Unif
Fre
quen
cy
2.40 2.45 2.50 2.55 2.60 2.65 2.70
050
100
150
200
Figure 6.3: Clustering statistics between Uniform and DP models
The uniform clustering model tends to produce somewhat larger numbers of
clusters with a somewhat smaller average cluster size. However, although some
difference is evident between the two models in terms of these cluster character-
istics, the results are not nearly as dramatic when compared to the prior simula-
tions in Section 5.7.
We also examined the differences between our Dirichlet process and uniform
clustering models in terms of the clustering trees (Section 6.1) and best parti-
tions (Section 6.2). Figure 6.4 gives the clustering trees produced under both the
94
Dirichlet process and uniform clustering models.
The clustering trees in Figure 6.4 are nearly identical except for a couple arbitrary
differences in the ordering of the branches. The best partition found with both the
Uniform and Dirichlet process models are identical both in terms of the clusters
themselves as well as their ranking by strength.
NU
CLE
AR
−MA
0074
NU
CLE
AR
−MA
0116
NU
CLE
AR
−MA
0071
NU
CLE
AR
−MA
0072
NU
CLE
AR
−MA
0117
NU
CLE
AR
−MA
0066
NU
CLE
AR
−MA
0111
RE
L−M
A00
61R
EL−
MA
0105
bHLH
−ZIP
−MA
0058
bHLH
−ZIP
−MA
0059
TRP
−CLU
STE
R−M
A00
50TR
P−C
LUS
TER
−MA
0051
FOR
KH
EA
D−M
A00
41FO
RK
HE
AD
−MA
0047
bHLH
−MA
0055
bHLH
−MA
0048
ETS
−MA
0062
ETS
−MA
0028
ETS
−MA
0076
HO
ME
O−M
A00
27H
MG
−MA
0044
bZIP
−MA
0102
bZIP
−MA
0025
bZIP
−MA
0043
RE
L−M
A01
07R
EL−
MA
0023
RE
L−M
A01
01bZ
IP−M
A00
18bZ
IP−M
A00
97FO
RK
HE
AD
−MA
0042
FOR
KH
EA
D−M
A00
40N
UC
LEA
R−M
A00
07N
UC
LEA
R−M
A01
09ZN
−FIN
GE
R−M
A00
12ZN
−FIN
GE
R−M
A00
10H
MG
−MA
0084
ZN−F
ING
ER
−MA
0013
FOR
KH
EA
D−M
A00
30FO
RK
HE
AD
−MA
0033
FOR
KH
EA
D−M
A00
32FO
RK
HE
AD
−MA
0031
Clustering Tree − DP
NU
CLE
AR
−MA
0074
NU
CLE
AR
−MA
0116
NU
CLE
AR
−MA
0111
NU
CLE
AR
−MA
0066
NU
CLE
AR
−MA
0117
NU
CLE
AR
−MA
0071
NU
CLE
AR
−MA
0072
RE
L−M
A00
61R
EL−
MA
0105
bHLH
−ZIP
−MA
0058
bHLH
−ZIP
−MA
0059
TRP
−CLU
STE
R−M
A00
50TR
P−C
LUS
TER
−MA
0051
FOR
KH
EA
D−M
A00
41FO
RK
HE
AD
−MA
0047
bHLH
−MA
0055
bHLH
−MA
0048
ETS
−MA
0062
ETS
−MA
0028
ETS
−MA
0076
HO
ME
O−M
A00
27H
MG
−MA
0044
bZIP
−MA
0102
bZIP
−MA
0025
bZIP
−MA
0043
RE
L−M
A01
07R
EL−
MA
0023
RE
L−M
A01
01bZ
IP−M
A00
18bZ
IP−M
A00
97FO
RK
HE
AD
−MA
0042
FOR
KH
EA
D−M
A00
40N
UC
LEA
R−M
A00
07N
UC
LEA
R−M
A01
09FO
RK
HE
AD
−MA
0032
FOR
KH
EA
D−M
A00
31FO
RK
HE
AD
−MA
0030
FOR
KH
EA
D−M
A00
33ZN
−FIN
GE
R−M
A00
10H
MG
−MA
0084
ZN−F
ING
ER
−MA
0012
ZN−F
ING
ER
−MA
0013
Clustering Tree − Unif
Figure 6.4: Comparison of clustering trees between Uniform and DP models
95
The Dirichlet process prior and Uniform prior give dramatically different clus-
tering results based upon prior simulation alone, but show very slight differ-
ences in the posterior clustering results of our TF motif application. However,
other datasets may show a larger influence of the prior specification on the pos-
terior clustering results. Green and Richardson (2001) demonstrate with several
datasets that the unequal allocations favored by the Dirichlet process priors can
persist in the posterior distribution.
6.6 Effect of w on Clustering Results
A natural question that arises with our clustering model is whether the clustering
results would be dramatically different if the width of the central motif w was
different. Some of the differences resulting from using different motif widths
should be negated by the motif alignment steps (Section 5.4) which can vary the
central motif within each raw motif matrix, but some effects of using different
motif widths will still persist.
In order to examine the effect of motif width, our model was used to cluster
our 116 motif dataset using a width of 6 bps and 10 bps, in addition to the 8
bps model studied thus far. Any motifs that were shorter than the central motif
width were excluded from the clustering procedure. Using a motif width of more
than 10 bps would exclude a substantial portion of our 116 motifs, as can be
seen from Figure 6.5 which gives motif width distribution in the dataset. The
obvious trend in Figure 6.6 is that lower motif width leads to a higher number
of clusters and more motifs included in clusters. This same trend can be seen in
the best partitions for each width. The best partition for the w = 6 model has
29 clusters containing 80 motifs, the best partition for the w = 8 model has 16
96
clusters containing 42 motifs, and the best partition for the w = 10 model has 11
clusters containing 30 motifs.
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Motif Width
05
1015
Figure 6.5: Distribution of motif widths in dataset
As mentioned in Section 6.1, using a 8-bp central motif includes 90 out of 116
motifs in our dataset, compared with 111 out of 116 motifs when using a width
of 6 bps and 72 out of 116 motifs when using a width of 10 bps. Figure 6.6 shows
the clustering trees for our motif dataset with central motif widths of 6, 8 and 10
bps.
This trend is a result of two separate factors. The first is that a smaller motif
width allows more motifs to be included in the clustering procedure since we
only exclude motifs that have width less than our central motif width. The second
factor is that a smaller motif width essentially relaxes the criteria for two motifs
to cluster together, since the number of motif positions that need to be similar
between the two motifs is reduced.
97
Clustering Tree with w = 6
ZN−F
INGER
−MA0
088
ZN−F
INGER
−MA0
095
HMG−
MA00
84HM
G−MA
0077
HMG−
MA00
87ZN
−FING
ER−M
A003
5ZN
−FING
ER−M
A003
7RE
L−MA
0061
REL−
MA01
05ET
S−MA
0080
ZN−F
INGER
−MA0
056
IPT/TI
G−MA
0085
TRP−
CLUS
TER−
MA00
50TR
P−CL
USTE
R−MA
0051
bHLH
−MA0
055
bHLH
−MA0
048
ZN−F
INGER
−MA0
038
HOME
O−MA
0070
TRP−
CLUS
TER−
MA00
34TR
P−CL
USTE
R−MA
0054
HOME
O−MA
0027
HMG−
MA00
44bZ
IP−MA
0025
bZIP−
MA00
43ET
S−MA
0062
ETS−
MA00
28ET
S−MA
0026
ETS−
MA00
76RE
L−MA
0022
REL−
MA00
23RE
L−MA
0101
REL−
MA01
07ZN
−FING
ER−M
A002
0ZN
−FING
ER−M
A002
1bZ
IP−MA
0019
bZIP−
MA01
02NU
CLEA
R−MA
0017
NUCL
EAR−
MA01
12NU
CLEA
R−MA
0113
NUCL
EAR−
MA00
16NU
CLEA
R−MA
0074
NUCL
EAR−
MA01
15NU
CLEA
R−MA
0117
NUCL
EAR−
MA00
66NU
CLEA
R−MA
0111
NUCL
EAR−
MA00
72NU
CLEA
R−MA
0071
NUCL
EAR−
MA01
14bZ
IP−MA
0096
bZIP−
MA00
18bZ
IP−MA
0097
ZN−F
INGER
−MA0
012
ZN−F
INGER
−MA0
013
ZN−F
INGER
−MA0
011
MADS
−MA0
052
ZN−F
INGER
−MA0
010
HMG−
MA00
45 ZN−F
INGER
−MA0
049
PAIRE
D−HO
MEO−
MA00
68FO
RKHE
AD−M
A004
1FO
RKHE
AD−M
A004
0FO
RKHE
AD−M
A004
2FO
RKHE
AD−M
A004
7NU
CLEA
R−MA
0007
NUCL
EAR−
MA01
09FO
RKHE
AD−M
A003
0FO
RKHE
AD−M
A003
1FO
RKHE
AD−M
A003
3FO
RKHE
AD−M
A003
2bH
LH−Z
IP−MA
0104
bHLH
−ZIP−
MA00
93bH
LH−Z
IP−MA
0058
bHLH
−MA0
004
bHLH
−ZIP−
MA00
59bH
LH−M
A000
6PA
IRED−
MA00
14ZN
−FING
ER−M
A007
9TA
TA−B
OX−M
A010
8ZN
−FING
ER−M
A001
5TE
A−MA
0090
P53−
MA01
06MA
DS−M
A000
5MA
DS−M
A000
1MA
DS−M
A008
2
Clustering Tree with w = 8
NUCL
EAR−
MA00
74
NUCL
EAR−
MA01
16
NUCL
EAR−
MA00
71
NUCL
EAR−
MA00
72
NUCL
EAR−
MA01
17
NUCL
EAR−
MA00
66
NUCL
EAR−
MA01
11
REL−
MA00
61
REL−
MA01
05
bHLH
−ZIP−
MA00
58
bHLH
−ZIP−
MA00
59
TRP−
CLUS
TER−
MA00
50
TRP−
CLUS
TER−
MA00
51
FORK
HEAD
−MA0
041
FORK
HEAD
−MA0
047
bHLH
−MA0
055
bHLH
−MA0
048
ETS−
MA00
62
ETS−
MA00
28
ETS−
MA00
76
HOME
O−MA
0027
HMG−
MA00
44
bZIP−
MA01
02
bZIP−
MA00
25
bZIP−
MA00
43
REL−
MA01
07
REL−
MA00
23
REL−
MA01
01
bZIP−
MA00
18
bZIP−
MA00
97
FORK
HEAD
−MA0
042
FORK
HEAD
−MA0
040
NUCL
EAR−
MA00
07
NUCL
EAR−
MA01
09
ZN−F
INGER
−MA0
012
ZN−F
INGER
−MA0
010
HMG−
MA00
84
ZN−F
INGER
−MA0
013
FORK
HEAD
−MA0
030
FORK
HEAD
−MA0
033
FORK
HEAD
−MA0
032
FORK
HEAD
−MA0
031
Clustering Tree with w = 10
NUCL
EAR−
MA00
71
NUCL
EAR−
MA00
72
NUCL
EAR−
MA00
66
NUCL
EAR−
MA01
17
bHLH
−ZIP−
MA00
58
bHLH
−ZIP−
MA00
59
TRP−
CLUS
TER−
MA00
50
TRP−
CLUS
TER−
MA00
51
bZIP−
MA01
02
bZIP−
MA00
25
bZIP−
MA00
43
REL−
MA00
22
REL−
MA00
23
REL−
MA01
01
REL−
MA01
07
REL−
MA00
61
REL−
MA01
05
FORK
HEAD
−MA0
042
FORK
HEAD
−MA0
041
FORK
HEAD
−MA0
047
NUCL
EAR−
MA00
07
NUCL
EAR−
MA01
09
MADS
−MA0
001
MADS
−MA0
082
ZN−F
INGER
−MA0
013
FORK
HEAD
−MA0
033
FORK
HEAD
−MA0
030
FORK
HEAD
−MA0
032
ZN−F
INGER
−MA0
010
ZN−F
INGER
−MA0
012
Figure 6.6: Comparison of clustering trees using different motif widths
We see the effect of both of these factors when examining the best partitions for
each width in detail. Table 6.3 shows the five strongest clusters in the best parti-
tions for each of the three motif widths.
The top 5 clusters share many common elements between the different partitions
corresponding to the different width models. The strongest cluster in the w = 6
results contains the NUCLEAR motifs that are present in the strongest motif in
the w = 8 motif as well as the fourth strongest cluster in the w = 10 results, but
98
Table 6.3: Top five clusters for all three motif widths
w Clus Size Strength Consensus Families Motifs6 1 10 341.3 aGGTCA NUCLEAR (10/10) MA0016 MA0066 MA0071
MA0072 MA0074 MA0117MA0111 MA0113 MA0114MA0115
6 2 5 179.5 CACGTG bHLH-ZIP (4/5) MA0058 MA0059 MA0093MA0104
bHLH (1/5) MA00046 3 4 117.9 TAAACA FORKHEAD (4/4) MA0030 MA0033 MA0032
MA00316 4 4 98.9 CGGAAg ETS (4/4) MA0026 MA0028 MA0062
MA00766 5 3 89.0 TGACGT bZIP (3/3) MA0018 MA0096 MA00978 1 5 187.7 gTAGGTCA NUCLEAR (5/5) MA0066 MA0071 MA0072
MA0111 MA01178 2 5 140.9 gTAAACAa FORKHEAD (4/5) MA0030 MA0033 MA0032
MA0031ZN-FINGER (1/5) MA0013
8 3 3 72.0 GgaTTTCC REL (3/3) MA0023 MA0101 MA01078 4 3 70.9 aCCGGAAg ETS (3/3) MA0028 MA0062 MA00768 5 2 46.0 AAgcGAAA TRP-CLUSTER (2/2) MA0050 MA0051
10 1 6 183.9 GGGgaTTtCC REL (6/6) MA0022 MA0023 MA0061MA 0101 MA0105 MA0107
10 2 4 102.6 acgTAAAcAa FORKHEAD (3/4) MA0030 MA0033 MA0032ZN-FINGER (1/4) MA0013
10 3 3 65.8 aTgTTTgtTT FORKHEAD (3/3) MA0042 MA0041 MA004710 4 2 60.6 gTaGGTCAcg NUCLEAR (2/2) MA0066 MA011710 5 2 55.0 gAAAgcGAAA TRP-CLUSTER (2/2) MA0050 MA0051
as noted from the clustering trees, the w = 6 NUCLEAR cluster contains a larger
number of motifs. This is partly due to our first factor since two of these motifs
are short enough (MA0114 - 9 bps, MA0115 - 7 bps) to be excluded by one or two
of the clustering procedures. The second factor is present in this case as well,
since the extra motif MA0016 has a consensus sequence of (GGGGTCACGg), which
has a central matrix that matches the other NUCLEAR motifs well enough in the
w = 6 case, but not if the central motif is expanded to w = 8 or w = 10.
The FORKHEAD cluster seen in Table 6.3 is also common to all three motif width
partitions, though again some differences are observed. Almost all the other
clusters appearing in Table 6.3 share some common motifs with clusters in the
other best partitions, but in several cases these clusters were not among the five
strongest. The one exception is the ETS cluster, which is fourth strongest for both
99
w = 6 and w = 8, but not among the w = 10 clusters.
The w = 8 clustering model has been the focus of our investigation because it
serves as a compromise between the w = 6 model and the w = 10 model. The
w = 6 model, though allowing for these extra motifs to be included in the cluster-
ing procedure, has the potential disadvantage that it may not be specific enough
to pick out biologically interesting clusters of motifs ie. many spurious clusters
may result from making the motif width too short. On the other hand, the w = 10
model will have a greatly reduced chance of spurious clusters, but may be too
restrictive for the clustering of similar motifs, in addition to removing a large
porportion (44 out of 116) of short motifs in our dataset from the clustering pro-
cedure.
100
Chapter 7
Prediction of Co-Regulated Genes
In this final application, we combine the statistical methods for motif discovery
presented in Chapters 2-4 and motif clustering, presented in Chapters 5-6, to pre-
dict sets of co-regulated genes in our target organism, the bacteria Bacillus subtilis.
This combined procedure relies solely on publicly available genomic sequence
information, and thus avoids the limitations of gene expression microarray data
that were mentioned in Chapter 1.
In Figure 7.1, a flowchart of our sequence-based technique is presented along
with the contrasting steps of a typical microarray procedure. Note that the or-
dering of the final two steps are reversed between the two procedures, indicating
that our sequence-based strategy will cluster genes based upon discovered mo-
tifs, whereas the usual microarray strategy clusters genes together before motif
discovery.
In Section 7.1, we present our procedure for forming sets of orthologous genes
between bacterial species related to B. subtilis and our focus on both a Studyset
dataset as well as a whole genome dataset. The application of our statistical
model for motif discovery to the datasets is presented in Section 7.2. In Sec-
101
Cellular mRNA
Gene Expression
Co-regulated Genes
Motifs
Microarray Experiment
Clustering
Motif Discovery
Whole Genome
Orthologous Genes
Motifs
Co-regulated Genes
Orthologue Detection
Motif Discovery
Clustering
Microarray Experiment Sequence-only Strategy
Figure 7.1: Microarray and sequence-based gene clustering procedures
tion 7.3, we discuss the application of our statistical model for motif clustering
to the motifs discovered in Section 7.2. The results of our clustering procedures
are analysed and validated in Section 7.4 for our Studyset and Section 7.6 for our
whole genome datasets.
7.1 Collection of Orthologous Gene Sets
The complete genome sequences and gene annotations for our target organism,
Bacillus subtilis, and an additional 6 bacterial species (summarized in Table 7.1)
were downloaded from the National Center for Biotechnology Information web-
102
site (www.ncbi.nlm.nih.gov)
Table 7.1: Bacterial species included in the study
Species Reference Genome NumberSize of Genes
Bacillus anthracis Read et al. (2003) 5.2 Mb 5738Bacillus halodurans Takami et al. (2000) 4.2 Mb 4066Bacillus subtilis Kunst et al. (1997) 4.2 Mb 4103Clostridium acetobutylicum Nolling et al. (2001) 3.9 Mb 3739Clostridium perfringens Shimizu et al. (2002) 3.0 Mb 2660Listeria innocua Glaser et al. (2001) 3.0 Mb 2989Oceanobacillus ihenysis Takami et al. (2002) 3.6 Mb 3496
The complete genome was also available for Listeria monocytogenes (Glaser et al.,
2001), but this species was excluded from the study since it was determined to be
almost identical to Listeria innocua, and therefore would contribute little extra
information.
Orthologous genes between B.subtilis and each of the other species listed in Ta-
ble 7.1 were identified using a reciprocal BLAST best-hit procedure (Remm et al.,
2001) consisting of three basic steps:
1. For each gene in B.subtilis, the gene in the other species that had the most
significant protein sequence similarity was found by using the program
BLASTP (Altschul et al., 1990).
2. For each gene in the other species, the gene in B.subtilis that had the most
significant protein sequence similarity was found, again using the program
BLASTP.
3. Any B.subtilis genes that were matched with the same gene from the other
species in both step 1 and step 2 were classified, along with their matched
103
gene from the other species, as an “orthologous gene pair”.
For each of the BLASTP matching procedures, an arbitrary threshold for signif-
icance must be specified. We used a very conservative significance threshold of
10−10. The seven species in our study are also summarized by the phylogenetic
tree in Figure 7.2, which gives an indication of which species are more related
to our target organism B.subtilis. To construct this phylogenetic tree, 530 sets
C.acetobutylicum
L.innocua
C.perfringens
O.ihenysis
B.anthracis
B.halodurans
B.subtilis
Figure 7.2: Phylogenetic tree of seven related bacterial species
of orthologous genes that contained all seven species were globally aligned by
ClustalW (Thompson et al., 1994) using the amino-acid sequence of each gene.
Next, a phylogenetic tree for each gene was inferred from protein alignments us-
ing a parsimony optimality criterion with the software program PHYLIP v3.573
(Felsenstein, 1993). Finally, a majority-rule consensus tree was constructed based
104
on the 530 separate gene phylogenies using PHYLIP (Felsenstein, 1993).
Table 7.2 gives the number and proportion of orthologous genes between B.subtilis
and each species.
Table 7.2: Orthologous gene pairs with B.subtilis
Species compared Number of Proportion ofwith B.subtilis Orthologous Genes Orthologous GenesBacillus anthracis 1128 0.20Bacillus halodurans 1022 0.25Clostridium acetobutylicum 531 0.14Clostridium perfugines 482 0.18Listeria innocua 689 0.23Oceanobacillus ihenysis 962 0.28
The orthologous gene pairs for each gene in B.subtilis were collected across all six
other species into 1516 “orthologous gene sets”.
We collected the regulatory region sequence for each gene in our orthologous
gene sets, which we defined as the 500-bp sequence located immediately up-
stream of the translation start site for each gene . However, this regulatory region
sequence was not allowed to overlap with the coding sequence of the previous
gene in the genome. The rationale behind this restriction is that the coding re-
gions of genes is that they rarely contain binding sites for regulatory proteins,
and even when they do, the coding regions will show a generally high degree of
conservation across species, making the discovery of conserved elements more
difficult.
In the case where the end of the coding sequence of the previous gene was within
500 basepairs of the translation start site, the regulatory region sequence would
be restricted to only the sequence between the two coding regions (ie. the inter-
105
genic region). In bacterial genomes, genes are often organized into operons, which
consist of several genes that are transcribed together under the control of a single
regulatory region. To avoid including sequences contained within an operon, any
sequences which had an intergenic length of less than 50 basepairs were excluded
from the study.
The number of sequences in each orthologous gene set (OGS) varied due to both
these length restrictions as well as the fact that orthologous genes were found in
some species but not others. At least two sequences were required to form an
orthologous gene set, a B.subtilis sequence and at least one sequence in another
species.
A second dataset considered in this investigation was a subset of 172 ortholo-
gous gene sets for which verified TF binding sites in B.subtilis were known. This
“Studyset” was used to validate and fine-tune our methods of motif discovery
and motif clustering before applying these methods to the full OGS dataset.
The distributions of the number of sequences in the full or “whole genome” OGS
dataset and the “Studyset” OGS dataset are given in Table 7.3 below.
Table 7.3: Sequence distributions for each dataset
Whole Genome StudysetX Number with Proportion with Number with Proportion with
X seqs X seqs X seqs X seqs2 163 0.11 14 0.083 405 0.27 40 0.234 388 0.26 47 0.275 256 0.17 23 0.136 171 0.11 28 0.167 133 0.09 20 0.12
Total 1516 1.00 172 1.00
106
7.2 Motif Discovery
The upstream regulatory regions for each orthologous gene set enumerated in
Table 7.3 forms a small (2-7 sequences with a maximum length of 500 bps each)
sequence dataset which we hypothesize contains TF binding motifs that have
been conserved by evolution.
We will apply the motif discovery strategies outlined in Chapters 2-4 to find these
conserved motifs. Specifically, our motif discovery procedure involves the motif-
finding program BioProspector (Section 1.4) and our scoring function optimiza-
tion algorithm, BioOptimizer (Chapter 3). BioProspector was used as the motif-
finding program because it has the capability to find both one and two-block
motifs.
However, our motif discovery techniques (Chapters 2-4) have previously focused
on the situation where a single conserved motif is present in a sequence dataset,
whereas each of these OGS sequence datasets could contain multiple different
conserved motifs. We will adapt our single-motif methods to this multiple mo-
tif situation by using an iterative-masking strategy. The best single motif will be
found by our single-motif methods, and then this motif will be removed from
the dataset and our single-motif methods reapplied to the modified dataset. This
process can be repeated several times in order to find multiple motif signals. This
same strategy is applied in the program AlignAce (Roth et al., 1998) and is men-
tioned in Section 2.5 on multiple motif strategies.
For a particular OGS sequence dataset, the following motif discovery procedure
was applied to find conserved one-block motifs.
107
1. The motif-finding program BioProspector (Liu et al., 2001) was used to find
the top five one-block motifs. Since the motif width w must be pre-specified
for BioProspector, the program was run separately for 12 different widths
(8, 10, . . . , 30). For each width, the top five motifs were collected.
2. Since BioProspector is a stochastic motif-finding algorithm, independent
runs of the program may give different results. To account for this fact, we
repeated Step 1 three times for each width, resulting in a total of 3×5×12 =
180 BioProspector motifs, many of which might be identical or close to iden-
tical.
3. Each of these 120 motifs was separately scored and optimized using BioOp-
timizer. BioOptimizer also allows the motif width to vary in order to find
the best possible motif signal. The motif with the highest BioOptimizer
score was retained as the “best motif”.
4. BioOptimizer also calculates a “null score” (Section 3.5) which is the score
that a motif with no sites would have in the given sequence dataset. If the
best motif had a score less than the null score, then it was removed from
consideration and the motif discovery procedure for that OGS sequence
dataset was stopped.
5. If the best motif had a score greater than the null score, it was retained for
the motif clustering procedure to follow. This best motif was then “masked
out” of the sequence dataset by replacing all the motif binding sites with
characters that are ignored by the BioProspector and BioOptimizer pro-
grams.
6. With this new “masked” sequence dataset, the entire motif-finding proce-
dure (steps 1-5) until no motifs are discovered that have a BioOptimizer
108
score greater than the null score.
Since each discovered motif is iteratively masked out of the sequence dataset, we
avoid re-discovering the same strong motif signal over and over again. The null
score cut-off criterion helps to avoid the discovery of weak motif signals that are
not biologically relevant. Applying this iterative-masking one-block motif dis-
covery strategy to each OGS sequence dataset separately results in several dis-
covered one-block motifs (summarized as count matrices) associated with each
orthologous gene set. Our entire motif discovery procedure is summarized in
Figure 7.3. Since our final goal is to cluster B. subtilis genes using similarity
of discovered motifs, we require any discovered motif contain at least one pre-
dicted B.subtilis site. This criterion is necessary because neither BioProspector
nor BioOptimizer are restricted to find sites in every sequence, and we do not
want to include any discovered motifs that are not present in B.subtilis.
An iterative-masking motif discovery procedure was also performed to find two-
block motifs with variable gaps. The two-block procedure was virtually identical
to the one-block procedure, except that in the first step, the top five two-block
motifs were found by BioProspector separately for 7 different two-block widths
(8−8, 10−10, . . . , 20−20) with a gap range of 12-15 bps, for a total of 3×5×7 = 105
two-block motifs that were subsequently optimized by BioOptimizer. The gap
range of 12-15 bps was used because that length roughly corresponds to a single
rotation of the DNA double helix, so that the two blocks of our motif are on the
same edge of the DNA double helix.
109
Figure 7.3: Flowchart for motif discovery procedure
7.3 Clustering Genes Based on Discovered Motifs
The one-block and two-block motifs found within each OGS sequence dataset
were combined to form a large collection of motifs, which will be clustered using
the motif clustering model described in Chapter 5.
For the 172 OGSs in the Studyset dataset, we found 81 one-block motifs and
168 two-block motifs that met our criterion (BioOptimizer score greater than null
score) for inclusion in the clustering procedures. For the 1516 OGSs in the whole
110
genome dataset, we found 1025 one-block motifs and 1416 two-block motifs that
met our criterion for inclusion in the clustering procedures. As mentioned in
Section 5.5, we have two clustering strategies for handling the two-block motifs.
In the Independent-Block strategy, we treat each of the two-block motifs as two sep-
arate and independent single block motifs and cluster these single block motifs
together with the discovered one-block motifs, for a total of 417 independent-block
motifs in the Studyset dataset, and 3466 independent-block motifs in the whole
genome dataset. Our one-block independent motifs were clustered based on a
”central motif”, though each central motif is allowed to shift within the raw mo-
tif matrices, as outlined in Section 5.4.
In the Joint-Block strategy, we treat each of the two blocks of a two-block motif to-
gether as one joined motif, and cluster these two-block motifs separately from the
discovered one-block motifs, for a total of 168 joint-block motifs in the Studyset,
and 1416 joint-block motifs in the whole genome dataset. Our joint-block motifs
were clustered based on a ”central motif” in each of the two blocks, with shifting
again allowed as described in Section 5.4.
Following the guidelines mentioned in Section 6.6 for our TF dataset, our indepen-
dent-block clustering model was implemented with a central motif of 8 bps. A
longer motif width might be too restrictive, especially when considering that the
small size of each OGS sequence dataset may lead to generally weaker motif sig-
nals. However, a shorter motif width might not be restrictive enough, result-
ing in too many spurious clusters that are not biologically relevant. In the joint-
block clustering model, both blocks contribute to the clustering probabilities, so
a shorter central motif of 6 bps was used in either block of the two-block motif
matrices.
111
We used the same implementation procedure for our clustering model as de-
scribed in Chapter 6, except that this application has two different datasets (study-
set and whole genome), each with two sets of motifs (independent-block and
joint-block). Multiple Gibbs sampler chains were run from several different ini-
tial configurations described in Chapter 6, and convergence was evaluated using
the R statistic as described at the beginning of Chapter 6.
Our results for both clustering procedures are described in Sections 7.4-7.5 for
the studyset dataset, and Sections 7.6-7.7 for the whole genome dataset. Our
predicted clusters are subjected to detailed examination, and are also evaluated
using four external validation measures described in the following section.
7.3.1 Validation of Gene Clusters
In order to evaluate our motif discovery and clustering procedures in terms of
ability to predict co-regulated gene clusters, we constructed several validation
measures based upon external information:
1. Functional Category Over-Representation
2. Known TF Over-Representation
3. Gene Expression: Median Within-Cluster Correlation
4. Gene Expression: Average Within-Cluster Variance
The first validation measure is to examine whether or not the predicted clus-
ters tend to contain genes with the same function. Kunst et al. (1997) classified
B.subtilis genes using a set of functional categories, which are available on the
112
Subtilist website (Moszer et al., 1995). The functional categories can be tabulated
across all genes in each cluster, and the program GeneMerge (Castillo-Davis and
Hartl, 2003) can be used to calculate a p-value for over-representation of a partic-
ular functional category in a given cluster. This p-value is calculated under the
assumption of a Hypergeometric distribution with a Bonferroni correction.
If our clustering procedure is effective, we expect to find over-representation of
functional categories in our predicted clusters. This measure is limited by the
granularity and imperfections in the functional classifications.
The second validation measure is to examine whether or not the predicted clus-
ters tend to contain genes that are known to be controlled by the same transcrip-
tion factor (TF) protein. A list of 650 known TF-gene interactions from the DBTBS
database was provided (Makita et al., 2004) and used to tabulate known TF inter-
actions for genes within each predicted cluster.
Again, the program GeneMerge (Castillo-Davis and Hartl, 2003) can be used to
calculate a p-value for over-representation of interactions with a particular TF
protein in a given cluster. If our clustering procedure has been effective, we ex-
pect to find that a large proportion of genes in a predicted cluster will have an
interaction with a known TF. This measure is limited, however, by the small size
of our interaction list, which presumably catalogues only a miniscule fraction of
the true gene-TF interactions.
Another validation measure is to examine gene expression patterns within pre-
dicted clusters to see if genes within particular clusters are co-expressed across
a variety of conditions. Our expression dataset consists of ratios of differential
expression on cDNA microarrays from seven different experimental conditions
in B.subtilis (Conlon et al., 2004). For a particular cluster, we considered two dif-
113
ferent measures of microarray co-expression.
The first expression measure was to calculate the median pairwise correlation S
within a cluster. The Pearson correlation was calculated between each possible
set of two genes in a particular cluster, and then the median value of these corre-
lations was taken to be the median within-cluster correlation, S. Since two genes
in the same cluster might be regulated by the same TF but in opposite ways (one
repressed while the other is promoted), we use the absolute value of the correla-
tion.
The second expression measure is the average within-cluster variance, which is
calculated for a cluster with n members as
T =1
7
7∑
i=1
[
1
n
n∑
j=1
(xij − xi)2
]
(7.1)
where xij is the differential expression ratio for gene j in experimental condition
i, and xi is the differential expression ratio in experimental condition i averaged
over all genes in the cluster.
If our motif discovery and clustering procedure has effectively parsed our genes
into co-regulated clusters, then we would expect low values of the average within-
cluster variance, T , and high values of the median within-cluster correlation, S.
We can estimate p-values for S and T for our predicted clusters by simulation
ie. comparing our observed S and T to many values of S and T calculated for
randomly-generated clusters. Since S and T will depend on the size of the pre-
dicted cluster, we simulated many random values of S and T for each possible
cluster size.
The main weaknesses of these expression-based measures is the limited number
of experimental conditions present in our dataset, as well as the inherent sources
114
of noise present in microarray data (reviewed by Tseng et al. (2001)).
Despite the fact that each measure has particular limitations, the use of several
measures simulateneously should give us a good idea of the effectiveness of our
motif discovery and clustering procedure.
7.4 Studyset Clustering Results
The overall studyset clustering results can be examined in the form of a clustering
tree (Section 6.1). As an example, the clustering tree for our 168 joint-block motifs
is given in Figure 7.4. Motifs are labelled in Figure 7.4 as genename-motifnum
since more than one motif was found in the upstream region of the OGS for many
of the B.subtilis genes.
The length of the branches is equal to 1 − pij where pij is the pairwise posterior
clustering probability of motif i and j. For example, the two motifs bsaA-m1 and
ywaC-m1 in the lower-left corner of the plot have a pij ≈ 1 of clustering together,
but have pij ≈ 0 of clustering with any other motifs in the dataset.
The large number of motifs under consideration in this project limits the amount
of information that one can gain from a tree over all the motifs. The clustering
tree for our 417 independent-block motifs is far too dense to be a useful visual
summary of the clustering results.
The best partition for each strategy, calculated according to Section 6.2, resulted
in 125 independent block clusters (containing 374 out of 417 independent block
motifs) and 44 joint block clusters (containing 112 out of 168 joint block motifs).
Our best clustering partitions were then filtered to remove any motifs that had
115
dh
bA
−m
1yq
kL
−m
1le
vD
−m
1b
sa
A−
m1
yw
aC
−m
1yo
cE
−m
1sp
oIV
A−
m1
ftsA
−m
1co
tH−
m1
na
rK−
m1
np
rE−
m1
ha
g−
m1
lctE
−m
2b
mrU
−m
1ycd
H−
m1
yciC
−m
1yu
mD
−m
1p
urE
−m
1p
urA
−m
1p
urR
−m
1ytiP
−m
2yq
hZ
−m
2yq
eZ
−m
1a
brB
−m
1io
lR−
m1
yh
aR
−m
1g
lnR
−m
1a
brB
−m
2a
ckA
−m
1ch
eV
−m
1ytx
G−
m1
yku
N−
m1
sp
oIV
B−
m1
sp
oV
ID−
m1
glv
A−
m1
cw
lC−
m1
mo
tA−
m1
yb
aN
−m
1ly
tD−
m1
clp
P−
m1
lytE
−m
1co
mG
A−
m1
sp
oV
T−
m1
lytR
−m
1yq
fZ−
m1
ycg
F−
m1
sp
oIV
FA
−m
1h
em
A−
m1
sig
H−
m1
yciC
−m
2yku
M−
m1
hu
tP−
m1
ysd
B−
m1
yxjC
−m
1yd
aR
−m
1ylb
J−
m1
ye
bB
−m
1xp
t−m
1sp
oV
K−
m1
gn
tR−
m1
yq
hZ
−m
1a
cu
A−
m1
dp
s−
m1
sp
oIIR
−m
1co
mC
−m
1ytiP
−m
1m
ed
−m
1d
ltA
−m
2b
glP
−m
2d
acB
−m
1yo
eA
−m
1d
ltA
−m
1p
rkA
−m
1ssp
D−
m1
arg
C−
m1
yw
oA
−m
1yce
C−
m1
yh
cR
−m
1m
ta−
m1
sp
oIID
−m
1m
mg
A−
m1
msm
X−
m1
yth
P−
m1
yq
fC−
m1
yu
xL
−m
1ycsN
−m
1yd
cA
−m
1n
arK
−m
2p
yrP
−m
1yu
nB
−m
1yp
jB−
m1
sp
oIV
CA
−m
1ya
bG
−m
1sp
oIIP
−m
1yd
aP
−m
1d
acF
−m
1yjb
C−
m1
ytv
I−m
1le
xA
−m
1d
ra−
m1
ssp
B−
m1
rocR
−m
1sig
W−
m1
bg
lP−
m1
citZ
−m
1a
co
R−
m1
fnr−
m1
lon
B−
m1
sp
oIV
A−
m2
op
uE
−m
1yis
K−
m1
ge
rE−
m1
sp
oIIE
−m
2ytg
A−
m1
me
cA
−m
1yp
iB−
m1
co
tX−
m1
sp
oIIM
−m
1yjb
F−
m1
bo
fA−
m1
gly
A−
m1
ye
bB
−m
2a
raA
−m
1vp
r−m
1 yo
bO
−m
1co
tE−
m1
sp
oV
B−
m1
ykrQ
−m
1ka
tX−
m1
sp
oIIG
A−
m1
citB
−m
1n
rgA
−m
1w
ap
A−
m1
yo
bO
−m
2yrv
J−
m1
yd
cC
−m
1a
co
A−
m1
yte
I−m
1yh
dM
−m
1co
mK
−m
1re
cA
−m
1lrp
C−
m1
sp
o0
F−
m1
Figure 7.4: Clustering tree for studyset joint-block motifs
individual clustering probabilities (Section 6.3) that were less than 0.75. As well,
the independent block clusters were filtered to remove any “redundant” motifs
ie. two motifs from the same gene in the same cluster, which may have arisen
from one of the blocks of a two-block motif found in a particular OGS being
similar enough to cluster to a one-block motif found in that same OGS (since the
motifs of two separate one-block and two-block motif-finding procedures were
combined). This redundancy was not an issue in the joint block clustering, since
only the two-block discovered motifs were used for this dataset, and the motif
discovery procedure was designed to avoid redundant motifs.
116
After filtering out redundant motifs and motifs with low individual clustering
probabilities, our independent block partition was reduced to 97 clusters con-
taining 271 motifs, while the joint block partition was reduced to 40 clusters con-
taining 99 motifs. A graphical representation of our clustering procedure for the
studyset dataset is given in Figure 7.5.
172 Orthologous Genes
81 One-Block Motifs 168 Two-Block Motifs
417 Indblock Motifs 168 Jointblock Motifs
125 Indblock Clusters 44 Jointblock Clusters
97 Indblock Clusters 40 Jointblock Clusters
40% significant on at least one measure
45% significant on at least one measure
10% significant on multiple measures
10% significant on multiple measures
Motif Discovery
Clustering Clustering
Filtering Filtering
Evaluation Evaluation
Further Evaluation Further Evaluation
Joined BlocksIndependent Blocks
Figure 7.5: Flowchart for studyset motif clustering procedure
Figure 7.6 gives the distribution of cluster sizes for both the independent-block
and joint-block best partitions. We see that the joint-block clustering tends to
117
produce a higher proportion of small clusters, especially clusters that have only
two motif members. This is also reflected in the average cluster size, which is 2.8
motifs for the independent-block best partition and 2.5 motifs for the joint-block
best partition.
2 3 4 5 6 7 8
Cluster Sizes − Best Partition − Studyset Indblock − DP
010
2030
4050
2 3 4 5 6 7 8
Cluster Sizes − Best Partition − Studyset Jointblock − DP
05
1015
2025
Figure 7.6: Distribution of cluster sizes for studyset best partitions
Both the independent block and joint block clusters were examined using the four
validation measures introduced in Section 7.3.1. All predicted clusters that were
significant (at a α = 0.1 level) on any of the four validation measures are shown
in Table 7.4. The independent block clusters are given first, followed by the joint
block clusters, and the list of clusters are ordered by the cluster strength statistic
described in Section 6.3, which is also given.
118
Table 7.4: Significant studyset predicted clusters
clus size str S p T p func num p TF num p multInd 2 5 133.7 Metabolism-nucs 3/5 0.002 PurR 5/5 0.000 ***Ind 3 5 127.5 Metabolism-nucs 3/5 0.002 PurR 5/5 0.000 ***Ind 4 4 123.8 Zur 2/4 0.002Ind 5 6 113.4 Sporulation 4/6 0.029Ind 12 4 69.8 0.04 0.056Ind 13 3 67.1 0.88 0.053 RocR 2/3 0.006 ***Ind 16 4 58.8 0.94 0.011Ind 18 3 52.0 0.02 0.026Ind 20 3 51.7 0.02 0.046Ind 21 3 51.2 0.94 0.028Ind 22 3 48.3 0.84 0.076 Sporulation 3/3 0.013 ***Ind 23 3 47.8 PurR 2/3 0.031Ind 26 3 45.0 1.00 0.002 SigE 3/3 0.022 ***Ind 31 3 42.9 1.00 0.002Ind 32 3 42.9 0.88 0.054Ind 33 3 41.9 SigW 2/3 0.062Ind 36 3 41.3 0.84 0.076Ind 39 3 40.2 SigW 2/3 0.062Ind 41 3 39.2 0.89 0.047 0.02 0.04 SigW 2/3 0.062 ***Ind 43 2 37.6 0.01 0.095 DinR 2/2 0.000 ***Ind 47 3 33.9 Transport/bindi 2/3 0.047Ind 49 2 30.2 SigA 2/2 0.065
TnrA 2/2 0.000Ind 50 2 29.9 Adaptation 2/2 0.001 CtsR 2/2 0.000 ***Ind 54 2 25.2 CcpA 2/2 0.050Ind 55 2 25.1 SigB 2/2 0.017Ind 56 2 25.0 PurR 2/2 0.004Ind 58 2 24.4 Sporulation 2/2 0.055Ind 63 2 23.1 Sporulation 2/2 0.055Ind 70 2 21.5 0.00 0.031 AbrB 2/2 0.014 ***Ind 71 2 21.5 RNA-synthesis 2/2 0.020Ind 77 2 20.4 0.98 0.040Ind 83 2 19.2 0.00 0.011Ind 84 2 19.1 SigB 2/2 0.017Ind 85 2 19.0 0.96 0.066Ind 87 2 18.9 ComK 2/2 0.007Ind 89 2 18.8 Transport/bindi 2/2 0.008 YqhN 2/2 0.000 ***Ind 90 2 18.7 0.00 0.037Ind 94 2 16.6 0.93 0.081Ind 96 2 14.9 0.92 0.083Joint 1 5 186.0 Metabolism-nucs 3/5 0.002 PurR 5/5 0.000 ***Joint 3 4 95.3 0.02 0.014Joint 4 4 92.9 0.87 0.026Joint 5 3 67.1 Sporulation 3/3 0.012Joint 6 3 65.4 Cell-Wall 2/3 0.021Joint 7 3 65.3 0.95 0.024Joint 10 3 63.8 Transport/bindi 2/3 0.042Joint 13 3 58.1 0.02 0.029Joint 15 2 40.7 Transport/bindi 2/2 0.007 Zur 2/2 0.000 ***Joint 17 2 35.4 0.99 0.031 Sporulation 2/2 0.054 SigE 2/2 0.062 ***Joint 20 2 33.3 1.00 0.022 Sporulation 2/2 0.054 SigE 2/2 0.062 ***Joint 21 2 32.9 0.96 0.066Joint 27 2 31.2 0.01 0.089Joint 28 2 30.3 CcpA 2/2 0.069Joint 31 2 28.9 0.96 0.071Joint 32 2 28.9 RNA-synthesis 2/2 0.021Joint 33 2 28.4 0.00 0.014Joint 39 2 26.1 SigD 2/2 0.001
In addition to cluster size, strength, consensus sequence, and number of sites
(|A|) in each significant cluster, the measure on which the cluster is significant is
also given. If either of the expression measures T or S is significant, than the value
of T or S is given, along with the p-value calculated as described in Section 7.3.1.
If the functional category over-representation is significant for a predicted cluster,
that functional category and the proportion of genes with that category are given,
119
along with the p-value. If the TF over-representation is significant, then the TF is
given, along with the proportion of genes in the cluster that are regulated by that
TF, and the p-value for the over-representation, as described in Section 7.3.1.
Of the independent block predicted clusters, 39 out of 97 (40 %) were significant
on at least one of the validation measures. The proportion of significant joint
block clusters was slightly higher, with 18 out of 40 (45 %) predicted clusters
being significant on at least one measure.
Several clusters are significant on multiple measures, in which case we are even
more confidant that these clusters are biologically relevant. There were 10 out of
97 (10 %) of independent block clusters that were significant on multiple mea-
sures, and 4 out of 40 (10 %) of joint block clusters were significant on multiple
measures. These are indicated by a “***” symbol in Table 7.4.
It could perhaps be argued that since we are using multiple validation measures,
each with a significance threshold of α = 0.1, we might expect to find a high
number (up to 40%) of clusters to be significant simply by chance. However,
although the chance of a cluster being significant on at least one measure is high,
the chance of being significant on more than one measure is quite low.
Assuming independence between measures, the probability of being significant
on two measures is only 1%, and is only 0.1% for being significant on three mea-
sures. As given above, we observe a much higher rate of significant clusters on
multiple measures than would be expected by chance, in both the independent
and joint block clustering partitions.
Examining Table 7.4, there does not appear to be a very strong relationship be-
tween the cluster Strength measure and the significance of the cluster. For both
120
the joint and independent block clusters, it seems that both low-ranking and
high-ranking clusters (in terms of Strength) appear in the table of significant clus-
ters.
7.5 Detailed Examination of Studyset Clusters
In Figure 7.7, we present a graphical representation of all clusters that share at
least two genes between the independent-block and joint-block results. This
graph was created using the GraphViz software package (Gansner and North,
1999).
Each independent-block cluster is represented by elliptical nodes for each gene
connected by a dark line to a diamond which represents the TF that regulates
that gene cluster. Each joint-block cluster is represented similarly, except that
light lines are used and the TF is represented by a rectangle. The TF nodes are
labelled with an “i” in the case of independent-block clusters, and “j” in the
case of joint-block clusters. For each TF node label, the consensus sequence for
that cluster is also given. If one of the clusters in the graph was significant on at
least one of the validation measure, that node was given a double-lined border
instead of a single-line border. The TF nodes are numbered in the same order as
their cluster strength, eg. iTF1 is the independent-block cluster with the highest
strength.
There are several types of interesting relationships summarized in Figure 7.7. We
see cases (eg. jTF31 and iTF85, jTF40 and iTF72) where identical clusters
were predicted by both the independent and joint-block procedures. The first
of these cases is significant on the correlation-based gene expression measure S.
121
iTF1aAaaGGgG
yqfC
ycsN
ydaP
ydcA
ypiB
yuxLnarK
bglP
iTF2AatgTTCG
purR
purE
ytiPyumD
purA
iTF3CGaAcaTT
iTF4AATcATTA
ycdH
yciC
yqkL
dhbA
iTF5caccTcCt
yabG
ftsA
yobO
spoIIM
spoIIP
cotH
iTF10TttCttca
opuE
yisK
lonB
wapA
iTF11cttTTtTC
spoIIAA
araA
bofA
glyA
iTF18ATtATAca
sigH
yebB
ysdB
iTF19tcctcaGC
spoIID
yhcR
argC
ywoA
iTF20acTTttTT
lrpC
spoIIGA
citB
iTF21gtAaGgAG
gerE
ytxG
iTF72AATagtat
yqhZ
acuA
iTF76TtAcaTga
glnR
ackA
iTF85tagacgTT
bsaA
ywaC
iTF97GccTagaC spoIIR
jTF1CGaAca--tgTtCG
jTF2taatAa--AAAGGg
jTF3AtgTcA--TtAcaT
abrB
yhaR
jTF4TttcTt--AaGgag
spoIVAjTF5ctcCTt--aaGgag
spoIVCA
jTF7ttTtTC--Aaaaac
yjbF
jTF13TTttTT--AtaCTt
jTF15CGTAAT--tAttat
jTF19ACcTcC--TcCatt
jTF21GTaaAa--ATtATA
jTF24aCgtTt--GtCtag
jTF29ctgagC--gCaGaa
jTF31ccgCta--gacgTT
jTF40AATagt--aaAggg
Figure 7.7: Graph of connected studyset clusters
An additional pair of clusters (iTF18 and jTF21) are identical except for the
additional sigH gene in iTF18 and these two clusters are also significant on the
correlation-based gene expression measure S.
A particularly interesting result is that the genes in the strongest cluster (jTF1)
in the joint block set of clusters are identical with the set of genes in the second
(iTF2) and third (iTF3) strongest independent block clusters. In this case, the
same genes that clustered together based on a two-block upstream motif also
122
clustered together based on both blocks of this upstream motif separately. These
identical clusters were significant on multiple measures (over-representation of
functional categories and over-representation of a particular TF), providing strong
evidence that this cluster has biological relevance. The over-represented TF is
PurR, which was examined in Saxild et al. (2001) and found to bind each of the
genes in this cluster (purR, purE, ytiP, yumD, purA) and to have a two-block
motif with a highly conserved CGAA segment in the first block and TTCG in the
second block, which is verified by the consensus sequence for our joint-block
cluster: CGaAca--tgTtCG.
Not surprisingly, PurR is also known to be involved in the in the purine biosyn-
thetic pathway in Bacillus subtilis (Saxild et al., 2001), which confirms the over-
represented functional category, Nucleotide Metabolism. One of the PurR genes,
ytiP, is also clustered with the gene spoIIR in two identical clusters (iTF97 and
jTF24), although neither of these clusters is significant on any of the validation
measures.
According to the DBTBS database (Makita et al., 2004), spoIIR is regulated by the
transcription factor σF, but is unknown if σF also regulates the gene ytiP. The
consensus sequence for σF is given in Table 4.5 to be GtaTaaa--tGgcaAtAcTa,
which does not match closely the motifs for either iTF97 or jTF24.
In several other cases, similar clusters were predicted by both the independent
and joint-block procedures, but some genes are only found in either the indepen-
dent or joint block clusters. An example of this is relationship between iTF19
and jTF29 or iTF4 and jTF15, where several more genes are present in the in-
dependent block cluster. jTF3 and iTF76 is an example were more genes are
present in the joint block cluster.
123
This type of relationship might indicate that the additional independent block
genes are bound by a TF that has a motif which resembles a portion of the joint-
block motif but not the entire joint-block motif. For example, Table 4.5 gives the
consensus sequence of the σE motif as ttgtcaTattt--ttcATAtaatg and the
σK motif as gcACa--gcATAtgaTaa, which share a similar second block but a
very different first block.
Another explanation for this behaviour could be that the joint-block motif in
some of these cases is not a true two-block binding motif, but rather consists
of binding sites for two single-block motifs that occur in close proximity to one
another in each of the genes in the joint-block cluster. In this case, the additional
independent-block motifs would represent genes that are bound by only one of
those TFs, but not the other, and so only are included in the one-block indepen-
dent clustering but not the joint clustering.
Further case-by-case evidence would be needed to confirm these or other theo-
ries. In the case of the clusters iTF4 (containing the genes ycdH, yciC, yqkL and
dhbA) and jTF15 (containing just the genes ycdH and yciC), we have the addi-
tional validation that the two common genes are bound by the TF protein Zur.
Gaballa et al. (2002) analyze the Zur regulon and demonstrate that the genes yciC
and ycdH are bound by Zur, and are in fact the genes with the highest differential
expression ratios from gene-knockout microarray experiments. They describe the
Zur protein as a regulator of genes involved in zinc uptake, which confirms the
over-representation of Transport/Binding proteins in the joint-block cluster.
Gaballa et al. (2002) also present a 28-bp long consensus sequence for the Zur
binding motif AAttTAAATCGTAATcATTacGaTTTAa which was based on four
genes. They note that the central region of this consensus sequence TAATnATTA
124
is shared by two other transcription factors, PerR and Fur. Examining the con-
sensus sequence of our the iTF1 cluster, we see this same consensus sequence
AATcATTA. This case seems to support the first of the two theories ie. the addi-
tional independent block genes (yqkL and dhbA) have binding motifs resembling
the central region of the Zur motif, but do not have the entire Zur motif. Accord-
ing to the DBTBS database (Makita et al., 2004), yqkL is bound by PerR and dhbA
is bound by Fur, which further confirms this theory.
Most of the common clusters in Figure 7.7 are not interconnected with each other,
but a single large sub-graph is present, connecting 9 different independent and
joint block clusters. Two of the TFs in this subgroup (iTF5 and jTF5) have signif-
icant over-representation of genes with Sporulation functions. Each of these two
clusters contain genes that are bound by either the σK TF or the σE TF, though nei-
ther of the clusters has significant over-representation of these TFs. The consen-
sus sequence of iTF5 (caccTcCt) matchs the first of the two blocks of the motif
for jTF5 (ctcCTt--aaGgag). Although these blocks do not match the known
motifs for σK TF or σE (given above), the second block of jTF5 does match the
known motif for the Ribosomal Binding Site, also known as the Shine-Dalgarno
sequence (Shine and Dalgarno, 1974), which is known to bind close to the binding
sites of σ TFs. It could be that the joint block motif jTF5 is actually a combination
of the second block of a two-block σ TF binding along with the ribosomal binding
site. Several of the other clusters in this large subgroup are significant on either
the variance-based or correlation-based expression measures, but do not show
over-representation of a particular function or TF.
125
7.6 Whole Genome Clustering Results
The resulting clustering trees, from both the independent-block and joint-block
strategies for the whole genome OGS dataset, were far too dense to be a useful
visual summary of the clustering results.
The best partition for each strategy, calculated according to Section 6.2, resulted
in 798 independent block clusters (containing 3369 out of 3466 independent block
motifs) and 407 joint block clusters (containing 1214 out of 1416 joint block mo-
tifs). Again, both best partitions were filtered to remove any “redundant” motifs
or motifs with individual clustering probabilities less than 0.75. clustering strate-
gies are given in the supplementary materials. A graphical representation of our
clustering procedure for the genome dataset is given in Figure 7.8.
After filtering the independent block best partition was reduced to 692 predicted
clusters containing 2480 motifs, while the joint block best partition was reduced
to 376 clusters containing 1097 motifs.
Figure 7.9 gives the distribution of cluster sizes for both the independent-block
and joint-block best partitions. Similar to our Studyset results, we see that the
joint-block clustering tends to produce a higher proportion of small clusters. The
average cluster size is 3.6 motifs for the independent-block best partition and 2.9
motifs for the joint-block best partition.
Both the independent block and joint block clusters were examined using the
four validation measures introduced in Section 7.3.1. Of the independent block
predicted clusters, 196 out of 692 (28 %) were significant on at least one of the
validation measures. The proportion of significant joint block clusters was lower,
with 104 out of 376 (28 %) predicted clusters being significant on at least one
126
1516 Orthologous Genes
771 One-Block Motifs 1443 Two-Block Motifs
3466 Indblock Motifs 1443 Jointblock Motifs
798 Indblock Clusters 407 Jointblock Clusters
692 Indblock Clusters 376 Jointblock Clusters
28% significant on at least one measure
28% significant on at least one measure
6% significant on multiple measures
5% significant on multiple measures
Motif Discovery
Clustering Clustering
Filtering Filtering
Evaluation Evaluation
Further Evaluation Further Evaluation
Joined BlocksIndependent Blocks
Figure 7.8: Flowchart for genome motif clustering procedure
measure.
Many of these clusters are significant on multiple measures, in which case we are
even more confidant that these clusters are biologically relevant. There were 41
out of 692 (6 %) of independent block clusters that were significant on multiple
measures, and 17 out of 376 (5 %) of joint block clusters. Just as in the Studyset re-
sults (Section 7.4), these multiple significance figures are much higher than would
be expected by chance. All clusters which were significant on multiple measures
127
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Cluster Sizes − Best Partition − Genome Indblock
5010
015
020
0
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Cluster Sizes − Best Partition − Genome Jointblock
5010
015
020
0
Figure 7.9: Distribution of cluster sizes for whole genome partition
are shown in Table 7.5.
The independent block clusters are given first, followed by the joint block clus-
ters, and each are ordered by cluster strength. In addition to cluster size, strength,
consensus sequence, and number of sites (|A|) in each significant cluster, the mea-
sure on which the cluster is significant is also given.
7.7 Detailed Examination of Whole Genome Clusters
All genes in our whole genome dataset were examined for relationships with
other genes within either the independent block or joint block clustering results,
as well as common relationships within both.
128
Table 7.5: Genome clusters significant on multiple measures
clus size str S p T p func num p TF num pInd 2 15 409.5 0.58 0.036 SigE 2/15 0.068Ind 13 7 224.2 0.60 0.090 Transport/bindi 5/7 0.001 Fur 4/7 0.000Ind 14 8 213.9 Metabolism-nucs 3/8 0.004 PurR 5/8 0.000
DinR 2/8 0.001Ind 19 9 193.8 Metabolism-carb 3/9 0.090 CcpA 2/9 0.061
SigA 3/9 0.032Ind 22 7 191.4 Metabolism-nucs 3/7 0.002 PurR 6/7 0.000Ind 25 8 182.1 0.06 0.064 CcpA 3/8 0.001Ind 27 6 179.8 0.70 0.033 Transport/bindi 3/6 0.067 SigA 2/6 0.055
Fur 4/6 0.000Ind 38 7 158.3 0.04 0.037 Protein-synthes 4/7 0.000Ind 54 7 133.9 0.63 0.062 Detoxification 2/7 0.077Ind 64 5 126.3 RNA-synthesis 3/5 0.013 RocR 2/5 0.000Ind 86 4 113.1 Adaptation 2/4 0.007 CtsR 2/4 0.000Ind 116 5 98.1 0.65 0.094 Membrane-bioene 2/5 0.021 ResD 2/5 0.001Ind 117 5 97.9 0.69 0.060 Metabolism-aa 2/5 0.098Ind 128 4 95.5 Metabolism-coen 2/4 0.009 SigA 2/4 0.023Ind 130 5 95.1 Transport/bindi 3/5 0.027 SigG 2/5 0.004Ind 147 5 89.7 0.02 0.036 Protein-synthes 2/5 0.031Ind 157 4 85.2 0.03 0.097 Protein-synthes 2/4 0.019Ind 166 4 82.2 0.01 0.006 RNA-synthesis 2/4 0.067Ind 177 5 78.8 Sporulation 3/5 0.008 SigE 2/5 0.008Ind 196 4 73.4 Sporulation 2/4 0.071 SigE 2/4 0.009Ind 213 4 70.0 0.68 0.087 Sporulation 2/4 0.071 SigE 2/4 0.005Ind 216 4 69.9 0.71 0.073 SimilartoBsub 3/4 0.019Ind 237 4 66.9 0.79 0.025 0.02 0.042Ind 239 4 66.8 0.02 0.037 RNA-synthesis 2/4 0.067Ind 247 3 66.0 0.01 0.038 Metabolism-lipi 3/3 0.000Ind 248 3 65.7 Metabolism-coen 2/3 0.003 SigA 2/3 0.012Ind 276 3 60.5 0.99 0.001 Similartoother 2/3 0.080Ind 287 4 56.4 0.67 0.089 0.02 0.021Ind 290 3 55.2 Membrane-bioene 2/3 0.003 ResD 2/3 0.000Ind 295 3 54.0 0.86 0.033 0.00 0 Protein-synthes 2/3 0.006Ind 301 3 52.0 Transport/bindi 2/3 0.058 PurR 2/3 0.000Ind 326 3 48.6 0.02 0.093 Similartoother 2/3 0.040Ind 372 3 44.4 0.02 0.084 Transport/bindi 2/3 0.058Ind 394 3 42.8 0.76 0.097 DNA-modificatio 2/3 0.001Ind 407 3 42.1 0.78 0.087 NoSimilarity 2/3 0.012Ind 424 3 39.9 0.02 0.074 Similartoother 2/3 0.080Ind 435 2 38.1 0.92 0.057 0.00 0.015 DinR 2/2 0.000Ind 439 3 37.7 0.01 0.022 Similartoother 2/3 0.080Ind 456 2 28.9 0.93 0.048 Similartoother 2/2 0.015Ind 480 2 26.0 0.00 0.04 Metabolism-lipi 2/2 0.001Ind 621 2 20.5 0.89 0.080 0.01 0.069Joint 3 8 300.9 Metabolism-nucs 3/8 0.003 PurR 6/8 0.000Joint 42 4 109.4 Sporulation 2/4 0.071 SigE 2/4 0.005Joint 85 3 82.2 Metabolism-carb 3/3 0.000 CcpA 2/3 0.003
SigA 2/3 0.019Joint 86 3 82.0 0.77 0.071 Metabolism-lipi 2/3 0.004Joint 89 3 78.4 0.94 0.008 0.01 0.018Joint 96 3 71.9 Transport/bindi 2/3 0.055 Fur 2/3 0.000Joint 135 3 62.8 0.95 0.005 0.01 0.018 RNA-synthesis 2/3 0.036Joint 147 3 60.3 0.75 0.083 0.02 0.078Joint 154 2 58.4 Metabolism-coen 2/2 0.001 SigA 2/2 0.002Joint 177 2 51.1 Membrane-bioene 2/2 0.001 ResD 2/2 0.000Joint 182 2 41.4 0.01 0.095 Protein-synthes 2/2 0.001Joint 191 2 37.6 0.01 0.059 RNA-synthesis 2/2 0.006Joint 195 2 36.8 1.00 0.005 SigE 2/2 0.002Joint 251 2 32.2 0.92 0.048 0.01 0.1Joint 296 2 29.9 0.93 0.044 0.01 0.039Joint 321 2 28.4 0.00 0.017 SimilartoBsub 2/2 0.018Joint 341 2 27.2 0.01 0.035 Transport/bindi 2/2 0.010
Graphs for every independent and joint-block cluster (with at least two common
genes) in the whole genome partition was too dense to be informative, so we
restricted ourselves to only clusters that were significant on at least one of the
validation measures. Even with this restriction, the graph must be split into two
figures. Figure 7.10 gives a large set of interconnected clusters within the whole
129
genome best partition. The graph characteristics are the same as in Figure 7.7.
iTF3cccTCCtt
rpoA
ycgA
nasF
ydiG
yetN
ygaI
hemE
yhfP
yhjR
yrrS
yrrM
yutC
yvbU
katX
iTF13ATTaTCAt
yfiY
yfiZyfhC
ykvW
dhbA
yxeB
fhuB
iTF19AAggtGaa
kbaA
adaA
yhxB
ykrU
acsA
acuA
ywnE
ywfM
iTF25TattaTaa
ctsRydeC
ylaN
yrzC
ytkK
dra
iTF27AtAATGAT
yclN
ykuNyrdQ
iTF86CTTTGACT
clpE
ykvI
dnaJ
iTF116ttttcAcA
yjbH
ctaA
ctaB
ylmC
vpr
iTF290ttataTtT
jTF4CtcCtt--TtTTaT
yflLyjbK
ysfA
yvqJ
ywjG
jTF85TAtTaT--AggtGg
yluB
jTF96AATgAT--gAtaat
ycdH
jTF177ttCACA--ataTtTjTF188
TTTGAC--AaaaTa
Figure 7.10: Graph of connected and significant whole genome clusters, part 1
Several of the other central clusters in this group (iTF19, iTF25, and jTF85)
have the significant over-representation of genes bound by the transcription fac-
tor CcpA and two of these clusters iTF19 and jTF85 are over-represented by
genes bound by SigA transcription factor. CcpA is involved in the catabolite re-
pression pathway (Kim and Chambliss, 1997) which was noted by Weickert and
Chambliss (1990) to also be linked to the Sporulation process in B.subtilis. SigA
130
encodes the primary sigma factor of RNA polymerase and so is a necessary pro-
tein for any cell growth.
iTF19 has a consensus sequence of AAggtGaa, 2 out of 9 genes known to be
under the control of CcpA, and 3 out of 9 genes known to be under the control
of SigA. iTF25 has a consensus sequence of TattaTaa and 3 out of 8 genes
known to be under the control of CcpA. jTF85 has a consensus sequence of
TAtTaT--AggtGg, 2 out of 3 genes under the control of CcpA, and 2 out of 3
genes under the control of SigA.
The literature consensus sequence for SigA (Helmann, 1995) is TTGACA--TATAAT
while the binding motif for CcpA, known as cre (catabolite response element), is
given in by Weickert and Chambliss (1990) to be TGTAAGCGTTAACA. Although
iTF25 and the second block of jTF85 seems to be a reasonable match to the
first block of the SigA motif it is unclear whether iTF19 or the second block of
jTF85 match the motifs for either SigA or CcpA, though these two blocks cer-
tainly match each other. Since many other genes are included in the iTF19 clus-
ters, a possible explanation could be that iTF19 and the second block of jTF85
actually represent the ribosomal binding site, which normally would be in close
proximity to the second block of a Sigma factor motif. This same phenomenon
was postulated in the Studyset results (Section 7.4).
Another notable group in Figure 7.10 are the three clusters iTF116, iTF290 and
jTF177 on the right side of the figure, all of which are over-represented for the
transcription factor ResD. ResD is a transcription factor that, along with ResE,
forms a signal-transduction system with an important role in cellular respiration
Sun et al. (1996), which confirms the functional category Membrane bioenergenics
that is also significantly over-represented in these three clusters. Several genes
131
are present in this group of clusters (ydiG,ylmC,yjbH,vpr) which seem to have
one of the single block motifs but not the other.
It is also worth noting that the group of three clusters iTF13, iTF27 and jTF96
at the top of Figure 7.10 are all over-represented for the transcription factor Fur
and the iTF27 cluster is also over-represented for the TFs Zur and SigA. This
group of clusters is analogous to the group of Zur/Fur clusters found in the
Studyset, and share several genes in common. In addition, we again see over-
representation of the Transport/Binding functional category, for all three of these
clusters.
The remainder of the connected and significant whole genome clusters not in-
cluded in Figure 7.10 are presented in Figure 7.11. Again, the graph characteris-
tics are the same as in Figure 7.7.
Examining the clusters of Figure 7.11, we again see several of the characteristics
noted in the Studyset analysis: many clusters have several genes in common, but
also several genes that only seem to have either the joint-block or independent-
block motif, but not both. For example, iTF128, iTF248, and jTF154 (near the
middle of the graph) are connected clusters that are all over-represented for sigA,
which is also mentioned in the Studyset results. Both independent-block motifs
resemble either block of the joint-block motif, but each also contain genes (yqfZ,
spoVB, ylxM) that have one of the single-block motifs, but not the other.
The most notable feature of Figure 7.11 is the PurR controlled clusters (iTF14,
iTF22, jTF3) at the bottom of the figure, which have the same set of genes (purR,
purE, ytiP, yumD, purA) present that was also found within the Studyset dataset
(Section 7.4), but now additional genes are also included in this subgraph. The
gene yebB is common across all three of these whole genome clusters, but was not
132
iTF14CGAAcatt
recAuvrB
purR
yebB
purE
yumD
hom
purA
iTF22AatgTTCG
abrBytiP
iTF38AgGGaGga
ileS
yfhJ
alaS
pheS
ytzA
tyrS
hag
iTF128CTTGaCat
spoVB
ylxM
nadB
nifS
iTF148gtgataac
yfhF
yfhG
yjbQyqkD
ywqE
iTF184AacTctCc
yteI
ytdI
leuS
ald
iTF206ctCCtttT
spoVG
yjbC
ylbP
rho
iTF247TTAgtAcC
yhfB
yjaX
yjbW
iTF248TGTCaaGA
yqfZ
iTF254tCtctTTT
yheI
yqxD
phoA
rpsD
iTF435CatatGTT
lexA
yneA
jTF3CGaAcA--tgTtCG
ydiA
yitG
jTF15CtccTT--AaaaaA
yhxAcsaA
comEA
ysdC
jTF19TCtcct--AaggGa
adk
ydbE
yfiO
ypjB
jTF53TGTTcg--tAtact
malS
ytkP
jTF86ctaaat--TTAgTA
msmR
jTF111CCTttt--tttgaA
sigW
jTF154TCTTGa--tCAAGA
jTF180CtCtTT--cgtttt
jTF182AGGGag--CCcTtt
Figure 7.11: Graph of connected and significant whole genome clusters, part 2
present in the Studyset clusters. There are also several genes that are included in
one of the PurR clusters but not the others: ydiA, yitG, recA, uvrB, hom, and
abrB.
Two of these genes, recA and uvrB, are also bound by the transcription factor
DinR (Makita et al., 2004), and in fact iTF14 which contains these two genes
is over-represented in terms of genes controlled by both PurR and DinR. DinR is
involved (along with recA) in the regulation of the SOS response to DNA damage
133
in Bacillus subtilis (Winterling et al., 1997). DinR is also over-represented in the
connected clusters iTF435 and jTF53 as well. Winterling et al. (1998) present
the DinR binding sequence as CGAACRNRYGTTYC, which seems to contain parts
of the motif from the iTF435 and jTF53 group (tGTT) as well as the iTF14 and
jTF3 clusters (CGAACA).
134
Chapter 8
Discussion and Future Work
Motif discovery is an important problem in computational biology because the
binding of transcription factors to upstream region motifs is crucial to the mech-
anism of gene regulation. In Chapter 2, we have presented various techniques
used in the past for motif discovery, a set of Bayesian models useful for devel-
oping motif-finding tools, and generalizations of these models that allow for un-
known motif width w and unknown motif abundance ratio p0. We have also dis-
cussed the use of scoring functions for motif finding. Viewing Bayesian models
in terms of scoring functions has provided insight to the similarities between the
full Bayesian model-based approaches and some non-Bayesian methods, such as
Consensus (Stormo and Hartzell, 1989).
We have introduced a scoring function formulation in Chapter 3, implemented
in the software BioOptimizer, designed to improve the prediction of regulatory
binding motifs. The advantage of scoring functions is that they give us an in-
tuitive means by which to compare different possible configurations of motif lo-
cations and can serve as a framework for the comparative use of several motif-
finding programs, thereby benefiting from the advantages that different motif-
finding programs may offer in different situations. This general approach of us-
135
ing multiple methods to obtain different estimates of an unknown quantity that
are subsequently compared and improved can be useful beyond models for motif
discovery.
This usefulness of BioOptimizer was demonstrated in Chapter 4 by the uniformly
increased accuracy of predicted sites across the board compared to BioProspector,
Consensus, Meme, and AlignAce. Although BioOptimizer is not guaranteed to
find a global best fit to our model, there is still a significant gain resulting from
its use with very little extra computational time. The best improvements were
obtained from the scoring functions that most closely approximated the posterior
distribution under our full Bayesian model.
BioOptimizer also allows for unknown motif abundance, unknown motif width,
and two-block motifs with variable-length gaps between the blocks. Allowing
the motif width to be inferred from the data has lead to non-conventional results
when applied to datasets for the spo0A binding motif in B.subtilis and the CRP
binding motif in E.coli. The two-block version of BioOptimizer provided inter-
esting results when applied to the search for binding motifs for several σ-factors
in B.subtilis as well as the CRP binding motif. It is seen that the optimal motif
width found by BioOptimizer was often substantially different from our a priori
expectations.
There are still many interesting open problems in this field. The vast majority
of motif-finding research has assumed that all information about the interaction
between transcription factors and their DNA binding motifs can be summarized
just by looking at the one-dimensional nucleotide sequence. Benos et al. (2002b)
and Benos et al. (2002a) discuss one-dimensional nucleotide models and conclude
that although their fit is not perfect, they do provide a very good approximation
136
to the true nature of protein-DNA interactions. However, in actuality this inter-
action is occurring in three-dimensional space, so ideally motif models should
incorporate characteristics of DNA morphology.
Keles et al. (2003) propose a supervised motif detection method, COMODE, that
takes into account structural information about the DNA-binding protein by con-
straining the motif search to be similar to previously known information content
profiles. As an example, in eukaryotic organisms, DNA is stored in the form
of tightly-compacted chromosomes where substantial portions of the DNA se-
quence is wrapped around proteins called histones. This is important informa-
tion to include in future models, since portions of the sequence that are wrapped
around histones are less free to interact with DNA-binding proteins like tran-
scription factors.
In specific examples, where extra information about the distances between mo-
tif sites and the start of the coding region is available, this information should
be added into the model. McCue et al. (2001) demonstrate that incorporating a
model that takes into account the location of the motif site relative to the end of
each sequence can improve the sensitivity of the algorithm. As mentioned in Sec-
tion 2.5, the multiple motif models of Lawrence et al. (1993) take into account or-
dering information between motifs, but not spacing information. Extending our
scoring function framework to the multiple motif situation while incorporating
both ordering and spacing information between motifs (beyond the two-block
case already handled by BioOptimizer) may provide extra power to detect motifs
for multiple TFs that regulate the same target genes.
Another interesting problem is to establish a model-based approach for incor-
porating gene expression information, such as microarray results, into the motif
137
discovery problem. Bussemaker et al. (2001) and Keles et al. (2002) both propose
methods for integrating sequence analysis together with microarray information.
The MDscan program mentioned above gives one approach to this problem, since
the upstream regions that are examined for motifs are updated in an iterative
fashion, based on microarray information. A more recent method, Motif Regres-
sor (Conlon et al. 2003) directly uses the microarray expression values to help
screen out false positive findings of MDscan. However, model-based approaches
may still be desirable since these models may provide us a principled way to tune
relevant parameters and guide us to achieve the optimal combination of the two
sources of information (i.e., genome sequences and microarray values).
A Bayesian hierarchical clustering model was introduced in Chapter 5 as a sta-
tistical approach to summarizing the common structure within a collection of
discovered motifs, with a Gibbs sampling implementation. This model has sev-
eral advantages over traditional clustering techniques, such as hierarchical tree
clustering or K-means clustering. The clustering decisions are systematic and
model-based not based on ad hoc similarity measures. The number of clusters is
allowed to vary and does not have to be pre-specified. The hierarchical frame-
work allows us to account for variability in the observed units (motif matrices),
instead of assuming these units are fixed and known.
Another notable element of our clustering procedure is that our model very eas-
ily deals with the alignment issue that, within each raw motif matrix, it is not
obvious where the central motif is located. Our model allows us to condition
on the motif location in all other raw matrices within the dataset when we cal-
culate the most likely location of the motif within a particular matrix. In many
cases, other matrices may show very similar compositions to the matrix in ques-
138
tion, in which case the conditioning provides a substantial amount of information
pertaining to the motif location. Although we have presented our clustering pro-
cedure in the context of a specific application to motif matrices, these advantages
of our Bayesian clustering model are not specific to this particular type of data.
Bayesian hierarchical clustering models based on a Dirichlet process prior dis-
tribution should be considered an attractive approach, especially in cases where
the number of clusters is not known a priori. The model is easy to implement
using MCMC methods which also allows for a full examination of the posterior
distribution instead of just focusing on a single point estimate.
Our motif clustering model was applied to a dataset of 116 TF binding motifs
in Chapter 6, and several approaches to analysing the clustering results were
discussed. Our posterior draws allowed us to summarize the variability of our
clustering results with a tree structure, as well as allowing us to estimate the best
partition of clusters. In addition to this best partition, we can calculate model-
based statistics to summarize the relative strength of our predicted clusters, as
well as observation-level probabilities for belonging to a particular cluster that
give us an indication of the variability in our point estimate. Two different clus-
tering priors, the Dirichlet process and the Uniform prior, were compared and
found to have quite different a priori clustering characteristics, but when applied
to our TF dataset did not show very different posterior results.
The clustering results that we observed suggests that the motifs within various
TF families can be organized into sub-groups based upon their tendency to clus-
ter together as a consequence of having very similar motifs. An area of future
research is to use the clustering information gained from a collection of motifs
to further improve subsequent motif discovery. A scoring function optimiza-
139
tion framework was presented in Chapter 3 based on a Bayesian motif discovery
model where very little is known a priori about the appearance of an unknown
motif. However, once a set of motifs has been discovered (and clustered), we
should incorporate this information into motif discovery procedure. One pro-
posal would be to use the posterior predictive distribution from our motif clus-
tering model as the scoring function for motif discovery, which would increase
the ability of our motif-finding algorithms to detect a motif that is similar to mo-
tifs that have already been discovered elsewhere.
In Chapter 7, we combined our techniques for motif discovery and motif clus-
tering to predict co-regulated clusters of genes in the bacteria Bacillus subtilis.
We used the whole genome sequences of seven related bacterial species to dis-
cover transcription factor binding motifs in the upstream regions of Bacillus sub-
tilis genes, and then have used similarities between these discovered motifs to
group these genes into possibly co-regulated gene clusters. This procedure can be
regarded as a sequence-based gene clustering that complements gene clustering
procedures based on microarray gene expression experiments. Our framework
could also be useful organisms for which no microarray chips are available, but
genome sequences from closely-related species is available.
Orthologous genes were identified between Bacillus subtilis and six other bacterial
species, and the upstream regulatory regions of these orthologous gene sets were
examined for elements that were possible transcription factor binding motifs con-
served by evolution, a technique often referred to as phylogenetic footprinting.
Our analysis focussed on two collections of these orthologous gene sets, the first
being a “Studyset” of gene sets for which some TF binding sites are known, and
the second being the “whole genome” of orthologous gene sets.
140
Our motif discovery strategy, as outlined in Chapters 3- 4, was a combination
of the stochastic motif-finding program, BioProspector and our deterministic op-
timization algorithm, BioOptimizer. The discovered one and two-block motifs
from this procedure were then clustered using the Bayesian hierarchical cluster-
ing model presented in Chapters 5-6. Our strategy of separately clustering two-
block motifs as both independent blocks and joint blocks allowed us to examine
several interesting interactions between one and two-block motif clusters within
both our Studyset (Section 7.5) and whole genome (Section 7.7) datasets. Many
of these relationships are confirmed within the biological literature.
Beyond these detailed examinations, we also performed a systematic evaluation
of our clustering results based on several external measures available for our tar-
get organism, Bacillus subtilis. Each Studyset and whole genome predicted cluster
was examined for over-representation of a particular functional category, over-
representation of a particular known transcription factor, and two gene expres-
sion statistics based on seven microarray experiments. The proportions of clus-
ters that were significant on multiple validation measures was much higher than
would be expected by chance in both the Studyset and whole genome datasets.
One aspect of this investigation that could be improved by further study is the in-
corporation of the concept of evolutionary distances into our motif discovery pro-
cedures. Each sequence within a particular orthologous gene set was weighted
equally with every other sequence by our motif-finding algorithms, despite the
fact that these sequences came from different species with unequal phylogenetic
distances between them. A more sophisticated motif discovery procedure which
incorporates this additional information may have increased power to detect
weaker motif signals.
141
Bibliography
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic
local alignment search tool. Journal of Molecular Biology 215, 403–410.
Bailey, T. and Elkan, C. (1994). Fitting a mixture model by expectation maximiza-
tion to discover motifs in biopolymers. In Proceedings of the Second International
Conference on Intelligent Systems for Molecular Biology, 28–36, Menlo Park, Cali-
fornia. AAAI Press.
Benos, P., Lapedes, A., and Stormo, G. (2002a). Additivity in protein-dna interac-
tions: how good an approximation is it? Nucleic Acids Research 30, 4442–4451.
Benos, P., Lapedes, A., and Stormo, G. (2002b). Probabilistic code for dna recog-
nition by proteins of the egr family. Journal of Molecular Biology 323, 701–727.
Benson, D., Karsch-Mizrachi, I., Lipman, D., Ostell, J., Rapp, B., and Wheeler, D.
(2002). Genbank. Nucleic Acids Research 30, 17–20.
Brazma, A., Jonassen, I., Vilo, J., and Ukkonen, E. (1998). Predicting gene regula-
tory elements in silico on a genomic scale. Genome Research 8, 1202–1215.
Britton, R., Eichenberger, P., Gonzalez-Pastor, J., Fawcett, P., Monson, R., Losick,
R., and Grossman, A. (2002). Genome-wide analysis of the stationary-phase
sigma factor (σh) regulon of bacillus subtilis. J. Bacteriol. 184, 4881–4890.
142
Bussemaker, H., Li, H., and Siggia, E. (2000). Building a dictionary for genomes:
identification of presumptive regulatory sites by statistical analysis. Proceedings
of the National Academy of Sciences (USA) 97, 10096–10100.
Bussemaker, H., Li, H., and Siggia, E. (2001). Regulatory element detection using
correlation with expression. Nature Genetics 27, 167–171.
Cardon, L. and Stormo, G. (1992). Expectation maximization algorithm for iden-
tifying protein-binding sites with variable lengths from unaligned dna frag-
ments. Journal of Molecular Biology 223, 159–170.
Castillo-Davis, C. and Hartl, D. (2003). Genemerge – post-genomic analysis, data
mining, and hypothesis testing. Bioinformatics 19, 891–892.
Conlon, E., Eichenberger, P., and Liu, J. (2004). Determining and analyzing differ-
entially expressed genes from cdna microarray experiments with complemen-
tary designs. Journal of Multivariate Analysis .
Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from in-
complete data via the em algorithm. Journal of the Royal Statistical Society, B 39,
1–38.
Eichenberger, P., Fujita, M., Jensen, S., Conlon, E., Rudner, D., Wang, S., Ferguson,
C., Sato, T., Liu, J., and R., L. (2004). The entire program of gene expression for
a single differentiating cell type. PLoS Biology Accepted for publication.
Eichenberger, P., Jensen, S., Conlon, E., van Ooij, C., Silvaggi, J., Gonzalez-Pastor,
J., Fujita, M., Ben-Yehuda, S., Stragier, P., Liu, J., and Losick, R. (2003). The
σe regulon and the identification of additional sporulation genes in Bacillus
subtilis. Journal of Molecular Biology 327, 945–972.
143
Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). Cluster analysis
and display of genome-wide expression patterns. Proceedings of the National
Academy of Sciences (USA) 95, 14863–14868.
Felsenstein, J. (1993). PHYLIP (phylogeny inference package) version 3.5c Dis-
tributed by the author. Department of Genetics, University of Washington,
Seattle.
Ferguson, T. (1974). Prior distributions on spaces of probability measures. Annals
of Statistics 2, 615–629.
Frith, M., Li, M., and Weng, Z. (2003). Cluster-Buster: Finding dense clusters of
motifs in dna sequences. Nucleic Acids Research 186, 3666–3668.
Gaballa, A., Wang, T., Ye, R., and Helmann, J. (2002). Functional analysis of the
Bacillus subtilis Zur regulon. Journal of Bacteriology 184, 6508–6514.
Galas, D., Eggert, M., and Waterman, M. (1985). Rigorous pattern-recognition
methods for dna sequences. analysis of promoter sequences from escherichia
coli. Journal of Molecular Biology 186, 117–128.
Gansner, E. and North, S. (1999). An open graph visualization system and its
applications to software engineering. Software – Practice and Experience 00, 1–
29.
Gelman, A., Carlin, J., Stern, H., and Rubin, D. (1995). Bayesian Data Analysis.
Chapman and Hall/CRC, Boca Raton, FL.
Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using
multiple sequences. Statistical Science 7, 457–472.
144
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and
the Bayesian restoration of images. IEEE Transaction on Pattern Analysis and
Machine Intelligence 6, 721–741.
Glaser, P., Frangeul, L., Buchrieser, C., Rusniok, C., Amend, A., Baquero, F.,
Berche, P., Bloecker, H., Brandt, P., Chakraborty, T., Charbit, A., Chetouani,
F., Couve, E., de Daruvar, A., Dehoux, P. a nd Domann, E., Dominguez-Bernal,
G., Duchaud, E., Durant, L., Dussurget, O., Entian, K., Fsihi, H., Garcia-del Por-
tillo, F., Garrido, P., Gautier, L., Goebel, W., Gomez-Lopez, N., Hain, T., Hauf,
J., Jackson, D., Jones, L., Kaerst, U., Kreft, J., Kuhn, M., Kunst, F., Kurapkat,
G., Madueno, E., Maitournam, A., Vicente, J., Ng, E., Nedjari, H., Nordsiek,
G., Novella, S., de Pablos, B., Perez-Diaz, J., Purcell, R., Remmel, B., Rose, M.,
Schlueter, T., Simoes, N., Tierrez, A., Vazquez-Boland, J., Voss, H., Wehland, J.,
and Cossart, P. (2001). Comparative genomics of listeria species. Science 294,
849–852.
Gordon, D., Nekludova, L., Gifford, D., Jaakkola, T., and Fraenkel, E. (2004).
Combining motif discovery algorithms with information from structural and
biochemical databases to understand transcriptional regulation. Submitted for
publication.
Green, P. and Richardson, S. (2001). Modelling heterogeneity with and without
the Dirichlet process. Scandinavian Journal of Statistics 28, 355–375.
Grundy, W., Bailey, T., and Elkan, C. (1996). Parameme: a parallel implementation
and a web interface for a dna and protein motif discovery tool. Comput Appl
Biosci 12, 303–310.
Gupta, M. and Liu, J. (2003). Discovery of conserved sequence patterns using
145
a stochastic dictionary model. Journal of the American Statistical Association 98,
1–12.
Halberg, R. and Kroos, L. (1994). Sporulation regulatory protein Spoiiid from
Bacillus subtilis activates and represses transcription by both mother-cell-
specific forms of RNA polymerase. Journal of Molecular Biology 243, 425–436.
Hampson, S., Baldi, P., Kibler, D., and Sandmeyer, S. (2000). Analysis of yeast’s
orf upstream regions by parallel processing, microarrays, and computational
methods. In Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 8, 190–201.
Hartigan, J. (1975). Clustering algorithms. Wiley, New York, NY.
Helmann, J. (1995). Compilation and analysis of Bacillus subtilis σa-dependent
promotor sequences: evidence for extended contact between RNA polymerase
and upstream promotor DNA. Nucleic Acids Research 23, 2351–2360.
Helmann, J. and Moran Jr., C. (2002). Rna polymerase and sigma factors. In
A. Sonenshein, J. Hoch, and R. Losick, eds., Bacillus subtilis and its closest rela-
tives. ASM Press, Washington, D.C.
Hertz, G. and Stormo, G. (1999). Identifying dna and protein patterns with statis-
tically significant alignments of multiple sequences. Bioinformatics 15, 563–577.
IUPAC (1986). Nomenclature for incompletely specified bases in nucleic acid se-
quences. Recommendations 1984. Proceedings of the National Academy of Sciences
(USA) 83, 4–8.
Jensen, S., Liu, X., Zhou, Q., and Liu, J. (2004). Computational discovery of gene
regulatory binding motifs: a Bayesian perspective. Statistical Science 19, 188–
204.
146
Kass, R. and Raftery, A. (1995). Bayes factors. Journal of the American Statistical
Association 90, 773–795.
Keich, U. and Pevzner, P. (2002). Finding motifs in the twilight zone. Bioinformat-
ics 18, 1374–1381.
Keles, S., van der Laan, M., Dudoit, S., Xing, B., and Eisen, M. (2003). Supervised
detection of regulatory motifs in dna sequences. Paper 131, U.C. Berkeley Di-
vision of Biostatistics.
Keles, S., van der Laan, M., and Eisen, M. (2002). Identification of regulatory
elements using a feature selection method. Bioinformatics 18, 1167–1175.
Kim, J.-H. and Chambliss, G. (1997). Contacts between Bacillus subtilis catabolite
regulatory protein CcpA and amyO target site. Nucleic Acids Research 25, 3490–
3496.
Kirkpatrick, S., Gelatt, C., and Vecchi, M. (1983). Optimization by simulated an-
nealing. Science 220, 671–680.
Kullback, S. and Leibler, R. (1951). On information and sufficiency. Ann. Math.
Stat. 22, 79–86.
Kunst, F., Ogasawara, N., Moszer, I., Albertini, A. M., Alloni, G., Azevedo, V.,
Bertero, M. G., Bessieres, P., Bolotin, A., Borchert, S., Borriss, R., Boursier, L.,
Brans, A., Braun, M., Brignell, S. C., Bron, S., Brouillet, S., Bruschi, C. V., Cald-
well, B., Capuano, V., Carter, N. M., Choi, S.-K., Codani, J.-J., Connerton, I. F.,
Cummings, N. J., Daniel, R. A., Denizot, F., Devine, K. M., Dusterhoft, A.,
Ehrlich, S. D., Emmerson, P. T., Entian, K. D., Errington, J., Fabret, C., Ferrari,
E., Foulger, D., Fritz, C., Fujita, M., Fujita, Y., Fuma, S., Galizzi, A., Galleron,
N., Ghim, S.-Y., Glaser, P., Goffeau, A., Golightly, E. J., Grandi, G., Guiseppi,
147
G., Guy, B. J., Haga, K., Haiech, J., Harwood, C. R., Henaut, A., Hilbert, H.,
Holsappel, S., Hosono, S., Hullo, M.-F., Itaya, M., Jones, L., Joris, B., Kara-
mata, D., Kasahara, Y., Klaerr-Blanchard, M., Klein, C., Kobayashi, Y., Koetter,
P., Koningstein, G., Krogh, S., Kumano, M., Kurita, K., Lapidus, A., Lardinois,
S., Lauber, J., Lazarevic, V., Lee, S.-M., Levine, A., Liu, H., Masuda, S., Maul,
C., Mdigue, C., Medina, N., Mellado, R. P., Mizuno, M., Moestl, D., Nakai,
S., Noback, M., Noone, D., O’Reilly, M., Ogawa, K., Ogiwara, A., Oudega, B.,
Park, S.-H., Parro, V., Pohl, T. M., Portetelle, D., Porwollik, S., Prescott, A. M.,
Presecan, E., Pujic, P., Purnelle, B., Rapoport, G., Rey, M., Reynolds, S., Rieger,
M., Rivolta, C., Rocha, E., Roche, B., Rose, M., Sadaie, Y., Sato, T., Scanlan,
E., Schleich, S., Schroeter, R., Scoffone, F., Sekiguchi, J., Sekowska, A., Seror,
S. J., Serror, P., Shin, B.-S., Soldo, B., Sorokin, A., Tacconi, E., Takagi, T., Taka-
hashi, H., Takemaru, K., Takeuchi, M., Tamakoshi, A., Tanaka, T., Terpstra, P.,
Tognoni, A., Tosato, V., Uchiyama, S., Vandelbol, M., Vannier, F., Vassarotti,
A., Viari, A., Wambutt, R., Wedler, E., Wedler, H., Weitzenegger, T., Winters,
P., Wipat, A., Yamamoto, H., Yamane, K., Yasumoto, K., Yata, K., Yoshida, K.,
Yoshikawa, H.-F., Zumstein, E., Yoshikawa, H., and Danchin, A. (1997). The
complete genome sequence of the gram-positive bacterium Bacillus subtilis.
Nature 390, 249–256.
Lawrence, C., Altschul, S., Boguski, M., Liu, J., Neuwald, A., and Wootton, J.
(1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multi-
ple alignment. Science 262, 208–214.
Lawrence, C. and Reilly, A. (1990). An expectation maximization (em) algo-
rithm for the identification and characterization of common sites in unaligned
biopolymer sequences. Proteins 7, 41–51.
148
Liu, J. (1994). The collapsed gibbs sampler in bayesian computations with appli-
cations to a gene regulation problem. Journal of the American Statistical Associa-
tion 94, 958–966.
Liu, J. (1996). Nonparametric hierarchical Bayes via sequential imputations. An-
nals of Statistics 24, 911–930.
Liu, J., Neuwald, A., and Lawrence, C. (1995). Bayesian models for multiple
local sequence alignment and gibbs sampling strategies. Journal of the American
Statistical Association 90, 1156–1170.
Liu, J., Neuwald, A., and Lawrence, C. (1999). Markovian structures in biological
sequence alignments. Journal of the American Statistical Association 94, 1–15.
Liu, X., Brutlag, D., and Liu, J. (2001). Bioprospector: discovering conserved dna
motifs in upstream regulatory regions of co-expressed genes. Pacific Symposium
on Biocomputing 6, 127–138.
Liu, X., Brutlag, D., and Liu, J. (2002). An algorithm for finding protein-dna in-
teraction sites with applications to chromatin immunoprecipitation microarray
experiments. Nature Biotechnology 20, 835–839.
Lodish, H., Baltimore, D., Berk, A., Zipursky, S., and Matsudaira, P. amd Darnell,
J. (1995). Regulation of transcription initiation. In Molecular Cell Biology, 405–
481. Scientific American Books, Inc., 4th edn.
Makita, Y., Nakao, M., Ogasawara, N., and Nakai, K. (2004). DBTBS: database of
transcriptional regulation in bacillus subtilis and its contribution to compara-
tive genomics. Nucleic Acids Research 32, 75–77.
McCue, L., Thompson, W., Carmack, C., Ryan, M., Liu, J., Derbyshire, V., and
149
C.E., L. (2001). Phylogenetic footprinting of transcription factor binding sites
in proteobacterial genomes. Nucleic Acids Research 29, 774–782.
Medvedovic, M. and Sivaganesan, S. (2002). Bayesian infinite mixture models
based clustering of gene expression profiles. Bioinformatics 18, 1194–1206.
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. (1953).
Equation of state calculations by fast computing machines. Journal of Chemical
Physics 21, 1087–1092.
Metropolis, N. and Ulam, S. (1949). The Monte Carlo method. Journal of the
American Statistical Association 49, 335–341.
Molle, V., Fujita, M., Jensen, S., Liu, J., and Losick, R. (2003). The spo0a regulon
in Bacillus subtilis. Molecular Microbiology 50, 1683–1701.
Moszer, I., Glaser, P., and Danchin, A. (1995). Subtilist: a relational database for
the Bacillus subtilis genome. Microbiology 141, 261–268.
Nolling, J., Breton, G., Omelchenko, M., Makarova, K., Zeng, Q., Gibson, R., Lee,
H., Dubois, J., Qiu, D., Hitti, J., Wolf, Y., Tatusov, R., Sabathe, F., Doucette-
Stamm, L., Soucaille, P., Daly, M., Bennett, G., Koonin, E., and Smith, D.
(2001). Genome sequence and comparative analysis of the solvent-producing
bacterium clostridium acetobutylicum. Journal of Bacteriology 183, 4823–4838.
Pfahl, M. (1981). Characteristics of tight binding repressors of the lac operon.
Journal of Molecular Biology 147, 1–10.
Qin, Z. S., McCue, L. A., Thompson, W., Mayerhofer, L., Lawrence, C. E., and Liu,
J. S. (2003). Identification of co-regulated genes through bayesian clustering of
predicted regulatory binding sites. Nature Biotechnology 21, 435–439.
150
Read, T., Peterson, S., Tourasse, N., Baillie, L., Paulsen, I., Nelson, K., Tettelin, H.,
Fouts, D., Eisen, J., Gill, S., Holtzapple, E., Okstad, O., Helgason, E., Rilstone,
J., Wu, M., Kolonay, J., Beanan, M., Dodson, R., Brinkac, L., Gwinn, M., DeBoy,
R., Madpu, R., Daugherty, S., Durkin, A., Haft, D., Nelson, W., Peterson, J.,
Pop, M., Khouri, H., Radune, D., B enton, J., Mahamoud, Y., Jiang, L., Hance,
I., Weidman, J., Berry, K., Plaut, R., Wolf, A., Watkins, K., Nierman, W., Hazen,
A., Cline, R., Redmond, C., Thwaite, J., White, O., Salzberg, S., Thomason, B.,
Friedlander, A., Koehler, T., Hanna, P., Kolsto, A., and Fraser, C. (2003). The
genome sequence of Bacillus anthracis ames and comparison to closely related
bacteria. Nature 423, 81–86.
Remm, M., Storm, C., and Sonnhammer, E. (2001). Automatic clustering of or-
thologs and in-paralogs from pairwise species comparisons. Journal of Molecu-
lar Biology 314, 1041–1052.
Roth, F., Hughes, J., Estep, P., and Church, G. (1998). Finding dna regulatory mo-
tifs within unaligned non-coding sequences clustered by whole-genome mrna
quantitation. Nature Biotechnology 16, 939–945.
Saxild, H., Brunstedt, K., Nielsen, K., Jarmer, H., and Nygaard, P. (2001). Def-
inition of the Bacillus subtilis PurR operator using genetic and bioinformatic
tools and expansion of the PurR regulon with glyA, guaC, pbuG, xpt-pbuX,
yqhZ-folD, and pbuO. Journal of Bacteriology 183, 6175–6183.
Schena, M., Shalon, D., Davis, R., and Brown, P. (1995). Quantitative monitoring
of gene expression patterns with a complementary dna microarray. Science 270,
467–470.
Schneider, T. D. and Stephens, R. M. (1990). Sequence logos: A new way to dis-
play consensus sequences. Nucleic Acids Research 18, 6097–6100.
151
Shimizu, T., Ohtani, K., Hirakawa, H., Ohshima, K., Yamashita, A., Shiba, T., Oga-
sawara, N., Hattori, M., Kuhara, S., and Hayashi, H. (2002). Complete genome
sequence of clostridium perfringens, an anaerobic flesh-eater. Proceedings of the
National Academy of Sciences (USA) 99, 996–1001.
Shine, J. and Dalgarno, L. (1974). The 3′-terminal sequence of Escherichia coli
16s ribosomal rna: complementarity to nonsense triplets and ribosome binding
sites. Proceedings of the National Academy of Sciences (USA) 71, 1342–1346.
Sinha, S. and Tompa, M. (2000). A statistical method for finding transcription
factor binding sites. In Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 8, 344–354.
Stirling, J. (1730). Methodus differentialis. William Bowyer, London.
Stormo, G. and Hartzell, G. (1989). Identifying protein-binding sites from un-
aligned dna fragments. Proceedings of the National Academy of Sciences (USA) 86,
1183–1187.
Strauch, M., Webb, V., Spiegelman, G., and Hoch, J. (1990). The spo0a protein
of bacillus subtilis is a repressor of the abrb gene. Proceedings of the National
Academy of Sciences (USA) 87, 1801–1805.
Sun, G., Sharkova, E., Chesnut, R., Birkey, S., Duggan, M., Sorokin, A., Pujic, P.,
Ehrlich, S., and Hulett, F. (1996). Regulators of aerobic and anaerobic respira-
tion in Bacillus subtilis. Journal of Bacteriology 178, 1374–1385.
Takami, H., Nakasone, K., Takaki, Y., Maeno, G., Sasaki, R., Masui, N., Fuji, F., Hi-
rama, C., Nakamura, Y., Ogasawara, N., Kuhara, S., and Horikoshi, K. (2000).
Complete genome sequence of the alkaliphilic bacterium Bacillus halodurans
and genomic sequence comparison with Bacillus subtilis. Nucleic Acids Research
28, 4317–4331.
152
Takami, H., Takaki, Y., and Uchiyama, I. (2002). Genome sequence of oceanobacil-
lus iheyensis isolated from the iheya ridge and its unexpected adaptive capa-
bilities to extreme environments. Nucleic Acids Research 30, 3927–3935.
Tanner, M. and Wong, W. (1987). The calculation of posterior distributions by
data augmentation. Journal of the American Statistical Association 82, 528–550.
Thompson, J., Higgins, D., and Gibson, T. (1994). CLUSTAL W: improving
the sensitivity of progressive multiple sequence alignment through sequence
weighting, position- specific gap penalties and weight matrix choice. Nucleic
Acids Research 22, 4673–80.
Tseng, G., Oh, M.-K., Liao, L. R. J., and Wong, W. (2001). Issues in cDNA microar-
ray analysis: quality filtering, channel normalization, models of variations and
assessment of gene effects. Nucleic Acids Research 29, 2549–2557.
van Helden, J., Andre, B., and Collado-Vides, J. (1998). Extracting regulatory
sites from the upstream region of yeast genes by computational analysis of
oligonucleotide frequencies. Journal of Molecular Biology 281, 827–842.
van Helden, J., Rios, A., and Collado-Vides, J. (2000). Discovering regulatory
elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids
Research 28, 1808–1818.
Velculescu, V., Zhang, L., Vogelstein, B., and Kinzler, K. (1995). Serial analysis of
gene expression. Science 270, 484–487.
Wang, T. and Stormo, G. (2003). Combining phylogenetic data with co-regulated
genes to identify regulatory motifs. Bioinformatics 19, 2369–2380.
Weickert, M. and Chambliss, G. (1990). Site-directed mutagenesis of a catabolite
153
repression operator sequence in Bacillus subtilis. Proceedings of the National
Academy of Sciences (USA) 87, 6238–6242.
Werner, T. (1999). Models for prediction and recognition of eukaryotic promoters.
Mamm. Genome 10, 168–175.
Winterling, K., Chafin, D., Hayes, J. J., Sun, J., Levine, A., Yasbin, R., and
Woodgate, R. (1998). The Bacillus subtilis DinR binding site: Redefinition of
the consensus sequence. Journal of Bacteriology 180, 2201–2211.
Winterling, K., Levine, A., Yasbin, R., and Woodgate, R. (1997). Characterization
of DinR, the Bacillus subtilis SOS repressor. Journal of Bacteriology 179, 1698–
1703.
Xing, E., Wu, W., Jordan, M., and Karp, R. (2003). Logos: A modular bayesian
model for de novo motif detection. In IEEE Computer Society Bioinformatics Con-
ference, CSB2003.
154