ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA...
Transcript of ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA...
ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED SCIENCES
PhD THESIS
Mustafa KARABULUT
EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE IDENTIFICATION
DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING
ADANA, 2011
ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED SCIENCES
EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE
IDENTIFICATION
Mustafa KARABULUT
PhD THESIS DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING We certify that the thesis titled above was reviewed and approved for the award of degree of the Doctor of Philosophy by the board of jury on 02/12/2011. ……………….................... ………………………….. ……................................ Asst. Prof. Dr. Turgay İBRİKCİ Prof. Dr. Elif Derya ÜBEYLİ Prof. Dr. Hamza EROL SUPERVISOR MEMBER MEMBER ……………….................... ………………………….. Assoc. Prof. Dr. Ulus ÇEVİK Asst. Prof. Dr. Sami ARICA MEMBER MEMBER This PhD Thesis is written at the Department of Institute of Natural And Applied Sciences of Çukurova University. Registration Number:
Prof. Dr. İlhami YEĞİNGİL Director Institute of Natural and Applied Sciences
Not:The usage of the presented specific declerations, tables, figures, and photographs
either in this thesis or in any other reference without citiation is subject to "The law of Arts and Intellectual Products" number of 5846 of Turkish Republic
I
ABSTRACT
PhD THESIS
EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE
IDENTIFICATION
Mustafa KARABULUT
ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED SCIENCES
DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING
Supervisor :Asst. Prof. Dr. Turgay İBRİKCİ Year: 2011, Pages: 92 Jury :Asst. Prof. Dr. Turgay İBRİKCİ :Prof. Dr. Elif Derya ÜBEYLİ :Prof. Dr. Hamza EROL :Assoc. Prof. Dr. Ulus ÇEVİK :Asst. Prof. Dr. Sami ARICA
Identification of transcription factor binding sites (TFBSs) is a significant task in contemporary biology towards deciphering the genome functions and understanding gene regulatory networks. One way to identify them is via laboratory experiments which are laborious, time consuming and costly. Alternatively, computational methods based on pattern recognition techniques are proposed in the literature for automatic extraction of TFBS instances from given DNA sequences.
In this study, three different computational methods are proposed. First approach is based on clustering all w-mers and attempting to find a statistically interesting local alignment via z-score testing. Four clustering methods, Self-organizing map, Fuzzy C-Means, K-means and Expectation Maximization with Gaussian Mixture Models, are considered in this context. The second technique is similar to the first one except that it has a Bayesian post-optimization procedure to fine-tune local alignments composed of Position weight matrices. The third computational technique developed in the thesis adopts a different approach by utilizing a stochastic search procedure, namely Particle swarm optimization. The developed methods each of which offers novel contributions to the relevant literature are evaluated against several types of datasets including low and high organism DNA. Moreover, they are also compared to state-of-art motif-finding tools from the literature such as MEME and MDScan, as well. Experimental results suggest that the proposed methods are highly promising for DNA motif-finding task. Key Words: Transcription factor binding site, DNA, motif discovery, machine
learning, particle swarm optimization
II
ÖZ
DOKTORA TEZİ
BİYOLOJİK DİZİLİMLER ÜZERİNDE VERİ MADENCİLİĞİ TEKNİKLERİ KULLANARAK TRANSKRİPSİYON FAKTÖRÜ
BAĞLANMA SİTELERİNİN TESPİTİ
Mustafa KARABULUT
ÇUKUROVA ÜNİVERSİTESİ FEN BİLİMLERİ ENSTİTÜSÜ
ELEKTRİK ELEKTRONİK MÜHENDİSLİĞİ ANABİLİM DALI
Danışman :Yrd. Doç. Dr. Turgay İBRİKCİ Yıl: 2011, 94 Sayfa Jüri :Yrd. Doç. Dr. Turgay İBRİKCİ :Prof. Dr. Elif Derya ÜBEYLİ :Prof. Dr. Hamza EROL :Doç. Dr. Ulus ÇEVİK :Yrd.Doç. Dr. Sami ARICA
Transkripsiyon faktörü bağlanma sitelerinin (TFBS) tanımlanması modern biyolojide genom fonksiyonları ve gen düzenleyici ağlarını çözümlenmesi doğrultusunda önemli bir süreçtir. Bu alanların tanımlanmasının bir yolu yoğun emek isteyen, zaman alıcı ve pahalı laboratuar deneyleridir. Bu deneylere alternatif olarak, TFBS alanlarının verilen DNA dizilimlerinden otomatik olarak çıkaran, desen tanıma temelli bilgisayar yöntemleri de ilgili literatürde bulunmaktadır.
Bu çalışmada üç farklı hesapsal yöntem önerilmektedir. Önerilen birinci yaklaşım verilen tüm w-mer’leri kümeleme ve z-score testi kullanarak istatistiksel olarak ilginç bir yerel hizalama bulmaya çalışmaktadır. Dört kümeleme metodu bu bağlamda değerlendirilmiştir, bunlar: Self-organizing map, Fuzzy C-Means, K-means ve Expectation Maximization algoritmalarıdır. İkinci teknik ise birinci tekniğe oldukça benzemekle beraber, farklı olarak, pozisyon ağırlık matrislerinden oluşan yerel hizalamaları iyileştirmek için Bayes teoremi temelli kümeleme sonrası optimizasyon prosedürü içermektedir. Bu tezde geliştirilen üçüncü hesapsal teknik ise parçacık sürü optimizasyonu adlı bir stokastik arama prosedürünü benimsemektedir. Geliştirilen metotlar, ilgili literatüre yeni katkılar sağlamak amacıyla düşük ve yüksek canlı DNA’sı içeren pek çok veri seti kullanılarak değerlendirilmiştir. Dahası, literatürden MEME ve MDScan gibi gelişmiş metotlar da kıyaslanmışlardır. Deneysel sonuçlar önerilen metotların DNA motif-bulma işi için oldukça umut vaat edici olduğunu göstermiştir.
Anahtar Kelimeler: Transkripsiyon faktörü bağlanma sitesi, motif keşfi, makine
öğrenmesi, parçacık sürü optimizasyonu
III
ACKNOWLEDGEMENTS
I would like to express my respects and deepest gratitude to my supervisor
Asst.Prof.Dr.Turgay İbrikçi for providing me thoughtful guidance, remarkable
insights and continuous encouragement.
I also thank my committee members, Prof. Dr. Elif Derya ÜBEYLİ, Prof.
Dr. Hamza Erol, Assoc. Prof. Dr. Ulus ÇEVİK and Asst. Prof. Dr. Sami ARICA
for their supports and valuable discussions.
A special thanks to my friends especially my colleagues from the
University of Gaziantep for supporting, motivating and encouraging me about my
study.
I would like to express my special appreciation to my wife, Esra
KARABULUT, for supporting me with patience all the time. This study wouldn’t
exist if she hadn’t supported me.
IV
CONTENTS PAGE ABSTRACT ............................................................................................................. I ÖZ ............................................................................................................................ II ACKNOWLEDGEMENTS ................................................................................... III CONTENTS……………………………………………………………………... IV LIST OF TABLES .................................................................................................. V LIST OF FIGURES .............................................................................................. VI LIST OF ABBREVIATIONS .............................................................................. VII 1. INTRODUCTION .............................................................................................. 1
1.1. Protein Synthesis: Transcription and Translation ........................................ 1 1.2. Discovery of Transcriptional Regulatory Elements ..................................... 3 1.3. Problem definition of DNA Motif Discovery .............................................. 4 1.4. Computational representation of DNA Motifs ............................................. 5 1.5. Motivation and Goal of the Thesis ............................................................... 8
2. RELATED WORKS ......................................................................................... 10 3. MATERIAL AND METHODS ........................................................................ 14
3.1. Datasets Utilized In the Study .................................................................... 14 3.2. Methods ...................................................................................................... 17
3.2.1. Fuzzy C-Means ............................................................................. 17 3.2.2. Expectation Maximization with Gaussian Mixture Models ......... 22 3.2.3. Self-Organizing Map ..................................................................... 27 3.2.4. K-Means ........................................................................................ 29 3.2.5. Post-optimization for clustering approach .................................... 31 3.2.6. Particle Swarm Optimization ........................................................ 33
4. RESEARCH AND DISCUSSION .................................................................... 42 4.1. Evaluation Metrics ..................................................................................... 42 4.2. Employing Fuzzy C-Means for DNA Motif Discovery ............................. 44 4.3. Assessment of Clustering Algorithms for Motif Discovery ...................... 49 4.4. Evaluation of Post-Optimization for EM/GMM Method........................... 55 4.5. Particle Swarm Optimization to Identify Regulatory Elements ................. 65
5. CONCLUSIONS ............................................................................................... 73 REFERENCES……………………………………………………….………….. 78 CURRICULUM VITAE….………………………………………….…………... 85
V
LIST OF TABLES PAGE
Table 1.1. Degenerate symbols for ambiguous letters ........................................... 7
Table 3.1. Saccharomyces cerevisiae datasets ..................................................... 14
Table 3.2. The second group of datasets that consists of different species .......... 15
Table 3.3. The third group of datasets.................................................................. 15
Table 3.4. Properties of synthetic datasets for different scenarios ....................... 16
Table 4.1. Performances of FCM, MEME and MDScan ..................................... 47
Table 4.2. Predicted and known motifs in sequence logo format ........................ 49
Table 4.3. Experimental results of four clustering algorithms for each dataset ... 51
Table 4.4. Motif finding performance of MEME for each dataset ...................... 53
Table 4.5. Comparison of EM/GMM with other algorithms for Saccharomyces
cerevisiae datasets in terms of MCC ................................................ 57
Table 4.6. Comparison of four algorithms for third group of datasets ................ 59
Table 4.7. Best results of GMM/EM, FCM and SOMBRERO ........................... 60
Table 4.8. Sequence logos of the known motifs and the predicted ones (a) ........ 62
Table 4.9. Sequence logos of the known motifs with predicted ones, (a) and (b) 64
Table 4.10. Results of PSO variants for synthetic datasets in terms of F-Scores 67
Table 4.11. Performance comparison of motif-finding tools for synthetic datasets
.......................................................................................................... 68
Table 4.12. Performances of PSO variants for 8 real datasets .............................. 69
Table 4.13. Comparison of motif-finding tools for third group of datasets .......... 72
VI
LIST OF FIGURES PAGE
Figure 1.1. The three phases of transcription process resulting in RNA ............... 2
Figure 1.2. Extraction of TFBS instances from given set of promoter sequences. 4
Figure 1.3. Generating PFM and PWM from the alignment of subsequences: (a) a
set of sequences, (b) PFM, (c) background frequencies, (d) PWM. .. 6
Figure 1.4. A sample sequence logo ...................................................................... 8
Figure 3.1. Extraction of subsequences by using the sliding-windows technique 19
Figure 3.2. Graphical representations of utilized population topologies: (a) GBest
(b) Ring (c) Random (d) Von Neumann .......................................... 35
Figure 3.3. A sample particle and its evaluation .................................................. 36
Figure 3.4. Transformation of iS into ciS ............................................................ 37
Figure 3.5. Pseudocode of PSO-based proposed algorithm ................................. 38
Figure 3.6. A Single iteration of re-alignment and simultaneous shift operators 40
Figure 4.1. The effect of number of clusters over the performance for FCM ...... 44
Figure 4.2. The processing time of FCM for each dataset ................................... 45
Figure 4.3. The performance of FCM per number of training cycles .................. 46
Figure 4.4. Comparison of the three methods in terms of MCC .......................... 48
Figure 4.5. Correlation between performance and number of clusters ................ 50
Figure 4.6. Average motif finding performances of clustering algorithms .......... 52
Figure 4.7. Training time of each algorithm to cluster LEXA dataset ................. 53
Figure 4.8. Performances of clustering algorithms and MEME for each species 54
Figure 4.9. Comparison of average performances of clustering algorithms and
MEME for each species ................................................................... 55
Figure 4.10. Performance of the algorithm over the number of clusters for each
dataset ............................................................................................... 56
Figure 4.11. Overall performances of the algorithms for second group datasets . 61
Figure 4.12. Performance variance of algorithms over two parameters ............... 61
Figure 4.13. Performance of PSO per number of particles ................................... 70
Figure 4.14. Consumed time by PSO with different number of particles ............. 70
Figure 4.15. Consumed time by each PSO variant for third group of datasets ..... 71
VII
LIST OF ABBREVIATIONS
BMU : Best Matching Unit
DNA : Deoxyribonucleic acid
EM : Expectation Maximization
FCM : Fuzzy C-Means
FN : False Negative
FP : False Positive
GA : Genetic Algorithm
GMM : Gaussian Mixture Models
HMM : Hidden Markov Model
HMR : Human, mouse and rat
MCC : Matthews’ Correlation Coefficient
mRNA : Messenger RNA
PDF : Probability Density Function
PFM : Position Frequency Matrix
PPV : Positive Predictive Value
PSO : Particle Swarm Optimization
PSSM : Position Specific Scoring Matrix
PWM : Position Weight Matrix
RNA : Ribonucleic acid
RNAP : RNA polymerase
SOM : Self-Organizing Map
TF : Transcription Factor
TFBS : Transcription Factor Binding Site
TN : True Negative
TP : True Positive
Sn : Sensitivity
Sp : Specificity
1. INTRODUCTION Mustafa KARABULUT
1
1. INTRODUCTION
Through the genomic era, in which many genome sequencing projects
unveil large volumes of data, computational methods have been of great
consideration in order to process the high throughput information available and
thus led a relatively new research field called bioinformatics emerge. This new
interdisciplinary field, which incorporates computer science with biology, makes
use of information technology to help comprehend biological processes. Due to
rapid developments in genomic research technologies and thus huge amount of
data being available, bioinformatics mostly deal with management of databases
and computational/statistical methods to process the genomic data (Luscombe,
2001). Bioinformatics researchers focus on application and development of
computational methods based on machine learning, data mining and artificial
intelligence techniques. Major bioinformatics research topics can be listed as
sequence analysis, gene expression analysis, genome annotation, comparative
genomics, gene ontology and taxonomy, phylogenetics and systems biology
(Bioinformatics Wiki, 2011).
In early bioinformatics studies, often referred as pre-genomic era
(Guttmacher and Collins, 2003) or classical bioinformatics, the efforts were
mostly for extracting genomes of the organisms including human. However, in
post-genomic era, the research focus has geared towards deciphering the genome
functions and understanding gene regulatory networks. Thus, modeling and
examining transcription process, which is a vital task in gene expression and
protein synthesis, have recently been research topics of significance.
1.1. Protein Synthesis: Transcription and Translation
Protein synthesis is a process in which a protein-coding gene is transcribed by
means of messenger RNA (mRNA) and this gene-product is then translated into
the target protein. Although the protein synthesis is slightly different in
prokaryotes and eukaryotes, it always includes the two mentioned steps, i.e.,
transcription and translation. The transcription is the process of making a
1. INTRODUCTION Mustafa KARABULUT
2
complementary copy of a DNA segment that includes a gene into RNA. If the
target gene contains a protein coding, then the complementary copy will be
mRNA which in turn will eventually produce a protein. Depending on encoding
of the gene, the result of the transcription sometimes may be some sort of RNA
other than mRNA which functions as a part of gene regulatory network. In either
case, the transcription is initialized by transcription factors (TFs) which bind to
DNA, usually close proximity upstream regions (promoter) of the target gene. The
TFs do not bind to random locations; instead, they bind do specific DNA regions,
which are considered as transcription regulators, called Transcription Factor
Binding Site (TFBS). Once a TF binds to a TFBS, it activates the enzyme, RNA
polymerase (RNAP), so that it produces the complementary copy of DNA into
RNA.
Figure 1.1. The three phases of transcription process resulting in RNA
The second phase of protein synthesis is the translation of obtained RNA into a
chain of amino acids called polypeptide which then folds into a protein. Ribosome
is what in charge of decoding the RNA into the proper sequence of amino acids.
1. INTRODUCTION Mustafa KARABULUT
3
The translation of RNA composed of A (Adenosine), C (Cytidine), G (Guanosine)
and T (Thymidine), into a specific amino acid is done according to the translation
table in which every three sequential nucleotides, namely codons, corresponds to
a specific amino acid out of all possible 20 amino acids.
1.2. Discovery of Transcriptional Regulatory Elements
In contemporary biology, identification of transcriptional regulatory
elements has been a crucial task in order to understand transcriptional regulatory
mechanism and functional components of gene expression (Stormo, 2000). In this
context, locating genome-wide TFBS has been a matter of interest to the
researchers. So far, to achieve this goal, several methods that belong to one of two
types, computationally and experimentally, have been proposed in the literature
(Elnitski et al., 2006). Common experimental laboratory studies to identify TFBS
include exploiting DNaseI hypersensitivity, using ChIP assays and ChIP-chip
technique. Experimentally verified TFBS instances of several organisms are
stored in online databases such as JASPAR (Sandelin et al., 2004) and
TRANSFAC (Matys et al., 2003). These databases enable researchers to search
and to match TFBS instances against unverified DNA sequences so that presence
of a known instance can be detected.
Alternative to these laboratory experiments, which are actually expensive
and labor intensive, in order to identify TFBS, computational methods mostly
based on pattern recognition and data mining techniques are also proposed in the
literature (Das and Dai, 2007). Since these methods attempt to discover TFBS
instances in given unaligned DNA sequences with little or no prior knowledge,
they are preferable to experimental methods. In general, computational methods
are effective and used to find patterns of overrepresented TFBS instances in
promoter sequences of putatively co-regulated genes. Phylogenetic footprinting
by use of orthologous sequences is also an alternative way of identifying TFBS
patterns residing in the sequences. In either way, using datasets that include co-
regulated or orthologous genes, the computationally performed task is known as
1. INTRODUCTION Mustafa KARABULUT
4
de novo motif discovery. In this study, de novo motif discovery methods
developed and/or evaluated consider only putatively co-regulated genes datasets.
1.3. Problem definition of DNA Motif Discovery
In DNA motif discovery task which is performed over a set of DNA
sequences that are promoters to putatively co-expressed genes, the goal is to find
recurring TFBS instances that form a common pattern of statistical significance
(Figure 1.2). This is, however, not a straightforward task since:
• Relative locations of the TFBS instances to the target genes are
unknown
• The common pattern, i.e., the sought motif, is also unknown
• The TFBS aren’t necessarily exact matches and each may have
variations such as insertions and deletions (Das and Dai, 2007)
The motif discovery task is considered as an NP-Hard problem since the
length of sequences is generally too long for an exhaustive search of all possible
locations, e.g., generally, each TFBS reside in proximity to its target gene up to a
few thousands bp. In addition to the NP-hardness, given promoter sequences may
also contain statistically interesting alignment possibilities that are actually not
biologically relevant. Thus, motif discovery methods should also be capable of
presenting multiple motif results, as well.
Figure 1.2. Extraction of TFBS instances from given set of promoter sequences.
The practitioner of a motif discovery method is usually expected to
provide a set of DNA sequences that are upstream regions of supposedly co-
regulated genes. In the most common case, each given sequence contains a single
1. INTRODUCTION Mustafa KARABULUT
5
TFBS that regulates transcription of the target gene. However, in real-world
scenarios each sequence may contain more than one instance of the sought TFBS
pattern. Moreover, some of the given sequences may turn out not to contain any
co-regulated gene, that is, it may not contain any TFBS of interest. The TFBS
instances that have function in the regulation of co-expressed genes share a
common pattern which appears to be statistically interesting with regard to the
background distribution of nucleotides. Since a common pattern is sought, TFBS
instances are considered to be of the same width while performing the search,
even though that assumption is not necessarily true for all cases. Therefore, the
goal is simplified as finding the TFBS starting locations on the given DNA
sequences. Nonetheless, an ideal motif finding software should consider the
variability in TFBS width and the number of TFBS instances in each given
sequence, as well as, presence of more than one motif pattern in the given dataset.
1.4. Computational representation of DNA Motifs
Computational methods perform motif search based on a model for the
sought pattern which represents the TFBS instances of interest. Each TFBS is a
string composed of the 4-letters [A, C, G, T]. Since each TFBS may have
variation, the utilized model to represent the sought motif should be able to
consider the variation. In the literature, two models for motif representation come
into prominence: Position Weight Matrix (PWM) and consensus sequence.
A PWM, also known as Position Specific Scoring Matrix (PSSM), is an
x4w matrix of scores that give a weighted match to any subsequence. PWM is a
derivation of Profile Matrix (PFM) that holds probabilities of how often a base
(i.e., A, C, G and T) occurs at each position. When PFM is converted into PWM,
the background distribution probabilities of each base are considered to calculate
log-odds scores:
ib ib bm =log(f /Θ ) (1)
1. INTRODUCTION Mustafa KARABULUT
6
where ibm is the PWM element for base b at position i, ibf stands for the
corresponding probability value from PFM and is the background probabilities
of letters A, C, G and T. The background probability of a specific letter is a part of
the background model of the organism whose DNA is analyzed. The background
model is characterized with the 3rd order Hidden Markov Model (HMM) of the
whole intergenic genome sequence of the organism. The calculation of PWM
from a set of locally aligned TFBS instances is also depicted in Figure 1.3.
Figure 1.3. Generating PFM and PWM from the alignment of subsequences: (a) a set of sequences, (b) PFM, (c) background frequencies, (d) PWM.
Using consensus strings, on the other hand, is another way of representing
a set of locally aligned TFBS instances. Composing a consensus is not too
different from constructing a PWM, the PFM from the alignment is still needed to
be calculated and then the most probable letter from each column in the PFM is
selected for each position of the consensus string. Some of the positions in the
consensus may include high probability of a specific letter with respect to the
others, that is, the letter is strictly required for the motif. On the other hand, at
some positions, none of the letters may not be dominant enough (e.g., probability
< 60%) and in this case degenerate base symbols are utilized. Table 1.1 presents
degenerate symbols according to International Union of Pure and Applied
Chemistry (IUPAC) notation.
With the consideration of above information, the consensus string of motif
example given in Figure 1.3 should be WASGTR. Consensus strings are easier to
compose and more readable when compared to PWMs. Also, consensus
representation requires less computational power when being processed.
1. INTRODUCTION Mustafa KARABULUT
7
However, they are not as accurate and sensitive means to represent an alignment
as PWM. Therefore, literature methods commonly use PWMs for internal
computations of a motif and use a consensus strings to present human-readable
results. Nonetheless, there are plenty of methods such as that of Stine (2003) that
use consensus strings instead of PWMs to computationally store and process
motif alignments.
Table 1.1. Degenerate symbols for ambiguous letters Degenerate Symbol Description Ambiguous bases
A Adenosine A C Cytidine C G Guanosine G T Thymidine T U Uridine U W Weak A and T S Strong C and G M Amino A and C K Keto G and T R Purine A and G Y Pyrimidine C and T B Not A C, G and T D Not C A, G and T H Not G A, C and T V Not T A, C and G N Any base A, C, G and T
In terms of presenting a motif in a human readable way, a sequence logo
(Schneider, 1990), with respect a consensus string, actually is a more convenient
way to present and also to visualize sought patterns from DNA sequences. A
sequence logo can be generated from a set of aligned sequences, in our case, a set
of aligned TFBS instances. The residues at a specific position are graphically
represented in a stack of letters whose height is proportional to the frequency of
the base. Therefore, the most conserved parts can visually be distinguished. Figure
1.4 depicts corresponding sequence logo for the previous motif sample given in
Figure 1.3.
1. INTRODUCTION Mustafa KARABULUT
8
Figure 1.4. A sample sequence logo
Through this study, motif discovery results are always given in
quantitative metrics and also in sequence logos where appropriate. The public and
free sequence logo generator, Weblogo (Crooks et al., 2004), is utilized to
generate sequence logos.
1.5. Motivation and Goal of the Thesis
As mentioned in Section 1.2, it is obvious that discovery of TFBS via
computational tools is advantageous and preferable with regards to labor intensive
and expensive laboratory experiments. Thus, researchers from the fields of
computer science, mathematics and statistics have studied several methods and
proposed various computational tools to achieve the goal of identifying regulatory
regions solely via computer methods (See Section 2 for more detailed literature
review). According to comparative studies (Tompa et al., 2005; Das and Dai,
2007; Sandve and Drablos, 2006) that have been done to evaluate these methods,
some tools were observed to be superior to the others for some specific conditions
(e.g., for short motifs or datasets extracted from DNA sequences of low
organisms) while for other conditions the case could be vice versa. Nonetheless,
regardless of its search strategy and motif representation, no algorithm alone is
reported in these comparative studies to be sufficient for predicting optimal motifs
for every condition. Thus, it is clear that research over computational methods for
TFBS identification is still an essential task and challenges researchers.
The goal of this thesis is to develop computational tools based on data
mining methods including clustering and Particle Swarm Optimization (PSO).
The study also provides evaluation of the self developed methods with respect to
state-of-art bioinformatics tools developed to perform DNA motif discovery task.
1. INTRODUCTION Mustafa KARABULUT
9
In this study, four clustering algorithms are considered: Fuzzy C-Means (FCM),
Self Organizing Map (SOM), K-Means and Expectation Maximization (EM) with
Gaussian Mixture Models (GMM). Original algorithms of the clustering methods
and required modifications to adapt them for a motif-finding strategy are
explained in Section 3. In addition to the clustering approach, we also considered
PSO to identify TFBS patterns. Section 3 also provides explanations related to
PSO and relevant modifications to the original PSO algorithm to fit our needs.
Moreover, in Section 4, the performances of the mentioned algorithms will be
given in several measures including quantitative and visual ones.
2. RELATED WORKS Mustafa KARABULUT
10
2. RELATED WORKS
The post-genomic era has been experiencing remarkable growth in
computational tools to process high-volume data made available. One major goal
of these tools is to support understanding of gene regulatory networks. Therefore,
automatic identification of TFBS via computational methods has been of great
interest to researchers. Such proposed methods often adopt one of two ways to
automatically extract motifs of sought TFBS: (a) From a set of sequences that are
promoters of co-regulated genes, (b) From a set of orthologous sequences that are
promoters of a single gene from different species, i.e., using phylogenetic foot
prints (Das and Dai, 2007). In this study, we focus on the first type of methods as
the developed methods in this study will only use co-regulated gene sequences.
Early studies of motif-finding on given co-regulated gene promoter
sequences are generally based on probabilistic frameworks such as EM and Gibbs
Sampling. Until the mid 90s, immediately afterwards one of the first probabilistic
motif-finding technique was proposed by Hertz et al. (1990), two methods come
into prominence, MEME (Multiple Expectation Maximization for Motif
Elicitation) (Bailey and Elkan, 1995) and Gibbs Sampler (Lawrence et al., 1993).
MEME has been a cornerstone method in motif-finding algorithms since it
brought innovative advantages that had not been implemented to that date. It
wasn’t limited with finding only “one-instance-per-sequence” motifs, it was also
able to find motifs whose instances are not shared by the given sequences in equal
number. Secondly, it adopted a strategy of erasing found instances in a
probabilistic manner which enabled it to find more than one motif candidate in
given sequences. As for methods that adopt Gibbs sampling strategy, extensions
to the original Gibbs Sampler developed by Lawrence et al. (1993) has been
proposed in order to remove its drawbacks, such sample methods are AlignACE
(Roth et al., 1998) and BioProspector (Liu et al., 2001).
In addition to probabilistic methods, algorithms based on exhaustive word
enumeration have also been considered by researchers. The first implementation
that adopts the strategy is done by Van Helden et al. (1998). Later their simple
methodology of performing search via enumeration of all possible motif locations
2. RELATED WORKS Mustafa KARABULUT
11
was improved by respective studies of Tompa (1999) and Sinha and Tompa
(2000). Moreover, Weeder (Pavesi et al., 2001) and WINNOWER (Pevzner and
Sze, 2000) are improved word-based methods incorporated with other approaches
such as suffix trees and graphs. MDScan (Liu et al., 2002) is such a hybrid
method that incorporates probabilistic approach with word-enumeration. When
word-enumerative methods are compared with probabilistic methods (Das and
Dai, 2007; Hu et al., 2005), the first group of methods are observed to be superior
for shorter motifs up to 5-10 nucleotides long. Probabilistic methods, however,
scale well for longer motifs and longer datasets although they do not guarantee
globally optimum results.
A more recent trend in motif-finding literature, with respect to
probabilistic and word-enumerative methods, is using machine learning
algorithms. Self-organizing map is a good instance of such algorithms. In two
separate studies (Liu et al., 2006; Mahony et al., 2005), SOM is solely utilized to
find TFBS instances in co-regulated gene sequences of prokaryotic organism,
Saccharomyces cerevisiae. In the first study, Liu et al. considered a SOM
structure composed of several layers each of which performs classification of
given inputs. At the output layer, the classification is done as “motif” or “non-
motif”. Although the work of Mahony et al. also utilizes SOM, its approach is
different from that of Liu et al. In their method, the so-called SOMBRERO, given
sequences are broken into subsequences of a specific length in order to cluster
them into an appropriate number of PWMs. Then each set of clusters are
statistically tested to see whether its distribution of bases is different from
background probabilities. In the statistical test, a z-score is calculated for each
PWM and then the PWMs with high z-scores are considered as motif candidates.
Both SOM based studies differ from previous probabilistic methods in the way
that they consider each subsequence independent of which promoter sequence
they belong to. Therefore, any number of motif instances in given sequences is
allowed and hence this feature might also lead to high number of false-positive
predictions. As an additional shortcoming, these two methods miss variable length
motif search feature.
2. RELATED WORKS Mustafa KARABULUT
12
Recently, evolutionary algorithms and stochastic search procedures have
also been considered by researchers. Genetic algorithm (GA) and PSO are the
most prominent of this category. GAME (Wei and Jensen, 2006) and GALF-P
(Chan et al., 2008) are two recent GA based de novo motif discovery methods. In
GAME, one location at each given sequence is presumed as a TFBS starting
point, and then these locations are optimized against a Bayesian fitness function
via GA operators such as crossover and mutation. In the post-processing phase,
the methods attempts to find additional TFBS instances to allow more than one
instance per sequence. GALF-P has similar aspects with GAME. However, it
combines consensus-led search with PWM optimization. Both methods are
reported to be superior to MEME and MDScan in experiments over real and
synthetic datasets.
PSO is a proven algorithm in many fields. However, with respect to GA,
PSO based motif-finding methods are rarer. The most important factor behind this
fact appears to be that PSO is designed to work in continuous domains, whereas,
in motif discovery, the search domain is a discrete space of DNA sequences that
are constituted of letters A, C, G and T. Nonetheless, the study of Lei and Ruan
(2010) is an instance of such PSO-based methods in the computational motif
discovery literature. In their study, all possible w-mers (i.e., subsequences of
length w) from the given sequences are extracted and a “word dissimilarity graph”
that holds dissimilarity scores of w-mers to each other is constructed in order to
convert the discrete domain to a “semi-continuous” one. Within this new domain,
each particle keeps track of a vector of locations in each given sequence and
formed a consensus sequence. The fitness function of the algorithm was based on
scoring the number of mismatches. In another PSO-based study (Hardin and
Rouchka, 2005), a hybrid algorithm of PSO and EM was proposed and PSO was
used only to seed the EM algorithm in order to detect motifs residing in regulatory
regions. Additionally, the HPSO algorithm (Zhou et al., 2005) was also a hybrid
motif discovery algorithm in which PSO is supported with some features of GA
such as the recombination operator.
Although several methodologies are considered, TFBS identification still
remains as a challenging task specifically for higher organisms such as metazoans
2. RELATED WORKS Mustafa KARABULUT
13
(Tompa et al., 2005). Hence, some researchers attempted to ensemble known and
proven algorithms to improve prediction accuracy. Hu et al. (2006) developed the
EMD algorithm that takes advantage of five proven motif-finding methods,
AlignACE, BioProspector, MDScan, MEME and MotifSampler (Thijs et al.,
2001). According to the authors’ report, the performance of the new ensemble
method is always superior or at least equal to those of underlying five algorithms
(Das and Dai, 2007). Other ensemble based motif-finder methods can be listed as
SCOPE (Carlson et al., 2007), BEST (Che et al., 2005), TAMO (Gordon et al.,
2005) and MotifVoter (Wijaya et al., 2008).
3. MATERIAL AND METHODS Mustafa KARABULUT
14
3. MATERIAL AND METHODS
3.1. Datasets Utilized In the Study
In this thesis, four groups of datasets are utilized to evaluate the
performances of both developed methods and literature methods. Only one group
of datasets is produced by extracting sequences from genome of relevant
organisms, the rest is taken from other studies’ public material which is available
online.
The first group of datasets includes data extracted from genome of the
organism yeast, Saccharomyces cerevisiae. The organism’s whole genome with
gene annotations is available online at “Saccharomyces Genome Database”
website (SGD Project, 2008). Each dataset consists of numerous promoter
sequences to one of five different genes, GAL4, GCN4, CBF1, RFX1 and HSF1.
The promoter sequences include TFBS instances that are experimentally validated
as playing regulatory role in the expression of these genes. The characteristics of
the first group of datasets are given in Table 3.1. The yeast datasets include both
small and large datasets which in total contains both short and long motifs (7-17
nucleotides) also with different number of motif instances.
Table 3.1. Saccharomyces cerevisiae datasets
Dataset Species Motif Length
Number of instances
Number of Sequences
Dataset size (nucleotides)
GAL4 S.cerevisiae 17 10 8 1647 GCN4 S.cerevisiae 7 62 60 11640 CBF1 S.cerevisiae 7 65 54 12159 RFX1 S.cerevisiae 13 8 6 5922 HSF1 S.cerevisiae 13 27 27 5049
In addition to the yeast datasets, each of which belongs to the same
species, a second group of datasets is also utilized to evaluate the performances of
proposed methods against datasets from different species. In this group, eight
datasets from genomes of four species including Saccharomyces cerevisiae
(yeast), Escherichia Coli (E.coli), Drosophila melanogaster (fly) and Homo
sapiens (human) are composed. The yeast promoters are again extracted from the
3. MATERIAL AND METHODS Mustafa KARABULUT
15
“Saccharomyces Genome Database” and E.coli data is taken from “The EcoGene
Database of Escherichia coli Sequence and Function” website (ECOGENE,
2009). Fly and human datasets are first considered by Tompa et al. (2005) and
taken from the public website of their study. Table 3.2 represents characteristics
of the second group of datasets.
Table 3.2. The second group of datasets that consists of different species
Dataset Species Motif length
Number of TFBSs
Number of Sequences
Dataset size (nucleotides)
GCN4 Yeast 7 10 10 1831 RFX1 Yeast 13 8 6 5465 ARGR1 E.coli 18 17 11 1681 LEXA E.coli 20 8 8 4715 DM05 Fly 12 14 7 7466 DM06 Fly 14 7 5 3792 HM10 Human 8 11 10 2949 HM17 Human 16 10 8 5328
As for the third group of datasets, it is structurally similar to the second
group as it also includes eight datasets from separate species. However, some
datasets in this group contain mostly mammalian promoter sequences which make
it different from the second group of datasets.
Table 3.3. The third group of datasets
Dataset Species Motif Length
Num. of TFBSs
Number of sequences
Dataset size (nucleotides)
CREB HMR 8 19 17 3544 CRP E.coli 22 24 18 1512 E2F Mammalian 11 27 25 4750 ERE Mammalian 13 25 25 4700 MEF2 HMR 7 17 17 3293 MYOD HMR 6 21 17 3315 SRF HMR 10 35 20 4127 TBP HMR 6 93 95 18525
These eight datasets were studied previously by some literature papers,
then compiled and considered by Wei and Jensen (Wei and Jensen, 2006) in their
study. Only CRP is a gene from prokaryotic DNA while the others were all
3. MATERIAL AND METHODS Mustafa KARABULUT
16
eukaryotic. CREB, MEF2, MYOD, SRF and TBP are extracted from ABS
database of annotated regulatory binding sites (Blanco et al., 2006) and include
promoter sequences to genes from human, mouse and rat (HMR). The other
eukaryotic datasets (E2F and ERE) are also extracted from different mammalian
species that were not specifically given in the relevant paper (Frith et al., 2004).
Table 3.3 presents the details of the datasets.
The first three groups of datasets were all real datasets extracted from
different organisms. Other than these datasets, we also utilized synthetic datasets
as well. The fourth group of the utilized data is synthetic datasets taken from
public website of the study (Chan et al., 2008). A total of 800 synthetic datasets
were populated artificially to include 8 separate real biological scenarios each of
which has 100 datasets produced by varying three parameters, the motif-width
(short and long), the dataset length (small and large) and motif conservation ratio
(high and low). Table 3.4 summarizes the properties of artificially populated
datasets.
Table 3.4. Properties of synthetic datasets for different scenarios Artificial scenario Dataset properties
Motif width Conservation Dataset
length Motif length
Number of sequences
Dataset size (nucleotides)
Short High Small 8 20 20000 Short Low Small 8 20 20000 Short High Large 8 60 60000 Short Low Large 8 60 60000 Long High Small 16 20 20000 Long Low Small 16 20 20000 Long High Large 16 60 60000 Long Low Large 16 60 60000
In short motifs, 8w whereas in longer ones, 16w . The small datasets
include 20 sequences each of which are 1000 nucleotides long, whereas the
number of sequences for large datasets is 60. In high conservation scenario, the
motif pattern is indented to be distinguishable that each position in the motif has a
clearly dominant nucleotide with a frequency of %91. However, in low
conservation scenario, which is designed to be noisy, this frequency is decreased
3. MATERIAL AND METHODS Mustafa KARABULUT
17
to %55. In each synthetic dataset, the sequences have 10% probability of
containing either no motif instances or more than one instance up to
approximately 5-6 additional instances.
3.2. Methods
3.2.1. Fuzzy C-Means
Fuzzy C-Means is a popular clustering algorithm utilized for solutions to
several problems in computational biology (Yaochu and Lipo, 2009), medicine
(Zhou et al., 2009) and data mining (Joisen et al., 2002). It is essentially an
iterative technique in which a desired number of cluster centers of the same
dimensionality with the inputs are matched to the given input data which is, in
general case, a set of high dimensional vectors. The algorithm involves the
execution of the following steps repeatedly until a satisfactory objective is
reached:
I. Calculating a membership value for each input and allowing the inputs to be
a member of multiple clusters,
II. Updating cluster centers with the given vectors by using the membership
values as a weighting factor.
More formally, let 1 2{ , ,..., }NX x x x= be a set of given input vectors and
1 2{ , ,..., }MC c c c= represent the cluster centers. The membership of each input xj
to each cluster ci is calculated and stored in a membership matrix U of size nxm :
1/( 1)2
1/( 1)2
1
1( )( )
1( )( )
q
j iij M
q
k j k
d x cu
d x c
−
−
=
−=
−∑ (2)
where d corresponds to the distance between the input and the cluster center, and
q denotes the fuzziness value. In case of multidimensional space, the distance
function is usually chosen as Euclidean; nonetheless, it can be any other
3. MATERIAL AND METHODS Mustafa KARABULUT
18
appropriate distance function that is suitable to the input space. The fuzziness
quantity q in the equation directly affects the performance of the FCM algorithm
and it must be chosen carefully. Generally, q is taken as a real number between 0
and 2 in most of the applications. Also it should be noted that, the total
membership value of an individual input vector is constrained by:
11, {1,..., }M
ijiu where j N
== =∑ (3)
Once the membership matrix is constructed, the cluster centers, which are
vectors in the same space, should be updated so that they are moved towards the
inputs. The update of a specific cluster takes into account the whole set of inputs:
1
1
( )
( )
Nij q
jj
i Nij q
j
u xc
u
=
=
=∑
∑ (4)
The algorithm is terminated when the objective is reached, e.g., the total sum of
distances of all inputs to all clusters (i.e., total dissimilarity) is minimized to a
certain threshold value ϕ . On the other hand, the algorithm could also be
terminated after a pre-determined maximum number of iterations are executed.
For certain input spaces such as DNA sequences used in this study, it is difficult
to define an objective function due to the complexity of the data type. Thus the
termination criterion is specified as reaching the maximum number of iterations.
The basic FCM algorithm explained above can be adapted to fit the DNA
motif discovery task which can be thought as searching an unknown number of w-
length subsequences from an N-length DNA sequence. Thus all of the w-length
subsequences from the given DNA sequence should be extracted in order that they
can be statistically analyzed for the probability of being a transcription binding
site instance. This probability is inversely proportional with the likelihood ratio of
the nucleotide frequencies of the subsequence to the background model. The FCM
3. MATERIAL AND METHODS Mustafa KARABULUT
19
based method proposed in this study handles these issues by performing the
following steps:
a) Clustering subsequences into a certain number of clusters, i.e., PWMs,
b) Testing each PWM to see whether it is statistically interesting or not.
The considered method requires the length of the sought motif instance to
be known prior to the analysis. Thus the only unknown parameter is the starting
positions of the motif instances. The algorithm will consider all the starting
positions from the beginning to the end of the whole DNA sequence by using the
sliding-windows technique. In this technique, a window of a specific length is
pushed every time one character along the DNA sequence to obtain the next
window. If it is assumed that the sought motif is w-length and the whole DNA
sequence is N-length, the sliding-windows process will produce N-w number of
subsequences which form the inputs to the algorithm. Figure 3.1 depicts the
technique.
Figure 3.1. Extraction of subsequences by using the sliding-windows technique Once the inputs are extracted with the above mentioned technique, FCM is
applied to cluster the inputs. In this case, each cluster center is actually a PWM of
x4w . Through the application of FCM by repeatedly calculating Equation 2 and
4, we should perform quantitative comparisons between each subsequence and the
cluster centers, that is, comparing a string with a PWM. To perform such
comparisons, a distance function which is suitable for the input space to calculate
the dissimilarity between the DNA subsequences and the PWMs is required.
Thus, the function D(x,m) which is given in Equation 5 is utilized for this purpose:
3. MATERIAL AND METHODS Mustafa KARABULUT
20
, , ,
, ,0
,
( , ) 1/( ( , ) )
1( , )
0
A C G Tw
i i c i ci c A
ii i c
i
D x m e x m m
x ce x m
x c
= =
=
= → = ≠ →
∑ ∑ (5)
where m is the PWM of x4w and x is a DNA sequence of length w. The PWMs
are randomly initialized and updated at each iteration of the clustering process.
Once the distances between the subsequences and the PWMs are calculated, then
the PWMs can be updated according to Equation 4. Since the PWMs and the
subsequences involve different types of data, the application of Equation 4
requires the input x to be replaced with the function R(x, c) which is defined as:
, , ,
0
1 21 2
1 2
( , ) ( , )
1( , )
0
A C G Tl
ii c A
R x c eq x c
c ceq c c
c c
= =
=
= → = ≠ →
∑ ∑ (6)
where c1 and c2 are the members of the set { , , , }A C G T . After the calculations are
performed, the PWMs are updated with Equation 1 which is given in Section 1.4.
It should be noted that the application of Equation 1 requires generation of
organism specific background model which is, in our case, 3rd order Hidden
Markov Model of the sequences.
A major feature of the FCM algorithm is its consideration of all inputs
when updating the clusters with the membership matrix as a weighting factor.
However, for the motif discovery task, the experiments of this study have shown
that updating each PWM with a selected group of subsequences, e.g., by selecting
some elements from the membership matrix by using a certain threshold, rather
than with all the subsequences resulted in better predictive performance. Thus, the
entries in each row of the membership matrix U are sorted and a certain number of
values are taken from the top of the sorted values. More formally, the x values in
Equation 4 are replaced with the Sel(x) function:
3. MATERIAL AND METHODS Mustafa KARABULUT
21
(max( ), )( )
0ij ij ij
j
u top u z uSel x
otherwise∈ →
= →
(7)
where uij is the membership value of xj to the cluster ci and z stands for the
number of top subsequences to be considered. As far as the experiments over the
datasets of this study are concerned, a heuristic assignment of a value between 10
and 40 to z improves the performance of the algorithm regardless of the dataset
length. The main reason of this selection is that the number of the transcription
binding site instances residing within the datasets of this study is close to the
proposed value of z. Thus, this selection can be thought of as a weighting factor to
support the influence of the transcription binding site instances to the clustering
performance. The membership value calculations (Equation 2 and 5) and the
PWM updates (Equation 4, 6 and 7) are performed repeatedly for a certain
number of times. Once the training phase is completed, the inputs are given to the
clusters one more time for a hard clustering of the inputs to the clusters. The final
forms of the PWMs are calculated after this assignment process where the PWM
updates are performed as described earlier. This extra cycle of hard clustering
ensures that the clusters are at their best state to represent the subsequences.
Once the clustering is finished, a selection mechanism should be invoked
in order to reveal potential PWMs that represent transcription binding factor site
instances among a certain number of PWMs which contains insignificant
information content. The statistical significance, as mentioned before, is directly
proportional with the unlikelihood to the background model. To filter out
uninteresting PWMs and mark potential PWMs that may contain motif instances,
the main method is to rank PWMs by calculating their z-scores, and consequently
considering top PWMs from the sorted list as potential motif instances. The
calculation of z-score for a specific PWM, i.e., cluster, can be performed by using
Equation 8:
O Ez scoreσ−
− = (8)
3. MATERIAL AND METHODS Mustafa KARABULUT
22
where O represents the number of subsequences hard clustered to the PWM, E
symbolizes the number of subsequences which coincide to the node by chance and
σ stands for the standard-deviation of the coincidence. As mentioned above,
once training phase is completed, the subsequences are given to the clusters one
more time for a hard clustering so that the value of O can be obtained for each
cluster. In order to calculate the parameters E and σ , the following steps are
performed (Mahony et al., 2005):
a) Artificial sequences, which are at the same length with the given DNA
sequence, are produced by using the background model
b) The artificial sequences are given to the clustering scheme the same way as
the real sequence is given, i.e., with the sliding-windows technique
c) The artificial windows are hard clustered so that the number of
subsequences associated with a cluster gives the value of E for that cluster
d) To remove the probability of the coincidences by chance, the steps (a), (b)
and (c) are repeated for a certain number of times (T), and the average
value of E is taken for each cluster.
The standard-deviation is calculated by using the parameter E and T. Once the z-
scores for each PWM are calculated, the PWMs can be sorted with this score in
order to determine the motifs. The PWMs with the highest z-scores represent the
most probable motif candidates predicted by the algorithm (e.g., top 10 highest
scored PWMs represent the predictions)
3.2.2. Expectation Maximization with Gaussian Mixture Models
Expectation Maximization algorithm, first introduced in the paper
(Dempster et al., 1977), has become a very popular way of estimating parameters
of a statistical model with incomplete data. The main idea of the algorithm is to
start with a guess of unknown parameters and then iteratively perform two steps
repeatedly;
a) E (Expectation) step in which an estimate of expected values of unknown
parameters by use of known values is calculated,
3. MATERIAL AND METHODS Mustafa KARABULUT
23
b) M (Maximization) step in which the hidden parameters are re-estimated to
maximize the likelihood of the data.
Gaussian Mixture Model is one of such mathematical models that are
frequently used with the EM algorithm to estimate unknown parameter values of
various problems. The combination of GMM and EM is a widely utilized model
for the task of data clustering as a part of several tasks in Machine Learning
(McLachlan and Krishnan, 1997) and bioinformatics (Do and Batzoglou, 2008).
In briefly, the EM algorithm for the particular problem of separating given
data vectors, 1 2{ , ,..., }nX x x x= , into clusters, 1 2{ , ,..., }mC c c c= , that is actually
fitting given X into an M-component GMM, can be characterized as:
1) E-Step:
1
( , , )( | )
( , , )
t t tt c n m m
Mt t tk n k k
k
w f x mp m nw f x m
σ
σ=
=
∑ (9)
where the function ( , , )t t
n c cf X m σ corresponds to the Probability Density Function
(PDF) of a given input vector xn over the cluster cm at the iteration time t; and tcw
denotes the weight of a specific cluster within the mixture of m-component under
the constraint 1
1m
tk
kw
=
=∑ . The PDF value indeed gives us the membership value of
a specific input to a specific cluster. The calculation of the PDF function is given
in Equation 10.
21 || ||( )
21( , , )( 2 )
x m
Df x m e σσ
σΠ
−−
= (10)
In Equation 10, D corresponds to the number of dimensions of the input data and
|| ... || is a distance function, e.g., Euclidean. Both Equation 9 and Equation 10
require the value of m denoting the mean vector of a specific cluster and σ that
stands for the variance of the cluster. Note that, for a higher dimensional input
space such as 2-D or 3-D, the symbol σ is generally replaced with the symbol Σ.
3. MATERIAL AND METHODS Mustafa KARABULUT
24
This symbol stands for the correlation-matrix that holds the variances and
covariances between dimensions.
2) M-Step:
1 1
1
( | )
( | )
Nt
nt nm N
t
n
p m n xm
p m n
+ =
=
=∑
∑ (11)
1 2
1 1
1
( | ) || ||
( | )
Nt t
n mt nm N
t
n
p m n x m
p m nσ
+
+ =
=
−=
∑
∑ (12)
1 1( | )
Nt
t nk
p m nw
N+ ==
∑ (13)
In Equation 11, 12 and 13, the means, variances and mixing weights for each
cluster is updated respectively with the parameters calculated in Equation 9. E and
M steps are repeated one after another until the algorithm is terminated. Reaching
a satisfactory objective value or execution of a number of maximum iterations are
usual cases for the algorithm termination. In this paper, similar to FCM, the
algorithm is preferred to run for a certain number of iterations.
Similar to FCM, EM/GMM is capable of clustering w-mers for motif
finding purposes. Since the original EM/GMM approach that is given above
(Equation 9-13) is suitable for high-dimensional numeric space, like in FCM,
some modifications should also be applied for EM/GMM, as well. The goal is
clustering a set of subsequences into k-number of clusters which is actually fitting
k-number of PWMs to the inputs where each PWM corresponds to a local
alignment of some subsequences. Unlike FCM, each PWM (shown with m), is a
4 w× matrix and characterized with a derivation of Equation 1 with missing log
function, /ib ib bm f Θ= . The reason why Equation 1 is modified with removal of
3. MATERIAL AND METHODS Mustafa KARABULUT
25
log is that each column in PWM should be normalized when GMM calculations
are done to obtain distances between PWMs and subsequences.
In E step in which PDF calculations are done, it is easily observed that a
specific distance function should be developed for the motif discovery problem.
The function ( , )D x m , which is given in Equation 14, is such a particular distance
function which is developed to replace every occurrence of x m− in Equation 10
and Equation 12, as well. This replacement is useful to overcome the problem of
being unable to perform operations between inputs and clusters since the two
operands are not in the same dimensionality. The ( , )D x m function, where x is a
string of w-length and m is matrix of 4 w× , is as follows:
11.0 ( , )
( , )
w
i ii
S x mD x m
w=
−=
∑ (14)
where ( , )i iS x m gives the similarity score for a single position within the string:
, , ,
( , ) ( , )
1( , )
0
A C G T
i i i ib ibb
i ibi ib
i ib
S x m eq x m m
x meq x m
x m
=
= → = ≠ →
∑ (15)
The function ( , )i iS x m gives the likelihood ratio of a given sequence to the PWM.
As mentioned before, the sum of values in the same column in both PWM and
probability matrix, are restricted to, , ,
11
A C G T
ibb
m=
=∑ .
After PDF function for each input and cluster is calculated, the second step
M is executed to update variances, means and weights for each cluster center.
Each cluster can be seen as a node that holds a mean, i.e. PWM, a variance and a
mixing weight value. Also, some supporting variables, such as probability matrix,
3. MATERIAL AND METHODS Mustafa KARABULUT
26
are stored within a node. In M step, each PWM is updated with the corresponding
PDF values of each input.
1( , ) ( , , )
1( , )
0
b b
b
b
b
N
ib n n nn
nn
n
p eq i x x f x m
i xeq i x
i x
σ=
=
= → = ≠ →
∑ (16)
In Equation 16, the calculation of a specific element in the probability
matrix is shown. After the calculations are done, the probability matrix is
normalized where each column total should exactly be equal to 1. Afterwards, the
calculation of PWM is straightforward as shown in Equation 11-13. As regards to
the GMM approach, all inputs should update every PWM with corresponding
PDF values as a weighting factor. However, as far as the experiments of this study
are concerned, for the motif discovery problem, updating each PWM with a
selection of subsequences rather than the complete set of subsequences resulted in
better performance. Thus, the statement xn in Equation 16 is replaced with
( )nSel x :
( , min( ))
( )0n
a top z a aSel x
else∈ →
= → (17)
where f is the PDF value, a stands for ( )nf x values for each x and z is the
number of top subsequences taken to update the PWM. As for the genomic
datasets utilized in this study, taking the number of motif instances as z is the best
choice to produce best results. Thus, it may be stated that giving an expected
number of motif instances and performing the update task with this parameter
improves the performance of the proposed method. Note that, the ( )nSel x
transformation is an optional step and may be disregarded. However, when
( )nSel x is considered to use, the user should provide an expected number of motif
3. MATERIAL AND METHODS Mustafa KARABULUT
27
instances in order to run the algorithm. It should be noted that favorable
contribution of Equation 17, specifically for motif discovery task, is justified
empirically. That is, it is possible that the application of Equation 17 may fail for
such tasks in which the goal is optimum clustering of vectors in high-dimensional
space. Nonetheless, experiments have shown that it anyway enhances motif
discovery performance. Since, with Equation 17, the original GMM is modified
and the resultant new model may be considered as a GMM-variant
After E and M steps are executed for a certain number of times, the
resulting PWMs are statistically tested as mentioned in Section 3.2.1. Similarly,
the best PWMs with the highest z-scores are regarded as the most probable motif
candidates.
3.2.3. Self-Organizing Map
Self-Organizing Map is a sort of neural-network which is mainly used for
visualization, dimension reduction and data compression (Kohonen, 1998). SOM
generally takes inputs of high-dimensional space and consequently maps these
inputs into a lower dimensional space. In order to accomplish such a projection of
inputs, SOM employs an input layer of vectors with the same dimensionality of
inputs and an output layer of nodes interconnected with input layer. In most of the
applications the output layer is chosen as low dimensionality as 2D planar grid of
nodes to provide easy interpretation of transformation products. The basic SOM
algorithm can be summarized as:
a) Randomly initialized weight vectors in the input layer are fed with one
input at a time,
b) Closest weight vector to the input, in other words winner node or best
matching unit (BMU) is chosen,
c) Winner node and its topological neighbors are updated with the input,
d) Steps a, b and c are repeated for each input for a number of times which is
called training.
More formally, the index of BMU is determined via:
3. MATERIAL AND METHODS Mustafa KARABULUT
28
arg ( ( , ))i kc min dist x n= (18)
where xi represents an input to be compared to nk which is a node on SOM output
layer; dist stands for an appropriate distance function, most commonly Euclidean.
In the application of motif finding, the basic SOM algorithm flow remains
almost the same. SOM, however, is mostly designed for numerical inputs and thus
some modifications should be applied in order to make it work for subsequence
clustering with PWMs. For motif finding, where the input space consists of
subsequences extracted from the given promoter sequences of putatively co-
regulated genes, each node at the output layer of SOM is associated with a
randomly initialized PWM (Step a). Comparing an input subsequence x, which is
an w-length string of nucleotides A, C, G and T in an arbitrary order, with an 4xw
PWM in order to find closest PWM to the given input requires a likelihood
function to adapt Equation 18 to motif finding procedure:
,
4
1 1( , ) ( , )
1( , )
0
i i b
l
i bL x m eq x b m
a beq a b
a b
= ==
= → = ≠ →
∑ ∑ (19)
With function L(x, m), a subsequence and a PWM can be compared, thus their
similarity level can be obtained quantitatively which helps us determine BMU
through training (Step b). Consequently, BMU and a number of its topological
neighbors are updated with the given input subsequence. For motif finding
application this task is performed via updating frequency matrix first and then
updating the PWM, which is representation of a single node, with frequency
matrix after all the inputs are associated with the clusters, i.e., nodes or PWMs. To
update frequency matrix, Equation 20 is utilized:
3. MATERIAL AND METHODS Mustafa KARABULUT
29
1, 4
1 1
( , )( ) ( )
( , )
hkb
c ki b h
ky
y k
eq x bf c t
eq x yϕ α =
= =
=∑
∑∑ (20)
where ,c
i bf corresponds to the frequency matrix element at column i and row b of
node c, and h stands for the number of subsequences associated with the node c.
Function ( )cϕ controls the influence of neighborhood size as a function of
topological closeness to the BMU. A usual application of neighborhood function
utilizes Gaussian distribution function since it produces continuous values along
with decreasing closeness. Term ( )tα stands for learning rate at iteration t where
learning coefficient decreases through algorithm iterations. Subsequent to
frequency matrix calculation, the PWM of the node is then obtained then by
applying Equation 18 once at each iteration of the algorithm.
After SOM clustering, the procedure to analyze obtained PWMs to extract
potential motifs is the same as in previous clustering techniques. The results with
the highest z-scores will be selected as the prediction of the SOM algorithm.
3.2.4. K-Means
Basically, K-Means is a clustering algorithm which mainly relies on
repetitively moving given number of cluster centers, which are generally
randomly initialized, 1 2{ , ,..., }mC c c c= to the given inputs 1 2{ , ,..., }nX x x x= to be
clustered. This procedure consists of executing two main steps until a satisfactory
convergence is reached:
a) Calculating the distances between inputs and cluster centers in order to
assign each input to the nearest cluster,
b) Updating cluster centers by taking the means of assigned inputs to each.
Thus, the objective is minimizing the following loss function:
2
1 1
M n
j ii j
V x c= =
= −∑ ∑ (21)
3. MATERIAL AND METHODS Mustafa KARABULUT
30
where || ... || is a distance function, usually Euclidean.
As for the motif discovery application, the distance function is required to
be replaced with 1/ ( , )j iL x c which utilizes the similarity function from Equation
19. Consequently, each cluster center is updated with:
j i
n
jx C
i
xc
n∀ ∈
=∑
(22)
In order to fit this mean updating task for motif finding, nucleotide
frequencies at each position of locally aligned sequences should be counted to
form frequency matrix. After establishing a hard association of each subsequence
to a cluster center, the frequency matrix is updated as follows:
1, 4
1 1
( , )
( , )
hkb
c ki b h
ky
y k
eq x bf
eq x y
=
= =
=∑
∑∑ (23)
where h corresponds to the number of subsequences associated with the cluster
center c. It is easily observed that the update step is very similar with SOM’s
update which is given in Equation 20. With the consideration of given equations,
it can be said that trainings of SOM and K-means are very similar until the update
step where K-means updates cluster centers with hardly-associated inputs,
whereas SOM updates the winner-node and also its neighbors as well.
Neighborhood concept and decreasing learning rate can be considered as the main
differences between SOM and K-means learning schemes.
3. MATERIAL AND METHODS Mustafa KARABULUT
31
3.2.5. Post-optimization for clustering approach
Clustering algorithms given through section 3.2 are generally sensitive to
initialization and thus it may not necessarily reach the global optimum.
Additionally, motif finding task is a multimodal problem where there are many
local optimums that the algorithm may be trapped into. Therefore, a post
optimization procedure to enhance the obtained PWMs is utilized after clustering
is completed. The optimization is performed for each PWM by use of the
Bayesian scoring function proposed by Jensen and Liu. (Jensen and Liu, 2004):
40
1 10 0
ˆˆ ˆ( ) log 1 log ˆˆ1
wib
ibi b b
pA Ap
θψ θ
θ= =
= − + −
∏∏ (24)
where A denotes the number of TFBS instances that are locally aligned in the
cluster and 0p is the ratio of A to total number of possible TFBS locations on X
(i.e., 0ˆ ( )p A N w= − ). The symbol ibθ denotes the frequency of letter b in the
position i of the alignment matrix and symbol 0bθ corresponds to the background
frequency of the same letter.
In clustering, each local alignment may include false positive
subsequences and they cannot be easily avoided by iteratively clustering
subsequences into most relevant centroids. Thus, we utilize the scoring function in
Equation 24 that gives us the ability to quantitatively measure the effects of
modifications on the alignment such as removal of a subsequence or addition of a
new w-mer from X. Therefore, in order to optimize each PWM, two steps of
operations are iteratively performed on the PWMs separately after clustering is
completed:
a) Re-alignment operation: Since most of the given sequences in X are
expected to contain an instance of the sought motif, the alignment that
represents a candidate motif model is also expected to contain instances
from most of the sequences. Thus, primarily, sequences S from X whose
3. MATERIAL AND METHODS Mustafa KARABULUT
32
any w-mers are not included in the local alignment are selected. Then, all
w-mers from each Si is considered to be included into the alignment by
calculating a corresponding score, ( )Aψ ′ . The w-mer is immediately
included into the alignment if ( ) ( )A Aψ ψ′ > . Similarly, the subsequences
that are already aligned in the clustering phase are reconsidered to see
whether their removal or replacement with another w-mer from the same Si
produces an improved score.
b) Shift operation: Let a be the starting position of a subsequence and
1 2 3, , ,..., na a a a ar
represents a local alignment of n number of
subsequences. The shift operation checks whether a simultaneous shift of
the alignment in either directions results in an improvement on the score.
That is, for each i from 1,1 ,
1 2 3, , ,...,i i i n ia a a a a r
and corresponding score ( )Aψ ′ is
calculated, then ar is accepted as the new ar if ( ) ( )A Aψ ψ′ > .
In addition to re-alignment and shift operations, the post-optimization
phase also contains a procedure to find the optimal width of the sought motif. As
mentioned previously, the user of the proposed method is required to give three
parameters for a motif search, ww, wmin and wmax. Clustering is executed with the
assumption of ww w and then finds the optimal width in the post-optimization
phase by varying the width between wmin and wmax. The Bayesian scoring scheme
is modified to fit a variable-width motif search strategy and becomes the below
one:
4
0 01 1
( , ) log ( ) log ( , | |)
( )(4 )log log4. ( ) ( 4 )
wibb
b bb i
A w p w B A L A
nn
A
ψ
ββθ
β β= =
= + − +
Γ +Γ+ Γ Γ +
∏∑ ∑ (25)
where ( )p w is a prior distribution such as Poisson of w, ( 1) !x xΓ + = and
1
0( , ) (1 )c dB c d x x dx= −∫ . The term ibn refers to the count of nucleotide b at
3. MATERIAL AND METHODS Mustafa KARABULUT
33
position i and 0bn denotes the background count of the same nucleotide. Please
refer to (Jensen et al. 2004) for more details.
The fitness function given in Equation 25 allows us to find an optimal
configuration of A, i.e., local alignment, with respect to a given w. The best width
w′where min maxw w w′≤ ≤ is accepted as the optimal width if ( , )A wψ ′ is superior
to all other ( , )iA wψ in the range of possible width options.
3.2.6. Particle Swarm Optimization
Particle Swarm Optimization (Eberhart and Kennedy, 1995) is a popular
metaheuristic optimization procedure that iteratively improves a swarm of
candidate solutions which are called particles. Through the iterations of the
algorithm, each particle is moved within the search space in order to locate a
better position in terms of a problem-specific fitness value. The movement of the
particles is controlled by a social cognition mechanism in which best found
position in the neighborhood of each particle influences its next position and
movement speed. Moreover, each particle’s record of own best position so far is
also considered while moving the particle in the search-space. By use of this basic
strategy, PSO is proven (Poli, 2008) to be capable of finding good solutions for
various problems modeled mostly in high dimensional space of numerical values.
More formally, let nS ⊆ R is the search space in which each particle kp
holds a vector of positions 1 2{ , ,..., | }n ix x x x x S= ∀ ∈r
and accompanying
velocities 1 2{ , ,..., | }Rn iv v v v v= ∀ ∈r
where n-dimensional space is considered. At
the beginning of the algorithm, each position ix and velocity iv is set to random
values. Then at each iteration t, the positions and the velocities are updated
according to Equation 26 and 27, respectively.
( 1) ( ) ( ) ( )
1 2( ) ( )t t lb t nb ti i i i i iv v r x x r x xα β γ+ = + − + − (26)
( 1) ( ) ( 1)t t ti i ix x v+ += + (27)
3. MATERIAL AND METHODS Mustafa KARABULUT
34
In Equation 26, r1 and r2 are independent random real values between 0 and 1.
The parameter α is called the inertia factor that controls how much the original
velocity ( )tiv is retained for the current iteration t+1. The terms β and γ are
cognitive and social parameters that are used to control the affect of lbix , the local
best position ever found, and nbix , the best position found by the informants of the
particle, i.e., particle’s neighbors according to a population topology. Some
topologies among the most popular of all are GBest, Bidirectional ring, Random
and Von Neumann (Kennedy and Mendes, 2002). In Gbest topology, the
informants of each particle are the whole population, whereas in the other
topologies, the particles are connected to k number of other particles where
1 pk C≤ ≤ in a swarm of size pC . The selected topology determines the parameter
k and which particles are interconnected. In Bidirectional ring, all particles are
arranged sequentially to form a ring and each particle has 2k = neighbors; the
previous particle and next one on the ring. In Random topology, an arbitrary k is
determined where 1 pk C≤ ≤ and each particle is connected to k number of
randomly selected particles. In Von Neumann, a two-dimensional lattice of
particles is formed and each particle has 4k = neighbors; above, below, left and
right particles. Several studies (Kennedy and Mendes, 2002), (Kennedy, 1999)
have evaluated performances of different population topologies and eventually,
along with other conclusions, it has been shown that including the particle itself to
its list of informants may increase the overall performance of the PSO algorithm.
Thus, in the above mentioned population topologies, the parameter k should be
considered as k+1 since the particle is also taken as an informant to itself. Figure
3.2 depicts how particles are interconnected to each other according to each
neighborhood topology utilized in this study.
In each iteration of the algorithm, the best position within the
neighborhood and each particle’s own best position visited are recalculated. The
algorithm is terminated when either a satisfactory convergence is reached or a
certain number of iterations are executed. The convergence can be defined as
3. MATERIAL AND METHODS Mustafa KARABULUT
35
either the convergence of swarm’s best particle to the global optimum or the
convergence of all particles to a single position which may be or not be the global
optimum. PSO practitioners often prefer to control the change in the global fitness
or in the best particle’s fitness to terminate the algorithm. If the tracked value
doesn’t change for a few iterations, then the swarm is considered to collapse or
converge leading to immediate termination of the algorithm.
Figure 3.2. Graphical representations of utilized population topologies: (a) GBest
(b) Ring (c) Random (d) Von Neumann
In order to adapt PSO for DNA motif discovery task, the solution space
and motif representation should be first designed. Given a set of DNA sequences
1 2{ , ,..., }nS S S S= where each iS is at arbitrary length iW restricted to 0iW > . We
seek starting positions of TFBS instances of known width w where the number of
instances ik is unknown. If we let each position in iS independently considered
for being a TFBS instance, then ik becomes 0 1i ik l w≤ < − + what makes the
solution space too large for an optimization procedure to process in a reasonable
time since the complexity would be ((2 ) )il w nO − (Wei and Jensen, 2006; Chan et
al., 2008). Thus, literature methods with similar approaches such as those which
utilized Genetic Algorithm generally preferred to simplify the solution space by
restricting ik with 0 1ik≤ ≤ . With this restriction, each particle represents a
solution with a vector, 1 2{ , ,..., }nx x x x=r
, where each ix represents a possible
location of a TFBS instance in iS and restricted with 0 1i ix l w≤ < − + . In the
real-world scenario, this restricted search procedure has the ability to locate the
majority of the sought TFBS instances, but not all of them since the motif
abundance per sequence may vary. Therefore, once the PWM of the sought model
3. MATERIAL AND METHODS Mustafa KARABULUT
36
is constructed, post-processing procedures can take place and find additional
motif instances that improve the motif model according to the fitness function.
Figure 3.3 shows a sample particle and its evaluation.
Figure 3.3. A sample particle and its evaluation
In this scheme, where each particle simply holds a vector xr
of possible
locations on each iS , PSO algorithm can be run as explained previously. However,
it may not reach the global optimum under these conditions. The problem is, even
though the search space appears to be continuous, it is actually not really.
Subsequences, i.e., w-mers, extracted from adjacent locations on iS aren’t
necessarily similar. For instance, let iS be a DNA sequence
ACGACCATCGATGG and 4w = . Then, all possible locations on iS are limited
to 0 11ix≤ ≤ . In this space, let’s consider two successive locations, 7ix = and
8ix = , the corresponding w-mers to these locations are ATCG and TCGA.
Obviously, although the locations are adjacent, the corresponding w-mers are very
different and they don’t share a common pattern when aligned one under the
other. In order that PSO can operate to reach an optimum iteratively, the search
space should be exactly continuous. Otherwise, the particles flow randomly
through the search space. To overcome this issue, we propose a transformation of
the original iS into a continuous space ciS . In order to construct c
iS , all w-mers of
length w from the original iS are extracted. These w-mers, which are also strings
composed of the alphabet {A, C, G, T}, are sorted in alphabetical order. The sorted
list is actually a new sequence of w-mers. At this point, we have a new continuous
space ciS in which w-mers present a gradient distribution. Figure 3.4 depicts the
3. MATERIAL AND METHODS Mustafa KARABULUT
37
process for the example given previously. The transformation is applied on each
particle by replacing each element of xr
with a simple function ( )iC x that
translates the original location information ix into a new position cix in the sorted
list.
Figure 3.4. Transformation of iS into c
iS
The chosen fitness function which is proven (Jensen and Liu, 2004; Wei
and Jensen, 2006) to be suitable and effective for motif discovery task is the
Bayesian scoring function given in Equation 24 in Section 3.2.5. The scoring
function is both usable as post-optimization for clustering approach and fitness
function for the stochastic search procedure, PSO. To utilize this function in PSO
based algorithm, a corresponding PFM, for each particle that keeps track of the
TFBS locations in xr
, is calculated in each iteration of the algorithm so that the
term ibθ can be obtained. Each 0bθ can be calculated once, when the dataset is
loaded to the program, by counting the frequencies of the letters in the whole
dataset. Also a variant of the Bayesian function given in Equation 25 is functional
for PSO, as well. With this function, PSO can be fit into a variable-width motif
3. MATERIAL AND METHODS Mustafa KARABULUT
38
search strategy. Figure 3.5 summarizes proposed PSO-based approach in
pseudocode format.
Initialize parameters Load data and transform data iS into c
iS k ← 0 while (k < numMotifs ) { i ← 0 Set topology Initialize particles Initialize connections while (i < maxIterations OR not converged) { Update particles’ velocities Update particles’ positions Update particles’ fitnesses Select best particle if (best fitness stagnate for 10 iterations) { Perturb particles Initialize connections } i ← i + 1 } Post-process ( best particle) Add best particle to the output list k ← k + 1 } Figure 3.5. Pseudocode of PSO-based proposed algorithm
Despite the fact that PSO is generally effective to explore the search space
very quickly, it doesn’t guarantee the global optimum and may result in a
premature convergence. There may also be several reasons that cause the
algorithm to be trapped in local optima such as ineffective parameter selection and
problem specific issues. In addition to this section, Section 4.5 presents a
discussion about parameter selection and some other techniques to escape from
local optima are given. However, these strategies may not suffice to make the
algorithm effectively perform exploration and exploitation through the search
space. In PSO literature, a recent trend to improve PSO’s convergence ability is to
perturb the particles by random mutations whenever all particles stagnate for a
predetermined number of iterations (Das et al., 2005; Xinchao, 2010). In this
study, perturbation is utilized and the stagnation is defined as the stability in the
3. MATERIAL AND METHODS Mustafa KARABULUT
39
best particle’s fitness for consecutive 10 iterations. Once the stagnation is
detected, for each particle, few random elements from xr is selected and shifted to
a random location. Hence, the swarm continues its movement. This procedure is
repeated for a maximum number of 100 times whenever stagnation happens.
Simulations on both synthetic and real datasets showed that this procedure
improves the exploitation ability of the swarm. Furthermore, one more issue that
should be considered was the presence of repetitive false positive sub-sequences
which constituted high scored local alignments. Besides, given DNA sequences
may contain more than one DNA motif of interest. To handle these two situations,
motif discovery programs are often required to output multiple motif predictions
in an order that the most interesting, i.e., high scoring, one at the top and the one
that appears to be less relevant at the bottom. Hence, the proposed PSO-based
method requires the user to give an additional parameter that determines how
many motif predictions to be done. To remove the possibility of the program
predicting the same motifs at each iteration, the related sub-sequences to each
predicted motif are removed from the given dataset after each prediction is
performed. Therefore, the next iteration is executed with the modified sequence
data. With this approach, PSO guarantees to eventually find the optimal motif
even if some false positive sub-sequences are present in the given data.
DNA motif discovery task can be thought as a multimodal problem and
hence some local minimums cannot be avoided with the methods described above.
Literature methods that utilize a stochastic search procedure such as Genetic
Algorithm are generally reported to fall into these local minimums (Wei and
Jensen, 2006; Liu et al., 2004; Larsson et al., 2007). As a heuristic search
procedure, PSO is not an exception.
3. MATERIAL AND METHODS Mustafa KARABULUT
40
Figure 3.6. A Single iteration of re-alignment and simultaneous shift operators
Among from several techniques some of which are the variants of the same
type, we choose two methods that were previously utilized by researchers. These
operations are executed once PSO terminates and calculates the best particle with
location vector bestxr .
a) Operations to make improvements on the best particle: The first operator,
i.e., the re-alignment, checks bestxr for false positive sub-sequences what
may often be included in noisy datasets. Each ix in bestxr is checked
against each location j in iS where 0 1ij l w≤ ≤ − + . If new bestx ′r that is
constructed by setting ix j= improves the fitness of the particle, then
bestx ′r replaces bestxr . With this operation, false positive sub-sequences are
either removed or replaced with a true positive sub-sequence. The second
operator checks bestxr whether a simultaneous shift in all ix improves the
fitness of the particle. If 1 2{ 1, 1,..., 1}best nx x x x′ = + + +r or
1 2{ 1, 1,..., 1}best nx x x x′ = − − −r and ( ) ( )best bestx xψ ψ′ > then bestxr is replaced
with bestx ′r and the operator repeats this procedure one more time until no
improvement is observed. Figure 3.6 depicts a single iteration of these two
operators.
3. MATERIAL AND METHODS Mustafa KARABULUT
41
b) Operation to find optimal motif width: Generally a practitioner who is
interested in finding motifs in a set of promoter sequences may not know
the exact width of the sought motifs. Moreover, TFBS instances of the
same motif aren’t necessarily of the same width. Thus, in the proposed
method, the users are required to provide three parameters to run the
algorithm, expected width 0w of the motif, a minimum width minw and a
maximum width maxw . The PSO algorithm is executed with the
assumption of 0w w= and then finds the optimal width in post-processing
phase by varying the width between minw and maxw . The fitness function
given in Equation 25 allows us to find an optimal configuration of A, i.e.,
xr , with respect to a given w. The best width w′where min maxw w w′≤ ≤ is
accepted as the optimal width if ( , )A wψ ′ is superior to all other ( , )iA wψ
in the range of possible width options. This second post-optimization
procedure is similar to the one utilized for finding optimal width for
clustering approach which is explained in Section 3.2.5.
In addition to above mentioned methods to overcome local minimums, the
parameter selection has also significant effect on performance of the algorithm.
Relevant parameter selection strategies for DNA motif discovery task is given in
Section 4.5.
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
42
4. RESEARCH AND DISCUSSION
In this section, the results of developed tools and relevant discussion are
given. First off, evaluation metrics that are utilized through the experimental
studies will be given in Section 4.1. Secondly, by using these metrics, clustering
algorithms will be evaluated through Section 4.2 and 4.3. Afterwards, post-
optimized clustering approach will be discussed in Section 4.4. Consequently,
experimental study related to PSO based algorithm will be given.
4.1. Evaluation Metrics
In order to assess the motif discovery prediction performance, some
measures have been studied in several papers such as those of Burset and Guigó
(1996), Pevzner and Sze (2000), and Tompa et al. (2005). The presented metrics
in this section requires us to acknowledge some prior information about the
datasets such as locations of motifs present in the dataset. Therefore, the
quantitative values on which the utilized evaluation metrics depend such as True
Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN)
counts could be obtained. All the methods discussed in the study are capable to
presenting multiple motif results, however, the TP, TN, FP and FN counts are
based on a single motif prediction. Ideally, the top motif, i.e., the first one in the
list of multiple predictions sorted with a scoring scheme such as z-score or p-
value, is considered as the result of the program. However, literature methods
most commonly allow selection of the best result is first 10-20 results as the result
of an evaluated motif-finder program.
The first metric, Sensitivity (Sn), measures the rate of correct predictions.
It is useful to measure how well the motif-finder program is at catching TFBS
instances. However, it lacks the view of false predictions proportion. Sensitivity is
also known as Recall and calculated as:
/Sn TP TP FN (28)
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
43
Specificity (Sp) is a measure somewhat inverse to Sn, however, it measures the
proportion of correctly identified non-motif instances.
/Sp TN TN FP (29)
Additionally, Positive Predictive Value (PPV) measures the rate of
positive TFBS instances which are correctly identified. PPV is also known as
Precision.
/PPV TP TP FP (30)
In comparison to the above metrics, Matthews’ Correlation Coefficient
(MCC) is a more sophisticated way of measuring the performance of predictor
since it takes into account TP, FN, FP and FN counts in a balanced way. MCC
score of motif-finder method is only high when both the proportion of true
predictions is high and the proportion of false predictions is low.
( x ) ( x )( )( )( )( )
TP TN FP FNMCCTP FP TP FN TN FP TN FN
(31)
A similar measure to MCC is F-Score (Shaw et al., 1997) which is often
utilized in motif-finding literature (Wei and Jensen, 2006). It is based on harmonic
mean of Precision and Recall.
2x Pr xecision RecallF ScorePrecision Recall
− =+
(32)
Since the develop methods in this study are compared to some literature
methods that reported performance of their methods in terms of F-Score, we had
to utilize both metrics even though they produce very similar values.
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
44
4.2. Employing Fuzzy C-Means for DNA Motif Discovery
Section 3.2.1 explains how FCM, which is actually designed for numerical
space, can be adapted for DNA motif finding task. In this section, experiments
over given promoter sequences of Saccharomyces cerevisiae (i.e., first dataset
group) are presented and discussed. Primarily, some parameter values should be
provided for the proposed algorithm in order to operate over a DNA sequence:
a) The length of the sought pattern
b) The number of clusters to group the given sequence in.
The choice of an appropriate number of clusters to group the data is a
common difficulty encountered in clustering algorithms, and FCM is not an
exception to this. In order to understand the correlation between number of
clusters and dataset length, several numbers of clusters are tried and relevant
performances in quantitative terms (MCC) are stored.
Figure 4.1. The effect of number of clusters over the performance for FCM
The experiments are performed over different segments of the same
sequence with varying length containing different number of motif instances
proportional to the sequence length. According to Figure 4.1, the algorithm
becomes reasonably accurate when the number of clusters is between 50 and
1000. Therefore, there seems to be a loose connection between the number of
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
45
clusters and the performance. Except for the 1000 nucleotides long DNA
sequence, the algorithm performs consistently when the cluster number is between
80 and 200 and beyond this range, the performance may slightly fall. The results
of 1000 bp long segment appears to fluctuate with varying number of clusters
since it contains small number of instances and capturing or missing even one
single instance results to high deviation in the performance. Accordingly, GAL4,
GCN4, CBF1, RFX1 and HSF1 sequences are run with 80, 180, 180, 100 and 100
numbers of clusters, respectively.
Another issue about the number of clusters is the computational load. This
aspect is evaluated by presenting the processing time of the proposed algorithm
for two cases:
a) The number of clusters are given proportional to the dataset length, i.e.,
the above mentioned numbers are used (Settings 1),
b) All sequences are analyzed by using the same number of clusters, 80
(Settings 2).
Figure 4.2. The processing time of FCM for each dataset
As a result, a ten times increase in the dataset length causes the
computational load rise about five times. On the other hand, when the number of
clusters is doubled, a ten times increase is observed in the processing time (Figure
4.2). The experiments show that the number of clusters affects the computational
load and the processing time more than the length of the processed DNA sequence
does.
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
46
As mentioned in Section 3.2.1, for clustering approach including FCM, the
chosen termination of the algorithm is done after a certain number of iterations are
executed. Nonetheless, this “certain” number is another parameter to be decided
carefully to prevent over-training and under-training disadvantages. The
experiments over the datasets, with a cluster number between 5 and 100, show
that a training of approximately 100 cycles seems to be sufficient as far as the
datasets used in this study are concerned.
Figure 4.3. The performance of FCM per number of training cycles
According to Figure 4.3, although datasets with dissimilar features are
given to the method, the algorithm seems to converge at about 50-60 iterations of
training regardless of the characteristics of the dataset. After 50-60 iterations, the
performance improvement is observed to be slight. Nonetheless, to ensure the best
predictions, the performance measurements for the datasets of this study are
calculated after 100 cycles of training for each dataset is executed. Consequently,
the proposed algorithm is run with the above mentioned settings and the site level
prediction performance is calculated in accordance with the performance measure
indices which are given in the relevant section. Table 4.1 presents the results of
the algorithms, FCM and the two compared methods, MEME and MDScan. In
Table 4.1, the best values of a measure for each dataset are given in bold.
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
47
Table 4.1. Performances of FCM, MEME and MDScan FCM MEME MDScan
Dataset Sn Sp MCC Sn Sp MCC Sn Sp MCC GAL4 1,000 0,999 0,912 1,000 0,999 0,912 0,800 0,999 0,842 GCN4 0,774 0,998 0,732 0,161 1,000 0,338 0,468 0,999 0,604 CBF1 1,000 1,000 0,978 0,203 1,000 0,450 0,985 1,000 0,962 RFX1 1,000 0,999 0,784 0,875 0,999 0,714 0,875 1,000 0,825 HSF1 0,815 0,997 0,704 0,138 0,999 0,232 0,483 0,999 0,611
The predictions of the proposed FCM based method is close to perfect for
the datasets GAL4, CBF1 and RFX1 since the predictions included either all or
almost all of the sought transcription binding sites with few irrelevant
subsequences identified as motif instances. One of the reasons for the situation is
the existence of some w-mers residing in the data that are similar to the motif
pattern but not biologically marked as playing any role in the transcription
process.
The other methods, MEME and MDScan, are run online with the most
similar settings available. Since our algorithm assumes any number of motif
instances may occur per sequence, MEME and MDScan are adjusted to search the
motifs in any sequence with the consideration of just the given DNA strand. The
first 10 predictions of the studied methods are considered as the results and the
most relevant prediction of the methods are included in Table 4.1. MEME, in
general, was successful at finding significant motif patterns in the given
sequences. On the other hand, MDScan is observed to be more successful than
MEME. However, the proposed algorithm outperforms both of the methods in
most of the performance measures considered in this study.
As pointed out in Section 3, if a single measure should be selected to
compare methods, MCC should be the selection since it is the most precise
performance measure among the others since it takes into account all the aspects
of the prediction. The proposed algorithm also outperforms the others in five out
of seven MCC measure values as can be seen in Figure 4.4. In addition to the
quantitative evaluation of the proposed method, the sequence logos, for the
purpose of visualization, are generated from the local alignments of the
subsequences associated to the clusters with the highest z-scores.
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
48
Figure 4.4. Comparison of the three methods in terms of MCC Also, the sequence logos in Table 4.2 prove the proficiency of the
proposed method at finding the sought motif patterns in an unsupervised way, as
well. To interpret the sequence logos, one should primarily consider the capability
of the motif finding tool to present the most significant parts of the original motif
pattern. The relatively high letters are the core components of the sought pattern,
whereas the positions, where there is no distinguished letter, mean that there is an
ambiguity and any nucleotide can take place for this specific position. The height
of the letters, on the other hand, is actually proportional with the information
content of the PWM being constructed. The higher information content means the
found pattern is considerably different from the background model and there is an
overrepresentation of sequences.
Despite the fact that the proposed FCM based method is generally
observed to be efficient for the studied datasets, it does not always produce perfect
predictions. It may fail in some cases that the information content of the sought
pattern is not very high or the number of transcription factor binding sites is very
low with respect to the sequence they reside. In such cases, the number of false
positives tends to be high or irrelevant patterns are identified as motifs by the
algorithm. Motifs that are overrepresented and statistically diverged from the
background model in promoters of putatively co-regulated genes are the primary
targets of the proposed clustering based motif finding algorithm. To overcome
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
49
such difficulties, the proposed algorithm may alternatively be combined with
other motif finding methods as a part of an ensemble application. Studies
(Yanover et al., 2009; Chakravarty et al., 2007; Wijaya et al., 2008) have shown
that such approaches are effective for the application of motif discovery.
Table 4.2. Predicted and known motifs in sequence logo format
PREDICTED MOTIF KNOWN MOTIF
GAL4
GCN4
CBF1
RFX1
HSF1
4.3. Assessment of Clustering Algorithms for Motif Discovery
In previous section, a FCM based motif-finding method is evaluated.
Actually, the FCM based method was inspired from the study of Mahony et al.
(2005) that reports a pure SOM based strategy is efficient and satisfactory for de
novo motif discovery. Thus, in this section, in addition to FCM and SOM,
performances of some well known clustering algorithms at motif-finding will be
evaluated and compared. The clustering algorithms, however, are mainly targeted
to work within high-dimensional vector space of numerical values. Modifications
to adapt them for the motif-finding task are given in between Section 3.2.1 and
3.2.4. The selected clustering algorithms are FCM, SOM, K-Means and
EM/GMM against the second group of datasets (See Section 3.1) that contains
datasets of the organisms, yeast, E.coli, fly and human.
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
50
Like in FCM based motif-finding method, one issue to the application of
clustering algorithms is how the algorithms will be initialized. In addition to
algorithm specific ones, common initialization parameters that should be decided
beforehand are number of clusters and how the algorithm terminates. For many
clustering algorithms, deciding number of clusters according to a given data is
generally a challenging issue since the result may change as a consequence of how
many partitions are desired. In the similar motif finding study of Mahony et al.
(2005), the ideal number of clusters for SOM algorithm is generally observed to
be 1/10 of given subsequences. With the consideration of this information, in
order to reveal a quantitative relationship between number of subsequences and
clusters, each algorithm is run with number of clusters that vary from 1/5 to 1/20
of number of inputs.
Figure 4.5. Correlation between performance and number of clusters
As a result, most efficient numbers of clusters-to-subsequence length ratios
for each algorithm are experimentally observed to be 1/12 for SOM, 1/10 for
FCM, 1/15 for K-means and EM/GMM. It should be noted that these ratios are
not deterministic values and could be thought as preliminary points when using
clustering algorithms for motif finding practice. As far as Figure 4.5 is concerned,
it is obvious that there seems a direct proportion between cluster and subsequence
counts but not much straight.
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
51
After deciding the number of clusters, the PWMs are randomly initialized
with values ranging between 0.0 and 1.0 under the constraint, , ,
11
A C G T
bib
m=
=∑ . All
algorithms are trained for a certain times of iterations, generally a number
between 50 and 100, which is observed to be sufficient. Consequently, all
algorithms are run with these settings and their performances are calculated in
terms of a selection of assessment indices including Sn, Sp and MCC.
Table 4.3. Experimental results of four clustering algorithms for each dataset Dataset
Algorithm Index GCN4 RFX1 ARGR1 LEXA DM05 DM06 HM10 HM17
SOM Sn 0,800 0,875 0,647 0,875 0,117 0,428 0,363 0,700 Sp 0,999 0,999 0,994 0,999 0,999 0,996 0,996 0,999
MCC 0,842 0,745 0,577 0,874 0,277 0,325 0,331 0,737
K-Means Sn 0,800 1,000 0,352 0,875 0,071 0,714 0,272 0,800 Sp 0,998 0,999 0,999 0,999 0,997 0,998 0,998 0,998
MCC 0,799 0,784 0,547 0,824 0,061 0,628 0,339 0,675
FCM Sn 0,900 0,875 0,764 0,875 0,142 0,571 0,454 0,800 Sp 1,000 0,999 0,993 1,000 0,997 0,999 0,992 0,999
MCC 0,948 0,874 0,639 0,935 0,117 0,675 0,291 0,842
EM/GMM Sn 0,700 0,750 0,705 0,750 0,357 0,571 0,454 0,700 Sp 1,000 0,999 0,997 1,000 0,994 0,998 0,983 0,999
MCC 0,836 0,670 0,702 0,865 0,194 0,502 0,163 0,666
Table 4.3 presents motif finding performances of evaluated algorithms in
terms of quantitative measures where best values for a single dataset are indicated
with bold values. It is obviously seen that none of the algorithms performed better
than the others in every studied dataset and measure. The complication of
modeling motifs based on overrepresentation and background probabilities is a
major reason behind the case. This fact hence suggests that no clustering
algorithm alone could be perfect for DNA motif discovery practice. Nonetheless,
it is observed that FCM produces 15 out of 24 best performances, whereas K-
Means has 7, EM/GMM has 5 and SOM has 2 of the best scores. Thus, FCM can
be distinguished among the others since it noticeably outperforms the other
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
52
clustering algorithms for nearly all metrics. In addition to this, comparing
algorithms by considering an average performance of each over all datasets
supports the same conclusion that all algorithms perform nearly as well as the
others except FCM.
Figure 4.6. Average motif finding performances of clustering algorithms
In general soft-clustering methods seem superior to hard-clustering
approach where FCM, SOM and EM/GMM can be counted in the former group of
algorithms. These three clustering algorithms use a type of thresholding strategy
when updating clusters with a given input in order to provide a convergence. To
this end, SOM updates neighbors of the BMU; FCM and EM/GMM update a
selection of nodes as told in Section 3.2.1 and 3.2.2. Seemingly, the selection
mechanism, which is utilized by FCM and EM/GMM, improves the performance
of FCM. On the other hand, EM/GMM generally suffers from the random
initialization and thus doesn’t perform as well as FCM. Despite this performance
improvement, FCM along with EM/GMM, which utilizes the method given in
Equation 7 and 17, respectively, slow down in terms of run-time. Thus, SOM and
K-Means seem faster than FCM and EM/GMM as to the comparison according to
time taken to complete a certain times of iterations in order to process a specific
length of DNA sequence. Figure 4.7 presents training time in seconds for each
algorithm studied in this paper. It should be noted that time taken for SOM, FCM
and EM/GMM to complete training may vary according to selected parameters
such as neighborhood size, number of clusters and so forth.
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
53
Figure 4.7. Training time of each algorithm to cluster LEXA dataset
To further evaluate performances of clustering algorithms, they are
compared with a well-known motif-finding tool, MEME. This tool can be run
online with adjustable settings to perform a motif search with the most similar
settings to those of the clustering algorithms studied in this paper. According to
the results presented in Table 4.4, MEME is fairly successful at finding motifs for
lower organisms; it, however, generally fails for complex ones such as fly and
human. LEXA is the only dataset in which MEME with a MCC score of 0.824 is
better than all the clustering algorithms. As for GCN4, RFX1 and ARGR1, the
results of MEME are observed to be close to those of clustering algorithms. On
the other hand, clustering algorithms, on the average, outperforms MEME for the
rest of the datasets, DM05, DM06, HM10 and HM17.
Table 4.4. Motif finding performance of MEME for each dataset Dataset Sn Sp MCC GCN4 0,200 1,000 0,446 RFX1 0,875 0,999 0,685
ARGR1 0,882 1,000 0,938 LEXA 0,875 0,999 0,824 DM05 0,000 0,998 -0,001 DM06 0,000 0,998 -0,002 HM10 0,000 0,993 -0,004 HM17 0,400 0,993 0,205
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
54
The results of MEME are in good agreement with the fact that finding
motifs in DNA sequences of eukaryotic organisms still challenges researchers
(Tompa et al., 2005; Das and Dai, 2007; Hu et al., 2005). Main reasons behind
this fact are mainly the relative low signal-to-noise ratio and low complexity
TFBS of higher organisms when compared to their background model.
Nonetheless, as far as the datasets of this study are concerned, the clustering
algorithms perform better than MEME in those difficult datasets.
Figure 4.8. Performances of clustering algorithms and MEME for each species
As can be seen in Figure 4.8, the superiority of clustering algorithms for
eight tested datasets of four separate organisms is also obvious when comparing
the average performances of clustering algorithms for each organism dataset
group to those of MEME. As Figure 4.9 depicts, the average performance values
of clustering approach outperforms corresponding results of MEME at three out
of four organism datasets whereas MEME is only best in E.Coli datasets.
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
55
Figure 4.9. Comparison of average performances of clustering algorithms and
MEME for each species 4.4. Evaluation of Post-Optimization for EM/GMM Method
In Section 4.2, clustering approach is evaluated with FCM algorithm
adapted to DNA motif-finding task. Subsequently, in Section 4.3, in addition to
FCM, three more well-known clustering algorithms, i.e., SOM, K-Means and
EM/GMM, are evaluated and compared to each other. As a result, SOM, K-Means
and EM/GMM based methods are observed to perform similarly, whereas, FCM
clearly outperforms the three methods on average MCC score against the second
group of datasets.
In this section, contribution of the post-optimization procedure based on a
Bayesian framework to clustering approach is evaluated against the first (i.e.,
Saccharomyces cerevisiae datasets) and fourth groups of datasets (See Section
3.1 for details of the datasets). EM/GMM method is the selected algorithm on
which this post-optimization is applied. Then its performance will be compared to
FCM, SOM and two literature methods, MEME and MDScan. This time,
SOMBRERO (Mahony et al., 2005) is the SOM implementation that will be
utilized for comparison purposes.
In this new scheme, the method is composed of three steps which of two
are the same as explained in Section 3.2.1. The new step, which is explained in
Section 3.2.5, intervenes between the two previously explained steps:
a) Clustering, or locally aligning, w-length windows
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
56
b) Fine-tuning the motif models (i.e., PWMs) by using the Bayesian scoring
scheme
c) Selecting the most statistically interesting alignments, i.e., clusters or
PWMs, from the set of clusters by using z-score scheme
Before covering the experimental results of the new scheme, the algorithm
initialization should be mentioned first. In Section 4.3, the importance of
algorithm initialization is mentioned and a proposal for initial parameters of
EM/GMM method is given. According to the proposal, 1/15 as the number of
clusters-to-dataset length ratio performs reasonable well for EM/GMM method.
Since, this time the methodology is changed, the effect of number of clusters is
reinvestigated. The proposed proportion from the previous section is taken as a
preliminary point and a range of number of clusters-to-data length ratios from
1/100 to 1/5 are investigated.
Figure 4.10. Performance of the algorithm over the number of clusters for each
dataset
Expectedly, taking 1/100 resulted in poorer performance for most of the
datasets. As can be seen in Figure 4.10, performance is still directly proportional
to the number of clusters for most cases as the proportion is being increased to
1/25. In the range of ratios from 1/25 to 1/5, the performance is observed to be
stable for some datasets and has a slight positive change for the others. Therefore,
with the consideration of the heavy computational load that the number of clusters
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
57
brings, the most suitable choice of the proportion is observed to be 1/10 for almost
all cases. Notably, this new proportion is very close to the one proposed in
previous section.
Each cluster is initialized randomly with a two-steps procedure. First, all
w-mers are prorated to the clusters and then mm and m for each cluster are
calculated in accordance with Equation 11 and 12, respectively. As for other
initial parameters, the length of the sought motif should also be provided by the
practitioner with three values ww, wmin and wmax. Thus, the algorithm operates to
discover optimal motifs of length ww and then in the post-optimization phase it
attempts to find the optimal width of the obtained motif in the range from wmin to
wmax.
Table 4.5. Comparison of EM/GMM with other algorithms for Saccharomyces cerevisiae datasets in terms of MCC
EM/GMM FCM SOMBRERO MEME MDScan GAL4 0,91 0,91 0,88 0,91 0,84 HSF1 0,74 0,70 0,51 0,23 0,61 RFX1 0,80 0,78 0,26 0,71 0,83 GCN4 1,00 0,73 0,54 0,34 0,60 CBF1 1,00 0,98 0,89 0,45 0,96
With the above mentioned settings, the post-optimized GMM, along with
FCM, SOMBRERO, MEME and MDScan are run against the Saccharomyces
cerevisiae datasets. SOMBRERO is a downloadable console application that may
run under different operating systems. As for others, both MEME and MDScan
are web based motif-finding programs. Since all sequences in the given datasets
may contain more than one instance of the sought motif, MEME is specifically
run in “Any number of repetitions per sequence” mode whereas the other
algorithms didn’t require any setting for such a search. All the algorithms
including GMM/EM method required the motif width given. Among the set of
algorithms, only MDScan and FCM weren’t suitable for a variable motif-width
search. With these settings, all algorithms are run to make 10 motif predictions.
For comparison purposes, the best result in terms MCC from the 10 results is
selected as the algorithm’s prediction. For comparison purposes, the best results
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
58
of each algorithm per dataset are given in Table 4.5 in which best value for each
dataset is marked as bold.
As can be seen in Table 4.5, the results of GMM/EM method for
Saccharomyces cerevisiae datasets are generally favorable in comparison to those
of others. Except RFX1 dataset, its performance is superior for all the datasets in
terms of MCC. In comparison to MEME and MDScan, the average performance
of the three clustering based methods are observed to be better which shows the
superiority of simultaneous motif-finding techniques to the others. As for
EM/GMM and FCM comparison, FCM was clearly superior to EM/GMM method
without the post-optimization (See Section 4.3); the post-optimized EM/GMM, in
contrast, outperforms FCM. Moreover, when SOMBRERO is compared to
EM/GMM, the improved performance of EM/GMM over SOMBRERO may also
be attributed to the post-optimization procedures employed in the proposed
method. On the other hand, Self-organizing map, which SOMBRERO is based
upon, is actually a topological map of given inputs and thus does not necessarily
produce optimal clusters. Therefore, poorer performance of SOMBRERO when
compared to EM/GMM may also be related to SOM’s non-optimal clustering
approach.
In Saccharomyces cerevisiae datasets, all algorithms generally produced
reasonable results especially for shorter datasets. In relatively longer datasets
where the number of motif instances is also high, MEME, which operates upon
the principle of estimating one motif at a time, performed worst whereas the
others were still reasonably well. Notably, in all datasets, EM/GMM was the best
and MDScan’s overall performance, next to that of FCM, was one of the closest to
it. MDScan is a hybrid algorithm that combines features of word enumerative and
PWM based stochastic methods. It also incorporates a Bayesian scoring function
(Liu et al., 2002) to optimize the found motif candidates in a post-processing step.
From this point of view, the superiority of GMM/EM method over MDScan
proves the effectiveness of clustering approach while both of them utilizes an
iterative post-optimization step based on different but related Bayesian functions.
Also, the experimental results over yeast datasets reveal the fact that the potential
of the proposed method in identifying more than one instance per sequence is
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
59
obvious particularly in the cases of GCN4 and CBF1 since Sensitivity is obtained
as 100% in both datasets in which some sequences are expected to contain more
than one instance of the sought motif (i.e., the number of instances is greater than
the number of sequences).
Secondly, all the 5 algorithms are evaluated against the third group of
datasets that consists of promoter sequences from various organisms including
mammalians and E.coli. The experiments are done with similar settings used for
Saccharomyces cerevisiae datasets. For GMM/EM, FCM and SOMBRERO, the
number of clusters (k) is again given in 1/10 of dataset length. Similarly,
algorithms are executed for 10 motif predictions and the best result among the 10
is selected as the result of the algorithm. Table 4.6 presents the results of the
second experiment in terms of MCC.
Table 4.6. Comparison of four algorithms for third group of datasets EM/GMM FCM SOMBRERO MEME MDScan CREB 0,69 0,47 0,12 0,59 0,69 CRP 0,85 0,65 0,56 0,67 0,50 E2F 0,83 0,57 0,73 0,76 0,73 ERE 0,76 0,14 0,50 0,71 0,66 MEF2 0,46 0,32 0,37 0,88 0,00 MYOD 0,83 0,21 0,23 0,00 0,00 SRF 0,72 0,57 0,75 0,67 0,76 TBP 0,45 0,29 0,15 0,36 0,49 AVERAGE 0,70 0,40 0,43 0,58 0,48
In general, the performance of algorithms in higher organism datasets is
lower than that of in Saccharomyces cerevisiae datasets which is a known issue
for motif-finding programs (Tompa et al., 2005; Hu et al., 2005). Specifically,
FCM, which was reasonably well in yeast datasets, performed relatively poorer
against higher organism DNA sequences. The performances of FCM,
SOMBRERO and MDScan appear to be comparable to each other whereas
MEME is observed to be superior to them. The overall performance of MEME in
both higher and lower organism datasets is observed to be moderately stable at a
mediocre level whereas those of others are not. On the other hand, GMM/EM
performs still reasonably well when compared to others including MEME. The
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
60
reason behind this fact is connected with the advantages of the simultaneous
motif-finding strategy that EM/GMM adopts. In simultaneous motif-finding, all
possible alignments are extracted and then best alignments are selected via a
scoring scheme, whereas, in “one motif at a time” approach, some best alignments
may be missed in first trials (e.g., 10 trials for our case) due to the local-
maximums present in the data. The poor performance of FCM and SOMBRERO,
which is also a simultaneous motif-finder, may appear to disprove the claim.
However, the problem with FCM and SOMBRERO is not totally with its local
alignment performance, in fact, it is rather related with its selection of the best
motifs. When the best results of FCM, SOMBRERO and GMM/EM among all
motifs (i.e., k-number of motifs where 1/k N w ) are taken rather than
choosing strictly from top 10 ranked motifs, their average is much more improved
(See Table 4.7).
Table 4.7. Best results of GMM/EM, FCM and SOMBRERO GMM/EM FCM SOMBRERO MCC Motif
rank MCC Motif
rank MCC Motif
rank CREB 0,69 #2 0,47 #7 0,67 #20 CRP 0,85 #1 0,73 #14 0,82 #31 E2F 0,83 #1 0,57 #1 0,73 #3 ERE 0,88 #38 0,24 #11 0,5 #1 MEF2 0,82 #107 0,35 #39 0,81 #28 MYOD 0,83 #6 0,65 #274 0,56 #27 SRF 0,72 #2 0,57 #4 0,82 #117 TBP 0,83 #43 0,42 #666 0,62 #614 AVERAGE 0,80 0,50 0,69
Although, EM/GMM, FCM and SOMBRERO utilizes a similar z-scoring
scheme, improved prediction of EM/GMM including better selection performance
relies upon the post-optimization procedure that supports the increased distinction
of sought motifs from non-motif alignments. Nonetheless, as far as the best results
of two clustering approaches are considered, the superiority of simultaneous
motif-finding approach to the others is clearly seen. However, in order to make a
fair comparison, the best results among 10 predictions of the algorithms are
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
61
considered. Figure 4.11 depicts the overall performance of the algorithms for all
datasets.
There are a few factors that limit the performance of motif discovery
programs. Dataset length is the leading example for such factors (Hu et al., 2005;
Das and Dai, 2007). Accordingly, the performances of the considered algorithms
for TBP dataset, which is a few times larger than average of the other datasets, are
not as satisfactory as those in other datasets. However, length is not the only
limiting factor. As the complexity of the organism increases, the performance of
motif discovery algorithms decreases (See Figure 4.12). In contrast, as the length
of the sought motif increases the performances of the algorithms are generally
observed to increase, as well.
Figure 4.11. Overall performances of the algorithms for second group datasets
Figure 4.12. Performance variance of algorithms over two parameters
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
62
Table 4.8. Sequence logos of the known motifs and the predicted ones (a) Dataset Known motif Predicted motif (a)
GAL4
HSF1
RFX1
GCN4
CBF1
CREB
CRP
E2F
ERE
MEF2
MYOD
SRF
TBP
The experiments presented so far are performed with fixed-width motif
search, that is, the three user-provided parameters regarding the sought motif
width, i.e., , ww, wmin and wmax, are given the same value wk which is the known
motif width. Application of Equation 25 in the post-optimization step enables the
proposed method to find the optimal width with respect to these three parameters.
In order to evaluate the ability of the method to find the optimal motif length,
three sets of experiments are done. In the first set (a), all three parameters are set
to wk. In the second set (b), ww is given the known motif width whereas wmin and
wmax are given 4kw and 4kw , respectively. In the third set of experiments
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
63
(c), ww is given a random value wr where 3 3k w kw r w while wmin and wmax
are given 4wr and 4wr , respectively. The resultant motifs obtained as a result
of these three sets of experiments are visualized by use of sequence logos (Crooks
et al. 2004). Sequence logos of first set of experiments (a) are given in Table 4.8
whereas the logos as a result of other sets, (b) and (c), are given in Table 4.9.
According to the sequence logos, the proposed method performed highly
accurate for most of the datasets (i.e., GAL4, HSF1, CREB, E2F and MYOD)
since the sought width is found correctly for each three separate cases without any
shift in the motif pattern. In RFX1, GCN4, MEF2 and TBP datasets, fixed-width
motif search (a) results in a slight shift in the sought pattern. However, in these
four datasets, the post-optimization procedure to find the optimal width extends
the width of the found pattern so that it may include the missing bases as a result
of the undesired shift.
Therefore, even though the width becomes larger than the original one, the
new pattern is inclusive of the whole sought pattern. In CRP and ERE datasets
where the starting or ending of the sought pattern contains low conserved
residues, i.e., gaps, the results of the fixed-width motif search contains shifts in
the pattern while variable width results disregard the gaps at the ends and narrow
the width. In such cases, variable width search appears to fail since it attempts to
fit highly conserved bases at starting or ending locations. Nevertheless, the overall
performance of the procedure to find the optimal width in two cases, (a) and (b),
is observed to be satisfactory.
As for the fixed-width motif search, there is a shift issue in some results of
this type of search. This is due to tendency of the proposed algorithm to primarily
align the mostly conserved parts of the motif. Thus when conserved nucleotides
are aligned, the rest may shift up to 2-3 nucleotides since they do not have a
significant impact on the score. However, this situation is not considered as a
performance reduction in the motif literature (Tompa et al., 2005).
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
64
Table 4.9. Sequence logos of the known motifs with predicted ones, (a) and (b) Known motif Predicted motif (b) Predicted motif (c)
GAL4
HSF1
RFX1
GCN4
CBF1
CREB
CRP
E2F
ERE
MEF2
MYOD
SRF
TBP
More or less, execution time of almost every motif discovery program is
directly proportional with the dataset length and GMM/EM method is not an
exception. An increase in the number of clusters, k, also increases the run-time
since the calculation of Equation 9-14 depends on the choice of k. However, it is
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
65
observed that motif length doesn’t have significant impact on the run-time. When
three clustering methods, GMM/EM, FCM and SOM, are compared in terms of
run-time, SOM appears to be much faster than the others as it processes a certain
number of iterations for an input of approximately 5000 bp long in 70 seconds
while the others finish the same task in over 120 seconds. The increased run-time
of GMM/EM with respect to SOM is connected with Equation 17 that boosts the
performance in terms of motif discovery metrics. The selection mechanism in
Equation 17 requires PDF values for each input to be sorted in each iteration and
consequently causes the algorithm to finish the clustering task in a longer time.
Moreover, the post-processing procedures that attempt to improve the alignments
and find the optimal width also increase the run-time of the algorithm.
4.5. Particle Swarm Optimization to Identify Regulatory Elements
Through previous sections, motif-finding methods based on clustering
algorithms are evaluated. Moreover, in Section 4.4, the effectiveness of the
Bayesian scoring scheme is discussed and proven against various datasets. Other
than post-optimization purposes, the Bayesian scoring scheme has also proven to
be effective as a fitness function for motif-finder methods from the literature
based on stochastic search algorithms such as GA (Wei and Jensen, 2006). The
Bayesian fitness function, as to the best of our knowledge, has never been utilized
with PSO in motif-finding literature. Section 3.2.6 presented how PSO and the
Bayesian fitness function can be incorporated for the motif-finding task. In this
section, the proposed PSO based method is evaluated against the third and fourth
groups of datasets explained in Section 3.1.
When assessing PSO with these datasets, an essential point to be decided
was the selection of input parameters which is greatly discussed in the literature
(Shi and Eberhart, 1998) and known to have significant impact on the
performance of the algorithm. The parameter α , the so-called inertia weight,
controls the velocity of the particle. As discussed by many researchers, a high
value for this parameter causes the particles explore a greater space but with the
risk of jumping over the optimal regions while a small value for this parameter
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
66
facilitates exploitation of local search area. To balance the exploration and
exploitation abilities of the particles, researchers generally started the population
with a relatively high value of α and reduce it gradually as the algorithm iterates.
In this study, it is observed that starting with 0.4 0.8α≤ ≤ and gradually reducing
it to 0.1α = result in good exploration ability while providing sufficient fine-
tuning capability. According to the experiments, taking even a bit greater value,
for instance 1.0α > , however, yielded reduced performance for our application.
As for β and γ , most of the literature papers take equal values for these
parameters, for instance 0.8 1.2β≤ ≤ and the same for γ . Therefore, in our
applications, we constantly utilized 0.8β = and 0.8γ = . Furthermore, the
population size is another known parameter that also affects performance of the
PSO algorithm. We tried several options from 10 to 300 particles and measured
the performance. As expected, small populations, e.g., 10 particles, result in
poorer exploration of the space while a larger population, e.g., 300 particles,
provides better exploration and exploitation but with much greater running times
(Please see Figure 4.13 and 4.14). To have a balanced solution under this trade-
off, the swarm size is set to 100 through the experiments.
With the above mentioned parameters ( 0.4α = , 0.8β = , 0.8γ = and
swarm size=100) four PSO variants are executed separately for each 800 synthetic
datasets (the fourth group of datasets). The quantitative assessment of the
experimental results is done with the calculation of F-Score (Wei and Jensen,
2006; Shaw et al., 1997) which measures both Precision and Recall at the same
time. The results of PSO variants in terms of F-Score are given in Table 4.10
where best score for each scenario is given in bold.
According to the table, in the particular scenario where sought motif is
long, all variants performed well and produced exactly the same F-Scores. In
short-size motif search scenario, Bidirectional Ring outperformed the others in
three out of four scores whereas GBest and Von Neumann were the best for only
one dataset group. On an average, Bidirectional ring is observed to be the best and
Random was the worst. According to the study (Kennedy, 1999), the topologies
with fewer connections may perform better in the multimodal problems where
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
67
there are many local optimums. In Bidirectional Ring, each particle has only three
connections including itself and thus the social information flow is the slowest of
all. Hence, through the experiments, even though Von Neumann appeared to
produce the best fitness values on average, Bidirectional Ring was the best in
terms of predicting true positive sub-sequences that form the sought motif.
Table 4.10. Results of PSO variants for synthetic datasets in terms of F-Scores Datasets PSO Variants
Motif Width Cons. Dataset
Length GBest B.Ring Random Von Neumann
Short High Small 0,78 0,80 0,79 0,79 Short Low Small 0,54 0,50 0,46 0,48 Short High Large 0,83 0,84 0,83 0,84 Short Low Large 0,41 0,46 0,38 0,41 Long High Small 0,96 0,96 0,96 0,96 Long Low Small 0,85 0,85 0,85 0,85 Long High Large 0,98 0,98 0,98 0,98 Long Low Large 0,90 0,90 0,90 0,90
AVERAGE 0,78 0,79 0,77 0,78
In each experiment out of 800, PSO is run to extract top 30 motifs that are
ranked according to their fitness values and subsequently the result with the
highest F-Score is taken. Approximately 92 percent of the results with the best F-
Scores was also the top motif with the highest fitness value. Similarly, 98 percent
of the best results with the highest F-Scores was ranked in top 10 according to the
fitness value. This observation shows that the utilized fitness function is proper
for the motif discovery application and is in accordance with the sought motif
model. Briefly, simulations over synthetic datasets put forward two topologies
which are Von Neumann with high average fitness values and Bidirectional Ring
with the best F-Scores.
We also compared these results with those of GAME, MEME and
BioProspector as reported in the study (Chan et al., 2008). Table 4.11 presents the
comparison of the proposed method with these tools. In the table, two separate
results are given for PSO with respect to the results of other tools. The first
column for PSO is a collection of the best results of PSO-variants for each
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
68
synthetic dataset group, whereas the second column is the sole results of
Bidirectional ring which performed best among the four PSO topologies.
Table 4.11. Performance comparison of motif-finding tools for synthetic datasets Datasets Methods
Motif Width Cons. Dataset
Length PSO
(Best)* PSO
(B.Ring) GAME MEME BioPros.
Short High Small 0,80 0,80 0,75 0,85 0,78 Short Low Small 0,54 0,50 0,30 0,39 0,39 Short High Large 0,84 0,84 0,83 0,83 0,76 Short Low Large 0,46 0,46 0,36 0,42 0,45 Long High Small 0,96 0,96 0,97 0,98 0,97 Long Low Small 0,85 0,85 0,82 0,88 0,83 Long High Large 0,98 0,98 0,98 0,98 0,96 Long Low Large 0,90 0,90 0,90 0,90 0,80
AVERAGE 0,82 0,79 0,79 0,78 0,74 * Selection of best PSO results from Table 4.10 for each dataset
When both PSO-results are compared to other tools, PSO appears to be
obviously superior to all tools. In long motif datasets where all methods
performed fairly well, MEME outperforms the others on average F-Score.
However, in short motif width datasets, which are noisier and more challenging,
PSO performs much better. Additionally, on average of all dataset results, PSO is
still the best with 0.82 and 0.79 F-Scores. Another notable point is that GAME is
a GA based motif-discovery method that has similarities with the proposed
method such as that it utilizes the same Bayesian framework. Although they share
the same fitness model, the results of the proposed method is clearly better than
those of GAME, hence, PSO proves itself against GA.
In addition to the first experiment over the synthetic datasets, PSO and
above mentioned algorithms are also evaluated against the third group of datasets
of the study. With these datasets, all PSO variants are executed separately while
the input parameters of PSO that were used in synthetic dataset experiments are
kept the same ( 0.4α = , 0.8β = , 0.8γ = and swarm size=100). The respective
results of PSO variants for the datasets are given in Table 4.12. As seen in the
table, on average scores, Bidirectional Ring performed the best and, similar to the
synthetic dataset experiments, the second in the comparison was Von Neumann.
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
69
This time, GBest couldn’t perform as well as Von Neumann did and Random
topology scored the worst although in few cases it was the best. Also notably, Von
Neumann was the best in terms of number of winning scores in comparison.
Nonetheless, performances of all the topologies except Random were comparable
to each other on average F-Scores.
Table 4.12. Performances of PSO variants for 8 real datasets Dataset GBest B.Ring Random Von Neumann CREB 0,70 0,72 0,54 0,72 CRP 0,77 0,86 0,78 0,86 E2F 0,61 0,82 0,56 0,82 ERE 0,65 0,65 0,67 0,60
MEF2 0,91 0,79 0,88 0,79 MyOD 0,43 0,38 0,41 0,34
SRF 0,73 0,73 0,79 0,73 TBP 0,81 0,82 0,76 0,84
AVERAGE 0,70 0,72 0,67 0,71
As mentioned before, the number of particles has significant influence on
the performance. For 8 real datasets, some different population configurations are
tested separately while the other input settings are preserved. As a result, 100
particles for all datasets appeared to be the best choice since it brings reasonable
performance for all datasets tested. Figure 4.13 depicts the performances of
Bidirectional ring, which has appeared to be the best PSO-variant so far, for 8 real
datasets based on F-Scores.
Figure 4.13 also depicts a remarkable point that PSO performs reasonably
well even with 10 particles for all of the 8 real datasets except MyOD and TBP. It
is not surprising to see PSO with 10 particles fail in TBP dataset since it contains
95 sequences of length 200 nucleotides. For this relatively very large dataset in
which the sought motif is conversely short, PSO required more particles, at least
100, to perform satisfactorily. In MyOD, although the dataset is not so large, the
sought motif is again very short, only 6 nucleotides. The overall performance of
PSO is also poor for MyOD even if the population size is increased to 300
particles.
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
70
Figure 4.13. Performance of PSO per number of particles
Figure 4.14. Consumed time by PSO with different number of particles Although increasing the population size generally appears to improve the
performance, it is not always the best option since it brings a heavy computational
burden leading to highly increased processing times. Figure 4.14 clearly presents
the direct relationship between the population size and the amount of time
consumed by PSO to process each dataset at a maximum of 3000 iterations.
Through the experiments, once PSO reaches a convergence, the algorithm is
terminated without executing all 3000 iterations. Hence, for some datasets the
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
71
consumed time is observed to decrease while the population size is increased,
since with a larger population PSO may reach the optimum in less number of
iterations. In this context, CRP dataset is a special case in which PSO required the
most amount of time. The most distinguishable characteristic of CRP is the width
of sought motif which is 22 nucleotides long, nearly a double of all other
relatively longer motifs. The reason behind this highly increased operation time
was the PFM and consequent PWM calculations from the starting locations of
predicted sites. In order to understand the operational costs of each PSO variant,
Figure 4.15 gives time consumption values per each PSO variant while Figure
4.14 shows the average amount of time of all PSO variants. Since the information
flow in Bidirectional Ring is quite slow, it only terminates at or near the
maximum number of iterations. Hence, it is the most time consuming topology of
all. As for GBest, the convergence is fast because of immediate communication
between particles, hence, it is the least time consuming topology among the four.
Figure 4.15. Consumed time by each PSO variant for third group of datasets Finally, we compared the performance of PSO with the other motif-finding
tools previously analyzed in the paper. As can be seen in Table 4.13, average
performances of both PSO results, which are based on the best results of all
variants and Bidirectional Ring, respectively, are superior to those of GAME,
MEME and BioProspector.
4. RESEARCH AND DISCUSSION Mustafa KARABULUT
72
Table 4.13. Comparison of motif-finding tools for third group of datasets Dataset PSO(Best)* PSO(B.Ring) GAME MEME BioPros. CREB 0,72 0,72 0,58 0,59 0,67 CRP 0,86 0,86 0,79 0,67 0,78 E2F 0,82 0,82 0,87 0,76 0,46 ERE 0,67 0,65 0,69 0,71 0,68
MEF2 0,91 0,79 0,71 0,88 0,71 MyOD 0,43 0,38 0,31 0,00 0,00
SRF 0,79 0,73 0,79 0,67 0,70 TBP 0,84 0,82 0,81 0,36 0,71
AVERAGE 0,76 0,72 0,69 0,58 0,59 * Selection of best PSO results from Table 4.12 for each dataset In real organism datasets (i.e., third group of datasets), GAME appears to
be performing better than MEME and BioProspector. However, in synthetic
datasets, MEME was superior to GAME and BioProspector. Obviously, the point
is that none of the other methods performed well in both synthetic and real dataset
simulations while PSO was consistently better in both. In synthetic and real
dataset experiments, PSO presented a superior performance. Even with few
number of particles it showed the ability to reach satisfactory optimum where the
datasets were highly multimodal, i.e., there were several local minimums
composed of false positive sub-sequences. Along with several strategies to
alleviate trapping at local optima, four PSO neighbourhood topologies are utilized
and compared to each other. It should be noted that research in this context
(Kennedy and Mendes, 2002) points out that none of the population topologies is
superior to others for all application fields. Nonetheless, the results of this paper
suggest Bidirectional Ring as the best performing PSO topology for motif
discovery application.
5. CONCLUSIONS Mustafa KARABULUT
73
5. CONCLUSIONS
This thesis studies effectiveness of developed DNA motif discovery
methods based on data mining techniques. It is possible to categorize the proposed
methods in the thesis into three types: a) Clustering based methods, b) Post-
optimized clustering methods, c) The stochastic search procedure based method.
All proposed methods are tested against datasets of four groups in total including
Saccharomyces cerevisiae, synthetic and other organism datasets. Moreover,
effectiveness of the developed tools is assessed with state-of-art literature methods
such as MEME and MDScan.
The clustering approach to the motif finding mainly relies on the
overrepresentation of the motif instances within the DNA sequence. The idea is
that overrepresented subsequences can be gathered into the same cluster after an
adequate amount of training is performed. Consequently, ranking the statistical
significance of the clusters with the consideration of the fact that the motif
instances diverge from the background model will reveal statistically interesting
subsequence alignments, i.e., motifs. This approach is implemented in four
clustering algorithms, namely FCM, SOM, K-means and EM/GMM. The problem
with clustering approach is that the original algorithms are not suitable for
subsequence clustering in discrete space of DNA sequences, but vectors of
numerical space. Therefore, the original clustering algorithms are modified for
this particular task. Such algorithm adaptations and updates are presented in
Section 3. With algorithm updates, FCM is solely evaluated in Section 4.2 and
observed to be promising when compared to MEME and MDScan. Subsequently,
all clustering algorithms are compared to each other and MEME in Section 4.3.
All the algorithms except FCM performed similarly and they all were reasonably
well with lower organism datasets. They, including FCM, however, generally fail
in high organism DNA datasets such as those of human and fly, which is a known
issue for computational DNA motif discovery tools (Tompa et al., 2005; Hu et al.,
2005). When compared to MEME and other statistical methods, clustering
technique appears to be more capable of catching weak motifs residing in the
given sequences, thanks to simultaneous motif-finding approach. Remarkably, in
5. CONCLUSIONS Mustafa KARABULUT
74
motif-finding, soft clustering algorithms such as FCM and SOM are observed to
be more accomplished when compared to hard-clustering, for instance, K-means.
GMM, which can be counted as a member of the former group, did not perform as
well as FCM and SOM since it was more sensitive to initial parameters than SOM
and FCM were.
Secondly, the evaluation of a post-optimization procedure based on
Bayesian fitness function (Jensen and Liu, 2004) is performed. The application of
the post-optimization is done over EM/GMM approach which appeared to be less
effective in comparison to FCM and SOM. According to the experimental results
over two different groups of datasets in Section 4.4, the post-optimization is so
capable of improving the results of clustering approach that the post-optimized
EM/GMM clearly outperforms FCM and other compared tools in terms of motif-
finding performance measures. This fact proves the effectiveness of the post-
optimization procedure based on Bayesian framework. In clustering approach, any
number of motif instances in each sequence is equally considered by the methods,
that is, the w-mers are considered regardless of the sequence they belong to.
However, in real-world scenarios most of the given sequences should contain at
least one TFBS instance of a common motif. In the post-processing step, the motif
models extracted from the DNA sequences via clustering are optimized with the
consideration of biological reality. Additionally, it is used as a scoring system to
find optimal width of the sought motif by varying the length between two user
provided values.
The merits of the Bayesian framework are not specific to the clustering
approach. Actually there are literature methods that also prove the effectiveness
and versatility of the Bayesian function such as the BioOptimizer program (Jensen
and Liu, 2004) and the GAME (Wei and Jensen, 2006), a GA based motif-finding
tool. Therefore, thirdly, a PSO based method that utilizes the Bayesian function as
the tool to test the fitness of particles is proposed and evaluated against a great
number of datasets including synthetic and real data. Respective results of the
experiments show that the proposed PSO based algorithm is highly promising and
effective for the motif-finding task in DNA sequences. It performed fairly well in
comparison to MEME, MDScan and GAME. The experiments also put forward
5. CONCLUSIONS Mustafa KARABULUT
75
another conclusion that, for motif-finding, Bidirectional Ring topology appeared
to be outstanding when compared to other topologies, GBest, Random and Von
Neumann. The literature (Kennedy and Mendes, 2002; Kennedy, 1999) related to
PSO population topologies suggests that the rationale for superior performance of
Bidirectional Ring is connected to its few number of connections between the
particles that leads to slow but mature convergence in multimodal problem
domains.
As briefly discussed above, despite developed algorithms performed well
on most of the datasets, there are still some drawbacks that they can’t avoid. First
off, none of the algorithms guarantee the global optimum, that is, they all share
the behavior to have the tendency of falling into local optimums. Secondly, due to
inadequate mathematical modeling of TFBS patterns including both low and high
organisms, computational methods are still not reasonably accurate for all
situations. As the mathematical models to biological facts are improved, the
accuracy of the computational tools will eventually be enhanced, as well.
Nonetheless, researchers propose various algorithm enhancements some of which
might be useful in future studies of the proposed methods in this thesis. For
instance, combining relatively weak methods to form a stronger prediction, i.e.,
ensemble methods, is a proven methodology (Wijaya et al., 2008). That is, the
proposed methods in the thesis may bring out better performance if utilized in an
ensemble strategy. Secondly, incorporating more biological knowledge such
phylogenetic data (Wang and Stormo, 2003) into the methods than the DNA
sequence itself may also enhance the performance.
Nonetheless, from the computer science point of view, literate related to
improvements of the utilized algorithms including FCM, EM and PSO in other
fields may also be useful in motif-finding field. For instance, for PSO-based
method, several PSO variants based on different aspects are proposed to enhance
the original PSO and remove its drawbacks. The fully informed PSO (Mendes et
al., 2004), Multi-objective PSO (Reyes-Sierra and Coello, 2006) and the PSO
with constricted parameters (Shi and Eberhart, 1998) are popular instances in this
context. We believe that these variants should also be evaluated to further
improve the performance of the proposed PSO-based method.
76
REFERENCES
BAILEY, T. L. and ELKAN, C., 1995. The value of prior knowledge in
discovering motifs with MEME. Proceedings of International Conference
on Intelligent Systems for Molecular Biology, 3: 21-9.
BEZDEK, J. C., 1981. Pattern recognition with fuzzy objective function
algorithms, New York, Plenum Pres.
BIOINFORMATICS Wiki, 2011, http://bioinformatics.org/wiki
BLANCO, E., FARRE, D., ALBA, M. M., MESSEGUER, X. and GUIGO, R.,
2006. ABS: a database of Annotated regulatory Binding Sites from
orthologous promoters. Nucleic acids research, 34: D63-7.
BURSET, M. and GUIGO, R., 1996. Evaluation of gene structure prediction
programs. Genomics, 34: 353-67.
CHAKRAVARTY, A., CARLSON, J. M., KHETANI, R. S. and GROSS, R. H.,
2007. A novel ensemble learning method for de novo computational
identification of DNA binding sites. BMC bioinformatics, 8: 249.
CARLSON, J. M., CHAKRAVARTY, A., DEZIEL, C. E. and GROSS, R. H.,
2007. SCOPE: a web server for practical de novo motif discovery. Nucleic
acids research, 35: 259-64.
CHAN, T. M., LEUNG, K. S. and LEE, K. H., 2008. TFBS identification based
on genetic algorithm with combined representations and adaptive post-
processing. Bioinformatics, 24: 341-9.
CHE, D., JENSEN, S., CAI, L. and LIU, J. S., 2005. BEST: binding-site
estimation suite of tools. Bioinformatics, 21: 2909-11.
CROOKS, G. E., HON, G., CHANDONIA, J. M. and BRENNER, S. E., 2004.
WebLogo: a sequence logo generator. Genome research, 14: 1188-90.
DAS, M. K. and DAI, H. K., 2007. A survey of DNA motif finding algorithms.
BMC bioinformatics, 8 Suppl 7: S21.
DAS, S., KONAR, A. and CHAKRABORTY, U. K. 2005. Improving particle
swarm optimization with differentially perturbed velocity. Proceedings of
the 2005 conference on Genetic and evolutionary computation.
Washington DC, USA: 177-184
77
DO, C. B. and BATZOGLOU, S., 2008. What is the expectation maximization
algorithm? Nature biotechnology, 26: 897-9.
DEMPSTER, A.P., LAIRD, N.M., and RUBIN, D.B., 1977. Maximum likelihood
from incomplete data via the EM algorithm. Journal of the Royal
Statistical Society series B, 39:1–38
EBERHART, R. and KENNEDY, J., 1995. A new optimizer using particle swarm
theory. Proceedings of the Sixth International Symposium on Micro
Machine and Human Science: 39-43.
ECOGENE, 2009. The EcoGene Database of Escherichia coli Sequence and
Function, http://ecogene.org
ELNITSKI, L., JIN, V. X., FARNHAM, P. J. and JONES, S. J., 2006. Locating
mammalian transcription factor binding sites: a survey of computational
and experimental techniques. Genome research, 16: 1455-64.
FRITH, M. C., HANSEN, U., SPOUGE, J. L. and WENG, Z., 2004. Finding
functional sequence elements by multiple local alignment. Nucleic acids
research, 32: 189-200.
GORDON, D. B., NEKLUDOVA, L., MCCALLUM, S. and FRAENKEL, E.,
2005. TAMO: a flexible, object-oriented framework for analyzing
transcriptional regulation using DNA-sequence motifs. Bioinformatics, 21:
3164-5.
GUTTMACHER, A. E. and COLLINS, F. S., 2003. Welcome to the genomic era.
The New England journal of medicine, 349: 996-8.
HARDIN, C. T. and ROUCHKA, E. C., 2005. DNA Motif Detection Using
Particle Swarm Optimization and Expectation-Maximization. Proceedings
of the IEEE Swarm Intelligence Symposium, 2005: 181-184.
HERTZ, G. Z., HARTZELL, G. W., 3RD and STORMO, G. D., 1990.
Identification of consensus patterns in unaligned DNA sequences known
to be functionally related. Computer applications in the biosciences :
CABIOS, 6: 81-92.
HU, J., LI, B. and KIHARA, D., 2005. Limitations and potentials of current motif
discovery algorithms. Nucleic acids research, 33: 4899-913.
78
HU, J., YANG, Y. D. and KIHARA, D., 2006. EMD: an ensemble algorithm for
discovering regulatory motifs in DNA sequences. BMC bioinformatics, 7:
342.
JENSEN, S. T. and LIU, J. S., 2004. BioOptimizer: a Bayesian scoring function
approach to motif discovery. Bioinformatics, 20: 1557-64.
JOISEN, K., LIU, M. C., LIAO, T. W. and TRIANTAPHYLLON, E. 2002. An
evaluation of sampling methods for data mining with fuzzy C-means.
Kluwer Academic Publishers.
KENNEDY, J., 1999. Effects of neighborhood topology on particle swarm
performance, Proceedings of the 1999 Congress on Evolutionary
Computation, pp. 1938
KENNEDY, J. and MENDES, R. 2002. Population structure and particle swarm
performance. Proceedings of the Evolutionary Computation on 2002, 2:
1671-1676
KOHONEN, T., 1998. The self-organizing map, Neurocomputing, 21(1-3): 1-6
LARSSON, E., LINDAHL, P. and MOSTAD, P., 2007. HeliCis: a DNA motif
discovery tool for colocalized motif pairs with periodic spacing. BMC
bioinformatics, 8: 418.
LAWRENCE, C. E., ALTSCHUL, S. F., BOGUSKI, M. S., LIU, J. S.,
NEUWALD, A. F. and WOOTTON, J. C., 1993. Detecting subtle
sequence signals: a Gibbs sampling strategy for multiple alignment.
Science, 262: 208-14.
LEI, C. and RUAN, J., 2010. A particle swarm optimization-based algorithm for
finding gapped motifs. BioData mining, 3: 9.
LIU, F. F. M. T., J.J.P.; CHEN, R.M.; CHEN, S.N.; SHIH, S.H. 2004. FMGA:
Finding Motifs by Genetic Algorithm. Proceedings of the 4th IEEE
Symposium on Bioinformatics and Bioengineering. IEEE Computer
Society: 459
LIU, D., XIONG, X., DASGUPTA, B. and ZHANG, H., 2006. Motif discoveries
in unaligned molecular sequences using self-organizing neural networks.
IEEE transactions on neural networks / a publication of the IEEE Neural
Networks Council, 17: 919-28.
79
LIU, X., BRUTLAG, D. L. and LIU, J. S., 2001. BioProspector: discovering
conserved DNA motifs in upstream regulatory regions of co-expressed
genes. Pacific Symposium on Biocomputing. Pacific Symposium on
Biocomputing: 127-38.
LIU, X. S., BRUTLAG, D. L. and LIU, J. S., 2002. An algorithm for finding
protein-DNA binding sites with applications to chromatin-
immunoprecipitation microarray experiments. Nature biotechnology, 20:
835-9.
LUSCOMBE, N. M., GREENBAUM, D. and GERSTEIN, M., 2001. What is
bioinformatics? A proposed definition and overview of the field. Methods
of information in medicine, 40: 346-58.
MAHONY, S., HENDRIX, D., GOLDEN, A., SMITH, T. J. and ROKHSAR, D.
S., 2005. Transcription factor binding site identification using the self-
organizing map. Bioinformatics, 21: 1807-14.
MATYS, V., FRICKE, E., GEFFERS, R., GOSSLING, E., HAUBROCK, M.,
HEHL, R., HORNISCHER, K., KARAS, D., KEL, A. E., KEL-
MARGOULIS, O. V., KLOOS, D. U., LAND, S., LEWICKI-POTAPOV,
B., MICHAEL, H., MUNCH, R., REUTER, I., ROTERT, S., SAXEL, H.,
SCHEER, M., THIELE, S. and WINGENDER, E., 2003. TRANSFAC:
transcriptional regulation, from patterns to profiles. Nucleic acids research,
31: 374-8.
MCLACHLAN, G.M., and KRISHNAN, T., 1997.The EM Algorithm and
Extensions, Wiley series in probability and statistics. John Wiley and
Sons.
MENDES, R., KENNEDY, J., and NEVES, J., 2004. The Fully Informed Particle
Swarm: Simpler, Maybe Better. In Proceedings of IEEE Trans.
Evolutionary Computation, 204-210.
PAVESI, G., MAURI, G. and PESOLE, G., 2001. An algorithm for finding
signals of unknown length in DNA sequences. Bioinformatics, 17 Suppl 1:
S207-14.
80
PEVZNER, P. A. and SZE, S. H., 2000. Combinatorial approaches to finding
subtle signals in DNA sequences. Proceedings of International Conference
on Intelligent Systems for Molecular Biology, ISMB, 8: 269-78.
POLI, R., 2008. Analysis of the publications on the applications of particle swarm
optimisation. J. Artif. Evol. App., 2008: 1-10.
REYES-SIERRA, M. and COELLO, C. A. C., 2006. Multi-Objective Particle
Swarm Optimizers: A Survey of the State-of-the-Art. International Journal
of Computational Intelligence Research, 2 (3).
ROTH, F. P., HUGHES, J. D., ESTEP, P. W. and CHURCH, G. M., 1998.
Finding DNA regulatory motifs within unaligned noncoding sequences
clustered by whole-genome mRNA quantitation. Nature biotechnology,
16: 939-45.
SANDELIN, A., ALKEMA, W., ENGSTROM, P., WASSERMAN, W. W. and
LENHARD, B., 2004. JASPAR: an open-access database for eukaryotic
transcription factor binding profiles. Nucleic acids research, 32: D91-4.
SANDVE, G. K. and DRABLOS, F., 2006. A survey of motif discovery methods
in an integrated framework. Biology direct, 1: 11.
SCHNEIDER, T. D. and STEPHENS, R. M., 1990. Sequence logos: a new way to
display consensus sequences. Nucleic acids research, 18: 6097-100.
SGD Project, 2008. "Saccharomyces Genome Database"
http://www.yeastgenome.org/
SHAW, W. M. J., BURGIN, R. and HOWELL, P., 1997. Performance standards
and evaluations in IR test collections: cluster-based retrieval models. Inf.
Process. Manage., 33: 1-14.
SHI, Y. and EBERHART, R. C. 1998. Parameter Selection in Particle Swarm
Optimization. Proceedings of the 7th International Conference on
Evolutionary Programming VII. Springer-Verlag: 591-600
SINHA, S. and TOMPA, M., 2000. A statistical method for finding transcription
factor binding sites. Proceedings of International Conference on Intelligent
Systems for Molecular Biology, 8: 344-54.
STINE, M., 2003. Motif discovery in upstream sequences of coordinately
expressed genes. CEC’03, USA, 1596–1603
81
STORMO, G. D., 2000. DNA binding sites: representation and discovery.
Bioinformatics, 16: 16-23.
THIJS, G., LESCOT, M., MARCHAL, K., ROMBAUTS, S., DE MOOR, B.,
ROUZE, P. and MOREAU, Y., 2001. A higher-order background model
improves the detection of promoter regulatory elements by Gibbs
sampling. Bioinformatics, 17: 1113-22.
TOMPA, M., 1999. An exact method for finding short motifs in sequences, with
application to the ribosome binding site problem. International Conference
on Intelligent Systems for Molecular Biology: 262-71.
TOMPA, M., LI, N., BAILEY, T. L., CHURCH, G. M., DE MOOR, B., ESKIN,
E., FAVOROV, A. V., FRITH, M. C., FU, Y., KENT, W. J., MAKEEV,
V. J., MIRONOV, A. A., NOBLE, W. S., PAVESI, G., PESOLE, G.,
REGNIER, M., SIMONIS, N., SINHA, S., THIJS, G., VAN HELDEN, J.,
VANDENBOGAERT, M., WENG, Z., WORKMAN, C., YE, C. and
ZHU, Z., 2005. Assessing computational tools for the discovery of
transcription factor binding sites. Nature biotechnology, 23: 137-44.
WANG, T. and STORMO, G. D., 2003. Combining phylogenetic data with co-
regulated genes to identify regulatory motifs. Bioinformatics, 19: 2369-80.
WEI, Z. and JENSEN, S. T., 2006. GAME: detecting cis-regulatory elements
using a genetic algorithm. Bioinformatics, 22: 1577-84.
WIJAYA, E., YIU, S. M., SON, N. T., KANAGASABAI, R. and SUNG, W. K.,
2008. MotifVoter: a novel ensemble method for fine-grained integration of
generic motif finders. Bioinformatics, 24: 2288-95.
XINCHAO, Z., 2010. A perturbed particle swarm algorithm for numerical
optimization. Appl. Soft Comput., 10: 119-124.
VAN HELDEN, J., ANDRE, B. and COLLADO-VIDES, J., 1998. Extracting
regulatory sites from the upstream region of yeast genes by computational
analysis of oligonucleotide frequencies. Journal of molecular biology, 281:
827-42.
YAOCHU, J., LIPO, W., 2009. Fuzzy Systems in Bioinformatics and
Computational Biology, Springer Berlin / Heidelberg
82
YANOVER, C., SINGH, M. and ZASLAVSKY, E., 2009. M are better than one:
an ensemble-based motif finder and its application to regulatory element
prediction. Bioinformatics, 25: 868-74.
ZHOU, H., SCHAEFER, G. and SHI, C. 2009. Fuzzy C-Means Techniques for
Medical Image Segmentation. Springer Berlin / Heidelberg.
ZHOU, W., ZHOU, C., LIU, G., HUANG, Y., 2005. Identification of
Transcription Factor Binding Sites Using Hybrid Particle Swarm
Optimization. Rough Sets, Fuzzy Sets, Data Mining, and Granular
Computing, Springer Berlin / Heidelberg, 438-445
83
CURRICULUM VITAE
Mustafa KARABULUT was born on March 29th, 1979, in Gaziantep,
TURKIYE. He received BSc degree in Computer Engineering from Çanakkale 18
Mart University in 2001 and then, the MSc degree in Electrical-Electronics
Engineering from Kahramanmaraş Sütçü İmam University in 2007. He worked as
a software developer between 2001 and 2003 in an IT company. Since 2003, he
has been working as an instructor at Vocational School of Higher Education
department in University of Gaziantep.