ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA...

92
ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED SCIENCES PhD THESIS Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE IDENTIFICATION DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING ADANA, 2011

Transcript of ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA...

Page 1: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED SCIENCES

PhD THESIS

Mustafa KARABULUT

EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE IDENTIFICATION

DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING

ADANA, 2011

Page 2: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED SCIENCES

EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE

IDENTIFICATION

Mustafa KARABULUT

PhD THESIS DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING We certify that the thesis titled above was reviewed and approved for the award of degree of the Doctor of Philosophy by the board of jury on 02/12/2011. ……………….................... ………………………….. ……................................ Asst. Prof. Dr. Turgay İBRİKCİ Prof. Dr. Elif Derya ÜBEYLİ Prof. Dr. Hamza EROL SUPERVISOR MEMBER MEMBER ……………….................... ………………………….. Assoc. Prof. Dr. Ulus ÇEVİK Asst. Prof. Dr. Sami ARICA MEMBER MEMBER This PhD Thesis is written at the Department of Institute of Natural And Applied Sciences of Çukurova University. Registration Number:

Prof. Dr. İlhami YEĞİNGİL Director Institute of Natural and Applied Sciences

Not:The usage of the presented specific declerations, tables, figures, and photographs

either in this thesis or in any other reference without citiation is subject to "The law of Arts and Intellectual Products" number of 5846 of Turkish Republic

Page 3: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

I

ABSTRACT

PhD THESIS

EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE

IDENTIFICATION

Mustafa KARABULUT

ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED SCIENCES

DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING

Supervisor :Asst. Prof. Dr. Turgay İBRİKCİ Year: 2011, Pages: 92 Jury :Asst. Prof. Dr. Turgay İBRİKCİ :Prof. Dr. Elif Derya ÜBEYLİ :Prof. Dr. Hamza EROL :Assoc. Prof. Dr. Ulus ÇEVİK :Asst. Prof. Dr. Sami ARICA

Identification of transcription factor binding sites (TFBSs) is a significant task in contemporary biology towards deciphering the genome functions and understanding gene regulatory networks. One way to identify them is via laboratory experiments which are laborious, time consuming and costly. Alternatively, computational methods based on pattern recognition techniques are proposed in the literature for automatic extraction of TFBS instances from given DNA sequences.

In this study, three different computational methods are proposed. First approach is based on clustering all w-mers and attempting to find a statistically interesting local alignment via z-score testing. Four clustering methods, Self-organizing map, Fuzzy C-Means, K-means and Expectation Maximization with Gaussian Mixture Models, are considered in this context. The second technique is similar to the first one except that it has a Bayesian post-optimization procedure to fine-tune local alignments composed of Position weight matrices. The third computational technique developed in the thesis adopts a different approach by utilizing a stochastic search procedure, namely Particle swarm optimization. The developed methods each of which offers novel contributions to the relevant literature are evaluated against several types of datasets including low and high organism DNA. Moreover, they are also compared to state-of-art motif-finding tools from the literature such as MEME and MDScan, as well. Experimental results suggest that the proposed methods are highly promising for DNA motif-finding task. Key Words: Transcription factor binding site, DNA, motif discovery, machine

learning, particle swarm optimization

Page 4: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

II

ÖZ

DOKTORA TEZİ

BİYOLOJİK DİZİLİMLER ÜZERİNDE VERİ MADENCİLİĞİ TEKNİKLERİ KULLANARAK TRANSKRİPSİYON FAKTÖRÜ

BAĞLANMA SİTELERİNİN TESPİTİ

Mustafa KARABULUT

ÇUKUROVA ÜNİVERSİTESİ FEN BİLİMLERİ ENSTİTÜSÜ

ELEKTRİK ELEKTRONİK MÜHENDİSLİĞİ ANABİLİM DALI

Danışman :Yrd. Doç. Dr. Turgay İBRİKCİ Yıl: 2011, 94 Sayfa Jüri :Yrd. Doç. Dr. Turgay İBRİKCİ :Prof. Dr. Elif Derya ÜBEYLİ :Prof. Dr. Hamza EROL :Doç. Dr. Ulus ÇEVİK :Yrd.Doç. Dr. Sami ARICA

Transkripsiyon faktörü bağlanma sitelerinin (TFBS) tanımlanması modern biyolojide genom fonksiyonları ve gen düzenleyici ağlarını çözümlenmesi doğrultusunda önemli bir süreçtir. Bu alanların tanımlanmasının bir yolu yoğun emek isteyen, zaman alıcı ve pahalı laboratuar deneyleridir. Bu deneylere alternatif olarak, TFBS alanlarının verilen DNA dizilimlerinden otomatik olarak çıkaran, desen tanıma temelli bilgisayar yöntemleri de ilgili literatürde bulunmaktadır.

Bu çalışmada üç farklı hesapsal yöntem önerilmektedir. Önerilen birinci yaklaşım verilen tüm w-mer’leri kümeleme ve z-score testi kullanarak istatistiksel olarak ilginç bir yerel hizalama bulmaya çalışmaktadır. Dört kümeleme metodu bu bağlamda değerlendirilmiştir, bunlar: Self-organizing map, Fuzzy C-Means, K-means ve Expectation Maximization algoritmalarıdır. İkinci teknik ise birinci tekniğe oldukça benzemekle beraber, farklı olarak, pozisyon ağırlık matrislerinden oluşan yerel hizalamaları iyileştirmek için Bayes teoremi temelli kümeleme sonrası optimizasyon prosedürü içermektedir. Bu tezde geliştirilen üçüncü hesapsal teknik ise parçacık sürü optimizasyonu adlı bir stokastik arama prosedürünü benimsemektedir. Geliştirilen metotlar, ilgili literatüre yeni katkılar sağlamak amacıyla düşük ve yüksek canlı DNA’sı içeren pek çok veri seti kullanılarak değerlendirilmiştir. Dahası, literatürden MEME ve MDScan gibi gelişmiş metotlar da kıyaslanmışlardır. Deneysel sonuçlar önerilen metotların DNA motif-bulma işi için oldukça umut vaat edici olduğunu göstermiştir.

Anahtar Kelimeler: Transkripsiyon faktörü bağlanma sitesi, motif keşfi, makine

öğrenmesi, parçacık sürü optimizasyonu

Page 5: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

III

ACKNOWLEDGEMENTS

I would like to express my respects and deepest gratitude to my supervisor

Asst.Prof.Dr.Turgay İbrikçi for providing me thoughtful guidance, remarkable

insights and continuous encouragement.

I also thank my committee members, Prof. Dr. Elif Derya ÜBEYLİ, Prof.

Dr. Hamza Erol, Assoc. Prof. Dr. Ulus ÇEVİK and Asst. Prof. Dr. Sami ARICA

for their supports and valuable discussions.

A special thanks to my friends especially my colleagues from the

University of Gaziantep for supporting, motivating and encouraging me about my

study.

I would like to express my special appreciation to my wife, Esra

KARABULUT, for supporting me with patience all the time. This study wouldn’t

exist if she hadn’t supported me.

Page 6: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

IV

CONTENTS PAGE ABSTRACT ............................................................................................................. I ÖZ ............................................................................................................................ II ACKNOWLEDGEMENTS ................................................................................... III CONTENTS……………………………………………………………………... IV LIST OF TABLES .................................................................................................. V LIST OF FIGURES .............................................................................................. VI LIST OF ABBREVIATIONS .............................................................................. VII 1. INTRODUCTION .............................................................................................. 1

1.1. Protein Synthesis: Transcription and Translation ........................................ 1 1.2. Discovery of Transcriptional Regulatory Elements ..................................... 3 1.3. Problem definition of DNA Motif Discovery .............................................. 4 1.4. Computational representation of DNA Motifs ............................................. 5 1.5. Motivation and Goal of the Thesis ............................................................... 8

2. RELATED WORKS ......................................................................................... 10 3. MATERIAL AND METHODS ........................................................................ 14

3.1. Datasets Utilized In the Study .................................................................... 14 3.2. Methods ...................................................................................................... 17

3.2.1. Fuzzy C-Means ............................................................................. 17 3.2.2. Expectation Maximization with Gaussian Mixture Models ......... 22 3.2.3. Self-Organizing Map ..................................................................... 27 3.2.4. K-Means ........................................................................................ 29 3.2.5. Post-optimization for clustering approach .................................... 31 3.2.6. Particle Swarm Optimization ........................................................ 33

4. RESEARCH AND DISCUSSION .................................................................... 42 4.1. Evaluation Metrics ..................................................................................... 42 4.2. Employing Fuzzy C-Means for DNA Motif Discovery ............................. 44 4.3. Assessment of Clustering Algorithms for Motif Discovery ...................... 49 4.4. Evaluation of Post-Optimization for EM/GMM Method........................... 55 4.5. Particle Swarm Optimization to Identify Regulatory Elements ................. 65

5. CONCLUSIONS ............................................................................................... 73 REFERENCES……………………………………………………….………….. 78 CURRICULUM VITAE….………………………………………….…………... 85

Page 7: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

V

LIST OF TABLES PAGE

Table 1.1. Degenerate symbols for ambiguous letters ........................................... 7

Table 3.1. Saccharomyces cerevisiae datasets ..................................................... 14

Table 3.2. The second group of datasets that consists of different species .......... 15

Table 3.3. The third group of datasets.................................................................. 15

Table 3.4. Properties of synthetic datasets for different scenarios ....................... 16

Table 4.1. Performances of FCM, MEME and MDScan ..................................... 47

Table 4.2. Predicted and known motifs in sequence logo format ........................ 49

Table 4.3. Experimental results of four clustering algorithms for each dataset ... 51

Table 4.4. Motif finding performance of MEME for each dataset ...................... 53

Table 4.5. Comparison of EM/GMM with other algorithms for Saccharomyces

cerevisiae datasets in terms of MCC ................................................ 57

Table 4.6. Comparison of four algorithms for third group of datasets ................ 59

Table 4.7. Best results of GMM/EM, FCM and SOMBRERO ........................... 60

Table 4.8. Sequence logos of the known motifs and the predicted ones (a) ........ 62

Table 4.9. Sequence logos of the known motifs with predicted ones, (a) and (b) 64

Table 4.10. Results of PSO variants for synthetic datasets in terms of F-Scores 67

Table 4.11. Performance comparison of motif-finding tools for synthetic datasets

.......................................................................................................... 68

Table 4.12. Performances of PSO variants for 8 real datasets .............................. 69

Table 4.13. Comparison of motif-finding tools for third group of datasets .......... 72

Page 8: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

VI

LIST OF FIGURES PAGE

Figure 1.1. The three phases of transcription process resulting in RNA ............... 2

Figure 1.2. Extraction of TFBS instances from given set of promoter sequences. 4

Figure 1.3. Generating PFM and PWM from the alignment of subsequences: (a) a

set of sequences, (b) PFM, (c) background frequencies, (d) PWM. .. 6

Figure 1.4. A sample sequence logo ...................................................................... 8

Figure 3.1. Extraction of subsequences by using the sliding-windows technique 19

Figure 3.2. Graphical representations of utilized population topologies: (a) GBest

(b) Ring (c) Random (d) Von Neumann .......................................... 35

Figure 3.3. A sample particle and its evaluation .................................................. 36

Figure 3.4. Transformation of iS into ciS ............................................................ 37

Figure 3.5. Pseudocode of PSO-based proposed algorithm ................................. 38

Figure 3.6. A Single iteration of re-alignment and simultaneous shift operators 40

Figure 4.1. The effect of number of clusters over the performance for FCM ...... 44

Figure 4.2. The processing time of FCM for each dataset ................................... 45

Figure 4.3. The performance of FCM per number of training cycles .................. 46

Figure 4.4. Comparison of the three methods in terms of MCC .......................... 48

Figure 4.5. Correlation between performance and number of clusters ................ 50

Figure 4.6. Average motif finding performances of clustering algorithms .......... 52

Figure 4.7. Training time of each algorithm to cluster LEXA dataset ................. 53

Figure 4.8. Performances of clustering algorithms and MEME for each species 54

Figure 4.9. Comparison of average performances of clustering algorithms and

MEME for each species ................................................................... 55

Figure 4.10. Performance of the algorithm over the number of clusters for each

dataset ............................................................................................... 56

Figure 4.11. Overall performances of the algorithms for second group datasets . 61

Figure 4.12. Performance variance of algorithms over two parameters ............... 61

Figure 4.13. Performance of PSO per number of particles ................................... 70

Figure 4.14. Consumed time by PSO with different number of particles ............. 70

Figure 4.15. Consumed time by each PSO variant for third group of datasets ..... 71

Page 9: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

VII

LIST OF ABBREVIATIONS

BMU : Best Matching Unit

DNA : Deoxyribonucleic acid

EM : Expectation Maximization

FCM : Fuzzy C-Means

FN : False Negative

FP : False Positive

GA : Genetic Algorithm

GMM : Gaussian Mixture Models

HMM : Hidden Markov Model

HMR : Human, mouse and rat

MCC : Matthews’ Correlation Coefficient

mRNA : Messenger RNA

PDF : Probability Density Function

PFM : Position Frequency Matrix

PPV : Positive Predictive Value

PSO : Particle Swarm Optimization

PSSM : Position Specific Scoring Matrix

PWM : Position Weight Matrix

RNA : Ribonucleic acid

RNAP : RNA polymerase

SOM : Self-Organizing Map

TF : Transcription Factor

TFBS : Transcription Factor Binding Site

TN : True Negative

TP : True Positive

Sn : Sensitivity

Sp : Specificity

Page 10: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

1. INTRODUCTION Mustafa KARABULUT

1

1. INTRODUCTION

Through the genomic era, in which many genome sequencing projects

unveil large volumes of data, computational methods have been of great

consideration in order to process the high throughput information available and

thus led a relatively new research field called bioinformatics emerge. This new

interdisciplinary field, which incorporates computer science with biology, makes

use of information technology to help comprehend biological processes. Due to

rapid developments in genomic research technologies and thus huge amount of

data being available, bioinformatics mostly deal with management of databases

and computational/statistical methods to process the genomic data (Luscombe,

2001). Bioinformatics researchers focus on application and development of

computational methods based on machine learning, data mining and artificial

intelligence techniques. Major bioinformatics research topics can be listed as

sequence analysis, gene expression analysis, genome annotation, comparative

genomics, gene ontology and taxonomy, phylogenetics and systems biology

(Bioinformatics Wiki, 2011).

In early bioinformatics studies, often referred as pre-genomic era

(Guttmacher and Collins, 2003) or classical bioinformatics, the efforts were

mostly for extracting genomes of the organisms including human. However, in

post-genomic era, the research focus has geared towards deciphering the genome

functions and understanding gene regulatory networks. Thus, modeling and

examining transcription process, which is a vital task in gene expression and

protein synthesis, have recently been research topics of significance.

1.1. Protein Synthesis: Transcription and Translation

Protein synthesis is a process in which a protein-coding gene is transcribed by

means of messenger RNA (mRNA) and this gene-product is then translated into

the target protein. Although the protein synthesis is slightly different in

prokaryotes and eukaryotes, it always includes the two mentioned steps, i.e.,

transcription and translation. The transcription is the process of making a

Page 11: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

1. INTRODUCTION Mustafa KARABULUT

2

complementary copy of a DNA segment that includes a gene into RNA. If the

target gene contains a protein coding, then the complementary copy will be

mRNA which in turn will eventually produce a protein. Depending on encoding

of the gene, the result of the transcription sometimes may be some sort of RNA

other than mRNA which functions as a part of gene regulatory network. In either

case, the transcription is initialized by transcription factors (TFs) which bind to

DNA, usually close proximity upstream regions (promoter) of the target gene. The

TFs do not bind to random locations; instead, they bind do specific DNA regions,

which are considered as transcription regulators, called Transcription Factor

Binding Site (TFBS). Once a TF binds to a TFBS, it activates the enzyme, RNA

polymerase (RNAP), so that it produces the complementary copy of DNA into

RNA.

Figure 1.1. The three phases of transcription process resulting in RNA

The second phase of protein synthesis is the translation of obtained RNA into a

chain of amino acids called polypeptide which then folds into a protein. Ribosome

is what in charge of decoding the RNA into the proper sequence of amino acids.

Page 12: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

1. INTRODUCTION Mustafa KARABULUT

3

The translation of RNA composed of A (Adenosine), C (Cytidine), G (Guanosine)

and T (Thymidine), into a specific amino acid is done according to the translation

table in which every three sequential nucleotides, namely codons, corresponds to

a specific amino acid out of all possible 20 amino acids.

1.2. Discovery of Transcriptional Regulatory Elements

In contemporary biology, identification of transcriptional regulatory

elements has been a crucial task in order to understand transcriptional regulatory

mechanism and functional components of gene expression (Stormo, 2000). In this

context, locating genome-wide TFBS has been a matter of interest to the

researchers. So far, to achieve this goal, several methods that belong to one of two

types, computationally and experimentally, have been proposed in the literature

(Elnitski et al., 2006). Common experimental laboratory studies to identify TFBS

include exploiting DNaseI hypersensitivity, using ChIP assays and ChIP-chip

technique. Experimentally verified TFBS instances of several organisms are

stored in online databases such as JASPAR (Sandelin et al., 2004) and

TRANSFAC (Matys et al., 2003). These databases enable researchers to search

and to match TFBS instances against unverified DNA sequences so that presence

of a known instance can be detected.

Alternative to these laboratory experiments, which are actually expensive

and labor intensive, in order to identify TFBS, computational methods mostly

based on pattern recognition and data mining techniques are also proposed in the

literature (Das and Dai, 2007). Since these methods attempt to discover TFBS

instances in given unaligned DNA sequences with little or no prior knowledge,

they are preferable to experimental methods. In general, computational methods

are effective and used to find patterns of overrepresented TFBS instances in

promoter sequences of putatively co-regulated genes. Phylogenetic footprinting

by use of orthologous sequences is also an alternative way of identifying TFBS

patterns residing in the sequences. In either way, using datasets that include co-

regulated or orthologous genes, the computationally performed task is known as

Page 13: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

1. INTRODUCTION Mustafa KARABULUT

4

de novo motif discovery. In this study, de novo motif discovery methods

developed and/or evaluated consider only putatively co-regulated genes datasets.

1.3. Problem definition of DNA Motif Discovery

In DNA motif discovery task which is performed over a set of DNA

sequences that are promoters to putatively co-expressed genes, the goal is to find

recurring TFBS instances that form a common pattern of statistical significance

(Figure 1.2). This is, however, not a straightforward task since:

• Relative locations of the TFBS instances to the target genes are

unknown

• The common pattern, i.e., the sought motif, is also unknown

• The TFBS aren’t necessarily exact matches and each may have

variations such as insertions and deletions (Das and Dai, 2007)

The motif discovery task is considered as an NP-Hard problem since the

length of sequences is generally too long for an exhaustive search of all possible

locations, e.g., generally, each TFBS reside in proximity to its target gene up to a

few thousands bp. In addition to the NP-hardness, given promoter sequences may

also contain statistically interesting alignment possibilities that are actually not

biologically relevant. Thus, motif discovery methods should also be capable of

presenting multiple motif results, as well.

Figure 1.2. Extraction of TFBS instances from given set of promoter sequences.

The practitioner of a motif discovery method is usually expected to

provide a set of DNA sequences that are upstream regions of supposedly co-

regulated genes. In the most common case, each given sequence contains a single

Page 14: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

1. INTRODUCTION Mustafa KARABULUT

5

TFBS that regulates transcription of the target gene. However, in real-world

scenarios each sequence may contain more than one instance of the sought TFBS

pattern. Moreover, some of the given sequences may turn out not to contain any

co-regulated gene, that is, it may not contain any TFBS of interest. The TFBS

instances that have function in the regulation of co-expressed genes share a

common pattern which appears to be statistically interesting with regard to the

background distribution of nucleotides. Since a common pattern is sought, TFBS

instances are considered to be of the same width while performing the search,

even though that assumption is not necessarily true for all cases. Therefore, the

goal is simplified as finding the TFBS starting locations on the given DNA

sequences. Nonetheless, an ideal motif finding software should consider the

variability in TFBS width and the number of TFBS instances in each given

sequence, as well as, presence of more than one motif pattern in the given dataset.

1.4. Computational representation of DNA Motifs

Computational methods perform motif search based on a model for the

sought pattern which represents the TFBS instances of interest. Each TFBS is a

string composed of the 4-letters [A, C, G, T]. Since each TFBS may have

variation, the utilized model to represent the sought motif should be able to

consider the variation. In the literature, two models for motif representation come

into prominence: Position Weight Matrix (PWM) and consensus sequence.

A PWM, also known as Position Specific Scoring Matrix (PSSM), is an

x4w matrix of scores that give a weighted match to any subsequence. PWM is a

derivation of Profile Matrix (PFM) that holds probabilities of how often a base

(i.e., A, C, G and T) occurs at each position. When PFM is converted into PWM,

the background distribution probabilities of each base are considered to calculate

log-odds scores:

ib ib bm =log(f /Θ ) (1)

Page 15: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

1. INTRODUCTION Mustafa KARABULUT

6

where ibm is the PWM element for base b at position i, ibf stands for the

corresponding probability value from PFM and is the background probabilities

of letters A, C, G and T. The background probability of a specific letter is a part of

the background model of the organism whose DNA is analyzed. The background

model is characterized with the 3rd order Hidden Markov Model (HMM) of the

whole intergenic genome sequence of the organism. The calculation of PWM

from a set of locally aligned TFBS instances is also depicted in Figure 1.3.

Figure 1.3. Generating PFM and PWM from the alignment of subsequences: (a) a set of sequences, (b) PFM, (c) background frequencies, (d) PWM.

Using consensus strings, on the other hand, is another way of representing

a set of locally aligned TFBS instances. Composing a consensus is not too

different from constructing a PWM, the PFM from the alignment is still needed to

be calculated and then the most probable letter from each column in the PFM is

selected for each position of the consensus string. Some of the positions in the

consensus may include high probability of a specific letter with respect to the

others, that is, the letter is strictly required for the motif. On the other hand, at

some positions, none of the letters may not be dominant enough (e.g., probability

< 60%) and in this case degenerate base symbols are utilized. Table 1.1 presents

degenerate symbols according to International Union of Pure and Applied

Chemistry (IUPAC) notation.

With the consideration of above information, the consensus string of motif

example given in Figure 1.3 should be WASGTR. Consensus strings are easier to

compose and more readable when compared to PWMs. Also, consensus

representation requires less computational power when being processed.

Page 16: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

1. INTRODUCTION Mustafa KARABULUT

7

However, they are not as accurate and sensitive means to represent an alignment

as PWM. Therefore, literature methods commonly use PWMs for internal

computations of a motif and use a consensus strings to present human-readable

results. Nonetheless, there are plenty of methods such as that of Stine (2003) that

use consensus strings instead of PWMs to computationally store and process

motif alignments.

Table 1.1. Degenerate symbols for ambiguous letters Degenerate Symbol Description Ambiguous bases

A Adenosine A C Cytidine C G Guanosine G T Thymidine T U Uridine U W Weak A and T S Strong C and G M Amino A and C K Keto G and T R Purine A and G Y Pyrimidine C and T B Not A C, G and T D Not C A, G and T H Not G A, C and T V Not T A, C and G N Any base A, C, G and T

In terms of presenting a motif in a human readable way, a sequence logo

(Schneider, 1990), with respect a consensus string, actually is a more convenient

way to present and also to visualize sought patterns from DNA sequences. A

sequence logo can be generated from a set of aligned sequences, in our case, a set

of aligned TFBS instances. The residues at a specific position are graphically

represented in a stack of letters whose height is proportional to the frequency of

the base. Therefore, the most conserved parts can visually be distinguished. Figure

1.4 depicts corresponding sequence logo for the previous motif sample given in

Figure 1.3.

Page 17: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

1. INTRODUCTION Mustafa KARABULUT

8

Figure 1.4. A sample sequence logo

Through this study, motif discovery results are always given in

quantitative metrics and also in sequence logos where appropriate. The public and

free sequence logo generator, Weblogo (Crooks et al., 2004), is utilized to

generate sequence logos.

1.5. Motivation and Goal of the Thesis

As mentioned in Section 1.2, it is obvious that discovery of TFBS via

computational tools is advantageous and preferable with regards to labor intensive

and expensive laboratory experiments. Thus, researchers from the fields of

computer science, mathematics and statistics have studied several methods and

proposed various computational tools to achieve the goal of identifying regulatory

regions solely via computer methods (See Section 2 for more detailed literature

review). According to comparative studies (Tompa et al., 2005; Das and Dai,

2007; Sandve and Drablos, 2006) that have been done to evaluate these methods,

some tools were observed to be superior to the others for some specific conditions

(e.g., for short motifs or datasets extracted from DNA sequences of low

organisms) while for other conditions the case could be vice versa. Nonetheless,

regardless of its search strategy and motif representation, no algorithm alone is

reported in these comparative studies to be sufficient for predicting optimal motifs

for every condition. Thus, it is clear that research over computational methods for

TFBS identification is still an essential task and challenges researchers.

The goal of this thesis is to develop computational tools based on data

mining methods including clustering and Particle Swarm Optimization (PSO).

The study also provides evaluation of the self developed methods with respect to

state-of-art bioinformatics tools developed to perform DNA motif discovery task.

Page 18: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

1. INTRODUCTION Mustafa KARABULUT

9

In this study, four clustering algorithms are considered: Fuzzy C-Means (FCM),

Self Organizing Map (SOM), K-Means and Expectation Maximization (EM) with

Gaussian Mixture Models (GMM). Original algorithms of the clustering methods

and required modifications to adapt them for a motif-finding strategy are

explained in Section 3. In addition to the clustering approach, we also considered

PSO to identify TFBS patterns. Section 3 also provides explanations related to

PSO and relevant modifications to the original PSO algorithm to fit our needs.

Moreover, in Section 4, the performances of the mentioned algorithms will be

given in several measures including quantitative and visual ones.

Page 19: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

2. RELATED WORKS Mustafa KARABULUT

10

2. RELATED WORKS

The post-genomic era has been experiencing remarkable growth in

computational tools to process high-volume data made available. One major goal

of these tools is to support understanding of gene regulatory networks. Therefore,

automatic identification of TFBS via computational methods has been of great

interest to researchers. Such proposed methods often adopt one of two ways to

automatically extract motifs of sought TFBS: (a) From a set of sequences that are

promoters of co-regulated genes, (b) From a set of orthologous sequences that are

promoters of a single gene from different species, i.e., using phylogenetic foot

prints (Das and Dai, 2007). In this study, we focus on the first type of methods as

the developed methods in this study will only use co-regulated gene sequences.

Early studies of motif-finding on given co-regulated gene promoter

sequences are generally based on probabilistic frameworks such as EM and Gibbs

Sampling. Until the mid 90s, immediately afterwards one of the first probabilistic

motif-finding technique was proposed by Hertz et al. (1990), two methods come

into prominence, MEME (Multiple Expectation Maximization for Motif

Elicitation) (Bailey and Elkan, 1995) and Gibbs Sampler (Lawrence et al., 1993).

MEME has been a cornerstone method in motif-finding algorithms since it

brought innovative advantages that had not been implemented to that date. It

wasn’t limited with finding only “one-instance-per-sequence” motifs, it was also

able to find motifs whose instances are not shared by the given sequences in equal

number. Secondly, it adopted a strategy of erasing found instances in a

probabilistic manner which enabled it to find more than one motif candidate in

given sequences. As for methods that adopt Gibbs sampling strategy, extensions

to the original Gibbs Sampler developed by Lawrence et al. (1993) has been

proposed in order to remove its drawbacks, such sample methods are AlignACE

(Roth et al., 1998) and BioProspector (Liu et al., 2001).

In addition to probabilistic methods, algorithms based on exhaustive word

enumeration have also been considered by researchers. The first implementation

that adopts the strategy is done by Van Helden et al. (1998). Later their simple

methodology of performing search via enumeration of all possible motif locations

Page 20: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

2. RELATED WORKS Mustafa KARABULUT

11

was improved by respective studies of Tompa (1999) and Sinha and Tompa

(2000). Moreover, Weeder (Pavesi et al., 2001) and WINNOWER (Pevzner and

Sze, 2000) are improved word-based methods incorporated with other approaches

such as suffix trees and graphs. MDScan (Liu et al., 2002) is such a hybrid

method that incorporates probabilistic approach with word-enumeration. When

word-enumerative methods are compared with probabilistic methods (Das and

Dai, 2007; Hu et al., 2005), the first group of methods are observed to be superior

for shorter motifs up to 5-10 nucleotides long. Probabilistic methods, however,

scale well for longer motifs and longer datasets although they do not guarantee

globally optimum results.

A more recent trend in motif-finding literature, with respect to

probabilistic and word-enumerative methods, is using machine learning

algorithms. Self-organizing map is a good instance of such algorithms. In two

separate studies (Liu et al., 2006; Mahony et al., 2005), SOM is solely utilized to

find TFBS instances in co-regulated gene sequences of prokaryotic organism,

Saccharomyces cerevisiae. In the first study, Liu et al. considered a SOM

structure composed of several layers each of which performs classification of

given inputs. At the output layer, the classification is done as “motif” or “non-

motif”. Although the work of Mahony et al. also utilizes SOM, its approach is

different from that of Liu et al. In their method, the so-called SOMBRERO, given

sequences are broken into subsequences of a specific length in order to cluster

them into an appropriate number of PWMs. Then each set of clusters are

statistically tested to see whether its distribution of bases is different from

background probabilities. In the statistical test, a z-score is calculated for each

PWM and then the PWMs with high z-scores are considered as motif candidates.

Both SOM based studies differ from previous probabilistic methods in the way

that they consider each subsequence independent of which promoter sequence

they belong to. Therefore, any number of motif instances in given sequences is

allowed and hence this feature might also lead to high number of false-positive

predictions. As an additional shortcoming, these two methods miss variable length

motif search feature.

Page 21: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

2. RELATED WORKS Mustafa KARABULUT

12

Recently, evolutionary algorithms and stochastic search procedures have

also been considered by researchers. Genetic algorithm (GA) and PSO are the

most prominent of this category. GAME (Wei and Jensen, 2006) and GALF-P

(Chan et al., 2008) are two recent GA based de novo motif discovery methods. In

GAME, one location at each given sequence is presumed as a TFBS starting

point, and then these locations are optimized against a Bayesian fitness function

via GA operators such as crossover and mutation. In the post-processing phase,

the methods attempts to find additional TFBS instances to allow more than one

instance per sequence. GALF-P has similar aspects with GAME. However, it

combines consensus-led search with PWM optimization. Both methods are

reported to be superior to MEME and MDScan in experiments over real and

synthetic datasets.

PSO is a proven algorithm in many fields. However, with respect to GA,

PSO based motif-finding methods are rarer. The most important factor behind this

fact appears to be that PSO is designed to work in continuous domains, whereas,

in motif discovery, the search domain is a discrete space of DNA sequences that

are constituted of letters A, C, G and T. Nonetheless, the study of Lei and Ruan

(2010) is an instance of such PSO-based methods in the computational motif

discovery literature. In their study, all possible w-mers (i.e., subsequences of

length w) from the given sequences are extracted and a “word dissimilarity graph”

that holds dissimilarity scores of w-mers to each other is constructed in order to

convert the discrete domain to a “semi-continuous” one. Within this new domain,

each particle keeps track of a vector of locations in each given sequence and

formed a consensus sequence. The fitness function of the algorithm was based on

scoring the number of mismatches. In another PSO-based study (Hardin and

Rouchka, 2005), a hybrid algorithm of PSO and EM was proposed and PSO was

used only to seed the EM algorithm in order to detect motifs residing in regulatory

regions. Additionally, the HPSO algorithm (Zhou et al., 2005) was also a hybrid

motif discovery algorithm in which PSO is supported with some features of GA

such as the recombination operator.

Although several methodologies are considered, TFBS identification still

remains as a challenging task specifically for higher organisms such as metazoans

Page 22: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

2. RELATED WORKS Mustafa KARABULUT

13

(Tompa et al., 2005). Hence, some researchers attempted to ensemble known and

proven algorithms to improve prediction accuracy. Hu et al. (2006) developed the

EMD algorithm that takes advantage of five proven motif-finding methods,

AlignACE, BioProspector, MDScan, MEME and MotifSampler (Thijs et al.,

2001). According to the authors’ report, the performance of the new ensemble

method is always superior or at least equal to those of underlying five algorithms

(Das and Dai, 2007). Other ensemble based motif-finder methods can be listed as

SCOPE (Carlson et al., 2007), BEST (Che et al., 2005), TAMO (Gordon et al.,

2005) and MotifVoter (Wijaya et al., 2008).

Page 23: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

14

3. MATERIAL AND METHODS

3.1. Datasets Utilized In the Study

In this thesis, four groups of datasets are utilized to evaluate the

performances of both developed methods and literature methods. Only one group

of datasets is produced by extracting sequences from genome of relevant

organisms, the rest is taken from other studies’ public material which is available

online.

The first group of datasets includes data extracted from genome of the

organism yeast, Saccharomyces cerevisiae. The organism’s whole genome with

gene annotations is available online at “Saccharomyces Genome Database”

website (SGD Project, 2008). Each dataset consists of numerous promoter

sequences to one of five different genes, GAL4, GCN4, CBF1, RFX1 and HSF1.

The promoter sequences include TFBS instances that are experimentally validated

as playing regulatory role in the expression of these genes. The characteristics of

the first group of datasets are given in Table 3.1. The yeast datasets include both

small and large datasets which in total contains both short and long motifs (7-17

nucleotides) also with different number of motif instances.

Table 3.1. Saccharomyces cerevisiae datasets

Dataset Species Motif Length

Number of instances

Number of Sequences

Dataset size (nucleotides)

GAL4 S.cerevisiae 17 10 8 1647 GCN4 S.cerevisiae 7 62 60 11640 CBF1 S.cerevisiae 7 65 54 12159 RFX1 S.cerevisiae 13 8 6 5922 HSF1 S.cerevisiae 13 27 27 5049

In addition to the yeast datasets, each of which belongs to the same

species, a second group of datasets is also utilized to evaluate the performances of

proposed methods against datasets from different species. In this group, eight

datasets from genomes of four species including Saccharomyces cerevisiae

(yeast), Escherichia Coli (E.coli), Drosophila melanogaster (fly) and Homo

sapiens (human) are composed. The yeast promoters are again extracted from the

Page 24: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

15

“Saccharomyces Genome Database” and E.coli data is taken from “The EcoGene

Database of Escherichia coli Sequence and Function” website (ECOGENE,

2009). Fly and human datasets are first considered by Tompa et al. (2005) and

taken from the public website of their study. Table 3.2 represents characteristics

of the second group of datasets.

Table 3.2. The second group of datasets that consists of different species

Dataset Species Motif length

Number of TFBSs

Number of Sequences

Dataset size (nucleotides)

GCN4 Yeast 7 10 10 1831 RFX1 Yeast 13 8 6 5465 ARGR1 E.coli 18 17 11 1681 LEXA E.coli 20 8 8 4715 DM05 Fly 12 14 7 7466 DM06 Fly 14 7 5 3792 HM10 Human 8 11 10 2949 HM17 Human 16 10 8 5328

As for the third group of datasets, it is structurally similar to the second

group as it also includes eight datasets from separate species. However, some

datasets in this group contain mostly mammalian promoter sequences which make

it different from the second group of datasets.

Table 3.3. The third group of datasets

Dataset Species Motif Length

Num. of TFBSs

Number of sequences

Dataset size (nucleotides)

CREB HMR 8 19 17 3544 CRP E.coli 22 24 18 1512 E2F Mammalian 11 27 25 4750 ERE Mammalian 13 25 25 4700 MEF2 HMR 7 17 17 3293 MYOD HMR 6 21 17 3315 SRF HMR 10 35 20 4127 TBP HMR 6 93 95 18525

These eight datasets were studied previously by some literature papers,

then compiled and considered by Wei and Jensen (Wei and Jensen, 2006) in their

study. Only CRP is a gene from prokaryotic DNA while the others were all

Page 25: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

16

eukaryotic. CREB, MEF2, MYOD, SRF and TBP are extracted from ABS

database of annotated regulatory binding sites (Blanco et al., 2006) and include

promoter sequences to genes from human, mouse and rat (HMR). The other

eukaryotic datasets (E2F and ERE) are also extracted from different mammalian

species that were not specifically given in the relevant paper (Frith et al., 2004).

Table 3.3 presents the details of the datasets.

The first three groups of datasets were all real datasets extracted from

different organisms. Other than these datasets, we also utilized synthetic datasets

as well. The fourth group of the utilized data is synthetic datasets taken from

public website of the study (Chan et al., 2008). A total of 800 synthetic datasets

were populated artificially to include 8 separate real biological scenarios each of

which has 100 datasets produced by varying three parameters, the motif-width

(short and long), the dataset length (small and large) and motif conservation ratio

(high and low). Table 3.4 summarizes the properties of artificially populated

datasets.

Table 3.4. Properties of synthetic datasets for different scenarios Artificial scenario Dataset properties

Motif width Conservation Dataset

length Motif length

Number of sequences

Dataset size (nucleotides)

Short High Small 8 20 20000 Short Low Small 8 20 20000 Short High Large 8 60 60000 Short Low Large 8 60 60000 Long High Small 16 20 20000 Long Low Small 16 20 20000 Long High Large 16 60 60000 Long Low Large 16 60 60000

In short motifs, 8w whereas in longer ones, 16w . The small datasets

include 20 sequences each of which are 1000 nucleotides long, whereas the

number of sequences for large datasets is 60. In high conservation scenario, the

motif pattern is indented to be distinguishable that each position in the motif has a

clearly dominant nucleotide with a frequency of %91. However, in low

conservation scenario, which is designed to be noisy, this frequency is decreased

Page 26: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

17

to %55. In each synthetic dataset, the sequences have 10% probability of

containing either no motif instances or more than one instance up to

approximately 5-6 additional instances.

3.2. Methods

3.2.1. Fuzzy C-Means

Fuzzy C-Means is a popular clustering algorithm utilized for solutions to

several problems in computational biology (Yaochu and Lipo, 2009), medicine

(Zhou et al., 2009) and data mining (Joisen et al., 2002). It is essentially an

iterative technique in which a desired number of cluster centers of the same

dimensionality with the inputs are matched to the given input data which is, in

general case, a set of high dimensional vectors. The algorithm involves the

execution of the following steps repeatedly until a satisfactory objective is

reached:

I. Calculating a membership value for each input and allowing the inputs to be

a member of multiple clusters,

II. Updating cluster centers with the given vectors by using the membership

values as a weighting factor.

More formally, let 1 2{ , ,..., }NX x x x= be a set of given input vectors and

1 2{ , ,..., }MC c c c= represent the cluster centers. The membership of each input xj

to each cluster ci is calculated and stored in a membership matrix U of size nxm :

1/( 1)2

1/( 1)2

1

1( )( )

1( )( )

q

j iij M

q

k j k

d x cu

d x c

=

−=

−∑ (2)

where d corresponds to the distance between the input and the cluster center, and

q denotes the fuzziness value. In case of multidimensional space, the distance

function is usually chosen as Euclidean; nonetheless, it can be any other

Page 27: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

18

appropriate distance function that is suitable to the input space. The fuzziness

quantity q in the equation directly affects the performance of the FCM algorithm

and it must be chosen carefully. Generally, q is taken as a real number between 0

and 2 in most of the applications. Also it should be noted that, the total

membership value of an individual input vector is constrained by:

11, {1,..., }M

ijiu where j N

== =∑ (3)

Once the membership matrix is constructed, the cluster centers, which are

vectors in the same space, should be updated so that they are moved towards the

inputs. The update of a specific cluster takes into account the whole set of inputs:

1

1

( )

( )

Nij q

jj

i Nij q

j

u xc

u

=

=

=∑

∑ (4)

The algorithm is terminated when the objective is reached, e.g., the total sum of

distances of all inputs to all clusters (i.e., total dissimilarity) is minimized to a

certain threshold value ϕ . On the other hand, the algorithm could also be

terminated after a pre-determined maximum number of iterations are executed.

For certain input spaces such as DNA sequences used in this study, it is difficult

to define an objective function due to the complexity of the data type. Thus the

termination criterion is specified as reaching the maximum number of iterations.

The basic FCM algorithm explained above can be adapted to fit the DNA

motif discovery task which can be thought as searching an unknown number of w-

length subsequences from an N-length DNA sequence. Thus all of the w-length

subsequences from the given DNA sequence should be extracted in order that they

can be statistically analyzed for the probability of being a transcription binding

site instance. This probability is inversely proportional with the likelihood ratio of

the nucleotide frequencies of the subsequence to the background model. The FCM

Page 28: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

19

based method proposed in this study handles these issues by performing the

following steps:

a) Clustering subsequences into a certain number of clusters, i.e., PWMs,

b) Testing each PWM to see whether it is statistically interesting or not.

The considered method requires the length of the sought motif instance to

be known prior to the analysis. Thus the only unknown parameter is the starting

positions of the motif instances. The algorithm will consider all the starting

positions from the beginning to the end of the whole DNA sequence by using the

sliding-windows technique. In this technique, a window of a specific length is

pushed every time one character along the DNA sequence to obtain the next

window. If it is assumed that the sought motif is w-length and the whole DNA

sequence is N-length, the sliding-windows process will produce N-w number of

subsequences which form the inputs to the algorithm. Figure 3.1 depicts the

technique.

Figure 3.1. Extraction of subsequences by using the sliding-windows technique Once the inputs are extracted with the above mentioned technique, FCM is

applied to cluster the inputs. In this case, each cluster center is actually a PWM of

x4w . Through the application of FCM by repeatedly calculating Equation 2 and

4, we should perform quantitative comparisons between each subsequence and the

cluster centers, that is, comparing a string with a PWM. To perform such

comparisons, a distance function which is suitable for the input space to calculate

the dissimilarity between the DNA subsequences and the PWMs is required.

Thus, the function D(x,m) which is given in Equation 5 is utilized for this purpose:

Page 29: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

20

, , ,

, ,0

,

( , ) 1/( ( , ) )

1( , )

0

A C G Tw

i i c i ci c A

ii i c

i

D x m e x m m

x ce x m

x c

= =

=

= → = ≠ →

∑ ∑ (5)

where m is the PWM of x4w and x is a DNA sequence of length w. The PWMs

are randomly initialized and updated at each iteration of the clustering process.

Once the distances between the subsequences and the PWMs are calculated, then

the PWMs can be updated according to Equation 4. Since the PWMs and the

subsequences involve different types of data, the application of Equation 4

requires the input x to be replaced with the function R(x, c) which is defined as:

, , ,

0

1 21 2

1 2

( , ) ( , )

1( , )

0

A C G Tl

ii c A

R x c eq x c

c ceq c c

c c

= =

=

= → = ≠ →

∑ ∑ (6)

where c1 and c2 are the members of the set { , , , }A C G T . After the calculations are

performed, the PWMs are updated with Equation 1 which is given in Section 1.4.

It should be noted that the application of Equation 1 requires generation of

organism specific background model which is, in our case, 3rd order Hidden

Markov Model of the sequences.

A major feature of the FCM algorithm is its consideration of all inputs

when updating the clusters with the membership matrix as a weighting factor.

However, for the motif discovery task, the experiments of this study have shown

that updating each PWM with a selected group of subsequences, e.g., by selecting

some elements from the membership matrix by using a certain threshold, rather

than with all the subsequences resulted in better predictive performance. Thus, the

entries in each row of the membership matrix U are sorted and a certain number of

values are taken from the top of the sorted values. More formally, the x values in

Equation 4 are replaced with the Sel(x) function:

Page 30: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

21

(max( ), )( )

0ij ij ij

j

u top u z uSel x

otherwise∈ →

= →

(7)

where uij is the membership value of xj to the cluster ci and z stands for the

number of top subsequences to be considered. As far as the experiments over the

datasets of this study are concerned, a heuristic assignment of a value between 10

and 40 to z improves the performance of the algorithm regardless of the dataset

length. The main reason of this selection is that the number of the transcription

binding site instances residing within the datasets of this study is close to the

proposed value of z. Thus, this selection can be thought of as a weighting factor to

support the influence of the transcription binding site instances to the clustering

performance. The membership value calculations (Equation 2 and 5) and the

PWM updates (Equation 4, 6 and 7) are performed repeatedly for a certain

number of times. Once the training phase is completed, the inputs are given to the

clusters one more time for a hard clustering of the inputs to the clusters. The final

forms of the PWMs are calculated after this assignment process where the PWM

updates are performed as described earlier. This extra cycle of hard clustering

ensures that the clusters are at their best state to represent the subsequences.

Once the clustering is finished, a selection mechanism should be invoked

in order to reveal potential PWMs that represent transcription binding factor site

instances among a certain number of PWMs which contains insignificant

information content. The statistical significance, as mentioned before, is directly

proportional with the unlikelihood to the background model. To filter out

uninteresting PWMs and mark potential PWMs that may contain motif instances,

the main method is to rank PWMs by calculating their z-scores, and consequently

considering top PWMs from the sorted list as potential motif instances. The

calculation of z-score for a specific PWM, i.e., cluster, can be performed by using

Equation 8:

O Ez scoreσ−

− = (8)

Page 31: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

22

where O represents the number of subsequences hard clustered to the PWM, E

symbolizes the number of subsequences which coincide to the node by chance and

σ stands for the standard-deviation of the coincidence. As mentioned above,

once training phase is completed, the subsequences are given to the clusters one

more time for a hard clustering so that the value of O can be obtained for each

cluster. In order to calculate the parameters E and σ , the following steps are

performed (Mahony et al., 2005):

a) Artificial sequences, which are at the same length with the given DNA

sequence, are produced by using the background model

b) The artificial sequences are given to the clustering scheme the same way as

the real sequence is given, i.e., with the sliding-windows technique

c) The artificial windows are hard clustered so that the number of

subsequences associated with a cluster gives the value of E for that cluster

d) To remove the probability of the coincidences by chance, the steps (a), (b)

and (c) are repeated for a certain number of times (T), and the average

value of E is taken for each cluster.

The standard-deviation is calculated by using the parameter E and T. Once the z-

scores for each PWM are calculated, the PWMs can be sorted with this score in

order to determine the motifs. The PWMs with the highest z-scores represent the

most probable motif candidates predicted by the algorithm (e.g., top 10 highest

scored PWMs represent the predictions)

3.2.2. Expectation Maximization with Gaussian Mixture Models

Expectation Maximization algorithm, first introduced in the paper

(Dempster et al., 1977), has become a very popular way of estimating parameters

of a statistical model with incomplete data. The main idea of the algorithm is to

start with a guess of unknown parameters and then iteratively perform two steps

repeatedly;

a) E (Expectation) step in which an estimate of expected values of unknown

parameters by use of known values is calculated,

Page 32: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

23

b) M (Maximization) step in which the hidden parameters are re-estimated to

maximize the likelihood of the data.

Gaussian Mixture Model is one of such mathematical models that are

frequently used with the EM algorithm to estimate unknown parameter values of

various problems. The combination of GMM and EM is a widely utilized model

for the task of data clustering as a part of several tasks in Machine Learning

(McLachlan and Krishnan, 1997) and bioinformatics (Do and Batzoglou, 2008).

In briefly, the EM algorithm for the particular problem of separating given

data vectors, 1 2{ , ,..., }nX x x x= , into clusters, 1 2{ , ,..., }mC c c c= , that is actually

fitting given X into an M-component GMM, can be characterized as:

1) E-Step:

1

( , , )( | )

( , , )

t t tt c n m m

Mt t tk n k k

k

w f x mp m nw f x m

σ

σ=

=

∑ (9)

where the function ( , , )t t

n c cf X m σ corresponds to the Probability Density Function

(PDF) of a given input vector xn over the cluster cm at the iteration time t; and tcw

denotes the weight of a specific cluster within the mixture of m-component under

the constraint 1

1m

tk

kw

=

=∑ . The PDF value indeed gives us the membership value of

a specific input to a specific cluster. The calculation of the PDF function is given

in Equation 10.

21 || ||( )

21( , , )( 2 )

x m

Df x m e σσ

σΠ

−−

= (10)

In Equation 10, D corresponds to the number of dimensions of the input data and

|| ... || is a distance function, e.g., Euclidean. Both Equation 9 and Equation 10

require the value of m denoting the mean vector of a specific cluster and σ that

stands for the variance of the cluster. Note that, for a higher dimensional input

space such as 2-D or 3-D, the symbol σ is generally replaced with the symbol Σ.

Page 33: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

24

This symbol stands for the correlation-matrix that holds the variances and

covariances between dimensions.

2) M-Step:

1 1

1

( | )

( | )

Nt

nt nm N

t

n

p m n xm

p m n

+ =

=

=∑

∑ (11)

1 2

1 1

1

( | ) || ||

( | )

Nt t

n mt nm N

t

n

p m n x m

p m nσ

+

+ =

=

−=

∑ (12)

1 1( | )

Nt

t nk

p m nw

N+ ==

∑ (13)

In Equation 11, 12 and 13, the means, variances and mixing weights for each

cluster is updated respectively with the parameters calculated in Equation 9. E and

M steps are repeated one after another until the algorithm is terminated. Reaching

a satisfactory objective value or execution of a number of maximum iterations are

usual cases for the algorithm termination. In this paper, similar to FCM, the

algorithm is preferred to run for a certain number of iterations.

Similar to FCM, EM/GMM is capable of clustering w-mers for motif

finding purposes. Since the original EM/GMM approach that is given above

(Equation 9-13) is suitable for high-dimensional numeric space, like in FCM,

some modifications should also be applied for EM/GMM, as well. The goal is

clustering a set of subsequences into k-number of clusters which is actually fitting

k-number of PWMs to the inputs where each PWM corresponds to a local

alignment of some subsequences. Unlike FCM, each PWM (shown with m), is a

4 w× matrix and characterized with a derivation of Equation 1 with missing log

function, /ib ib bm f Θ= . The reason why Equation 1 is modified with removal of

Page 34: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

25

log is that each column in PWM should be normalized when GMM calculations

are done to obtain distances between PWMs and subsequences.

In E step in which PDF calculations are done, it is easily observed that a

specific distance function should be developed for the motif discovery problem.

The function ( , )D x m , which is given in Equation 14, is such a particular distance

function which is developed to replace every occurrence of x m− in Equation 10

and Equation 12, as well. This replacement is useful to overcome the problem of

being unable to perform operations between inputs and clusters since the two

operands are not in the same dimensionality. The ( , )D x m function, where x is a

string of w-length and m is matrix of 4 w× , is as follows:

11.0 ( , )

( , )

w

i ii

S x mD x m

w=

−=

∑ (14)

where ( , )i iS x m gives the similarity score for a single position within the string:

, , ,

( , ) ( , )

1( , )

0

A C G T

i i i ib ibb

i ibi ib

i ib

S x m eq x m m

x meq x m

x m

=

= → = ≠ →

∑ (15)

The function ( , )i iS x m gives the likelihood ratio of a given sequence to the PWM.

As mentioned before, the sum of values in the same column in both PWM and

probability matrix, are restricted to, , ,

11

A C G T

ibb

m=

=∑ .

After PDF function for each input and cluster is calculated, the second step

M is executed to update variances, means and weights for each cluster center.

Each cluster can be seen as a node that holds a mean, i.e. PWM, a variance and a

mixing weight value. Also, some supporting variables, such as probability matrix,

Page 35: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

26

are stored within a node. In M step, each PWM is updated with the corresponding

PDF values of each input.

1( , ) ( , , )

1( , )

0

b b

b

b

b

N

ib n n nn

nn

n

p eq i x x f x m

i xeq i x

i x

σ=

=

= → = ≠ →

∑ (16)

In Equation 16, the calculation of a specific element in the probability

matrix is shown. After the calculations are done, the probability matrix is

normalized where each column total should exactly be equal to 1. Afterwards, the

calculation of PWM is straightforward as shown in Equation 11-13. As regards to

the GMM approach, all inputs should update every PWM with corresponding

PDF values as a weighting factor. However, as far as the experiments of this study

are concerned, for the motif discovery problem, updating each PWM with a

selection of subsequences rather than the complete set of subsequences resulted in

better performance. Thus, the statement xn in Equation 16 is replaced with

( )nSel x :

( , min( ))

( )0n

a top z a aSel x

else∈ →

= → (17)

where f is the PDF value, a stands for ( )nf x values for each x and z is the

number of top subsequences taken to update the PWM. As for the genomic

datasets utilized in this study, taking the number of motif instances as z is the best

choice to produce best results. Thus, it may be stated that giving an expected

number of motif instances and performing the update task with this parameter

improves the performance of the proposed method. Note that, the ( )nSel x

transformation is an optional step and may be disregarded. However, when

( )nSel x is considered to use, the user should provide an expected number of motif

Page 36: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

27

instances in order to run the algorithm. It should be noted that favorable

contribution of Equation 17, specifically for motif discovery task, is justified

empirically. That is, it is possible that the application of Equation 17 may fail for

such tasks in which the goal is optimum clustering of vectors in high-dimensional

space. Nonetheless, experiments have shown that it anyway enhances motif

discovery performance. Since, with Equation 17, the original GMM is modified

and the resultant new model may be considered as a GMM-variant

After E and M steps are executed for a certain number of times, the

resulting PWMs are statistically tested as mentioned in Section 3.2.1. Similarly,

the best PWMs with the highest z-scores are regarded as the most probable motif

candidates.

3.2.3. Self-Organizing Map

Self-Organizing Map is a sort of neural-network which is mainly used for

visualization, dimension reduction and data compression (Kohonen, 1998). SOM

generally takes inputs of high-dimensional space and consequently maps these

inputs into a lower dimensional space. In order to accomplish such a projection of

inputs, SOM employs an input layer of vectors with the same dimensionality of

inputs and an output layer of nodes interconnected with input layer. In most of the

applications the output layer is chosen as low dimensionality as 2D planar grid of

nodes to provide easy interpretation of transformation products. The basic SOM

algorithm can be summarized as:

a) Randomly initialized weight vectors in the input layer are fed with one

input at a time,

b) Closest weight vector to the input, in other words winner node or best

matching unit (BMU) is chosen,

c) Winner node and its topological neighbors are updated with the input,

d) Steps a, b and c are repeated for each input for a number of times which is

called training.

More formally, the index of BMU is determined via:

Page 37: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

28

arg ( ( , ))i kc min dist x n= (18)

where xi represents an input to be compared to nk which is a node on SOM output

layer; dist stands for an appropriate distance function, most commonly Euclidean.

In the application of motif finding, the basic SOM algorithm flow remains

almost the same. SOM, however, is mostly designed for numerical inputs and thus

some modifications should be applied in order to make it work for subsequence

clustering with PWMs. For motif finding, where the input space consists of

subsequences extracted from the given promoter sequences of putatively co-

regulated genes, each node at the output layer of SOM is associated with a

randomly initialized PWM (Step a). Comparing an input subsequence x, which is

an w-length string of nucleotides A, C, G and T in an arbitrary order, with an 4xw

PWM in order to find closest PWM to the given input requires a likelihood

function to adapt Equation 18 to motif finding procedure:

,

4

1 1( , ) ( , )

1( , )

0

i i b

l

i bL x m eq x b m

a beq a b

a b

= ==

= → = ≠ →

∑ ∑ (19)

With function L(x, m), a subsequence and a PWM can be compared, thus their

similarity level can be obtained quantitatively which helps us determine BMU

through training (Step b). Consequently, BMU and a number of its topological

neighbors are updated with the given input subsequence. For motif finding

application this task is performed via updating frequency matrix first and then

updating the PWM, which is representation of a single node, with frequency

matrix after all the inputs are associated with the clusters, i.e., nodes or PWMs. To

update frequency matrix, Equation 20 is utilized:

Page 38: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

29

1, 4

1 1

( , )( ) ( )

( , )

hkb

c ki b h

ky

y k

eq x bf c t

eq x yϕ α =

= =

=∑

∑∑ (20)

where ,c

i bf corresponds to the frequency matrix element at column i and row b of

node c, and h stands for the number of subsequences associated with the node c.

Function ( )cϕ controls the influence of neighborhood size as a function of

topological closeness to the BMU. A usual application of neighborhood function

utilizes Gaussian distribution function since it produces continuous values along

with decreasing closeness. Term ( )tα stands for learning rate at iteration t where

learning coefficient decreases through algorithm iterations. Subsequent to

frequency matrix calculation, the PWM of the node is then obtained then by

applying Equation 18 once at each iteration of the algorithm.

After SOM clustering, the procedure to analyze obtained PWMs to extract

potential motifs is the same as in previous clustering techniques. The results with

the highest z-scores will be selected as the prediction of the SOM algorithm.

3.2.4. K-Means

Basically, K-Means is a clustering algorithm which mainly relies on

repetitively moving given number of cluster centers, which are generally

randomly initialized, 1 2{ , ,..., }mC c c c= to the given inputs 1 2{ , ,..., }nX x x x= to be

clustered. This procedure consists of executing two main steps until a satisfactory

convergence is reached:

a) Calculating the distances between inputs and cluster centers in order to

assign each input to the nearest cluster,

b) Updating cluster centers by taking the means of assigned inputs to each.

Thus, the objective is minimizing the following loss function:

2

1 1

M n

j ii j

V x c= =

= −∑ ∑ (21)

Page 39: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

30

where || ... || is a distance function, usually Euclidean.

As for the motif discovery application, the distance function is required to

be replaced with 1/ ( , )j iL x c which utilizes the similarity function from Equation

19. Consequently, each cluster center is updated with:

j i

n

jx C

i

xc

n∀ ∈

=∑

(22)

In order to fit this mean updating task for motif finding, nucleotide

frequencies at each position of locally aligned sequences should be counted to

form frequency matrix. After establishing a hard association of each subsequence

to a cluster center, the frequency matrix is updated as follows:

1, 4

1 1

( , )

( , )

hkb

c ki b h

ky

y k

eq x bf

eq x y

=

= =

=∑

∑∑ (23)

where h corresponds to the number of subsequences associated with the cluster

center c. It is easily observed that the update step is very similar with SOM’s

update which is given in Equation 20. With the consideration of given equations,

it can be said that trainings of SOM and K-means are very similar until the update

step where K-means updates cluster centers with hardly-associated inputs,

whereas SOM updates the winner-node and also its neighbors as well.

Neighborhood concept and decreasing learning rate can be considered as the main

differences between SOM and K-means learning schemes.

Page 40: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

31

3.2.5. Post-optimization for clustering approach

Clustering algorithms given through section 3.2 are generally sensitive to

initialization and thus it may not necessarily reach the global optimum.

Additionally, motif finding task is a multimodal problem where there are many

local optimums that the algorithm may be trapped into. Therefore, a post

optimization procedure to enhance the obtained PWMs is utilized after clustering

is completed. The optimization is performed for each PWM by use of the

Bayesian scoring function proposed by Jensen and Liu. (Jensen and Liu, 2004):

40

1 10 0

ˆˆ ˆ( ) log 1 log ˆˆ1

wib

ibi b b

pA Ap

θψ θ

θ= =

= − + −

∏∏ (24)

where A denotes the number of TFBS instances that are locally aligned in the

cluster and 0p is the ratio of A to total number of possible TFBS locations on X

(i.e., 0ˆ ( )p A N w= − ). The symbol ibθ denotes the frequency of letter b in the

position i of the alignment matrix and symbol 0bθ corresponds to the background

frequency of the same letter.

In clustering, each local alignment may include false positive

subsequences and they cannot be easily avoided by iteratively clustering

subsequences into most relevant centroids. Thus, we utilize the scoring function in

Equation 24 that gives us the ability to quantitatively measure the effects of

modifications on the alignment such as removal of a subsequence or addition of a

new w-mer from X. Therefore, in order to optimize each PWM, two steps of

operations are iteratively performed on the PWMs separately after clustering is

completed:

a) Re-alignment operation: Since most of the given sequences in X are

expected to contain an instance of the sought motif, the alignment that

represents a candidate motif model is also expected to contain instances

from most of the sequences. Thus, primarily, sequences S from X whose

Page 41: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

32

any w-mers are not included in the local alignment are selected. Then, all

w-mers from each Si is considered to be included into the alignment by

calculating a corresponding score, ( )Aψ ′ . The w-mer is immediately

included into the alignment if ( ) ( )A Aψ ψ′ > . Similarly, the subsequences

that are already aligned in the clustering phase are reconsidered to see

whether their removal or replacement with another w-mer from the same Si

produces an improved score.

b) Shift operation: Let a be the starting position of a subsequence and

1 2 3, , ,..., na a a a ar

represents a local alignment of n number of

subsequences. The shift operation checks whether a simultaneous shift of

the alignment in either directions results in an improvement on the score.

That is, for each i from 1,1 ,

1 2 3, , ,...,i i i n ia a a a a r

and corresponding score ( )Aψ ′ is

calculated, then ar is accepted as the new ar if ( ) ( )A Aψ ψ′ > .

In addition to re-alignment and shift operations, the post-optimization

phase also contains a procedure to find the optimal width of the sought motif. As

mentioned previously, the user of the proposed method is required to give three

parameters for a motif search, ww, wmin and wmax. Clustering is executed with the

assumption of ww w and then finds the optimal width in the post-optimization

phase by varying the width between wmin and wmax. The Bayesian scoring scheme

is modified to fit a variable-width motif search strategy and becomes the below

one:

4

0 01 1

( , ) log ( ) log ( , | |)

( )(4 )log log4. ( ) ( 4 )

wibb

b bb i

A w p w B A L A

nn

A

ψ

ββθ

β β= =

= + − +

Γ +Γ+ Γ Γ +

∏∑ ∑ (25)

where ( )p w is a prior distribution such as Poisson of w, ( 1) !x xΓ + = and

1

0( , ) (1 )c dB c d x x dx= −∫ . The term ibn refers to the count of nucleotide b at

Page 42: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

33

position i and 0bn denotes the background count of the same nucleotide. Please

refer to (Jensen et al. 2004) for more details.

The fitness function given in Equation 25 allows us to find an optimal

configuration of A, i.e., local alignment, with respect to a given w. The best width

w′where min maxw w w′≤ ≤ is accepted as the optimal width if ( , )A wψ ′ is superior

to all other ( , )iA wψ in the range of possible width options.

3.2.6. Particle Swarm Optimization

Particle Swarm Optimization (Eberhart and Kennedy, 1995) is a popular

metaheuristic optimization procedure that iteratively improves a swarm of

candidate solutions which are called particles. Through the iterations of the

algorithm, each particle is moved within the search space in order to locate a

better position in terms of a problem-specific fitness value. The movement of the

particles is controlled by a social cognition mechanism in which best found

position in the neighborhood of each particle influences its next position and

movement speed. Moreover, each particle’s record of own best position so far is

also considered while moving the particle in the search-space. By use of this basic

strategy, PSO is proven (Poli, 2008) to be capable of finding good solutions for

various problems modeled mostly in high dimensional space of numerical values.

More formally, let nS ⊆ R is the search space in which each particle kp

holds a vector of positions 1 2{ , ,..., | }n ix x x x x S= ∀ ∈r

and accompanying

velocities 1 2{ , ,..., | }Rn iv v v v v= ∀ ∈r

where n-dimensional space is considered. At

the beginning of the algorithm, each position ix and velocity iv is set to random

values. Then at each iteration t, the positions and the velocities are updated

according to Equation 26 and 27, respectively.

( 1) ( ) ( ) ( )

1 2( ) ( )t t lb t nb ti i i i i iv v r x x r x xα β γ+ = + − + − (26)

( 1) ( ) ( 1)t t ti i ix x v+ += + (27)

Page 43: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

34

In Equation 26, r1 and r2 are independent random real values between 0 and 1.

The parameter α is called the inertia factor that controls how much the original

velocity ( )tiv is retained for the current iteration t+1. The terms β and γ are

cognitive and social parameters that are used to control the affect of lbix , the local

best position ever found, and nbix , the best position found by the informants of the

particle, i.e., particle’s neighbors according to a population topology. Some

topologies among the most popular of all are GBest, Bidirectional ring, Random

and Von Neumann (Kennedy and Mendes, 2002). In Gbest topology, the

informants of each particle are the whole population, whereas in the other

topologies, the particles are connected to k number of other particles where

1 pk C≤ ≤ in a swarm of size pC . The selected topology determines the parameter

k and which particles are interconnected. In Bidirectional ring, all particles are

arranged sequentially to form a ring and each particle has 2k = neighbors; the

previous particle and next one on the ring. In Random topology, an arbitrary k is

determined where 1 pk C≤ ≤ and each particle is connected to k number of

randomly selected particles. In Von Neumann, a two-dimensional lattice of

particles is formed and each particle has 4k = neighbors; above, below, left and

right particles. Several studies (Kennedy and Mendes, 2002), (Kennedy, 1999)

have evaluated performances of different population topologies and eventually,

along with other conclusions, it has been shown that including the particle itself to

its list of informants may increase the overall performance of the PSO algorithm.

Thus, in the above mentioned population topologies, the parameter k should be

considered as k+1 since the particle is also taken as an informant to itself. Figure

3.2 depicts how particles are interconnected to each other according to each

neighborhood topology utilized in this study.

In each iteration of the algorithm, the best position within the

neighborhood and each particle’s own best position visited are recalculated. The

algorithm is terminated when either a satisfactory convergence is reached or a

certain number of iterations are executed. The convergence can be defined as

Page 44: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

35

either the convergence of swarm’s best particle to the global optimum or the

convergence of all particles to a single position which may be or not be the global

optimum. PSO practitioners often prefer to control the change in the global fitness

or in the best particle’s fitness to terminate the algorithm. If the tracked value

doesn’t change for a few iterations, then the swarm is considered to collapse or

converge leading to immediate termination of the algorithm.

Figure 3.2. Graphical representations of utilized population topologies: (a) GBest

(b) Ring (c) Random (d) Von Neumann

In order to adapt PSO for DNA motif discovery task, the solution space

and motif representation should be first designed. Given a set of DNA sequences

1 2{ , ,..., }nS S S S= where each iS is at arbitrary length iW restricted to 0iW > . We

seek starting positions of TFBS instances of known width w where the number of

instances ik is unknown. If we let each position in iS independently considered

for being a TFBS instance, then ik becomes 0 1i ik l w≤ < − + what makes the

solution space too large for an optimization procedure to process in a reasonable

time since the complexity would be ((2 ) )il w nO − (Wei and Jensen, 2006; Chan et

al., 2008). Thus, literature methods with similar approaches such as those which

utilized Genetic Algorithm generally preferred to simplify the solution space by

restricting ik with 0 1ik≤ ≤ . With this restriction, each particle represents a

solution with a vector, 1 2{ , ,..., }nx x x x=r

, where each ix represents a possible

location of a TFBS instance in iS and restricted with 0 1i ix l w≤ < − + . In the

real-world scenario, this restricted search procedure has the ability to locate the

majority of the sought TFBS instances, but not all of them since the motif

abundance per sequence may vary. Therefore, once the PWM of the sought model

Page 45: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

36

is constructed, post-processing procedures can take place and find additional

motif instances that improve the motif model according to the fitness function.

Figure 3.3 shows a sample particle and its evaluation.

Figure 3.3. A sample particle and its evaluation

In this scheme, where each particle simply holds a vector xr

of possible

locations on each iS , PSO algorithm can be run as explained previously. However,

it may not reach the global optimum under these conditions. The problem is, even

though the search space appears to be continuous, it is actually not really.

Subsequences, i.e., w-mers, extracted from adjacent locations on iS aren’t

necessarily similar. For instance, let iS be a DNA sequence

ACGACCATCGATGG and 4w = . Then, all possible locations on iS are limited

to 0 11ix≤ ≤ . In this space, let’s consider two successive locations, 7ix = and

8ix = , the corresponding w-mers to these locations are ATCG and TCGA.

Obviously, although the locations are adjacent, the corresponding w-mers are very

different and they don’t share a common pattern when aligned one under the

other. In order that PSO can operate to reach an optimum iteratively, the search

space should be exactly continuous. Otherwise, the particles flow randomly

through the search space. To overcome this issue, we propose a transformation of

the original iS into a continuous space ciS . In order to construct c

iS , all w-mers of

length w from the original iS are extracted. These w-mers, which are also strings

composed of the alphabet {A, C, G, T}, are sorted in alphabetical order. The sorted

list is actually a new sequence of w-mers. At this point, we have a new continuous

space ciS in which w-mers present a gradient distribution. Figure 3.4 depicts the

Page 46: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

37

process for the example given previously. The transformation is applied on each

particle by replacing each element of xr

with a simple function ( )iC x that

translates the original location information ix into a new position cix in the sorted

list.

Figure 3.4. Transformation of iS into c

iS

The chosen fitness function which is proven (Jensen and Liu, 2004; Wei

and Jensen, 2006) to be suitable and effective for motif discovery task is the

Bayesian scoring function given in Equation 24 in Section 3.2.5. The scoring

function is both usable as post-optimization for clustering approach and fitness

function for the stochastic search procedure, PSO. To utilize this function in PSO

based algorithm, a corresponding PFM, for each particle that keeps track of the

TFBS locations in xr

, is calculated in each iteration of the algorithm so that the

term ibθ can be obtained. Each 0bθ can be calculated once, when the dataset is

loaded to the program, by counting the frequencies of the letters in the whole

dataset. Also a variant of the Bayesian function given in Equation 25 is functional

for PSO, as well. With this function, PSO can be fit into a variable-width motif

Page 47: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

38

search strategy. Figure 3.5 summarizes proposed PSO-based approach in

pseudocode format.

Initialize parameters Load data and transform data iS into c

iS k ← 0 while (k < numMotifs ) { i ← 0 Set topology Initialize particles Initialize connections while (i < maxIterations OR not converged) { Update particles’ velocities Update particles’ positions Update particles’ fitnesses Select best particle if (best fitness stagnate for 10 iterations) { Perturb particles Initialize connections } i ← i + 1 } Post-process ( best particle) Add best particle to the output list k ← k + 1 } Figure 3.5. Pseudocode of PSO-based proposed algorithm

Despite the fact that PSO is generally effective to explore the search space

very quickly, it doesn’t guarantee the global optimum and may result in a

premature convergence. There may also be several reasons that cause the

algorithm to be trapped in local optima such as ineffective parameter selection and

problem specific issues. In addition to this section, Section 4.5 presents a

discussion about parameter selection and some other techniques to escape from

local optima are given. However, these strategies may not suffice to make the

algorithm effectively perform exploration and exploitation through the search

space. In PSO literature, a recent trend to improve PSO’s convergence ability is to

perturb the particles by random mutations whenever all particles stagnate for a

predetermined number of iterations (Das et al., 2005; Xinchao, 2010). In this

study, perturbation is utilized and the stagnation is defined as the stability in the

Page 48: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

39

best particle’s fitness for consecutive 10 iterations. Once the stagnation is

detected, for each particle, few random elements from xr is selected and shifted to

a random location. Hence, the swarm continues its movement. This procedure is

repeated for a maximum number of 100 times whenever stagnation happens.

Simulations on both synthetic and real datasets showed that this procedure

improves the exploitation ability of the swarm. Furthermore, one more issue that

should be considered was the presence of repetitive false positive sub-sequences

which constituted high scored local alignments. Besides, given DNA sequences

may contain more than one DNA motif of interest. To handle these two situations,

motif discovery programs are often required to output multiple motif predictions

in an order that the most interesting, i.e., high scoring, one at the top and the one

that appears to be less relevant at the bottom. Hence, the proposed PSO-based

method requires the user to give an additional parameter that determines how

many motif predictions to be done. To remove the possibility of the program

predicting the same motifs at each iteration, the related sub-sequences to each

predicted motif are removed from the given dataset after each prediction is

performed. Therefore, the next iteration is executed with the modified sequence

data. With this approach, PSO guarantees to eventually find the optimal motif

even if some false positive sub-sequences are present in the given data.

DNA motif discovery task can be thought as a multimodal problem and

hence some local minimums cannot be avoided with the methods described above.

Literature methods that utilize a stochastic search procedure such as Genetic

Algorithm are generally reported to fall into these local minimums (Wei and

Jensen, 2006; Liu et al., 2004; Larsson et al., 2007). As a heuristic search

procedure, PSO is not an exception.

Page 49: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

40

Figure 3.6. A Single iteration of re-alignment and simultaneous shift operators

Among from several techniques some of which are the variants of the same

type, we choose two methods that were previously utilized by researchers. These

operations are executed once PSO terminates and calculates the best particle with

location vector bestxr .

a) Operations to make improvements on the best particle: The first operator,

i.e., the re-alignment, checks bestxr for false positive sub-sequences what

may often be included in noisy datasets. Each ix in bestxr is checked

against each location j in iS where 0 1ij l w≤ ≤ − + . If new bestx ′r that is

constructed by setting ix j= improves the fitness of the particle, then

bestx ′r replaces bestxr . With this operation, false positive sub-sequences are

either removed or replaced with a true positive sub-sequence. The second

operator checks bestxr whether a simultaneous shift in all ix improves the

fitness of the particle. If 1 2{ 1, 1,..., 1}best nx x x x′ = + + +r or

1 2{ 1, 1,..., 1}best nx x x x′ = − − −r and ( ) ( )best bestx xψ ψ′ > then bestxr is replaced

with bestx ′r and the operator repeats this procedure one more time until no

improvement is observed. Figure 3.6 depicts a single iteration of these two

operators.

Page 50: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

3. MATERIAL AND METHODS Mustafa KARABULUT

41

b) Operation to find optimal motif width: Generally a practitioner who is

interested in finding motifs in a set of promoter sequences may not know

the exact width of the sought motifs. Moreover, TFBS instances of the

same motif aren’t necessarily of the same width. Thus, in the proposed

method, the users are required to provide three parameters to run the

algorithm, expected width 0w of the motif, a minimum width minw and a

maximum width maxw . The PSO algorithm is executed with the

assumption of 0w w= and then finds the optimal width in post-processing

phase by varying the width between minw and maxw . The fitness function

given in Equation 25 allows us to find an optimal configuration of A, i.e.,

xr , with respect to a given w. The best width w′where min maxw w w′≤ ≤ is

accepted as the optimal width if ( , )A wψ ′ is superior to all other ( , )iA wψ

in the range of possible width options. This second post-optimization

procedure is similar to the one utilized for finding optimal width for

clustering approach which is explained in Section 3.2.5.

In addition to above mentioned methods to overcome local minimums, the

parameter selection has also significant effect on performance of the algorithm.

Relevant parameter selection strategies for DNA motif discovery task is given in

Section 4.5.

Page 51: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

42

4. RESEARCH AND DISCUSSION

In this section, the results of developed tools and relevant discussion are

given. First off, evaluation metrics that are utilized through the experimental

studies will be given in Section 4.1. Secondly, by using these metrics, clustering

algorithms will be evaluated through Section 4.2 and 4.3. Afterwards, post-

optimized clustering approach will be discussed in Section 4.4. Consequently,

experimental study related to PSO based algorithm will be given.

4.1. Evaluation Metrics

In order to assess the motif discovery prediction performance, some

measures have been studied in several papers such as those of Burset and Guigó

(1996), Pevzner and Sze (2000), and Tompa et al. (2005). The presented metrics

in this section requires us to acknowledge some prior information about the

datasets such as locations of motifs present in the dataset. Therefore, the

quantitative values on which the utilized evaluation metrics depend such as True

Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN)

counts could be obtained. All the methods discussed in the study are capable to

presenting multiple motif results, however, the TP, TN, FP and FN counts are

based on a single motif prediction. Ideally, the top motif, i.e., the first one in the

list of multiple predictions sorted with a scoring scheme such as z-score or p-

value, is considered as the result of the program. However, literature methods

most commonly allow selection of the best result is first 10-20 results as the result

of an evaluated motif-finder program.

The first metric, Sensitivity (Sn), measures the rate of correct predictions.

It is useful to measure how well the motif-finder program is at catching TFBS

instances. However, it lacks the view of false predictions proportion. Sensitivity is

also known as Recall and calculated as:

/Sn TP TP FN (28)

Page 52: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

43

Specificity (Sp) is a measure somewhat inverse to Sn, however, it measures the

proportion of correctly identified non-motif instances.

/Sp TN TN FP (29)

Additionally, Positive Predictive Value (PPV) measures the rate of

positive TFBS instances which are correctly identified. PPV is also known as

Precision.

/PPV TP TP FP (30)

In comparison to the above metrics, Matthews’ Correlation Coefficient

(MCC) is a more sophisticated way of measuring the performance of predictor

since it takes into account TP, FN, FP and FN counts in a balanced way. MCC

score of motif-finder method is only high when both the proportion of true

predictions is high and the proportion of false predictions is low.

( x ) ( x )( )( )( )( )

TP TN FP FNMCCTP FP TP FN TN FP TN FN

(31)

A similar measure to MCC is F-Score (Shaw et al., 1997) which is often

utilized in motif-finding literature (Wei and Jensen, 2006). It is based on harmonic

mean of Precision and Recall.

2x Pr xecision RecallF ScorePrecision Recall

− =+

(32)

Since the develop methods in this study are compared to some literature

methods that reported performance of their methods in terms of F-Score, we had

to utilize both metrics even though they produce very similar values.

Page 53: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

44

4.2. Employing Fuzzy C-Means for DNA Motif Discovery

Section 3.2.1 explains how FCM, which is actually designed for numerical

space, can be adapted for DNA motif finding task. In this section, experiments

over given promoter sequences of Saccharomyces cerevisiae (i.e., first dataset

group) are presented and discussed. Primarily, some parameter values should be

provided for the proposed algorithm in order to operate over a DNA sequence:

a) The length of the sought pattern

b) The number of clusters to group the given sequence in.

The choice of an appropriate number of clusters to group the data is a

common difficulty encountered in clustering algorithms, and FCM is not an

exception to this. In order to understand the correlation between number of

clusters and dataset length, several numbers of clusters are tried and relevant

performances in quantitative terms (MCC) are stored.

Figure 4.1. The effect of number of clusters over the performance for FCM

The experiments are performed over different segments of the same

sequence with varying length containing different number of motif instances

proportional to the sequence length. According to Figure 4.1, the algorithm

becomes reasonably accurate when the number of clusters is between 50 and

1000. Therefore, there seems to be a loose connection between the number of

Page 54: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

45

clusters and the performance. Except for the 1000 nucleotides long DNA

sequence, the algorithm performs consistently when the cluster number is between

80 and 200 and beyond this range, the performance may slightly fall. The results

of 1000 bp long segment appears to fluctuate with varying number of clusters

since it contains small number of instances and capturing or missing even one

single instance results to high deviation in the performance. Accordingly, GAL4,

GCN4, CBF1, RFX1 and HSF1 sequences are run with 80, 180, 180, 100 and 100

numbers of clusters, respectively.

Another issue about the number of clusters is the computational load. This

aspect is evaluated by presenting the processing time of the proposed algorithm

for two cases:

a) The number of clusters are given proportional to the dataset length, i.e.,

the above mentioned numbers are used (Settings 1),

b) All sequences are analyzed by using the same number of clusters, 80

(Settings 2).

Figure 4.2. The processing time of FCM for each dataset

As a result, a ten times increase in the dataset length causes the

computational load rise about five times. On the other hand, when the number of

clusters is doubled, a ten times increase is observed in the processing time (Figure

4.2). The experiments show that the number of clusters affects the computational

load and the processing time more than the length of the processed DNA sequence

does.

Page 55: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

46

As mentioned in Section 3.2.1, for clustering approach including FCM, the

chosen termination of the algorithm is done after a certain number of iterations are

executed. Nonetheless, this “certain” number is another parameter to be decided

carefully to prevent over-training and under-training disadvantages. The

experiments over the datasets, with a cluster number between 5 and 100, show

that a training of approximately 100 cycles seems to be sufficient as far as the

datasets used in this study are concerned.

Figure 4.3. The performance of FCM per number of training cycles

According to Figure 4.3, although datasets with dissimilar features are

given to the method, the algorithm seems to converge at about 50-60 iterations of

training regardless of the characteristics of the dataset. After 50-60 iterations, the

performance improvement is observed to be slight. Nonetheless, to ensure the best

predictions, the performance measurements for the datasets of this study are

calculated after 100 cycles of training for each dataset is executed. Consequently,

the proposed algorithm is run with the above mentioned settings and the site level

prediction performance is calculated in accordance with the performance measure

indices which are given in the relevant section. Table 4.1 presents the results of

the algorithms, FCM and the two compared methods, MEME and MDScan. In

Table 4.1, the best values of a measure for each dataset are given in bold.

Page 56: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

47

Table 4.1. Performances of FCM, MEME and MDScan FCM MEME MDScan

Dataset Sn Sp MCC Sn Sp MCC Sn Sp MCC GAL4 1,000 0,999 0,912 1,000 0,999 0,912 0,800 0,999 0,842 GCN4 0,774 0,998 0,732 0,161 1,000 0,338 0,468 0,999 0,604 CBF1 1,000 1,000 0,978 0,203 1,000 0,450 0,985 1,000 0,962 RFX1 1,000 0,999 0,784 0,875 0,999 0,714 0,875 1,000 0,825 HSF1 0,815 0,997 0,704 0,138 0,999 0,232 0,483 0,999 0,611

The predictions of the proposed FCM based method is close to perfect for

the datasets GAL4, CBF1 and RFX1 since the predictions included either all or

almost all of the sought transcription binding sites with few irrelevant

subsequences identified as motif instances. One of the reasons for the situation is

the existence of some w-mers residing in the data that are similar to the motif

pattern but not biologically marked as playing any role in the transcription

process.

The other methods, MEME and MDScan, are run online with the most

similar settings available. Since our algorithm assumes any number of motif

instances may occur per sequence, MEME and MDScan are adjusted to search the

motifs in any sequence with the consideration of just the given DNA strand. The

first 10 predictions of the studied methods are considered as the results and the

most relevant prediction of the methods are included in Table 4.1. MEME, in

general, was successful at finding significant motif patterns in the given

sequences. On the other hand, MDScan is observed to be more successful than

MEME. However, the proposed algorithm outperforms both of the methods in

most of the performance measures considered in this study.

As pointed out in Section 3, if a single measure should be selected to

compare methods, MCC should be the selection since it is the most precise

performance measure among the others since it takes into account all the aspects

of the prediction. The proposed algorithm also outperforms the others in five out

of seven MCC measure values as can be seen in Figure 4.4. In addition to the

quantitative evaluation of the proposed method, the sequence logos, for the

purpose of visualization, are generated from the local alignments of the

subsequences associated to the clusters with the highest z-scores.

Page 57: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

48

Figure 4.4. Comparison of the three methods in terms of MCC Also, the sequence logos in Table 4.2 prove the proficiency of the

proposed method at finding the sought motif patterns in an unsupervised way, as

well. To interpret the sequence logos, one should primarily consider the capability

of the motif finding tool to present the most significant parts of the original motif

pattern. The relatively high letters are the core components of the sought pattern,

whereas the positions, where there is no distinguished letter, mean that there is an

ambiguity and any nucleotide can take place for this specific position. The height

of the letters, on the other hand, is actually proportional with the information

content of the PWM being constructed. The higher information content means the

found pattern is considerably different from the background model and there is an

overrepresentation of sequences.

Despite the fact that the proposed FCM based method is generally

observed to be efficient for the studied datasets, it does not always produce perfect

predictions. It may fail in some cases that the information content of the sought

pattern is not very high or the number of transcription factor binding sites is very

low with respect to the sequence they reside. In such cases, the number of false

positives tends to be high or irrelevant patterns are identified as motifs by the

algorithm. Motifs that are overrepresented and statistically diverged from the

background model in promoters of putatively co-regulated genes are the primary

targets of the proposed clustering based motif finding algorithm. To overcome

Page 58: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

49

such difficulties, the proposed algorithm may alternatively be combined with

other motif finding methods as a part of an ensemble application. Studies

(Yanover et al., 2009; Chakravarty et al., 2007; Wijaya et al., 2008) have shown

that such approaches are effective for the application of motif discovery.

Table 4.2. Predicted and known motifs in sequence logo format

PREDICTED MOTIF KNOWN MOTIF

GAL4

GCN4

CBF1

RFX1

HSF1

4.3. Assessment of Clustering Algorithms for Motif Discovery

In previous section, a FCM based motif-finding method is evaluated.

Actually, the FCM based method was inspired from the study of Mahony et al.

(2005) that reports a pure SOM based strategy is efficient and satisfactory for de

novo motif discovery. Thus, in this section, in addition to FCM and SOM,

performances of some well known clustering algorithms at motif-finding will be

evaluated and compared. The clustering algorithms, however, are mainly targeted

to work within high-dimensional vector space of numerical values. Modifications

to adapt them for the motif-finding task are given in between Section 3.2.1 and

3.2.4. The selected clustering algorithms are FCM, SOM, K-Means and

EM/GMM against the second group of datasets (See Section 3.1) that contains

datasets of the organisms, yeast, E.coli, fly and human.

Page 59: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

50

Like in FCM based motif-finding method, one issue to the application of

clustering algorithms is how the algorithms will be initialized. In addition to

algorithm specific ones, common initialization parameters that should be decided

beforehand are number of clusters and how the algorithm terminates. For many

clustering algorithms, deciding number of clusters according to a given data is

generally a challenging issue since the result may change as a consequence of how

many partitions are desired. In the similar motif finding study of Mahony et al.

(2005), the ideal number of clusters for SOM algorithm is generally observed to

be 1/10 of given subsequences. With the consideration of this information, in

order to reveal a quantitative relationship between number of subsequences and

clusters, each algorithm is run with number of clusters that vary from 1/5 to 1/20

of number of inputs.

Figure 4.5. Correlation between performance and number of clusters

As a result, most efficient numbers of clusters-to-subsequence length ratios

for each algorithm are experimentally observed to be 1/12 for SOM, 1/10 for

FCM, 1/15 for K-means and EM/GMM. It should be noted that these ratios are

not deterministic values and could be thought as preliminary points when using

clustering algorithms for motif finding practice. As far as Figure 4.5 is concerned,

it is obvious that there seems a direct proportion between cluster and subsequence

counts but not much straight.

Page 60: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

51

After deciding the number of clusters, the PWMs are randomly initialized

with values ranging between 0.0 and 1.0 under the constraint, , ,

11

A C G T

bib

m=

=∑ . All

algorithms are trained for a certain times of iterations, generally a number

between 50 and 100, which is observed to be sufficient. Consequently, all

algorithms are run with these settings and their performances are calculated in

terms of a selection of assessment indices including Sn, Sp and MCC.

Table 4.3. Experimental results of four clustering algorithms for each dataset Dataset

Algorithm Index GCN4 RFX1 ARGR1 LEXA DM05 DM06 HM10 HM17

SOM Sn 0,800 0,875 0,647 0,875 0,117 0,428 0,363 0,700 Sp 0,999 0,999 0,994 0,999 0,999 0,996 0,996 0,999

MCC 0,842 0,745 0,577 0,874 0,277 0,325 0,331 0,737

K-Means Sn 0,800 1,000 0,352 0,875 0,071 0,714 0,272 0,800 Sp 0,998 0,999 0,999 0,999 0,997 0,998 0,998 0,998

MCC 0,799 0,784 0,547 0,824 0,061 0,628 0,339 0,675

FCM Sn 0,900 0,875 0,764 0,875 0,142 0,571 0,454 0,800 Sp 1,000 0,999 0,993 1,000 0,997 0,999 0,992 0,999

MCC 0,948 0,874 0,639 0,935 0,117 0,675 0,291 0,842

EM/GMM Sn 0,700 0,750 0,705 0,750 0,357 0,571 0,454 0,700 Sp 1,000 0,999 0,997 1,000 0,994 0,998 0,983 0,999

MCC 0,836 0,670 0,702 0,865 0,194 0,502 0,163 0,666

Table 4.3 presents motif finding performances of evaluated algorithms in

terms of quantitative measures where best values for a single dataset are indicated

with bold values. It is obviously seen that none of the algorithms performed better

than the others in every studied dataset and measure. The complication of

modeling motifs based on overrepresentation and background probabilities is a

major reason behind the case. This fact hence suggests that no clustering

algorithm alone could be perfect for DNA motif discovery practice. Nonetheless,

it is observed that FCM produces 15 out of 24 best performances, whereas K-

Means has 7, EM/GMM has 5 and SOM has 2 of the best scores. Thus, FCM can

be distinguished among the others since it noticeably outperforms the other

Page 61: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

52

clustering algorithms for nearly all metrics. In addition to this, comparing

algorithms by considering an average performance of each over all datasets

supports the same conclusion that all algorithms perform nearly as well as the

others except FCM.

Figure 4.6. Average motif finding performances of clustering algorithms

In general soft-clustering methods seem superior to hard-clustering

approach where FCM, SOM and EM/GMM can be counted in the former group of

algorithms. These three clustering algorithms use a type of thresholding strategy

when updating clusters with a given input in order to provide a convergence. To

this end, SOM updates neighbors of the BMU; FCM and EM/GMM update a

selection of nodes as told in Section 3.2.1 and 3.2.2. Seemingly, the selection

mechanism, which is utilized by FCM and EM/GMM, improves the performance

of FCM. On the other hand, EM/GMM generally suffers from the random

initialization and thus doesn’t perform as well as FCM. Despite this performance

improvement, FCM along with EM/GMM, which utilizes the method given in

Equation 7 and 17, respectively, slow down in terms of run-time. Thus, SOM and

K-Means seem faster than FCM and EM/GMM as to the comparison according to

time taken to complete a certain times of iterations in order to process a specific

length of DNA sequence. Figure 4.7 presents training time in seconds for each

algorithm studied in this paper. It should be noted that time taken for SOM, FCM

and EM/GMM to complete training may vary according to selected parameters

such as neighborhood size, number of clusters and so forth.

Page 62: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

53

Figure 4.7. Training time of each algorithm to cluster LEXA dataset

To further evaluate performances of clustering algorithms, they are

compared with a well-known motif-finding tool, MEME. This tool can be run

online with adjustable settings to perform a motif search with the most similar

settings to those of the clustering algorithms studied in this paper. According to

the results presented in Table 4.4, MEME is fairly successful at finding motifs for

lower organisms; it, however, generally fails for complex ones such as fly and

human. LEXA is the only dataset in which MEME with a MCC score of 0.824 is

better than all the clustering algorithms. As for GCN4, RFX1 and ARGR1, the

results of MEME are observed to be close to those of clustering algorithms. On

the other hand, clustering algorithms, on the average, outperforms MEME for the

rest of the datasets, DM05, DM06, HM10 and HM17.

Table 4.4. Motif finding performance of MEME for each dataset Dataset Sn Sp MCC GCN4 0,200 1,000 0,446 RFX1 0,875 0,999 0,685

ARGR1 0,882 1,000 0,938 LEXA 0,875 0,999 0,824 DM05 0,000 0,998 -0,001 DM06 0,000 0,998 -0,002 HM10 0,000 0,993 -0,004 HM17 0,400 0,993 0,205

Page 63: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

54

The results of MEME are in good agreement with the fact that finding

motifs in DNA sequences of eukaryotic organisms still challenges researchers

(Tompa et al., 2005; Das and Dai, 2007; Hu et al., 2005). Main reasons behind

this fact are mainly the relative low signal-to-noise ratio and low complexity

TFBS of higher organisms when compared to their background model.

Nonetheless, as far as the datasets of this study are concerned, the clustering

algorithms perform better than MEME in those difficult datasets.

Figure 4.8. Performances of clustering algorithms and MEME for each species

As can be seen in Figure 4.8, the superiority of clustering algorithms for

eight tested datasets of four separate organisms is also obvious when comparing

the average performances of clustering algorithms for each organism dataset

group to those of MEME. As Figure 4.9 depicts, the average performance values

of clustering approach outperforms corresponding results of MEME at three out

of four organism datasets whereas MEME is only best in E.Coli datasets.

Page 64: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

55

Figure 4.9. Comparison of average performances of clustering algorithms and

MEME for each species 4.4. Evaluation of Post-Optimization for EM/GMM Method

In Section 4.2, clustering approach is evaluated with FCM algorithm

adapted to DNA motif-finding task. Subsequently, in Section 4.3, in addition to

FCM, three more well-known clustering algorithms, i.e., SOM, K-Means and

EM/GMM, are evaluated and compared to each other. As a result, SOM, K-Means

and EM/GMM based methods are observed to perform similarly, whereas, FCM

clearly outperforms the three methods on average MCC score against the second

group of datasets.

In this section, contribution of the post-optimization procedure based on a

Bayesian framework to clustering approach is evaluated against the first (i.e.,

Saccharomyces cerevisiae datasets) and fourth groups of datasets (See Section

3.1 for details of the datasets). EM/GMM method is the selected algorithm on

which this post-optimization is applied. Then its performance will be compared to

FCM, SOM and two literature methods, MEME and MDScan. This time,

SOMBRERO (Mahony et al., 2005) is the SOM implementation that will be

utilized for comparison purposes.

In this new scheme, the method is composed of three steps which of two

are the same as explained in Section 3.2.1. The new step, which is explained in

Section 3.2.5, intervenes between the two previously explained steps:

a) Clustering, or locally aligning, w-length windows

Page 65: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

56

b) Fine-tuning the motif models (i.e., PWMs) by using the Bayesian scoring

scheme

c) Selecting the most statistically interesting alignments, i.e., clusters or

PWMs, from the set of clusters by using z-score scheme

Before covering the experimental results of the new scheme, the algorithm

initialization should be mentioned first. In Section 4.3, the importance of

algorithm initialization is mentioned and a proposal for initial parameters of

EM/GMM method is given. According to the proposal, 1/15 as the number of

clusters-to-dataset length ratio performs reasonable well for EM/GMM method.

Since, this time the methodology is changed, the effect of number of clusters is

reinvestigated. The proposed proportion from the previous section is taken as a

preliminary point and a range of number of clusters-to-data length ratios from

1/100 to 1/5 are investigated.

Figure 4.10. Performance of the algorithm over the number of clusters for each

dataset

Expectedly, taking 1/100 resulted in poorer performance for most of the

datasets. As can be seen in Figure 4.10, performance is still directly proportional

to the number of clusters for most cases as the proportion is being increased to

1/25. In the range of ratios from 1/25 to 1/5, the performance is observed to be

stable for some datasets and has a slight positive change for the others. Therefore,

with the consideration of the heavy computational load that the number of clusters

Page 66: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

57

brings, the most suitable choice of the proportion is observed to be 1/10 for almost

all cases. Notably, this new proportion is very close to the one proposed in

previous section.

Each cluster is initialized randomly with a two-steps procedure. First, all

w-mers are prorated to the clusters and then mm and m for each cluster are

calculated in accordance with Equation 11 and 12, respectively. As for other

initial parameters, the length of the sought motif should also be provided by the

practitioner with three values ww, wmin and wmax. Thus, the algorithm operates to

discover optimal motifs of length ww and then in the post-optimization phase it

attempts to find the optimal width of the obtained motif in the range from wmin to

wmax.

Table 4.5. Comparison of EM/GMM with other algorithms for Saccharomyces cerevisiae datasets in terms of MCC

EM/GMM FCM SOMBRERO MEME MDScan GAL4 0,91 0,91 0,88 0,91 0,84 HSF1 0,74 0,70 0,51 0,23 0,61 RFX1 0,80 0,78 0,26 0,71 0,83 GCN4 1,00 0,73 0,54 0,34 0,60 CBF1 1,00 0,98 0,89 0,45 0,96

With the above mentioned settings, the post-optimized GMM, along with

FCM, SOMBRERO, MEME and MDScan are run against the Saccharomyces

cerevisiae datasets. SOMBRERO is a downloadable console application that may

run under different operating systems. As for others, both MEME and MDScan

are web based motif-finding programs. Since all sequences in the given datasets

may contain more than one instance of the sought motif, MEME is specifically

run in “Any number of repetitions per sequence” mode whereas the other

algorithms didn’t require any setting for such a search. All the algorithms

including GMM/EM method required the motif width given. Among the set of

algorithms, only MDScan and FCM weren’t suitable for a variable motif-width

search. With these settings, all algorithms are run to make 10 motif predictions.

For comparison purposes, the best result in terms MCC from the 10 results is

selected as the algorithm’s prediction. For comparison purposes, the best results

Page 67: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

58

of each algorithm per dataset are given in Table 4.5 in which best value for each

dataset is marked as bold.

As can be seen in Table 4.5, the results of GMM/EM method for

Saccharomyces cerevisiae datasets are generally favorable in comparison to those

of others. Except RFX1 dataset, its performance is superior for all the datasets in

terms of MCC. In comparison to MEME and MDScan, the average performance

of the three clustering based methods are observed to be better which shows the

superiority of simultaneous motif-finding techniques to the others. As for

EM/GMM and FCM comparison, FCM was clearly superior to EM/GMM method

without the post-optimization (See Section 4.3); the post-optimized EM/GMM, in

contrast, outperforms FCM. Moreover, when SOMBRERO is compared to

EM/GMM, the improved performance of EM/GMM over SOMBRERO may also

be attributed to the post-optimization procedures employed in the proposed

method. On the other hand, Self-organizing map, which SOMBRERO is based

upon, is actually a topological map of given inputs and thus does not necessarily

produce optimal clusters. Therefore, poorer performance of SOMBRERO when

compared to EM/GMM may also be related to SOM’s non-optimal clustering

approach.

In Saccharomyces cerevisiae datasets, all algorithms generally produced

reasonable results especially for shorter datasets. In relatively longer datasets

where the number of motif instances is also high, MEME, which operates upon

the principle of estimating one motif at a time, performed worst whereas the

others were still reasonably well. Notably, in all datasets, EM/GMM was the best

and MDScan’s overall performance, next to that of FCM, was one of the closest to

it. MDScan is a hybrid algorithm that combines features of word enumerative and

PWM based stochastic methods. It also incorporates a Bayesian scoring function

(Liu et al., 2002) to optimize the found motif candidates in a post-processing step.

From this point of view, the superiority of GMM/EM method over MDScan

proves the effectiveness of clustering approach while both of them utilizes an

iterative post-optimization step based on different but related Bayesian functions.

Also, the experimental results over yeast datasets reveal the fact that the potential

of the proposed method in identifying more than one instance per sequence is

Page 68: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

59

obvious particularly in the cases of GCN4 and CBF1 since Sensitivity is obtained

as 100% in both datasets in which some sequences are expected to contain more

than one instance of the sought motif (i.e., the number of instances is greater than

the number of sequences).

Secondly, all the 5 algorithms are evaluated against the third group of

datasets that consists of promoter sequences from various organisms including

mammalians and E.coli. The experiments are done with similar settings used for

Saccharomyces cerevisiae datasets. For GMM/EM, FCM and SOMBRERO, the

number of clusters (k) is again given in 1/10 of dataset length. Similarly,

algorithms are executed for 10 motif predictions and the best result among the 10

is selected as the result of the algorithm. Table 4.6 presents the results of the

second experiment in terms of MCC.

Table 4.6. Comparison of four algorithms for third group of datasets EM/GMM FCM SOMBRERO MEME MDScan CREB 0,69 0,47 0,12 0,59 0,69 CRP 0,85 0,65 0,56 0,67 0,50 E2F 0,83 0,57 0,73 0,76 0,73 ERE 0,76 0,14 0,50 0,71 0,66 MEF2 0,46 0,32 0,37 0,88 0,00 MYOD 0,83 0,21 0,23 0,00 0,00 SRF 0,72 0,57 0,75 0,67 0,76 TBP 0,45 0,29 0,15 0,36 0,49 AVERAGE 0,70 0,40 0,43 0,58 0,48

In general, the performance of algorithms in higher organism datasets is

lower than that of in Saccharomyces cerevisiae datasets which is a known issue

for motif-finding programs (Tompa et al., 2005; Hu et al., 2005). Specifically,

FCM, which was reasonably well in yeast datasets, performed relatively poorer

against higher organism DNA sequences. The performances of FCM,

SOMBRERO and MDScan appear to be comparable to each other whereas

MEME is observed to be superior to them. The overall performance of MEME in

both higher and lower organism datasets is observed to be moderately stable at a

mediocre level whereas those of others are not. On the other hand, GMM/EM

performs still reasonably well when compared to others including MEME. The

Page 69: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

60

reason behind this fact is connected with the advantages of the simultaneous

motif-finding strategy that EM/GMM adopts. In simultaneous motif-finding, all

possible alignments are extracted and then best alignments are selected via a

scoring scheme, whereas, in “one motif at a time” approach, some best alignments

may be missed in first trials (e.g., 10 trials for our case) due to the local-

maximums present in the data. The poor performance of FCM and SOMBRERO,

which is also a simultaneous motif-finder, may appear to disprove the claim.

However, the problem with FCM and SOMBRERO is not totally with its local

alignment performance, in fact, it is rather related with its selection of the best

motifs. When the best results of FCM, SOMBRERO and GMM/EM among all

motifs (i.e., k-number of motifs where 1/k N w ) are taken rather than

choosing strictly from top 10 ranked motifs, their average is much more improved

(See Table 4.7).

Table 4.7. Best results of GMM/EM, FCM and SOMBRERO GMM/EM FCM SOMBRERO MCC Motif

rank MCC Motif

rank MCC Motif

rank CREB 0,69 #2 0,47 #7 0,67 #20 CRP 0,85 #1 0,73 #14 0,82 #31 E2F 0,83 #1 0,57 #1 0,73 #3 ERE 0,88 #38 0,24 #11 0,5 #1 MEF2 0,82 #107 0,35 #39 0,81 #28 MYOD 0,83 #6 0,65 #274 0,56 #27 SRF 0,72 #2 0,57 #4 0,82 #117 TBP 0,83 #43 0,42 #666 0,62 #614 AVERAGE 0,80 0,50 0,69

Although, EM/GMM, FCM and SOMBRERO utilizes a similar z-scoring

scheme, improved prediction of EM/GMM including better selection performance

relies upon the post-optimization procedure that supports the increased distinction

of sought motifs from non-motif alignments. Nonetheless, as far as the best results

of two clustering approaches are considered, the superiority of simultaneous

motif-finding approach to the others is clearly seen. However, in order to make a

fair comparison, the best results among 10 predictions of the algorithms are

Page 70: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

61

considered. Figure 4.11 depicts the overall performance of the algorithms for all

datasets.

There are a few factors that limit the performance of motif discovery

programs. Dataset length is the leading example for such factors (Hu et al., 2005;

Das and Dai, 2007). Accordingly, the performances of the considered algorithms

for TBP dataset, which is a few times larger than average of the other datasets, are

not as satisfactory as those in other datasets. However, length is not the only

limiting factor. As the complexity of the organism increases, the performance of

motif discovery algorithms decreases (See Figure 4.12). In contrast, as the length

of the sought motif increases the performances of the algorithms are generally

observed to increase, as well.

Figure 4.11. Overall performances of the algorithms for second group datasets

Figure 4.12. Performance variance of algorithms over two parameters

Page 71: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

62

Table 4.8. Sequence logos of the known motifs and the predicted ones (a) Dataset Known motif Predicted motif (a)

GAL4

HSF1

RFX1

GCN4

CBF1

CREB

CRP

E2F

ERE

MEF2

MYOD

SRF

TBP

The experiments presented so far are performed with fixed-width motif

search, that is, the three user-provided parameters regarding the sought motif

width, i.e., , ww, wmin and wmax, are given the same value wk which is the known

motif width. Application of Equation 25 in the post-optimization step enables the

proposed method to find the optimal width with respect to these three parameters.

In order to evaluate the ability of the method to find the optimal motif length,

three sets of experiments are done. In the first set (a), all three parameters are set

to wk. In the second set (b), ww is given the known motif width whereas wmin and

wmax are given 4kw and 4kw , respectively. In the third set of experiments

Page 72: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

63

(c), ww is given a random value wr where 3 3k w kw r w while wmin and wmax

are given 4wr and 4wr , respectively. The resultant motifs obtained as a result

of these three sets of experiments are visualized by use of sequence logos (Crooks

et al. 2004). Sequence logos of first set of experiments (a) are given in Table 4.8

whereas the logos as a result of other sets, (b) and (c), are given in Table 4.9.

According to the sequence logos, the proposed method performed highly

accurate for most of the datasets (i.e., GAL4, HSF1, CREB, E2F and MYOD)

since the sought width is found correctly for each three separate cases without any

shift in the motif pattern. In RFX1, GCN4, MEF2 and TBP datasets, fixed-width

motif search (a) results in a slight shift in the sought pattern. However, in these

four datasets, the post-optimization procedure to find the optimal width extends

the width of the found pattern so that it may include the missing bases as a result

of the undesired shift.

Therefore, even though the width becomes larger than the original one, the

new pattern is inclusive of the whole sought pattern. In CRP and ERE datasets

where the starting or ending of the sought pattern contains low conserved

residues, i.e., gaps, the results of the fixed-width motif search contains shifts in

the pattern while variable width results disregard the gaps at the ends and narrow

the width. In such cases, variable width search appears to fail since it attempts to

fit highly conserved bases at starting or ending locations. Nevertheless, the overall

performance of the procedure to find the optimal width in two cases, (a) and (b),

is observed to be satisfactory.

As for the fixed-width motif search, there is a shift issue in some results of

this type of search. This is due to tendency of the proposed algorithm to primarily

align the mostly conserved parts of the motif. Thus when conserved nucleotides

are aligned, the rest may shift up to 2-3 nucleotides since they do not have a

significant impact on the score. However, this situation is not considered as a

performance reduction in the motif literature (Tompa et al., 2005).

Page 73: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

64

Table 4.9. Sequence logos of the known motifs with predicted ones, (a) and (b) Known motif Predicted motif (b) Predicted motif (c)

GAL4

HSF1

RFX1

GCN4

CBF1

CREB

CRP

E2F

ERE

MEF2

MYOD

SRF

TBP

More or less, execution time of almost every motif discovery program is

directly proportional with the dataset length and GMM/EM method is not an

exception. An increase in the number of clusters, k, also increases the run-time

since the calculation of Equation 9-14 depends on the choice of k. However, it is

Page 74: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

65

observed that motif length doesn’t have significant impact on the run-time. When

three clustering methods, GMM/EM, FCM and SOM, are compared in terms of

run-time, SOM appears to be much faster than the others as it processes a certain

number of iterations for an input of approximately 5000 bp long in 70 seconds

while the others finish the same task in over 120 seconds. The increased run-time

of GMM/EM with respect to SOM is connected with Equation 17 that boosts the

performance in terms of motif discovery metrics. The selection mechanism in

Equation 17 requires PDF values for each input to be sorted in each iteration and

consequently causes the algorithm to finish the clustering task in a longer time.

Moreover, the post-processing procedures that attempt to improve the alignments

and find the optimal width also increase the run-time of the algorithm.

4.5. Particle Swarm Optimization to Identify Regulatory Elements

Through previous sections, motif-finding methods based on clustering

algorithms are evaluated. Moreover, in Section 4.4, the effectiveness of the

Bayesian scoring scheme is discussed and proven against various datasets. Other

than post-optimization purposes, the Bayesian scoring scheme has also proven to

be effective as a fitness function for motif-finder methods from the literature

based on stochastic search algorithms such as GA (Wei and Jensen, 2006). The

Bayesian fitness function, as to the best of our knowledge, has never been utilized

with PSO in motif-finding literature. Section 3.2.6 presented how PSO and the

Bayesian fitness function can be incorporated for the motif-finding task. In this

section, the proposed PSO based method is evaluated against the third and fourth

groups of datasets explained in Section 3.1.

When assessing PSO with these datasets, an essential point to be decided

was the selection of input parameters which is greatly discussed in the literature

(Shi and Eberhart, 1998) and known to have significant impact on the

performance of the algorithm. The parameter α , the so-called inertia weight,

controls the velocity of the particle. As discussed by many researchers, a high

value for this parameter causes the particles explore a greater space but with the

risk of jumping over the optimal regions while a small value for this parameter

Page 75: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

66

facilitates exploitation of local search area. To balance the exploration and

exploitation abilities of the particles, researchers generally started the population

with a relatively high value of α and reduce it gradually as the algorithm iterates.

In this study, it is observed that starting with 0.4 0.8α≤ ≤ and gradually reducing

it to 0.1α = result in good exploration ability while providing sufficient fine-

tuning capability. According to the experiments, taking even a bit greater value,

for instance 1.0α > , however, yielded reduced performance for our application.

As for β and γ , most of the literature papers take equal values for these

parameters, for instance 0.8 1.2β≤ ≤ and the same for γ . Therefore, in our

applications, we constantly utilized 0.8β = and 0.8γ = . Furthermore, the

population size is another known parameter that also affects performance of the

PSO algorithm. We tried several options from 10 to 300 particles and measured

the performance. As expected, small populations, e.g., 10 particles, result in

poorer exploration of the space while a larger population, e.g., 300 particles,

provides better exploration and exploitation but with much greater running times

(Please see Figure 4.13 and 4.14). To have a balanced solution under this trade-

off, the swarm size is set to 100 through the experiments.

With the above mentioned parameters ( 0.4α = , 0.8β = , 0.8γ = and

swarm size=100) four PSO variants are executed separately for each 800 synthetic

datasets (the fourth group of datasets). The quantitative assessment of the

experimental results is done with the calculation of F-Score (Wei and Jensen,

2006; Shaw et al., 1997) which measures both Precision and Recall at the same

time. The results of PSO variants in terms of F-Score are given in Table 4.10

where best score for each scenario is given in bold.

According to the table, in the particular scenario where sought motif is

long, all variants performed well and produced exactly the same F-Scores. In

short-size motif search scenario, Bidirectional Ring outperformed the others in

three out of four scores whereas GBest and Von Neumann were the best for only

one dataset group. On an average, Bidirectional ring is observed to be the best and

Random was the worst. According to the study (Kennedy, 1999), the topologies

with fewer connections may perform better in the multimodal problems where

Page 76: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

67

there are many local optimums. In Bidirectional Ring, each particle has only three

connections including itself and thus the social information flow is the slowest of

all. Hence, through the experiments, even though Von Neumann appeared to

produce the best fitness values on average, Bidirectional Ring was the best in

terms of predicting true positive sub-sequences that form the sought motif.

Table 4.10. Results of PSO variants for synthetic datasets in terms of F-Scores Datasets PSO Variants

Motif Width Cons. Dataset

Length GBest B.Ring Random Von Neumann

Short High Small 0,78 0,80 0,79 0,79 Short Low Small 0,54 0,50 0,46 0,48 Short High Large 0,83 0,84 0,83 0,84 Short Low Large 0,41 0,46 0,38 0,41 Long High Small 0,96 0,96 0,96 0,96 Long Low Small 0,85 0,85 0,85 0,85 Long High Large 0,98 0,98 0,98 0,98 Long Low Large 0,90 0,90 0,90 0,90

AVERAGE 0,78 0,79 0,77 0,78

In each experiment out of 800, PSO is run to extract top 30 motifs that are

ranked according to their fitness values and subsequently the result with the

highest F-Score is taken. Approximately 92 percent of the results with the best F-

Scores was also the top motif with the highest fitness value. Similarly, 98 percent

of the best results with the highest F-Scores was ranked in top 10 according to the

fitness value. This observation shows that the utilized fitness function is proper

for the motif discovery application and is in accordance with the sought motif

model. Briefly, simulations over synthetic datasets put forward two topologies

which are Von Neumann with high average fitness values and Bidirectional Ring

with the best F-Scores.

We also compared these results with those of GAME, MEME and

BioProspector as reported in the study (Chan et al., 2008). Table 4.11 presents the

comparison of the proposed method with these tools. In the table, two separate

results are given for PSO with respect to the results of other tools. The first

column for PSO is a collection of the best results of PSO-variants for each

Page 77: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

68

synthetic dataset group, whereas the second column is the sole results of

Bidirectional ring which performed best among the four PSO topologies.

Table 4.11. Performance comparison of motif-finding tools for synthetic datasets Datasets Methods

Motif Width Cons. Dataset

Length PSO

(Best)* PSO

(B.Ring) GAME MEME BioPros.

Short High Small 0,80 0,80 0,75 0,85 0,78 Short Low Small 0,54 0,50 0,30 0,39 0,39 Short High Large 0,84 0,84 0,83 0,83 0,76 Short Low Large 0,46 0,46 0,36 0,42 0,45 Long High Small 0,96 0,96 0,97 0,98 0,97 Long Low Small 0,85 0,85 0,82 0,88 0,83 Long High Large 0,98 0,98 0,98 0,98 0,96 Long Low Large 0,90 0,90 0,90 0,90 0,80

AVERAGE 0,82 0,79 0,79 0,78 0,74 * Selection of best PSO results from Table 4.10 for each dataset

When both PSO-results are compared to other tools, PSO appears to be

obviously superior to all tools. In long motif datasets where all methods

performed fairly well, MEME outperforms the others on average F-Score.

However, in short motif width datasets, which are noisier and more challenging,

PSO performs much better. Additionally, on average of all dataset results, PSO is

still the best with 0.82 and 0.79 F-Scores. Another notable point is that GAME is

a GA based motif-discovery method that has similarities with the proposed

method such as that it utilizes the same Bayesian framework. Although they share

the same fitness model, the results of the proposed method is clearly better than

those of GAME, hence, PSO proves itself against GA.

In addition to the first experiment over the synthetic datasets, PSO and

above mentioned algorithms are also evaluated against the third group of datasets

of the study. With these datasets, all PSO variants are executed separately while

the input parameters of PSO that were used in synthetic dataset experiments are

kept the same ( 0.4α = , 0.8β = , 0.8γ = and swarm size=100). The respective

results of PSO variants for the datasets are given in Table 4.12. As seen in the

table, on average scores, Bidirectional Ring performed the best and, similar to the

synthetic dataset experiments, the second in the comparison was Von Neumann.

Page 78: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

69

This time, GBest couldn’t perform as well as Von Neumann did and Random

topology scored the worst although in few cases it was the best. Also notably, Von

Neumann was the best in terms of number of winning scores in comparison.

Nonetheless, performances of all the topologies except Random were comparable

to each other on average F-Scores.

Table 4.12. Performances of PSO variants for 8 real datasets Dataset GBest B.Ring Random Von Neumann CREB 0,70 0,72 0,54 0,72 CRP 0,77 0,86 0,78 0,86 E2F 0,61 0,82 0,56 0,82 ERE 0,65 0,65 0,67 0,60

MEF2 0,91 0,79 0,88 0,79 MyOD 0,43 0,38 0,41 0,34

SRF 0,73 0,73 0,79 0,73 TBP 0,81 0,82 0,76 0,84

AVERAGE 0,70 0,72 0,67 0,71

As mentioned before, the number of particles has significant influence on

the performance. For 8 real datasets, some different population configurations are

tested separately while the other input settings are preserved. As a result, 100

particles for all datasets appeared to be the best choice since it brings reasonable

performance for all datasets tested. Figure 4.13 depicts the performances of

Bidirectional ring, which has appeared to be the best PSO-variant so far, for 8 real

datasets based on F-Scores.

Figure 4.13 also depicts a remarkable point that PSO performs reasonably

well even with 10 particles for all of the 8 real datasets except MyOD and TBP. It

is not surprising to see PSO with 10 particles fail in TBP dataset since it contains

95 sequences of length 200 nucleotides. For this relatively very large dataset in

which the sought motif is conversely short, PSO required more particles, at least

100, to perform satisfactorily. In MyOD, although the dataset is not so large, the

sought motif is again very short, only 6 nucleotides. The overall performance of

PSO is also poor for MyOD even if the population size is increased to 300

particles.

Page 79: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

70

Figure 4.13. Performance of PSO per number of particles

Figure 4.14. Consumed time by PSO with different number of particles Although increasing the population size generally appears to improve the

performance, it is not always the best option since it brings a heavy computational

burden leading to highly increased processing times. Figure 4.14 clearly presents

the direct relationship between the population size and the amount of time

consumed by PSO to process each dataset at a maximum of 3000 iterations.

Through the experiments, once PSO reaches a convergence, the algorithm is

terminated without executing all 3000 iterations. Hence, for some datasets the

Page 80: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

71

consumed time is observed to decrease while the population size is increased,

since with a larger population PSO may reach the optimum in less number of

iterations. In this context, CRP dataset is a special case in which PSO required the

most amount of time. The most distinguishable characteristic of CRP is the width

of sought motif which is 22 nucleotides long, nearly a double of all other

relatively longer motifs. The reason behind this highly increased operation time

was the PFM and consequent PWM calculations from the starting locations of

predicted sites. In order to understand the operational costs of each PSO variant,

Figure 4.15 gives time consumption values per each PSO variant while Figure

4.14 shows the average amount of time of all PSO variants. Since the information

flow in Bidirectional Ring is quite slow, it only terminates at or near the

maximum number of iterations. Hence, it is the most time consuming topology of

all. As for GBest, the convergence is fast because of immediate communication

between particles, hence, it is the least time consuming topology among the four.

Figure 4.15. Consumed time by each PSO variant for third group of datasets Finally, we compared the performance of PSO with the other motif-finding

tools previously analyzed in the paper. As can be seen in Table 4.13, average

performances of both PSO results, which are based on the best results of all

variants and Bidirectional Ring, respectively, are superior to those of GAME,

MEME and BioProspector.

Page 81: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

4. RESEARCH AND DISCUSSION Mustafa KARABULUT

72

Table 4.13. Comparison of motif-finding tools for third group of datasets Dataset PSO(Best)* PSO(B.Ring) GAME MEME BioPros. CREB 0,72 0,72 0,58 0,59 0,67 CRP 0,86 0,86 0,79 0,67 0,78 E2F 0,82 0,82 0,87 0,76 0,46 ERE 0,67 0,65 0,69 0,71 0,68

MEF2 0,91 0,79 0,71 0,88 0,71 MyOD 0,43 0,38 0,31 0,00 0,00

SRF 0,79 0,73 0,79 0,67 0,70 TBP 0,84 0,82 0,81 0,36 0,71

AVERAGE 0,76 0,72 0,69 0,58 0,59 * Selection of best PSO results from Table 4.12 for each dataset In real organism datasets (i.e., third group of datasets), GAME appears to

be performing better than MEME and BioProspector. However, in synthetic

datasets, MEME was superior to GAME and BioProspector. Obviously, the point

is that none of the other methods performed well in both synthetic and real dataset

simulations while PSO was consistently better in both. In synthetic and real

dataset experiments, PSO presented a superior performance. Even with few

number of particles it showed the ability to reach satisfactory optimum where the

datasets were highly multimodal, i.e., there were several local minimums

composed of false positive sub-sequences. Along with several strategies to

alleviate trapping at local optima, four PSO neighbourhood topologies are utilized

and compared to each other. It should be noted that research in this context

(Kennedy and Mendes, 2002) points out that none of the population topologies is

superior to others for all application fields. Nonetheless, the results of this paper

suggest Bidirectional Ring as the best performing PSO topology for motif

discovery application.

Page 82: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

5. CONCLUSIONS Mustafa KARABULUT

73

5. CONCLUSIONS

This thesis studies effectiveness of developed DNA motif discovery

methods based on data mining techniques. It is possible to categorize the proposed

methods in the thesis into three types: a) Clustering based methods, b) Post-

optimized clustering methods, c) The stochastic search procedure based method.

All proposed methods are tested against datasets of four groups in total including

Saccharomyces cerevisiae, synthetic and other organism datasets. Moreover,

effectiveness of the developed tools is assessed with state-of-art literature methods

such as MEME and MDScan.

The clustering approach to the motif finding mainly relies on the

overrepresentation of the motif instances within the DNA sequence. The idea is

that overrepresented subsequences can be gathered into the same cluster after an

adequate amount of training is performed. Consequently, ranking the statistical

significance of the clusters with the consideration of the fact that the motif

instances diverge from the background model will reveal statistically interesting

subsequence alignments, i.e., motifs. This approach is implemented in four

clustering algorithms, namely FCM, SOM, K-means and EM/GMM. The problem

with clustering approach is that the original algorithms are not suitable for

subsequence clustering in discrete space of DNA sequences, but vectors of

numerical space. Therefore, the original clustering algorithms are modified for

this particular task. Such algorithm adaptations and updates are presented in

Section 3. With algorithm updates, FCM is solely evaluated in Section 4.2 and

observed to be promising when compared to MEME and MDScan. Subsequently,

all clustering algorithms are compared to each other and MEME in Section 4.3.

All the algorithms except FCM performed similarly and they all were reasonably

well with lower organism datasets. They, including FCM, however, generally fail

in high organism DNA datasets such as those of human and fly, which is a known

issue for computational DNA motif discovery tools (Tompa et al., 2005; Hu et al.,

2005). When compared to MEME and other statistical methods, clustering

technique appears to be more capable of catching weak motifs residing in the

given sequences, thanks to simultaneous motif-finding approach. Remarkably, in

Page 83: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

5. CONCLUSIONS Mustafa KARABULUT

74

motif-finding, soft clustering algorithms such as FCM and SOM are observed to

be more accomplished when compared to hard-clustering, for instance, K-means.

GMM, which can be counted as a member of the former group, did not perform as

well as FCM and SOM since it was more sensitive to initial parameters than SOM

and FCM were.

Secondly, the evaluation of a post-optimization procedure based on

Bayesian fitness function (Jensen and Liu, 2004) is performed. The application of

the post-optimization is done over EM/GMM approach which appeared to be less

effective in comparison to FCM and SOM. According to the experimental results

over two different groups of datasets in Section 4.4, the post-optimization is so

capable of improving the results of clustering approach that the post-optimized

EM/GMM clearly outperforms FCM and other compared tools in terms of motif-

finding performance measures. This fact proves the effectiveness of the post-

optimization procedure based on Bayesian framework. In clustering approach, any

number of motif instances in each sequence is equally considered by the methods,

that is, the w-mers are considered regardless of the sequence they belong to.

However, in real-world scenarios most of the given sequences should contain at

least one TFBS instance of a common motif. In the post-processing step, the motif

models extracted from the DNA sequences via clustering are optimized with the

consideration of biological reality. Additionally, it is used as a scoring system to

find optimal width of the sought motif by varying the length between two user

provided values.

The merits of the Bayesian framework are not specific to the clustering

approach. Actually there are literature methods that also prove the effectiveness

and versatility of the Bayesian function such as the BioOptimizer program (Jensen

and Liu, 2004) and the GAME (Wei and Jensen, 2006), a GA based motif-finding

tool. Therefore, thirdly, a PSO based method that utilizes the Bayesian function as

the tool to test the fitness of particles is proposed and evaluated against a great

number of datasets including synthetic and real data. Respective results of the

experiments show that the proposed PSO based algorithm is highly promising and

effective for the motif-finding task in DNA sequences. It performed fairly well in

comparison to MEME, MDScan and GAME. The experiments also put forward

Page 84: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

5. CONCLUSIONS Mustafa KARABULUT

75

another conclusion that, for motif-finding, Bidirectional Ring topology appeared

to be outstanding when compared to other topologies, GBest, Random and Von

Neumann. The literature (Kennedy and Mendes, 2002; Kennedy, 1999) related to

PSO population topologies suggests that the rationale for superior performance of

Bidirectional Ring is connected to its few number of connections between the

particles that leads to slow but mature convergence in multimodal problem

domains.

As briefly discussed above, despite developed algorithms performed well

on most of the datasets, there are still some drawbacks that they can’t avoid. First

off, none of the algorithms guarantee the global optimum, that is, they all share

the behavior to have the tendency of falling into local optimums. Secondly, due to

inadequate mathematical modeling of TFBS patterns including both low and high

organisms, computational methods are still not reasonably accurate for all

situations. As the mathematical models to biological facts are improved, the

accuracy of the computational tools will eventually be enhanced, as well.

Nonetheless, researchers propose various algorithm enhancements some of which

might be useful in future studies of the proposed methods in this thesis. For

instance, combining relatively weak methods to form a stronger prediction, i.e.,

ensemble methods, is a proven methodology (Wijaya et al., 2008). That is, the

proposed methods in the thesis may bring out better performance if utilized in an

ensemble strategy. Secondly, incorporating more biological knowledge such

phylogenetic data (Wang and Stormo, 2003) into the methods than the DNA

sequence itself may also enhance the performance.

Nonetheless, from the computer science point of view, literate related to

improvements of the utilized algorithms including FCM, EM and PSO in other

fields may also be useful in motif-finding field. For instance, for PSO-based

method, several PSO variants based on different aspects are proposed to enhance

the original PSO and remove its drawbacks. The fully informed PSO (Mendes et

al., 2004), Multi-objective PSO (Reyes-Sierra and Coello, 2006) and the PSO

with constricted parameters (Shi and Eberhart, 1998) are popular instances in this

context. We believe that these variants should also be evaluated to further

improve the performance of the proposed PSO-based method.

Page 85: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

76

REFERENCES

BAILEY, T. L. and ELKAN, C., 1995. The value of prior knowledge in

discovering motifs with MEME. Proceedings of International Conference

on Intelligent Systems for Molecular Biology, 3: 21-9.

BEZDEK, J. C., 1981. Pattern recognition with fuzzy objective function

algorithms, New York, Plenum Pres.

BIOINFORMATICS Wiki, 2011, http://bioinformatics.org/wiki

BLANCO, E., FARRE, D., ALBA, M. M., MESSEGUER, X. and GUIGO, R.,

2006. ABS: a database of Annotated regulatory Binding Sites from

orthologous promoters. Nucleic acids research, 34: D63-7.

BURSET, M. and GUIGO, R., 1996. Evaluation of gene structure prediction

programs. Genomics, 34: 353-67.

CHAKRAVARTY, A., CARLSON, J. M., KHETANI, R. S. and GROSS, R. H.,

2007. A novel ensemble learning method for de novo computational

identification of DNA binding sites. BMC bioinformatics, 8: 249.

CARLSON, J. M., CHAKRAVARTY, A., DEZIEL, C. E. and GROSS, R. H.,

2007. SCOPE: a web server for practical de novo motif discovery. Nucleic

acids research, 35: 259-64.

CHAN, T. M., LEUNG, K. S. and LEE, K. H., 2008. TFBS identification based

on genetic algorithm with combined representations and adaptive post-

processing. Bioinformatics, 24: 341-9.

CHE, D., JENSEN, S., CAI, L. and LIU, J. S., 2005. BEST: binding-site

estimation suite of tools. Bioinformatics, 21: 2909-11.

CROOKS, G. E., HON, G., CHANDONIA, J. M. and BRENNER, S. E., 2004.

WebLogo: a sequence logo generator. Genome research, 14: 1188-90.

DAS, M. K. and DAI, H. K., 2007. A survey of DNA motif finding algorithms.

BMC bioinformatics, 8 Suppl 7: S21.

DAS, S., KONAR, A. and CHAKRABORTY, U. K. 2005. Improving particle

swarm optimization with differentially perturbed velocity. Proceedings of

the 2005 conference on Genetic and evolutionary computation.

Washington DC, USA: 177-184

Page 86: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

77

DO, C. B. and BATZOGLOU, S., 2008. What is the expectation maximization

algorithm? Nature biotechnology, 26: 897-9.

DEMPSTER, A.P., LAIRD, N.M., and RUBIN, D.B., 1977. Maximum likelihood

from incomplete data via the EM algorithm. Journal of the Royal

Statistical Society series B, 39:1–38

EBERHART, R. and KENNEDY, J., 1995. A new optimizer using particle swarm

theory. Proceedings of the Sixth International Symposium on Micro

Machine and Human Science: 39-43.

ECOGENE, 2009. The EcoGene Database of Escherichia coli Sequence and

Function, http://ecogene.org

ELNITSKI, L., JIN, V. X., FARNHAM, P. J. and JONES, S. J., 2006. Locating

mammalian transcription factor binding sites: a survey of computational

and experimental techniques. Genome research, 16: 1455-64.

FRITH, M. C., HANSEN, U., SPOUGE, J. L. and WENG, Z., 2004. Finding

functional sequence elements by multiple local alignment. Nucleic acids

research, 32: 189-200.

GORDON, D. B., NEKLUDOVA, L., MCCALLUM, S. and FRAENKEL, E.,

2005. TAMO: a flexible, object-oriented framework for analyzing

transcriptional regulation using DNA-sequence motifs. Bioinformatics, 21:

3164-5.

GUTTMACHER, A. E. and COLLINS, F. S., 2003. Welcome to the genomic era.

The New England journal of medicine, 349: 996-8.

HARDIN, C. T. and ROUCHKA, E. C., 2005. DNA Motif Detection Using

Particle Swarm Optimization and Expectation-Maximization. Proceedings

of the IEEE Swarm Intelligence Symposium, 2005: 181-184.

HERTZ, G. Z., HARTZELL, G. W., 3RD and STORMO, G. D., 1990.

Identification of consensus patterns in unaligned DNA sequences known

to be functionally related. Computer applications in the biosciences :

CABIOS, 6: 81-92.

HU, J., LI, B. and KIHARA, D., 2005. Limitations and potentials of current motif

discovery algorithms. Nucleic acids research, 33: 4899-913.

Page 87: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

78

HU, J., YANG, Y. D. and KIHARA, D., 2006. EMD: an ensemble algorithm for

discovering regulatory motifs in DNA sequences. BMC bioinformatics, 7:

342.

JENSEN, S. T. and LIU, J. S., 2004. BioOptimizer: a Bayesian scoring function

approach to motif discovery. Bioinformatics, 20: 1557-64.

JOISEN, K., LIU, M. C., LIAO, T. W. and TRIANTAPHYLLON, E. 2002. An

evaluation of sampling methods for data mining with fuzzy C-means.

Kluwer Academic Publishers.

KENNEDY, J., 1999. Effects of neighborhood topology on particle swarm

performance, Proceedings of the 1999 Congress on Evolutionary

Computation, pp. 1938

KENNEDY, J. and MENDES, R. 2002. Population structure and particle swarm

performance. Proceedings of the Evolutionary Computation on 2002, 2:

1671-1676

KOHONEN, T., 1998. The self-organizing map, Neurocomputing, 21(1-3): 1-6

LARSSON, E., LINDAHL, P. and MOSTAD, P., 2007. HeliCis: a DNA motif

discovery tool for colocalized motif pairs with periodic spacing. BMC

bioinformatics, 8: 418.

LAWRENCE, C. E., ALTSCHUL, S. F., BOGUSKI, M. S., LIU, J. S.,

NEUWALD, A. F. and WOOTTON, J. C., 1993. Detecting subtle

sequence signals: a Gibbs sampling strategy for multiple alignment.

Science, 262: 208-14.

LEI, C. and RUAN, J., 2010. A particle swarm optimization-based algorithm for

finding gapped motifs. BioData mining, 3: 9.

LIU, F. F. M. T., J.J.P.; CHEN, R.M.; CHEN, S.N.; SHIH, S.H. 2004. FMGA:

Finding Motifs by Genetic Algorithm. Proceedings of the 4th IEEE

Symposium on Bioinformatics and Bioengineering. IEEE Computer

Society: 459

LIU, D., XIONG, X., DASGUPTA, B. and ZHANG, H., 2006. Motif discoveries

in unaligned molecular sequences using self-organizing neural networks.

IEEE transactions on neural networks / a publication of the IEEE Neural

Networks Council, 17: 919-28.

Page 88: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

79

LIU, X., BRUTLAG, D. L. and LIU, J. S., 2001. BioProspector: discovering

conserved DNA motifs in upstream regulatory regions of co-expressed

genes. Pacific Symposium on Biocomputing. Pacific Symposium on

Biocomputing: 127-38.

LIU, X. S., BRUTLAG, D. L. and LIU, J. S., 2002. An algorithm for finding

protein-DNA binding sites with applications to chromatin-

immunoprecipitation microarray experiments. Nature biotechnology, 20:

835-9.

LUSCOMBE, N. M., GREENBAUM, D. and GERSTEIN, M., 2001. What is

bioinformatics? A proposed definition and overview of the field. Methods

of information in medicine, 40: 346-58.

MAHONY, S., HENDRIX, D., GOLDEN, A., SMITH, T. J. and ROKHSAR, D.

S., 2005. Transcription factor binding site identification using the self-

organizing map. Bioinformatics, 21: 1807-14.

MATYS, V., FRICKE, E., GEFFERS, R., GOSSLING, E., HAUBROCK, M.,

HEHL, R., HORNISCHER, K., KARAS, D., KEL, A. E., KEL-

MARGOULIS, O. V., KLOOS, D. U., LAND, S., LEWICKI-POTAPOV,

B., MICHAEL, H., MUNCH, R., REUTER, I., ROTERT, S., SAXEL, H.,

SCHEER, M., THIELE, S. and WINGENDER, E., 2003. TRANSFAC:

transcriptional regulation, from patterns to profiles. Nucleic acids research,

31: 374-8.

MCLACHLAN, G.M., and KRISHNAN, T., 1997.The EM Algorithm and

Extensions, Wiley series in probability and statistics. John Wiley and

Sons.

MENDES, R., KENNEDY, J., and NEVES, J., 2004. The Fully Informed Particle

Swarm: Simpler, Maybe Better. In Proceedings of IEEE Trans.

Evolutionary Computation, 204-210.

PAVESI, G., MAURI, G. and PESOLE, G., 2001. An algorithm for finding

signals of unknown length in DNA sequences. Bioinformatics, 17 Suppl 1:

S207-14.

Page 89: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

80

PEVZNER, P. A. and SZE, S. H., 2000. Combinatorial approaches to finding

subtle signals in DNA sequences. Proceedings of International Conference

on Intelligent Systems for Molecular Biology, ISMB, 8: 269-78.

POLI, R., 2008. Analysis of the publications on the applications of particle swarm

optimisation. J. Artif. Evol. App., 2008: 1-10.

REYES-SIERRA, M. and COELLO, C. A. C., 2006. Multi-Objective Particle

Swarm Optimizers: A Survey of the State-of-the-Art. International Journal

of Computational Intelligence Research, 2 (3).

ROTH, F. P., HUGHES, J. D., ESTEP, P. W. and CHURCH, G. M., 1998.

Finding DNA regulatory motifs within unaligned noncoding sequences

clustered by whole-genome mRNA quantitation. Nature biotechnology,

16: 939-45.

SANDELIN, A., ALKEMA, W., ENGSTROM, P., WASSERMAN, W. W. and

LENHARD, B., 2004. JASPAR: an open-access database for eukaryotic

transcription factor binding profiles. Nucleic acids research, 32: D91-4.

SANDVE, G. K. and DRABLOS, F., 2006. A survey of motif discovery methods

in an integrated framework. Biology direct, 1: 11.

SCHNEIDER, T. D. and STEPHENS, R. M., 1990. Sequence logos: a new way to

display consensus sequences. Nucleic acids research, 18: 6097-100.

SGD Project, 2008. "Saccharomyces Genome Database"

http://www.yeastgenome.org/

SHAW, W. M. J., BURGIN, R. and HOWELL, P., 1997. Performance standards

and evaluations in IR test collections: cluster-based retrieval models. Inf.

Process. Manage., 33: 1-14.

SHI, Y. and EBERHART, R. C. 1998. Parameter Selection in Particle Swarm

Optimization. Proceedings of the 7th International Conference on

Evolutionary Programming VII. Springer-Verlag: 591-600

SINHA, S. and TOMPA, M., 2000. A statistical method for finding transcription

factor binding sites. Proceedings of International Conference on Intelligent

Systems for Molecular Biology, 8: 344-54.

STINE, M., 2003. Motif discovery in upstream sequences of coordinately

expressed genes. CEC’03, USA, 1596–1603

Page 90: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

81

STORMO, G. D., 2000. DNA binding sites: representation and discovery.

Bioinformatics, 16: 16-23.

THIJS, G., LESCOT, M., MARCHAL, K., ROMBAUTS, S., DE MOOR, B.,

ROUZE, P. and MOREAU, Y., 2001. A higher-order background model

improves the detection of promoter regulatory elements by Gibbs

sampling. Bioinformatics, 17: 1113-22.

TOMPA, M., 1999. An exact method for finding short motifs in sequences, with

application to the ribosome binding site problem. International Conference

on Intelligent Systems for Molecular Biology: 262-71.

TOMPA, M., LI, N., BAILEY, T. L., CHURCH, G. M., DE MOOR, B., ESKIN,

E., FAVOROV, A. V., FRITH, M. C., FU, Y., KENT, W. J., MAKEEV,

V. J., MIRONOV, A. A., NOBLE, W. S., PAVESI, G., PESOLE, G.,

REGNIER, M., SIMONIS, N., SINHA, S., THIJS, G., VAN HELDEN, J.,

VANDENBOGAERT, M., WENG, Z., WORKMAN, C., YE, C. and

ZHU, Z., 2005. Assessing computational tools for the discovery of

transcription factor binding sites. Nature biotechnology, 23: 137-44.

WANG, T. and STORMO, G. D., 2003. Combining phylogenetic data with co-

regulated genes to identify regulatory motifs. Bioinformatics, 19: 2369-80.

WEI, Z. and JENSEN, S. T., 2006. GAME: detecting cis-regulatory elements

using a genetic algorithm. Bioinformatics, 22: 1577-84.

WIJAYA, E., YIU, S. M., SON, N. T., KANAGASABAI, R. and SUNG, W. K.,

2008. MotifVoter: a novel ensemble method for fine-grained integration of

generic motif finders. Bioinformatics, 24: 2288-95.

XINCHAO, Z., 2010. A perturbed particle swarm algorithm for numerical

optimization. Appl. Soft Comput., 10: 119-124.

VAN HELDEN, J., ANDRE, B. and COLLADO-VIDES, J., 1998. Extracting

regulatory sites from the upstream region of yeast genes by computational

analysis of oligonucleotide frequencies. Journal of molecular biology, 281:

827-42.

YAOCHU, J., LIPO, W., 2009. Fuzzy Systems in Bioinformatics and

Computational Biology, Springer Berlin / Heidelberg

Page 91: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

82

YANOVER, C., SINGH, M. and ZASLAVSKY, E., 2009. M are better than one:

an ensemble-based motif finder and its application to regulatory element

prediction. Bioinformatics, 25: 868-74.

ZHOU, H., SCHAEFER, G. and SHI, C. 2009. Fuzzy C-Means Techniques for

Medical Image Segmentation. Springer Berlin / Heidelberg.

ZHOU, W., ZHOU, C., LIU, G., HUANG, Y., 2005. Identification of

Transcription Factor Binding Sites Using Hybrid Particle Swarm

Optimization. Rough Sets, Fuzzy Sets, Data Mining, and Granular

Computing, Springer Berlin / Heidelberg, 438-445

Page 92: ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED ... · Mustafa KARABULUT EMPLOYING DATA MINING TECHNIQUES ON BIOLOGICAL SEQUENCES FOR TRANSCRIPTION FACTOR BINDING SITE ... Prof.

83

CURRICULUM VITAE

Mustafa KARABULUT was born on March 29th, 1979, in Gaziantep,

TURKIYE. He received BSc degree in Computer Engineering from Çanakkale 18

Mart University in 2001 and then, the MSc degree in Electrical-Electronics

Engineering from Kahramanmaraş Sütçü İmam University in 2007. He worked as

a software developer between 2001 and 2003 in an IT company. Since 2003, he

has been working as an instructor at Vocational School of Higher Education

department in University of Gaziantep.